Nowadays, the volume of databases that financial companies manage is so great that it has become necessary to address this problem, and the solution to this can be found in Big Data techniques applied to massive financial datasets for segmenting risk groups. In this paper, the presence of large datasets is approached through the development of some Monte Carlo experiments using known techniques and algorithms. In addition, a linear mixed model (LMM) has been implemented as a new incremental contribution to calculate the credit risk of financial companies. These computational experiments are developed with several combinations of dataset sizes and forms to cover a wide variety of cases. Results reveal that large datasets need Big Data techniques and algorithms that yield faster and unbiased estimators. Big Data can help to extract the value of data and thus better decisions can be made without the runtime component. Through these techniques, there would be less risk for financial companies when predicting which clients will be successful in their payments. Consequently, more people could have access to credit loans.
Any credit rating system that enables the automatic assessment of the risk associated to a banking operation is called credit scoring. This risk may depend on several customer and credit characteristics, such as solvency, type of credit, maturity, loan amount, and other features inherent in financial operations. It is an objective system for approving credit that does not depend on the analyst's discretion.
In the 1960s, coinciding with the massive demand for credit cards, financial companies began applying credit scoring techniques as a means of assessing their exposure to risk insolvency (Altman, 1998). At the same time, the United States also began to develop and apply credit scoring techniques to assess credit risk assessment and to estimate the probability of default (Escalona Cortés, 2011).
In this research experiment, the intention was to make files that can represent loans for any bank branch, without depending on a specific country or particular institution. For this reason, we simulated random datasets from N = 2000 to N = 100,000 records and from p = 1 to p = 250 explanatory variables.
Eight methods were proposed: QDA, CART, PRUNECART, LM, LMM, LSVM, GLMLOGIT, and NN. For each method, measures of effectiveness and efficiency were calculated. The most effective methods are GLMLOGIT and LMM. LM, LMM, and GLMLOGIT are the most computationally efficient methods, followed by QDA and CART. For large datasets, LMM takes less elapsed time, and so does GLMLOGIT for short datasets.