Comparing Data Mining Models in Loan Default Prediction: A Framework and a Demonstration
In the banking sector, credit risk assessment is an important process to ensure that loans could be paid on time, and that banks could maintain their credit performance effectively. Despite restless business efforts allocated to credit scoring yearly, high percentage of loan defaulting remains a major issue. With the availability of tremendous banking data and advanced analytics tools, data mining algorithms can be applied to develop a platform of credit scoring, and to resolve the loan defaulting problem. This paper puts forward a framework to compare four classification algorithms, including logistic regression, decision tree, neural network, and Xgboost, using a public dataset. Confusion matrix and Monte Carlo simulation benchmarks are used to evaluate their performance. We find that the XGboost outperforms the other three traditional models. We also offer practial recommendation and future research.
Louzada, F., Ara, A., and Fernandes, G.B. (2016).Classification methods applied to credit scoring: Systematic review and overall comparison. Surveys in Operations Research and Management Science, 21(2): 117-134.
Morales, D., and Vaca, M. (2013).Monte Carlo simulation study of regression models used to estimate the credit banking risk in home equity loans. WIT Transactions on Information and Communication Technologies, 45): 141-153.
Thomas, L.C., Edelman, D.B., and Crook, J.N. (2002).Credit Scoring and its Applications: SIAM monographs on mathematical modeling and computation. Philadelphia: University City Science Center, SIAM.
Brown, I., and Mues, C. (2012).An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3): 3446-3453.
Xia, Y., Liu, C., Li, Y., and Liu, N. (2017).A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring. Expert Systems with Applications, 78): 225-241.
Lee, T.-S., Chiu, C.-C., Chou, Y.-C., and Lu, C.-J. (2006).Mining the customer credit using classification and regression tree and multivariate adaptive regression splines. Computational Statistics & Data Analysis, 50(4): 1113-1130.
Xu, S., and Qiu, M. (2008).A privacy preserved data mining framework for customer relationship management. Journal of Relationship Marketing, 7(3): 309-322.
Arminger, G., Enache, D., and Bonne, T. (1997).Analyzing credit risk data: A comparison of logistic discrimination, classification tree analysis, and feedforward networks. Computational Statistics, 12(2.
West, D. (2000).Neural network credit scoring models. Computers & operations research, 27(11-12): 1131-1152.
Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., and Vanthienen, J. (2003).Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the operational research society, 54(6): 627-635.
Desai, V.S., Crook, J.N., and Overstreet Jr, G.A. (1996).A comparison of neural networks and linear scoring models in the credit union environment. European journal of operational research, 95(1): 24-37.
Yobas, M.B., Crook, J.N., and Ross, P. (2000).Credit scoring using neural and evolutionary techniques. IMA Journal of Management Mathematics, 11(2): 111-125.
Altman, E.I., Marco, G., and Varetto, F. (1994).Corporate distress diagnosis: Comparisons using linear discriminant analysis and neural networks (the Italian experience). Journal of banking & finance, 18(3): 505-529.
Baesens, B., Roesch, D., and Scheule, H. (2016). Credit risk analytics: Measurement techniques, applications, and examples in SAS. John Wiley & Sons.
Hosmer, D.W., Lemeshow, S., and Cook, E. (2000).Applied logistic regression 2nd edition. New York: Jhon Wiley and Sons Inc.
Gupta, B., Rawat, A., Jain, A., Arora, A., and Dhami, N. (2017).Analysis of various decision tree algorithms for classification in data mining. International Journal of Computer Applications, 163(8): 15-19.
Friedman, J.H. (2001).Greedy function approximation: a gradient boosting machine. Annals of statistics): 1189-1232.
Friedman, J.H. (2002).Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4): 367-378.
How to Cite
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).