Comparing Data Mining Models in Loan Default Prediction: A Framework and a Demonstration


  • Cuong Nguyen Faculty of Management, University of Dong A, Vietnam
  • Liang Chen Paul and Virginia Engler College of Business, West Texas A&M University, USA



In the banking sector, credit risk assessment is an important process to ensure that loans could be paid on time, and that banks could maintain their credit performance effectively. Despite restless business efforts allocated to credit scoring yearly, high percentage of loan defaulting remains a major issue. With the availability of tremendous banking data and advanced analytics tools, data mining algorithms can be applied to develop a platform of credit scoring, and to resolve the loan defaulting problem. This paper puts forward a framework to compare four classification algorithms, including logistic regression, decision tree, neural network, and Xgboost, using a public dataset. Confusion matrix and Monte Carlo simulation benchmarks are used to evaluate their performance. We find that the XGboost outperforms the other three traditional models. We also offer practial recommendation and future research.


Louzada, F., Ara, A., and Fernandes, G.B. (2016).Classification methods applied to credit scoring: Systematic review and overall comparison. Surveys in Operations Research and Management Science, 21(2): 117-134.

Morales, D., and Vaca, M. (2013).Monte Carlo simulation study of regression models used to estimate the credit banking risk in home equity loans. WIT Transactions on Information and Communication Technologies, 45): 141-153.

Thomas, L.C., Edelman, D.B., and Crook, J.N. (2002).Credit Scoring and its Applications: SIAM monographs on mathematical modeling and computation. Philadelphia: University City Science Center, SIAM.

Brown, I., and Mues, C. (2012).An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3): 3446-3453.

Xia, Y., Liu, C., Li, Y., and Liu, N. (2017).A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring. Expert Systems with Applications, 78): 225-241.

Lee, T.-S., Chiu, C.-C., Chou, Y.-C., and Lu, C.-J. (2006).Mining the customer credit using classification and regression tree and multivariate adaptive regression splines. Computational Statistics & Data Analysis, 50(4): 1113-1130.

Xu, S., and Qiu, M. (2008).A privacy preserved data mining framework for customer relationship management. Journal of Relationship Marketing, 7(3): 309-322.

Arminger, G., Enache, D., and Bonne, T. (1997).Analyzing credit risk data: A comparison of logistic discrimination, classification tree analysis, and feedforward networks. Computational Statistics, 12(2.

West, D. (2000).Neural network credit scoring models. Computers & operations research, 27(11-12): 1131-1152.

Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., and Vanthienen, J. (2003).Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the operational research society, 54(6): 627-635.

Desai, V.S., Crook, J.N., and Overstreet Jr, G.A. (1996).A comparison of neural networks and linear scoring models in the credit union environment. European journal of operational research, 95(1): 24-37.

Yobas, M.B., Crook, J.N., and Ross, P. (2000).Credit scoring using neural and evolutionary techniques. IMA Journal of Management Mathematics, 11(2): 111-125.

Altman, E.I., Marco, G., and Varetto, F. (1994).Corporate distress diagnosis: Comparisons using linear discriminant analysis and neural networks (the Italian experience). Journal of banking & finance, 18(3): 505-529.

Baesens, B., Roesch, D., and Scheule, H. (2016). Credit risk analytics: Measurement techniques, applications, and examples in SAS. John Wiley & Sons.

Hosmer, D.W., Lemeshow, S., and Cook, E. (2000).Applied logistic regression 2nd edition. New York: Jhon Wiley and Sons Inc.

Gupta, B., Rawat, A., Jain, A., Arora, A., and Dhami, N. (2017).Analysis of various decision tree algorithms for classification in data mining. International Journal of Computer Applications, 163(8): 15-19.

Friedman, J.H. (2001).Greedy function approximation: a gradient boosting machine. Annals of statistics): 1189-1232.

Friedman, J.H. (2002).Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4): 367-378.




How to Cite

Nguyen, C., & Chen, L. (2022). Comparing Data Mining Models in Loan Default Prediction: A Framework and a Demonstration. Journal of Information Technology and Computer Science, 7(1), 1–8.