Abstract
One common challenge in classification modeling is the existence of imbalanced classes within the data. If the analysis continues with imbalanced classes, it is probable that the result will demonstrate inadequate performance when forecasting new data. Various approaches exist to rectify this class imbalance issue, such as random oversampling, random undersampling, and the Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC). Each of these methods encompasses distinct techniques aimed at achieving balanced class distribution within the dataset. Comparison of classification performance on imbalanced classes handled by these three methods has never been carried out in previous research. Therefore, this study undertakes an evaluation of classification models (specifically Gradient Boosting, Random Forest, and Extremely Randomized Trees) in the context of imbalanced class data. The results of this research show that the random undersampling method used to balance the class distribution has the best performance on two classification models (Random Forest and Gradient Boosted Tree).
Keywords
- Classification
- Imbalanced Class
- Random Oversampling
- Random Undersampling
- SMOTENC
References
- G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with Application in R. New York: Springer, 2013. doi: 10.2174/0929867003374372.
- A. Natekin and A. Knoll, βGradient boosting machines, a tutorial,β Front. Neurorobot., vol. 7, 2013, doi: 10.3389/fnbot.2013.00021.
- S. M. Lundberg et al., βFrom local explanations to global understanding with explainable AI for trees,β Nat. Mach. Intell., vol. 2, no. January, pp. 56β67, 2020, http://dx.doi.org/10.1038/s42256-019-0138-9.
- P. Geurts, D. Ernst, and L. Wehenkel, βExtremely randomized trees,β Mach. Learn., vol. 63, no. 1, pp. 3β42, 2006, doi: 10.1007/s10994-006-6226-1.
- R. Siringoringo, βKlasifikasi Data Tidak Seimbang Menggunakan Algoritma SMOTE dan K-Nearest Neighbor,β J. ISD, vol. 3, no. 1, pp. 44β49, 2018.
- W. Chaipanha and P. Kaewwichian, βSmote Vs. Random Undersampling for Imbalanced Data-Car Ownership Demand Model,β Communications, vol. 24, no. 3, pp. D105βD115, 2022, doi: 10.26552/com.C.2022.3.D105-D115.
- S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, βHandling imbalanced datasets : A review,β Science , vol. 30, no. 1, pp. 25β36, 2006
- Q. H. Doan, S. H. Mai, Q. T. Do, and D. K. Thai, βA cluster-based data splitting method for small sample and class imbalance problems in impact damage classification,β Appl. Soft Comput., vol. 120, p. 108628, 2022, doi: 10.1016/j.asoc.2022.108628.
- D. T. Utari, βIntegration of Svm and Smote-Nc for Classification of Heart Failure Patients,β BAREKENG J. Ilmu Mat. dan Terap., vol. 17, no. 4, pp. 2263β2272, 2023.
- M. A. Ganai, M. Hu, A. K. Malik, M. Tanvir, and P. N. Suganthan, βEnsemble deep learning: A review,β Eng. Appl. Artif. Intell., vol. 115, 2022, doi: https://doi.org/10.1016/j.engappai.2022.105151.
- L. Breiman, βRandom Forests,β Mach. Learn., vol. 45, pp. 5β32, 2001.
- M. Aria, C. Cuccurullo, and A. Gnasso, βA comparison among interpretative proposals for Random Forests,β Machine Learning with Applications, vol. 6. p. 100094, 2021. doi: 10.1016/j.mlwa.2021.100094.
- J. Ali, R. Khan, N. Ahmad, and I. Maqsood, βRandom forests and decision trees,β IJCSI Int. J. Comput. Sci. Issues, vol. 9, no. 5, pp. 272β278, 2012.
- S. Han, H. Kim, and Y. S. Lee, βDouble random forest,β Mach. Learn., vol. 109, no. 8, pp. 1569β1586, 2020.
- S. E. Suryana, B. Warsito, and S. Suparti, βPenerapan Gradient Boosting Dengan Hyperopt Untuk Memprediksi Keberhasilan Telemarketing Bank,β J. Gaussian, vol. 10, no. 4, pp. 617β623, 2021, doi: 10.14710/j.gauss.v10i4.31335.
- J. H. Friedman, βGreedy function approximation: A gradient boosting machine,β Ann. Stat., vol. 29, no. 5, pp. 1189β1232, 2001, doi: 10.1214/aos/1013203451.
- R. Kohavi and F. Provost, βGlossary of Terms Glossary of Terms,β Mach. Learn., vol. 30, pp. 271β274, 1998.
- J. C. Obi, βA comparative study of several classification metrics and their performances on data,β World Journal of Advanced Engineering Technology and Sciences, vol. 8, no. 1, pp. 308β314, 2023, doi: https://doi.org/10.30574/wjaets.2023.8.1.0054.