Abstract

One common challenge in classification modeling is the existence of imbalanced classes within the data. If the analysis continues with imbalanced classes, it is probable that the result will demonstrate inadequate performance when forecasting new data. Various approaches exist to rectify this class imbalance issue, such as random oversampling, random undersampling, and the Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC). Each of these methods encompasses distinct techniques aimed at achieving balanced class distribution within the dataset. Comparison of classification performance on imbalanced classes handled by these three methods has never been carried out in previous research. Therefore, this study undertakes an evaluation of classification models (specifically Gradient Boosting, Random Forest, and Extremely Randomized Trees) in the context of imbalanced class data. The results of this research show that the random undersampling method used to balance the class distribution has the best performance on two classification models (Random Forest and Gradient Boosted Tree).

Keywords

  • Classification
  • Imbalanced Class
  • Random Oversampling
  • Random Undersampling
  • SMOTENC

References

  1. G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with Application in R. New York: Springer, 2013. doi: 10.2174/0929867003374372.
  2. A. Natekin and A. Knoll, β€œGradient boosting machines, a tutorial,” Front. Neurorobot., vol. 7, 2013, doi: 10.3389/fnbot.2013.00021.
  3. S. M. Lundberg et al., β€œFrom local explanations to global understanding with explainable AI for trees,” Nat. Mach. Intell., vol. 2, no. January, pp. 56–67, 2020, http://dx.doi.org/10.1038/s42256-019-0138-9.
  4. P. Geurts, D. Ernst, and L. Wehenkel, β€œExtremely randomized trees,” Mach. Learn., vol. 63, no. 1, pp. 3–42, 2006, doi: 10.1007/s10994-006-6226-1.
  5. R. Siringoringo, β€œKlasifikasi Data Tidak Seimbang Menggunakan Algoritma SMOTE dan K-Nearest Neighbor,” J. ISD, vol. 3, no. 1, pp. 44–49, 2018.
  6. W. Chaipanha and P. Kaewwichian, β€œSmote Vs. Random Undersampling for Imbalanced Data-Car Ownership Demand Model,” Communications, vol. 24, no. 3, pp. D105–D115, 2022, doi: 10.26552/com.C.2022.3.D105-D115.
  7. S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, β€œHandling imbalanced datasets : A review,” Science , vol. 30, no. 1, pp. 25–36, 2006
  8. Q. H. Doan, S. H. Mai, Q. T. Do, and D. K. Thai, β€œA cluster-based data splitting method for small sample and class imbalance problems in impact damage classification,” Appl. Soft Comput., vol. 120, p. 108628, 2022, doi: 10.1016/j.asoc.2022.108628.
  9. D. T. Utari, β€œIntegration of Svm and Smote-Nc for Classification of Heart Failure Patients,” BAREKENG J. Ilmu Mat. dan Terap., vol. 17, no. 4, pp. 2263–2272, 2023.
  10. M. A. Ganai, M. Hu, A. K. Malik, M. Tanvir, and P. N. Suganthan, β€œEnsemble deep learning: A review,” Eng. Appl. Artif. Intell., vol. 115, 2022, doi: https://doi.org/10.1016/j.engappai.2022.105151.
  11. L. Breiman, β€œRandom Forests,” Mach. Learn., vol. 45, pp. 5–32, 2001.
  12. M. Aria, C. Cuccurullo, and A. Gnasso, β€œA comparison among interpretative proposals for Random Forests,” Machine Learning with Applications, vol. 6. p. 100094, 2021. doi: 10.1016/j.mlwa.2021.100094.
  13. J. Ali, R. Khan, N. Ahmad, and I. Maqsood, β€œRandom forests and decision trees,” IJCSI Int. J. Comput. Sci. Issues, vol. 9, no. 5, pp. 272–278, 2012.
  14. S. Han, H. Kim, and Y. S. Lee, β€œDouble random forest,” Mach. Learn., vol. 109, no. 8, pp. 1569–1586, 2020.
  15. S. E. Suryana, B. Warsito, and S. Suparti, β€œPenerapan Gradient Boosting Dengan Hyperopt Untuk Memprediksi Keberhasilan Telemarketing Bank,” J. Gaussian, vol. 10, no. 4, pp. 617–623, 2021, doi: 10.14710/j.gauss.v10i4.31335.
  16. J. H. Friedman, β€œGreedy function approximation: A gradient boosting machine,” Ann. Stat., vol. 29, no. 5, pp. 1189–1232, 2001, doi: 10.1214/aos/1013203451.
  17. R. Kohavi and F. Provost, β€œGlossary of Terms Glossary of Terms,” Mach. Learn., vol. 30, pp. 271–274, 1998.
  18. J. C. Obi, β€œA comparative study of several classification metrics and their performances on data,” World Journal of Advanced Engineering Technology and Sciences, vol. 8, no. 1, pp. 308–314, 2023, doi: https://doi.org/10.30574/wjaets.2023.8.1.0054.