Performance of Random Oversampling, Random Undersampling, and SMOTE-NC Methods in Handling Imbalanced Class in Classification Models

Andika Putri Ratnasari

doi:10.18535/ijsrm/v12i04.m03

Abstract

One common challenge in classification modeling is the existence of imbalanced classes within the data. If the analysis continues with imbalanced classes, it is probable that the result will demonstrate inadequate performance when forecasting new data. Various approaches exist to rectify this class imbalance issue, such as random oversampling, random undersampling, and the Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC). Each of these methods encompasses distinct techniques aimed at achieving balanced class distribution within the dataset. Comparison of classification performance on imbalanced classes handled by these three methods has never been carried out in previous research. Therefore, this study undertakes an evaluation of classification models (specifically Gradient Boosting, Random Forest, and Extremely Randomized Trees) in the context of imbalanced class data. The results of this research show that the random undersampling method used to balance the class distribution has the best performance on two classification models (Random Forest and Gradient Boosted Tree).

Keywords

Classification
Imbalanced Class
Random Oversampling
Random Undersampling
SMOTENC

References

G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with Application in R. New York: Springer, 2013. doi: 10.2174/0929867003374372.
A. Natekin and A. Knoll, “Gradient boosting machines, a tutorial,” Front. Neurorobot., vol. 7, 2013, doi: 10.3389/fnbot.2013.00021.
S. M. Lundberg et al., “From local explanations to global understanding with explainable AI for trees,” Nat. Mach. Intell., vol. 2, no. January, pp. 56–67, 2020, http://dx.doi.org/10.1038/s42256-019-0138-9.
P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Mach. Learn., vol. 63, no. 1, pp. 3–42, 2006, doi: 10.1007/s10994-006-6226-1.
R. Siringoringo, “Klasifikasi Data Tidak Seimbang Menggunakan Algoritma SMOTE dan K-Nearest Neighbor,” J. ISD, vol. 3, no. 1, pp. 44–49, 2018.
W. Chaipanha and P. Kaewwichian, “Smote Vs. Random Undersampling for Imbalanced Data-Car Ownership Demand Model,” Communications, vol. 24, no. 3, pp. D105–D115, 2022, doi: 10.26552/com.C.2022.3.D105-D115.
S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Handling imbalanced datasets : A review,” Science , vol. 30, no. 1, pp. 25–36, 2006
Q. H. Doan, S. H. Mai, Q. T. Do, and D. K. Thai, “A cluster-based data splitting method for small sample and class imbalance problems in impact damage classification,” Appl. Soft Comput., vol. 120, p. 108628, 2022, doi: 10.1016/j.asoc.2022.108628.
D. T. Utari, “Integration of Svm and Smote-Nc for Classification of Heart Failure Patients,” BAREKENG J. Ilmu Mat. dan Terap., vol. 17, no. 4, pp. 2263–2272, 2023.
M. A. Ganai, M. Hu, A. K. Malik, M. Tanvir, and P. N. Suganthan, “Ensemble deep learning: A review,” Eng. Appl. Artif. Intell., vol. 115, 2022, doi: https://doi.org/10.1016/j.engappai.2022.105151.
L. Breiman, “Random Forests,” Mach. Learn., vol. 45, pp. 5–32, 2001.
M. Aria, C. Cuccurullo, and A. Gnasso, “A comparison among interpretative proposals for Random Forests,” Machine Learning with Applications, vol. 6. p. 100094, 2021. doi: 10.1016/j.mlwa.2021.100094.
J. Ali, R. Khan, N. Ahmad, and I. Maqsood, “Random forests and decision trees,” IJCSI Int. J. Comput. Sci. Issues, vol. 9, no. 5, pp. 272–278, 2012.
S. Han, H. Kim, and Y. S. Lee, “Double random forest,” Mach. Learn., vol. 109, no. 8, pp. 1569–1586, 2020.
S. E. Suryana, B. Warsito, and S. Suparti, “Penerapan Gradient Boosting Dengan Hyperopt Untuk Memprediksi Keberhasilan Telemarketing Bank,” J. Gaussian, vol. 10, no. 4, pp. 617–623, 2021, doi: 10.14710/j.gauss.v10i4.31335.
J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Ann. Stat., vol. 29, no. 5, pp. 1189–1232, 2001, doi: 10.1214/aos/1013203451.
R. Kohavi and F. Provost, “Glossary of Terms Glossary of Terms,” Mach. Learn., vol. 30, pp. 271–274, 1998.
J. C. Obi, “A comparative study of several classification metrics and their performances on data,” World Journal of Advanced Engineering Technology and Sciences, vol. 8, no. 1, pp. 308–314, 2023, doi: https://doi.org/10.30574/wjaets.2023.8.1.0054.

[refR-1] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with Application in R. New York: Springer, 2013. doi: 10.2174/0929867003374372.

[refR-2] A. Natekin and A. Knoll, “Gradient boosting machines, a tutorial,” Front. Neurorobot., vol. 7, 2013, doi: 10.3389/fnbot.2013.00021.

[refR-3] S. M. Lundberg et al., “From local explanations to global understanding with explainable AI for trees,” Nat. Mach. Intell., vol. 2, no. January, pp. 56–67, 2020, http://dx.doi.org/10.1038/s42256-019-0138-9.

[refR-4] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Mach. Learn., vol. 63, no. 1, pp. 3–42, 2006, doi: 10.1007/s10994-006-6226-1.

[refR-5] R. Siringoringo, “Klasifikasi Data Tidak Seimbang Menggunakan Algoritma SMOTE dan K-Nearest Neighbor,” J. ISD, vol. 3, no. 1, pp. 44–49, 2018.

[refR-6] W. Chaipanha and P. Kaewwichian, “Smote Vs. Random Undersampling for Imbalanced Data-Car Ownership Demand Model,” Communications, vol. 24, no. 3, pp. D105–D115, 2022, doi: 10.26552/com.C.2022.3.D105-D115.

[refR-7] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Handling imbalanced datasets : A review,” Science , vol. 30, no. 1, pp. 25–36, 2006

[refR-8] Q. H. Doan, S. H. Mai, Q. T. Do, and D. K. Thai, “A cluster-based data splitting method for small sample and class imbalance problems in impact damage classification,” Appl. Soft Comput., vol. 120, p. 108628, 2022, doi: 10.1016/j.asoc.2022.108628.

[refR-9] D. T. Utari, “Integration of Svm and Smote-Nc for Classification of Heart Failure Patients,” BAREKENG J. Ilmu Mat. dan Terap., vol. 17, no. 4, pp. 2263–2272, 2023.

[refR-10] M. A. Ganai, M. Hu, A. K. Malik, M. Tanvir, and P. N. Suganthan, “Ensemble deep learning: A review,” Eng. Appl. Artif. Intell., vol. 115, 2022, doi: https://doi.org/10.1016/j.engappai.2022.105151.

[refR-11] L. Breiman, “Random Forests,” Mach. Learn., vol. 45, pp. 5–32, 2001.

[refR-12] M. Aria, C. Cuccurullo, and A. Gnasso, “A comparison among interpretative proposals for Random Forests,” Machine Learning with Applications, vol. 6. p. 100094, 2021. doi: 10.1016/j.mlwa.2021.100094.

[refR-13] J. Ali, R. Khan, N. Ahmad, and I. Maqsood, “Random forests and decision trees,” IJCSI Int. J. Comput. Sci. Issues, vol. 9, no. 5, pp. 272–278, 2012.

[refR-14] S. Han, H. Kim, and Y. S. Lee, “Double random forest,” Mach. Learn., vol. 109, no. 8, pp. 1569–1586, 2020.

[refR-15] S. E. Suryana, B. Warsito, and S. Suparti, “Penerapan Gradient Boosting Dengan Hyperopt Untuk Memprediksi Keberhasilan Telemarketing Bank,” J. Gaussian, vol. 10, no. 4, pp. 617–623, 2021, doi: 10.14710/j.gauss.v10i4.31335.

[refR-16] J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Ann. Stat., vol. 29, no. 5, pp. 1189–1232, 2001, doi: 10.1214/aos/1013203451.

[refR-17] R. Kohavi and F. Provost, “Glossary of Terms Glossary of Terms,” Mach. Learn., vol. 30, pp. 271–274, 1998.

[refR-18] J. C. Obi, “A comparative study of several classification metrics and their performances on data,” World Journal of Advanced Engineering Technology and Sciences, vol. 8, no. 1, pp. 308–314, 2023, doi: https://doi.org/10.30574/wjaets.2023.8.1.0054.

Performance of Random Oversampling, Random Undersampling, and SMOTE-NC Methods in Handling Imbalanced Class in Classification Models

Abstract

Keywords

References

Author Resources

Journal Policies

Author Desk