Optimization of K-Means in Disease Clustering of Pregnant Women Using Random Forest

Rezqiwati Ishak, Nurmawanti Nurmawanti, Amiruddin Bengnga

Abstract


Pregnant women's health is an important aspect of the public health system, where grouping disease data can help in risk identification and better treatment planning. However, traditional clustering methods such as K-Means often face challenges in optimal separation between clusters, especially when the attributes used are irrelevant. This study aims to optimize the K-Means method in disease clustering in pregnant women by applying Random Forest-based attribute selection. Of the six available attributes (age, weight, height, gestational age, systole, and diastole), the three main attributes namely systole, diastole, and gestational age were selected based on the Importance Score from Random Forest. The test results showed that the use of these three attributes increased the Silhouette Score by 0.21 (from 0.23 to 0.44), indicating better cluster separation, and lowered the Davies-Bouldin Index by 0.69 (from 1.50 to 0.81), indicating a more compact and well-separated cluster. Clustering visualization using Principal Component Analysis (PCA) supports these results. In addition, the calculation of the Elbow method shows the optimal number of clusters at k=3, reinforcing the conclusion that the selection of the right attributes and the number of clusters improves the quality of clustering. Overall, this study proves that the selection of Random Forest-based features is able to optimize the K-Means method in disease clustering in pregnant women, which is expected to improve the effectiveness of diagnosis and treatment planning.

Kesehatan ibu hamil merupakan aspek penting dalam sistem kesehatan masyarakat, di mana pengelompokan data penyakit dapat membantu dalam identifikasi risiko dan perencanaan perawatan yang lebih baik. Namun, metode clustering tradisional seperti K-Means sering kali menghadapi tantangan dalam pemisahan yang optimal antar cluster, terutama ketika atribut yang digunakan tidak relevan. Penelitian ini bertujuan untuk mengoptimalkan metode K-Means dalam clustering penyakit pada ibu hamil dengan menerapkan seleksi atribut berbasis Random Forest. Dari enam atribut yang tersedia (usia, berat badan, tinggi badan, usia kehamilan, sistole, dan diastole), tiga atribut utama yaitu sistole, diastole, dan usia kehamilan dipilih berdasarkan Importance Score dari Random Forest. Hasil pengujian menunjukkan bahwa penggunaan tiga atribut ini meningkatkan Silhouette Score sebesar 0,21 (dari 0,23 menjadi 0,44), yang mengindikasikan pemisahan cluster yang lebih baik, serta menurunkan Davies-Bouldin Index sebesar 0,69 (dari 1,50 menjadi 0,81), menunjukkan cluster yang lebih kompak dan terpisah dengan baik. Visualisasi clustering menggunakan Principal Component Analysis (PCA) mendukung hasil ini. Selain itu, perhitungan metode Elbow menunjukkan jumlah cluster optimal pada k=3, memperkuat kesimpulan bahwa pemilihan atribut dan jumlah cluster yang tepat meningkatkan kualitas clustering. Secara keseluruhan, penelitian ini membuktikan bahwa seleksi fitur berbasis Random Forest mampu mengoptimalkan metode K-Means dalam clustering penyakit pada ibu hamil, yang diharapkan dapat meningkatkan efektivitas diagnosis dan perencanaan perawatan.


Keywords


Clustering; Attribute Selection, Importance Score, Silhouette Score; Davies-Bouldin Index

Full Text:

PDF

References


WHO, “Maternal Health,” 2021. https://www.who.int/health-topics/maternal-health#tab=tab_1 (accessed Sep. 01, 2024).

Z. M. Kesuma, Nurhasanah, and P. Kesuma, “Maternal health care in Aceh Province: cluster analysis results,” J. Phys. Conf. Ser., vol. 1116, p. 022019, Dec. 2018, doi: 10.1088/1742-6596/1116/2/022019.

K. P. Sinaga and M.-S. Yang, “Unsupervised K-Means Clustering Algorithm,” IEEE Access, vol. 8, no. 10, pp. 80716–80727, Mar. 2020, doi: 10.1109/ACCESS.2020.2988796.

X. Li, Y. Ye, M. J. Li, and M. Ng, “On cluster tree for nested and multi-density data clustering,” Pattern Recognit., vol. 43, pp. 3130–3143, 2010, doi: 10.1016/j.patcog.2010.03.020.

R. V. S. Kumar, “An Efficient Clustering Approach using DBSCAN,” HELIX, vol. 8, no. 3, pp. 3399–3405, Apr. 2018, doi: 10.29042/2018-3399-3405.

S. K. Majhi and S. Biswal, “Optimal cluster analysis using hybrid K-Means and Ant Lion Optimizer,” Karbala Int. J. Mod. Sci., vol. 4, no. 4, pp. 347–360, Dec. 2018, doi: 10.1016/j.kijoms.2018.09.001.

N. Gholizadeh, H. Saadatfar, and N. Hanafi, “K-DBSCAN: An improved DBSCAN algorithm for big data,” J. Supercomput., vol. 77, no. 6, pp. 6214–6235, Jun. 2021, doi: 10.1007/s11227-020-03524-3.

I. T. Utami, F. Suryaningrum, and D. Ispriyanti, “K-Means Cluster Count Optimization With Silhouette Index Validation And Davies Bouldin Index (Case Study: Coverage Of Pregnant Women, Childbirth, And Postpartum Health Services In Indonesia In 2020),” BAREKENG J. Ilmu Mat. dan Terap., vol. 17, no. 2, pp. 0707–0716, Jun. 2023, doi: 10.30598/barekengvol17iss2pp0707-0716.

A. Bengnga and R. Ishak, “Penerapan XGBoost untuk Seleksi Atribut pada K-Means dalam Clustering Penerima KIP Kuliah,” Jambura J. Electr. Electron. Eng., vol. 5, no. 2, pp. 192–196, 2023, doi: 10.37905/jjeee.v5i2.20253.

A. Bengnga and R. Ishak, “Implementasi Seleksi Fitur Klasifikasi Waktu Kelulusan Mahasiswa Menggunakan Correlation Matrix with Heatmap,” Jambura J. Electr. Electron. Eng., vol. 4, no. 2, pp. 169–174, Jul. 2022, doi: 10.37905/jjeee.v4i2.14403.

Z. Wang, H. Li, B. Nie, J. Du, Y. Du, and Y. Chen, “Feature selection using different evaluate strategy and random forests,” in 2021 International Conference on Computer Engineering and Artificial Intelligence (ICCEAI), Aug. 2021, pp. 310–313, doi: 10.1109/ICCEAI52939.2021.00062.

A. Y. Mahmoud, “Novel efficient feature selection: Classification of medical and immunotherapy treatments utilising Random Forest and Decision Trees,” Intell. Med., vol. 10, p. 100151, 2024, doi: 10.1016/j.ibmed.2024.100151.

R.-C. Chen, C. Dewi, S.-W. Huang, and R. E. Caraka, “Selecting critical features for data classification based on machine learning methods,” J. Big Data, vol. 7, no. 1, p. 52, Dec. 2020, doi: 10.1186/s40537-020-00327-4.

S. Rajesh, P. Praveen, and D. N, “Performance Analysis of Machine Learning Algorithms on Parkinson’s Disease Data,” in 2024 IEEE International Conference on Contemporary Computing and Communications (InC4), Mar. 2024, pp. 1–10, doi: 10.1109/InC460750.2024.10649372.

A. Damayanti, W. D. Utami, D. C. R. Novitasari, P. K. Intan, and M. L. Kurniawan, “Cluster Analysis of Environmental Pollution in Indonesia Using Complete Linkage Method with Elbow Optimization,” JTAM (Jurnal Teor. dan Apl. Mat., vol. 7, no. 2, p. 399, Apr. 2023, doi: 10.31764/jtam.v7i2.12961.

I. Turbay, P. Ortiz, and R. Ortiz, “Statistical analysis of principal components (PCA) in the study of the vulnerability of Heritage Churches,” Procedia Struct. Integr., vol. 55, pp. 168–176, 2024, doi: 10.1016/j.prostr.2024.02.022.

Suyanto, Data Mining untuk Klasifikasi dan Klasterisasi Data. Bandung: Informatika, 2019.

E. Prasetyo, Data Mining : Konsep Dan Aplikasi Menggunakan Matlab. Yogyakarta: CV. Andi Offset, 2013.

R. Primartha, Algoritma Machine Learning. Bandung: Informatika, 2021.

Suyanto, Machine Learning Tingkat Dasar dan Lanjut. Bandung: :Informatika, 2018.

Y. Heryadi and T. Wahyono, Machine Learning Konsep dan Implementasi. Yogyakarta: Gava Media, 2020.

Dinkes, “Laporan Harian Pelayanan Pasien,” Gorontalo, 2024.




DOI: https://doi.org/10.37905/jjeee.v7i1.28374

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Published by:
Electrical Engineering Department
Faculty of Engineering
State University of Gorontalo
Jenderal Sudirman Street No.6, Gorontalo City, Gorontalo Province, Indonesia
Telp. 0435-821175; 081340032063
Email: redaksijjeee@ung.ac.id/redaksijjeee@gmail.com

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.