Optimization of K-Means Attribute Selection Using Correlation Matrix in Patient Disease Clustering

Amiruddin Bengnga; Rezqiwati Ishak

doi:10.37905/jjeee.v7i2.28010

Optimization of K-Means Attribute Selection Using Correlation Matrix in Patient Disease Clustering

Amiruddin Bengnga, Rezqiwati Ishak

Abstract

Patient health is a critical element in public health systems, where grouping disease data can facilitate risk identification and more efficient treatment planning. However, conventional clustering methods such as K-Means often have difficulty in separating clusters optimally, especially when the attributes used are irrelevant or redundant. This study aims to optimize the clustering process of patient health data by applying attribute selection using Correlation Matrix and Heatmap in the K-Means algorithm. The method used involves normalizing the data with a StandardScaler and determining the optimal number of clusters through the Elbow Method, which results in three optimal clusters. Attribute selection is carried out to reduce redundancy, leaving important features such as age, height, and body mass index (BMI). The results of the analysis showed that attribute selection significantly improved clustering performance, with the Silhouette Score increasing from 0.20 to 0.54 and the Davies-Bouldin Index (DBI) decreasing from 1.60 to 0.63. Visualization of clustering results using Principal Component Analysis (PCA) shows a clearer separation between clusters, reflecting different patient characteristics. These findings confirm the importance of attribute selection in the clustering process to achieve more optimal results that can help in understanding patient health patterns and designing more appropriate interventions.

Kesehatan pasien merupakan elemen penting dalam sistem kesehatan masyarakat, di mana pengelompokan data penyakit dapat memfasilitasi identifikasi risiko dan perencanaan perawatan yang lebih efisien. Namun metode clustering konvensional seperti K-Means sering mengalami kesulitan dalam memisahkan cluster secara optimal, terutama ketika atribut yang digunakan tidak relevan atau berlebihan. Penelitian ini bertujuan untuk mengoptimalkan proses clustering data kesehatan pasien dengan menerapkan seleksi atribut menggunakan Correlation Matrix dan Heatmap dalam algoritma K-Means. Metode yang digunakan melibatkan normalisasi data dengan StandardScaler dan penentuan jumlah cluster optimal melalui Elbow Method, yang menghasilkan tiga cluster optimal. Seleksi atribut dilakukan untuk mengurangi redundansi, menyisakan fitur-fitur penting seperti umur, tinggi badan, dan indeks massa tubuh (IMT). Hasil analisis menunjukkan bahwa seleksi atribut secara signifikan meningkatkan performa clustering, dengan Silhouette Score meningkat dari 0,20 menjadi 0,54 dan Davies-Bouldin Index (DBI) menurun dari 1,60 menjadi 0,63. Visualisasi hasil clustering menggunakan Principal Component Analysis (PCA) menunjukkan pemisahan yang lebih jelas antar cluster, mencerminkan karakteristik pasien yang berbeda. Temuan ini menegaskan pentingnya seleksi atribut dalam proses clustering untuk mencapai hasil yang lebih optimal yang dapat membantu dalam memahami pola kesehatan pasien dan merancang intervensi yang lebih tepat.

Keywords

Heatmap; Silhouette Score; K-Means; Principal Component Analysis (PCA); Elbow

Full Text:

PDF

References

Dinkes, “Laporan Harian Pelayanan Pasien,” Gorontalo, 2024.

J. Zhao, Y. Bao, D. Li, and X. Guan, “An Improved K-Means Algorithm Based on Contour Similarity,” Mathematics, vol. 12, no. 14, p. 2211, Jul. 2024, doi: 10.3390/math12142211.

W. Lv, W. Tang, H. Huang, and T. Chen, “Research and Application of Intersection Clustering Algorithm Based on PCA Feature Extraction and K-Means,” J. Phys. Conf. Ser., vol. 1861, no. 1, p. 012001, Mar. 2021, doi: 10.1088/1742-6596/1861/1/012001.

D. Andra and A. B. Baizal, “E-commerce Recommender System Using PCA and K-Means Clustering,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 6, no. 1, pp. 57–63, Feb. 2022, doi: 10.29207/resti.v6i1.3782.

H. A. Rosyid, U. Pujianto, and M. R. Yudhistira, “Classification of Lexile Level Reading Load Using the K-Means Clustering and Random Forest Method,” Kinet. Game Technol. Inf. Syst. Comput. Network, Comput. Electron. Control, pp. 139–146, May 2020, doi: 10.22219/kinetik.v5i2.897.

X. Zhao et al., “ECFS-DEA: an ensemble classifier-based feature selection for differential expression analysis on expression profiles,” BMC Bioinformatics, vol. 21, no. 1, p. 43, Dec. 2020, doi: 10.1186/s12859-020-3388-y.

N. Yusliani, S. A. Q. Aruda, M. D. Marieska, D. M. Saputra, and A. Abdiansah, “The effect of Chi-Square Feature Selection on Question Classification using Multinomial Naïve Bayes,” Sinkron, vol. 7, no. 4, pp. 2430–2436, Oct. 2022, doi: 10.33395/sinkron.v7i4.11788.

M. R. Mahmood, “Two Feature Selection Methods Comparison Chi-square and Relief-F for Facial Expression Recognition,” J. Phys. Conf. Ser., vol. 1804, no. 1, p. 012056, Feb. 2021, doi: 10.1088/1742-6596/1804/1/012056.

A. Bengnga and R. Ishak, “Penerapan XGBoost untuk Seleksi Atribut pada K-Means dalam Clustering Penerima KIP Kuliah,” Jambura J. Electr. Electron. Eng., vol. 5, no. 2, pp. 192–196, 2023, doi: 10.37905/jjeee.v5i2.20253.

J. Henriques, F. Caldeira, T. Cruz, and P. Simões, “Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets,” Electronics, vol. 9, no. 7, p. 1164, Jul. 2020, doi: 10.3390/electronics9071164.

C. Jie, Z. Jiyue, W. Junhui, W. Yusheng, S. Huiping, and L. Kaiyan, “Review on the Research of K-means Clustering Algorithm in Big Data,” in 2020 IEEE 3rd International Conference on Electronics and Communication Engineering (ICECE), Dec. 2020, pp. 107–111, doi: 10.1109/ICECE51594.2020.9353036.

A. Bengnga and R. Ishak, “Implementasi Seleksi Fitur Klasifikasi Waktu Kelulusan Mahasiswa Menggunakan Correlation Matrix with Heatmap,” Jambura J. Electr. Electron. Eng., vol. 4, no. 2, pp. 169–174, Jul. 2022, doi: 10.37905/jjeee.v4i2.14403.

A. C. Pellicelli, “Application of an K-means Improved Clustering Analysis Algorithm in the Design of Resource Management Information System,” 2022.

M. Lv, “Application of an K-means Improved Clustering Analysis Algorithm in the Design of Resource Management Information System,” in 2022 World Automation Congress (WAC), Oct. 2022, pp. 158–162, doi: 10.23919/WAC55640.2022.9934387.

S. Rajesh, P. Praveen, and D. N, “Performance Analysis of Machine Learning Algorithms on Parkinson’s Disease Data,” in 2024 IEEE International Conference on Contemporary Computing and Communications (InC4), Mar. 2024, pp. 1–10, doi: 10.1109/InC460750.2024.10649372.

“Clustering Performance Evaluation,” scikit-learn developers. https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation (accessed May 20, 2023).

A. B. H. Kiat, Y. Azhar, and V. Rahmayanti, “Penerapan Metode K-Means Dengan Metode Elbow Untuk Segmentasi Pelanggan Menggunakan Model RFM (Recency, Frequency & Monetary),” Repositor, vol. 2, no. 7, pp. 945–952, 2020.

R. Ishak and Amiruddin, “Clustering Tingkat Pemahaman Dasar Mahasiswa Pada Pra-Perkuliahan Probabilitas Statistika Dengan Metode K-Means,” Jambura J. Electr. Electron. Eng., vol. 4, pp. 65–69, 2022, doi: 10.37905/jjeee.v4i1.11997.

R. Primartha, Algoritma Machine Learning. Bandung: Informatika, 2021.

“Clustering,” scikit-learn developers. https://scikit-learn.org/stable/modules/clustering.html# (accessed May 20, 2023).

A. K. Singh, S. Mittal, P. Malhotra, and Y. V. Srivastava, “Clustering Evaluation by Davies-Bouldin Index(DBI) in Cereal data using K-Means,” in 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC), Mar. 2020, pp. 306–310, doi: 10.1109/ICCMC48092.2020.ICCMC-00057.

E. Muningsih, I. Maryani, and V. R. Handayani, “Penerapan Metode K-Means dan Optimasi Jumlah Cluster dengan Index Davies Bouldin untuk Clustering Propinsi Berdasarkan Potensi Desa,” J. Sains dan Manaj., vol. 9, no. 1, p. 96, 2021, doi: 10.31294/evolusi.v9i1.10428.

Suyanto, Data Mining untuk Klasifikasi dan Klasterisasi Data. Bandung: Informatika, 2019.

I. Turbay, P. Ortiz, and R. Ortiz, “Statistical analysis of principal components (PCA) in the study of the vulnerability of Heritage Churches,” Procedia Struct. Integr., vol. 55, pp. 168–176, 2024, doi: 10.1016/j.prostr.2024.02.022.

DOI: https://doi.org/10.37905/jjeee.v7i2.28010

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Published by:
Electrical Engineering Department
Faculty of Engineering
State University of Gorontalo
Jalan B.J.Habibie Desa Moutong Kecamatan Tilongkabila Kabupaten Bone Bolango
Telp. 0435-821175; 081340032063
Email: [email protected]/[email protected]

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Username
Password
Remember me

// <![CDATA[ var sc_project=12204862; var sc_invisible=0; var sc_security="bbcd1158"; var sc_https=1; var sc_remove_link=1; var scJsHost = "https://"; document.write("<sc"+"ript type='text/javascript' src='" + scJsHost+ "statcounter.com/counter/counter.js'></"+"script>"); // ]]>

Optimization of K-Means Attribute Selection Using Correlation Matrix in Patient Disease Clustering

Abstract

Keywords

Full Text:

References

Refbacks