Penerapan Principal Component Analysis untuk Reduksi Variabel pada Algoritma K-Means Clustering

Istina Alya Rosyada, Dina Tri Utari

Abstract


K-Means clustering is a widely used clustering algorithm. However, it has the disadvantage that the performance of clustering data decreases if the variables of the processed data are immense. The complex variables problem in K-Means can be overcome by combining the Principal Component Analysis (PCA) variable reduction method. This study uses seven indicator variables for the welfare of the people of West Java Province in 2021 to measure the welfare level of districts/cities. The results of the analysis obtained two principal components based on eigenvalues. Clustering from cluster analysis with the K-Means with variable reduction using PCA formed the three best clusters where the number of members of each cluster consisted of 12, 8, and 7 districts/cities.

Keywords


clustering; K-Means; Principal Component Analysis; kesejahteraan masyarakat

Full Text:

PDF

References


L. Zhang, “A feature selection algorithm integrating maximum classification information and minimum interaction feature dependency information,” Computational Intelligence and Neuroscience, 2021.

J. Shlens, “A tutorial on principal component analysis,” http://arxiv.org/abs/1404.1100, 2014.

I. Jolliffe, “Principal components analysis,” Wiley StatsRef: Statistics Reference Online, 2014.

Z. John Lu, “The elements of statistical learning: data mining, inference, and prediction,” Journal of the Royal Statistical Society Series A: Statistics in Society, vol. 173, no. 3, 2010.

A. Deshpande and K. Varadarajan, “Sampling-based dimension reduction for subspace approximation,” in Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, 2007. doi: 10.1145/1250790.1250884 pp. 641–650.

J. Wang, C. Xia, Y. Wu, X. Tian, K. Zhang, and Z. Wang, “Rapid detection of carbapenem-resistant klebsiella pneumoniae using machine learning and maldi-tof ms platform,” Infection and Drug Resistance, vol. 15, pp. 3703–3710, 2022.

J. Yang, Y. K. Wang, X. Yao, and C. T. Lin, “Adaptive initialization method for k-means algorithm,” Frontiers in Artificial Intelligence, vol. 4, 2021.

A. Deshpande and K. Varadarajan, “Sampling-based dimension reduction for subspace approximation,” in Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, 2007. doi: 10.1145/1250790.1250884 pp. 641–650.

J. Duo, P. Zhang, and L. Hao, “A k-means text clustering algorithm based on subject feature vector,” Journal of Web Engineering, vol. 20, no. 6, pp. 1935– 1946, 2021.

K. Katahira, “Evaluating the predictive performance of subtyping: A criterion for cluster mean-based prediction,” Statistics in Medicine, vol. 42, no. 7, pp. 1045–1065, 2023.

R. Lakshmi and S. Baskar, “Dic-doc-k-means: Dissimilarity-based initial centroid selection for document clustering using k-means for improving the effectiveness of text document clustering,” Journal of Information Science, vol. 45, no. 6, pp. 818–832, 2019.

K. Shanthi and D. S. .M, “Performance analysis of improved k-means & kmeans in cluster generation,” International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering, vol. 3, no. 9, pp. 11 878– 11 884, 2014.

J. Yang, Y.-K. Wang, X. Yao, and C.-T. Lin, “Adaptive initialization method for k-means algorithm,” Frontiers in Artificial Intelligence, vol. 4, 2021.

A. L. Yusniyanti, F. Virgantari, and Y. E. Faridhan, “Comparison of average linkage and k-means methods in clustering indonesia’s provinces based on welfare indicators,” Journal of Physics: Conference Series, vol. 1863, no. 1, 2021.

BPS Jabar, Badan Pusat Statistik Provinsi Jawa Barat. Provinsi Jawa Barat, 2021.

I. T. Jolliffe, Principal Component Analysis. Springer Science & Business Media, 2013.

R. A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis, 6th ed. Prentice Education, Inc., 2007.

J. F. Hair, R. E. Anderson, R. L. Tatham, and W. C. Black, Multivariate Data Analysis 5th Edition, 5th ed. Prentice-Hall, Inc., 1998.

D. C. Montgomery, E. A. Peck, and G. G. Vining, Introduction to Linear Regression Analysis Solutions Manual to Accompany. John Wiley & Sons, 2013.

B. Everitt and T. Hothorn, An introduction to applied multivariate analysis with R. Springer Science & Business Media, 2011.

C. L. Clayman, S. M. Srinivasan, and R. S. Sangwan, “K-means clustering and principal components analysis of microarray data of l1000 landmark genes,”Procedia Computer Science, vol. 168, pp. 97–104, 2020.

T. M. Kodinariva and P. R. Makwana, “Review on determining number of cluster in k-means clustering,” International Journal of Advance Research in Computer Science and Management Studies, vol. 1, no. 6, 2013.

D. A. I. C. Dewi and D. A. K. Pramita, “Analisis perbandingan metode elbow dan silhouette pada algoritma clustering k-medoids dalam pengelompokan produk kerajinan bali,” MATRIX: Jurnal Manajemen Teknologi dan Informatika, vol. 9, no. 3, 2019.

R. Tibshirani, G. Walther, and T. Hastie, “Estimating the number of clusters in a data set via the gap statistic,” Journal of the Royal Statistical Society, vol. 63, no. 2, pp. 411–423, 2001.

R. Silvi, “Analisis cluster dengan data outlier menggunakan centroid linkage dan k-means clustering untuk pengelompokan indikator hiv/aids di indonesia,” Jurnal Matematika MANTIK, vol. 4, no. 1, pp. 22–31, 2018.

M. Charrad, N. Ghazzali, V. Boiteau, and A. Niknafs, “Nbclust: An r package for determining the relevant number of clusters in a data set,” Journal of Statistical Software, vol. 61, no. 6, pp. 1–36, 2014.

R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification and scene analysis. Wiley New York, 1973, vol. 3.

L. J. Hubert and J. R. Levin, “A general statistical framework for assessing categorical clustering in free recall,” Psychological Bulletin, vol. 83, no. 6, pp. 1072–1080, 1976.

L. A. Goodman, W. H. Kruskal, L. A. Goodman, and W. H. Kruskal, Measures of association for cross classifications. Springer, 1979.

E. M. L. Beale, Cluster analysis. Scientific Control Systems, 1969.

G. W. Milligan and M. C. Cooper, “An examination of procedures for determining the number of clusters in a data set,” Psychometrika, vol. 50, no. 2, pp. 159–179, 1985.




DOI: https://doi.org/10.37905/jjps.v5i1.18733

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 Jambura Journal of Probability and Statistics

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.


Editorial Office of Jambura Journal of Probability and Statistics:
 
Department of Statistics, 3rd Floor Faculty of Mathematics and Natural Sciences, Universitas Negeri Gorontalo
Jl. Prof. Dr. Ing. B.J Habibie, Tilongkabila Kabupaten Bone Bolango, 96119
Telp: +6285398740008 (Call/SMS/WA)
E-mail: redaksi.jjps@ung.ac.id