Unsupervised Feature Selection Based on Self-configuration Approaches using Multidimensional Scaling

Ridho Ananda; Atika Ratna Dewi; Maifuza Binti Mohd Amin; Miftahul Huda; Gushelmi Gushelmi

doi:10.34312/jjom.v5i2.20397

Unsupervised Feature Selection Based on Self-configuration Approaches using Multidimensional Scaling

Ridho Ananda, Atika Ratna Dewi, Maifuza Binti Mohd Amin, Miftahul Huda, Gushelmi Gushelmi

Abstract

Some researchers often collect features so the principal information does not lose. However, many features sometimes cause problems. The truth of analysis results will decrease because of the irrelevant or repetitive features. To overcome it, one of the solutions is feature selection. They are divided into two, namely supervised and unsupervised learning. In supervised, the feature selection can only be carried out on data containing labels. Meanwhile, in unsupervised, there are three approaches correlation, configuration, and variance. This study proposes an unsupervised feature selection by combining correlation and configuration using multidimensional scaling (MDS). The proposed algorithm is MDS-Clustering, which uses hierarchical and non-hierarchical clustering. The result of MDS-clustering is compared with the existing feature selection. There are three schemes in the comparison process, namely, 75\%, 50\%, and 25\% feature selected. The dataset used in this study is the UCI dataset. The validities used are the goodness-of-fit of the proximity matrix (GoFP) and the accuracy of the classification algorithm. The comparison results show that the feature selection proposed is certainly worth recommending as a new approach in the feature selection process. Besides, on certain data, the algorithm can outperform the existing feature selection.

Keywords

Feature Selection; Multidimensional Scaling; Clustering

Full Text:

PDF

References

V. C. P and A. A. Chikkamannur, "Feature selection: An empirical study" International Journal of Engineering Trends and Technology, vol. 69, no. 2, pp. 165-170, 2021, doi: 10.14445/22315381/IJETT-V69I2P223.

S. Velliangiri, S. Alagumuthukrishnan, and S. I. T. joseph, "A review of dimensionality reduction techniques for efficient computation" Procedia Computer Science, vol. 165, pp. 104-111, 2019, doi: 10.1016/j.procs.2020.01.079.

S. Solorio-Fernandez, J. A. Carrasco-Ochoa, and J. F. MartÄ±nez-Trinidad, "A review of unsupervised feature selection methods" Artificial Intelligence Review, vol. 53, no. 2, pp. 907-948, 2020, doi: 10.1007/s10462-019-09682-y.

M. Kuhn and K. Johnson, Applied Predictive Modeling. Springer New York, 2013, doi: 10.1007/978-1-4614-6849-3.

Y. Xue, L. Zhang, B. Wang, Z. Zhang, and F. Li, "Nonlinear feature selection using gaussian kernel svm-rfe for fault diagnosis" Applied Intelligence, vol. 48, no. 10, pp. 3306-3331, 2018, doi: 10.1007/s10489-018-1140-3.

H. Rao, X. Shi, A. K. Rodrigue, J. Feng, Y. Xia, M. Elhoseny, X. Yuan, and L. Gu, "Feature selection based on artificial bee colony and gradient boosting decision tree" Applied Soft Computing, vol. 74, no. 10, pp. 634-642, 2019, doi: 10.1016/j.asoc.2018.10.036.

Antonelli, Claggett, Henglin, Kim, Ovsak, Kim, Deng, Rao, Tyagi, Watrous, Lagerborg, Hushcha, Demler, Mora, Niiranen, Pereira, Jain, and Cheng, "Statistical workflow for feature selection in human metabolomics data" Metabolites, vol. 9, no. 7, p. 143, 2019, doi: 10.3390/metabo9070143.

S. Hooker, D. Erhan, P.-J. Kindermans, and B. Kim, A Benchmark for Interpretability Methods in Deep Neural Networks. Red Hook, NY, USA: Curran Associates Inc., 2019.

I. T. Jolliffe, "Discarding variables in a principal component analysis. i: Artificial data" Applied Statistics, vol. 21, no. 2, pp. 160-173, 1972, doi: 10.2307/2346488.

I. T. Jolliffe, "Discarding variables in a principal component analysis. ii: Real data" Applied Statistics, vol. 22, no. 1, pp. 21-31, 1973, doi: 10.2307/2346300.

Siswadi, A. Muslim, and T. Bakhtiar, "Variable selection using principal component and Procrustesanalysis and its application in educational data" Journal of Asian Scientific Research, vol. 2, no. 12, pp. 856-865, 2012, [online]. available: https://archive.aessweb.com/index.php/5003/article/view/3435.

Y. Siti Ambarwati and S. Uyun, "Feature selection on Magelang duck egg candling image using variance threshold method" in 2020 3rd International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), 2020, pp. 694-699, doi: 10.1109/ISRITI51436.2020.9315486.

P. Mair, I. Borg, and T. Rusch, "Goodness-of-fit assessment in multidimensional scaling and unfolding" Multivariate Behavioral Research, vol. 51, pp. 772-789, 2016, doi: 10.1080/00273171.2016.1235966.

G. Young and A. S. Householder, "Discussion of a set of points in terms of their mutual distances" Psychometrika, vol. 3, no. 1, pp. 19-22, 1938, doi: 10.1007/BF02287916.

J. C. Gower, "Some distance properties of latent root and vector methods used in multivariate analysis" Biometrika, vol. 53, no. 3/4, p. 325, 1966, doi: 10.2307/2333639.

J. B. Kruskal, "Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis" Psychometrika, vol. 29, no. 1, pp. 1-27, 1964, doi: 10.1007/BF02289565.

I. T. Jolliffe, Principal Component Analysis. Springer-Verlag, 2002, doi: 10.1007/b98835.

M. Pavithra and R.M.S.Parvathi, "A survey on clustering high dimensional data techniques" International Journal of Applied Engineering Research, vol. 12, no. 11, pp. 2893-2899, 2017.

R. Ananda, "Silhouette density canopy k-means for mapping the quality of education based on the results of the 2019 national exam in banyumas regency" Khazanah Informatika : Jurnal Ilmu Komputer dan Informatika, vol. 5, no. 2, pp. 158-168, 2019, doi: 10.23917/khif.v5i2.8375.

R. Ananda, M. Z. Naf'an, A. B. Arifa, and A. Burhanuddin, "Recommendation system for specialization selection using k-means density canopy" Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 4, no. 1, pp. 172-179, 2020, doi: 10.29207/resti.v4i1.1531

R. Adhitama, A. Burhanuddin, and R. Ananda, "Penentuan jumlah cluster ideal smk di jawa tengah dengan metode x-means clustering" JIKO (Jurnal Informatika dan Komputer), vol. 3, no. 1, pp. 1-5, 2020, doi: 10.33387/jiko.v3i1.1635.

R. Ananda and A. Z. Yamani, "Determination of initial k-means centroid in the process of clustering data evaluation of teaching lecturers" Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 4, no. 3, pp. 544-550, 2020, doi: 10.29207/resti.v4i3.1896.

R. Ananda and A. Prasetiadi, "Hierarchical and k-means clustering in the line drawing data shape using procrustes analysis" JOIV : International Journal on Informatics Visualization, vol. 5, no. 3, p. 306, 2021, doi: 10.30630/joiv.5.3.532.

A. Bahl, B. Hellack, M. Balas, A. Dinischiotu, M. Wiemann, J. Brinkmann, A. Luch, B. Y. Renard, and A. Haase, "Recursive feature elimination in random forest classification supports nanomaterial grouping" NanoImpact, vol. 15, p. 100179, 2019, doi: 10.1016/j.impact.2019.100179.

S. Bashir, Z. S. Khan, F. Hassan Khan, A. Anjum, and K. Bashir, "Improving heart disease prediction using feature selection approaches" in 2019 16th International Bhurban Conference on Applied Sciences and Technology (IBCAST), 2019, pp. 619-623, doi: 10.1109/IBCAST.2019.8667106.

Nurrahma and R. Yusuf, "Comparing different supervised machine learning accuracy on analyzing covid-19 data using anova test" in 2020 6th International Conference on Interactive Digital Media (ICIDM), 2020, pp. 1-6, doi: 10.1109/ICIDM51048.2020.9339676.

E. O. Omuya, G. O. Okeyo, and M. W. Kimwele, "Feature selection for classification using principal component analysis and information gain" Expert Systems with Applications, vol. 174, p. 114765, 2021, doi: 10.1016/j.eswa.2021.114765.

T. Bakhtiar and S. Siswadi, "On the symmetrical property of procrustes measure of distance" International Journal of Pure and Apllied Mathematics, vol. 99, no. 3, pp. 315-324, 2015, doi: 10.12732/ijpam.v99i3.7.

R. Ananda and A. Prasetiadi, "Classification based on configuration objects by using Procrustes analysis" Jurnal Infotel, vol. 13, no. 2, pp. 76-83, 5 2021, doi: 10.20895/infotel.v13i2.637.

DOI: https://doi.org/10.34312/jjom.v5i2.20397