Optimizing of IndoBERT Embedding with Ditto Whitening for Measuring Research Title Similarity

Rezqiwati Ishak, Amiruddin Bengnga

Abstract


Measuring the semantic similarity of research titles is a crucial component in maintaining academic originality and preventing topic duplication in higher education. However, IndoBERT embeddings, as a pretrained Indonesian language model, are known to suffer from anisotropy, causing many titles to exhibit high similarity scores despite being semantically distinct. This study aims to optimize the quality of IndoBERT embeddings through Ditto Whitening and to evaluate its impact on research title similarity measurement. The dataset comprises 7.785 undergraduate thesis titles collected from six disciplinary domains and processed using mean pooling and L2 normalization before and after whitening. An intrinsic evaluation was conducted by assessing embedding isotropy, cosine similarity distribution, global bias toward the mean vector, and hubness phenomena, supported by embedding space visualizations using t-SNE, UMAP, and cosine similarity heatmaps. Experimental results demonstrate substantial improvements in embedding quality, indicated by a reduction in Cosine Pair Mean from 0.559 to −0.000145, a decrease in MeanCos-to-Mean from 0.748 to 0.0068, and a reduction in Hubness Skew from 1.60 to 0.68. The isotropy of the embeddings also increased markedly, reflecting a more uniform vector distribution. These findings confirm that Ditto Whitening effectively improves the isotropy of IndoBERT embeddings and directly enhances the accuracy of research title similarity detection and academic document retrieval systems, thereby supporting topic management and research quality assurance in higher education.


Keywords


IndoBERT; Whitening; Embedding; Kesamaan Semantik; Judul Penelitian

Full Text:

PDF

References


P. Zhang, X. Huang, Y. Wang, C. Jiang, and S. Member, “Semantic Similarity Computing Model Based on Multi Model Fine-Grained Nonlinear Fusion,” vol. 9, 2021, doi: 10.1109/ACCESS.2021.3049378.

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” NAACL HLT 2019 - 2019 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. - Proc. Conf., vol. 1, pp. 4171–4186, 2019, doi: https://doi.org/10.18653/v1/N19-1423.

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” EMNLP-IJCNLP 2019 - 2019 Conf. Empir. Methods Nat. Lang. Process. 9th Int. Jt. Conf. Nat. Lang. Process. Proc. Conf., pp. 3982–3992, 2019, doi: 10.18653/v1/d19-1410.

K. Ethayarajh, “How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMO, and GPT-2 Embeddings,” EMNLP-IJCNLP 2019 - 2019 Conf. Empir. Methods Nat. Lang. Process. 9th Int. Jt. Conf. Nat. Lang. Process. Proc. Conf., pp. 55–65, 2019, doi: 10.18653/v1/d19-1006.

B. Li, H. Zhou, J. He, M. Wang, Y. Yang, and L. Li, “On the Sentence Embeddings from Pre-trained Language Models,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 9119–9130, doi: 10.18653/v1/2020.emnlp-main.733.

J. Su, J. Cao, W. Liu, and Y. Ou, “Whitening Sentence Representations for Better Semantics and Faster Retrieval,” 2021, doi: https://doi.org/10.48550/arXiv.2103.15316.

J. Mu and P. Viswanath, “All-but-the-top: Simple and Effective Post-Processing for Word Representations,” 6th Int. Conf. Learn. Represent. ICLR 2018 - Conf. Track Proc., pp. 1–25, 2018, doi: https://doi.org/10.48550/arXiv.1702.01417.

T. Jiang et al., “PromptBERT: Improving BERT Sentence Embeddings with Prompts,” arXiv Prepr. arXiv2201.04337, 2022, doi: https://doi.org/10.48550/arXiv.2201.04337.

Jumino and S. A. Suwanto, “Analisis Layanan Repositori Universitas Diponegoro Berdasarkan Aksesibilitas, Tampilan, Dan Isi: Upaya Pemberdayaan Repositori Berbasis Riset,” vol. 9008, no. 21, 2019, doi: 10.14203/j.baca.v40i2.449.

Q. Chen et al., “Ditto: A Simple and Efficient Approach to Improve Sentence Embeddings,” EMNLP 2023 - 2023 Conf. Empir. Methods Nat. Lang. Process. Proc., pp. 5868–5875, 2023, doi: 10.18653/v1/2023.emnlp-main.359.

BAAK, “Sistem Informasi Akademik Unisan,” 2025. https://siakun.unisan.ac.id/ (accessed Aug. 25, 2025).

C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to Fine-Tune BERT for Text Classification?,” no. 2.

I. Garrido-muñoz, A. Montejo-ráez, F. Martínez-santiago, and L. A. Ureña-lópez, “A Survey on Bias in Deep NLP,” 2021, doi: https://doi.org/10.3390/app11073184.

C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge University Press, 2008.

I. T. Jolliffe, Principal component analysis, 2nd ed. Springer, 2002.

A. Kessy, A. Lewin, and K. Strimmer, “Optimal Whitening and Decorrelation,” no. December 2015, pp. 1–14, 2016, doi: https://doi.org/10.1080/00031305.2016.1277159.

C. M. Bishop and N. M. Nasrabadi, Pattern Recognition and Machine Learning, vol. 4, no. 4. Springer, 2006.

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, vol. 1. MIT press Cambridge, 2016.

T. Wang and P. Isola, “Understanding Contrastive Representation Learning Through Alignment and Uniformity on the Hypersphere,” 37th Int. Conf. Mach. Learn. ICML 2020, vol. PartF16814, pp. 9871–9881, 2020.

R. Feldbauer, T. Rattei, and A. Flexer, “scikit-hubness : Hubness Reduction and Approximate Neighbor Search,” vol. 5, pp. 45–47, 2020, doi: 10.21105/joss.01957.

L. Van Der Maaten and G. Hinton, “Visualizing Data Using t-SNE,” vol. 9, pp. 2579–2605, 2008.

L. Mcinnes, J. Healy, and J. Melville, “UMAP : Uniform Manifold Approximation and Projection for Dimension Reduction,” 2020, doi: https://doi.org/10.48550/arXiv.1802.03426.




DOI: https://doi.org/10.37905/jjeee.v8i1.35554

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Published by:
Electrical Engineering Department
Faculty of Engineering
State University of Gorontalo
Jalan B.J.Habibie Desa Moutong Kecamatan Tilongkabila Kabupaten Bone Bolango
Telp. 0435-821175; 081340032063
Email: [email protected]/[email protected]

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.