Exploratory Data Analysis on TMDb Top Rated Movies Dataset Using Winsorization Approach

M Fauzan Sayifullah, Ulfatun Nadifa, Wahab Musa, Rahmat Hidayat Dongka, Ade irawaty Tolango, Zainudin Bonok

Abstract


This research presents an Exploratory Data Analysis (EDA) on the top-rated movies dataset from The Movie Database (TMDb) spanning 1902–2026. The main objective of this study is to clean the data, identify distribution biases, and prepare the dataset for predictive modeling. The approach includes missing value imputation, skewness metric evaluation, and correlation analysis. Findings reveal highly positive skewness in popularity and vote count variables, as well as a temporal bias dominated by modern era movies. To handle extreme values (outliers), the Interquartile Range (IQR) method combined with the capping (Winsorization) technique was applied. As a result, the data distribution became more stable without losing representative information from blockbuster movies. Correlation analysis revealed a strong positive relationship (0.62) between popularity and vote count, and multicollinearity (0.97) between the month and quarter variables, which needs to be eliminated in the subsequent machine learning phase.

Keywords


EDA; IQR; Dataset TMDb; Movie; Winsorization

Full Text:

PDF

References


Jiawei Han, M. Kamber, & J. Pei, Data Mining: Concepts and Techniques, 3rd ed. Morgan Kaufmann, 2012.

Peter J. Rousseeuw & A. M. Leroy, Robust Regression and Outlier Detection. Wiley, 1987.

John W. Tukey, Exploratory Data Analysis. Addison-Wesley, 1977.

Rand R. Wilcox, Introduction to Robust Estimation and Hypothesis Testing, 4th ed. Academic Press, 2017.

Trevor Hastie, Robert Tibshirani, & Jerome Friedman, The Elements of Statistical Learning. Springer, 2009.

John W. Tukey, Exploratory Data Analysis. Addison-Wesley, 1977.

Gareth James et al., An Introduction to Statistical Learning. Springer, 2013.

Douglas C. Montgomery, E. A. Peck, & G. G. Vining, Introduction to Linear Regression Analysis. Wiley, 2012.

Andrew Ng, “Machine Learning Yearning,” 2018.

TMDb, TMDb Dataset Documentation, 2026.


Refbacks

  • There are currently no refbacks.


Copyright (c) 2026 International Journal of Embedded Computer Engineering

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.