Clustering and Classification of Retail Sales Data: A Big Data and Data Mining Analysis

Ahmad Bilal Almagribi; Sri Redjeki

doi:10.56347/jics.v4i2.303

About This Journal

Journal Innovations Computer Science is a peer-reviewed academic journal dedicated to publishing cutting-edge research and developments in the rapidly evolving field of computer science....

Learn More

Quick Links

Browse

Recent Issues

Browse Archives

Contact Info

Banda Aceh, Indonesia
+6285277777449
jics@kawanad.com
Week Days: 09.00 to 17.00
Sunday: Closed

Article

Clustering and Classification of Retail Sales Data: A Big Data and Data Mining Analysis

Authors Ahmad Bilal Almagribi, Sri Redjeki

Affiliations

Ahmad Bilal Almagribi: Universitas Teknologi Digital Indonesia

Sri Redjeki: Universitas Teknologi Digital Indonesia

Published 2025-11-30

Section Article

DOI https://doi.org/10.56347/jics.v4i2.303

Issue Vol. 4 No. 2 (2025): November

704

Views

860

Downloads

Altmetrics

Abstract

In the evolving retail landscape, data-driven decision-making has become essential for understanding customer behavior and predicting sales trends. This study integrates clustering and classification techniques to analyze retail sales data comprising 1,000 transactions obtained from Kaggle. Using the K-Means algorithm, three optimal customer clusters were identified through the Elbow Method, achieving an average within-centroid distance of 25,272.635 and a Davies–Bouldin Index of 0.443, indicating clear cluster separation. The subsequent classification phase compared the predictive performance of three algorithms—Naïve Bayes, Decision Tree, and Random Forest—on 70:30 training-to-testing data partitions. The Naïve Bayes algorithm attained 94.67% accuracy, while both Decision Tree and Random Forest achieved perfect classification accuracy of 100%. These findings highlight the robustness and adaptability of tree-based models for complex retail datasets, outperforming probabilistic methods in terms of accuracy and generalization. The results suggest that the integration of clustering and classification provides retailers with a powerful analytical framework for identifying high-value customer segments, optimizing marketing strategies, and enhancing inventory management. Despite achieving strong outcomes, the study acknowledges dataset limitations and recommends future research involving larger and more diverse datasets, as well as additional features, to expand model scalability and predictive precision.

Keywords

Retail Analytics; K-Means; Naïve Bayes; Decision Tree; Random Forest

References

Andry, J. F., Hartono, H., & Jo, J. (2023). Analysis and prediction of supermarket sales with data mining using RapidMiner. In N. I. Saragih, S. A. Salma, F. Dewi, D. Caesaron, M. Dellarosawati, D. Rachmawaty, F. D. Winati, D. Y. Bernanda, F. R. Wilujeng, & G. D. Rembulan (Eds.), AIP Conference Proceedings (Vol. 2693, Issue 1). American Institute of Physics Inc. https://doi.org/10.1063/5.0118725
Arfan, U., & Paraga, N. (2024). Perbandingan algoritma K-Means, Naïve Bayes, dan Decision Tree dalam memprediksi penjualan bahan bakar minyak: The comparison of K-Means, Naïve Bayes and Decision Tree algorithm in predicting fuel oil sales. MALCOM: Indonesian Journal of Machine Learning and Computer Science, 4(4). https://doi.org/10.57152/malcom.v4i4.1566
Arman, S. A., Untari, R. T., & Erion, E. (2023). Implementasi data mining menggunakan metode Decision Tree dalam mempolakan penjualan pada showroom motor bekas. Journal of Science and Social Research, 6(2). https://doi.org/10.54314/jssr.v6i2.1313
Arraudhah, N. (2025). Peningkatan klasifikasi penjualan produk fashion di Sabhira Official dengan Random Forest. Jurnal Dinamika Informatika, 14(1).
Awan, M. J., Rahim, M. S. M., Nobanee, H., Yasin, A., Khalaf, O. I., & Ishfaq, U. (2021). A big data approach to Black Friday sales. Intelligent Automation and Soft Computing, 27(3), 785–797. https://doi.org/10.32604/iasc.2021.014216
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Chen, J., Koju, W., Xu, S., & Liu, Z. (2021). Sales forecasting using deep neural network and SHAP techniques. IEEE International Conference on Big Data, Artificial Intelligence, and Internet of Things Engineering (ICBAIE), 135–138. https://doi.org/10.1109/ICBAIE52039.2021.9389930
Firnanda, P. A., Shofwatillah, L., Rahma, F., & Fauzi, F. (2025). Analisis perbandingan Decision Tree dan Random Forest dalam klasifikasi penjualan produk pada supermarket. Emerging Statistics and Data Science Journal, 3(1). https://doi.org/10.20885/esds.vol3.iss.1.art2
Juwita, J., Safii, M., & Damanik, B. E. (2022). Naïve Bayes algorithm for predicting sales at the Pematang Siantar VJCakes store. JOMLAI: Journal of Machine Learning and Artificial Intelligence, 1(4). https://doi.org/10.55123/jomlai.v1i4.1674
Li, J. (2022). A feature engineering approach for tree-based machine learning sales forecast, optimized by a genetic algorithm-based sales feature framework. IEEE International Conference on Artificial Intelligence and Big Data (ICAIBD), 133–139. https://doi.org/10.1109/ICAIBD55127.2022.9820532
Mahmudati, R., Rohman, S., & Sa’adah, I. (2025). Sistem prediksi hasil laba penjualan di UNSIQ Mart menggunakan metode Naive Bayes. STORAGE: Jurnal Ilmiah Teknik dan Ilmu Komputer, 4(1). https://doi.org/10.55123/storage.v4i1.4849
Niu, Y. (2020). Walmart sales forecasting using XGBoost algorithm and feature engineering. Proceedings of the International Conference on Big Data, Artificial Intelligence, and Software Engineering (ICBASE), 458–461. https://doi.org/10.1109/ICBASE51474.2020.00103
Permadi, V. A., Tahalea, S. P., & Agusdin, R. P. (2023). K-Means and Elbow Method for cluster analysis of elementary school data. Progres Pendidikan, 4(1), 50–57. https://doi.org/10.29303/prospek.v4i1.328
Pradana, R. Y., Nastiti, F. E., & Oktaviani, I. (2024). Machine learning pengklasifikasikan performa karyawan direct sales force kartu prabayar menggunakan metode Random Forest classifier. JEKIN – Jurnal Teknik Informatika, 4(3). https://doi.org/10.58794/jekin.v4i3.864
Pratiwi, G. E., & Nugroho, A. (2024). Implementasi metode Random Forest untuk klasifikasi penjualan produk sabun paling laris. Jurnal Teknik Informasi dan Komputer (Tekinkom), 7(2). https://doi.org/10.37600/tekinkom.v7i2.1610
Ramadhani, D., A’yuniyah, Q., Elvira, W., Nazira, N., Ambarani, I., & Intan, S. F. (2023). Analisa algoritma Naïve Bayes Classifier (NBC) untuk prediksi penjualan alat kesehatan. Indonesian Journal of Informatic Research and Software Engineering (IJIRSE), 3(2). https://doi.org/10.57152/ijirse.v3i2.941
Shetty, S., & Shetty, S. (2023). Big Mart sales prediction using machine learning. In J. Stephen, P. Sharma, Y. Chaba, K. U. Abraham, P. K. Anooj, N. Mohammad, G. Thomas, & S. Srikiran (Eds.), International Conference on Advanced Computing, Control, and Telecommunication Technology (ACT) (pp. 1556–1561). Grenze Scientific Society. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85174270894
Suranda, D. I., & Nugroho, A. (2024). Klasifikasi data penjualan untuk memprediksi tingkat penjualan produk menggunakan metode Decision Tree. Jurnal Teknik Informasi dan Komputer (Tekinkom), 7(1). https://doi.org/10.37600/tekinkom.v7i1.1269
Umargono, E., Suseno, J. E., & Gunawan, S. K. V. (2020). K-Means clustering optimization using the Elbow Method and early centroid determination based on mean and median formula. Advances in Social Science, Education and Humanities Research, 121–129. https://doi.org/10.2991/assehr.k.201010.019
Wahyudi, T., & Silfia, T. (2022). Implementation of data mining using K-Means clustering method to determine sales strategy in S&R Baby Store. Journal of Applied Engineering and Technological Science, 4(1), 93–103. https://doi.org/10.37385/jaets.v4i1.913
Wei, H., & Zeng, Q. (2021). Research on sales forecast based on XGBoost–LSTM algorithm model. Journal of Physics: Conference Series, 1754(1). https://doi.org/10.1088/1742-6596/1754/1/012191

Author Biographies

Ahmad Bilal Almagribi

Universitas Teknologi Digital Indonesia

Master of Information Technology, Universitas Teknologi Digital Indonesia, Bantul Regency, Special Region of Yogyakarta, Indonesia.

Sri Redjeki

Universitas Teknologi Digital Indonesia

Master of Information Technology, Universitas Teknologi Digital Indonesia, Bantul Regency, Special Region of Yogyakarta, Indonesia.

Almagribi, A. B., & Redjeki, S. (2025). Clustering and Classification of Retail Sales Data: A Big Data and Data Mining Analysis. Journal Innovations Computer Science, 4(2), 242-253. https://doi.org/10.56347/jics.v4i2.303

Volume: 4
Issue: 2
Pages: 242-253
Published: 2025-11-30
Section: Article
ISSN: 2961-970X

This work is licensed under a Creative Commons Attribution 4.0 International License.

Articles in this journal are published under the Creative Commons Attribution Licence (CC-BY 4.0). This means that users may share and adapt the articles published on this website in a reasonable manner, but they must give appropriate credit to the creator and indicate the changes they have made. Users must not apply additional restrictions, but must publish the work under the same license (CC-BY 4.0).

About This Journal

Quick Links

Browse

Recent Issues

Contact Info

Article

Clustering and Classification of Retail Sales Data: A Big Data and Data Mining Analysis

704

Views

860

Downloads

Altmetrics