Clustering and Classification of Retail Sales Data: A Big Data and Data Mining Analysis

Views icon

11

Views

Downloads icon

7

Downloads

Altmetrics icon

Altmetrics

Abstract

In the evolving retail landscape, data-driven decision-making has become essential for understanding customer behavior and predicting sales trends. This study integrates clustering and classification techniques to analyze retail sales data comprising 1,000 transactions obtained from Kaggle. Using the K-Means algorithm, three optimal customer clusters were identified through the Elbow Method, achieving an average within-centroid distance of 25,272.635 and a Davies–Bouldin Index of 0.443, indicating clear cluster separation. The subsequent classification phase compared the predictive performance of three algorithms—Naïve Bayes, Decision Tree, and Random Forest—on 70:30 training-to-testing data partitions. The Naïve Bayes algorithm attained 94.67% accuracy, while both Decision Tree and Random Forest achieved perfect classification accuracy of 100%. These findings highlight the robustness and adaptability of tree-based models for complex retail datasets, outperforming probabilistic methods in terms of accuracy and generalization. The results suggest that the integration of clustering and classification provides retailers with a powerful analytical framework for identifying high-value customer segments, optimizing marketing strategies, and enhancing inventory management. Despite achieving strong outcomes, the study acknowledges dataset limitations and recommends future research involving larger and more diverse datasets, as well as additional features, to expand model scalability and predictive precision.

References

  1. Andry, J. F., Hartono, H., & Jo, J. (2023). Analysis and prediction of supermarket sales with data mining using RapidMiner. In N. I. Saragih, S. A. Salma, F. Dewi, D. Caesaron, M. Dellarosawati, D. Rachmawaty, F. D. Winati, D. Y. Bernanda, F. R. Wilujeng, & G. D. Rembulan (Eds.), AIP Conference Proceedings (Vol. 2693, Issue 1). American Institute of Physics Inc. https://doi.org/10.1063/5.0118725
  2. Arfan, U., & Paraga, N. (2024). Perbandingan algoritma K-Means, Naïve Bayes, dan Decision Tree dalam memprediksi penjualan bahan bakar minyak: The comparison of K-Means, Naïve Bayes and Decision Tree algorithm in predicting fuel oil sales. MALCOM: Indonesian Journal of Machine Learning and Computer Science, 4(4). https://doi.org/10.57152/malcom.v4i4.1566
  3. Arman, S. A., Untari, R. T., & Erion, E. (2023). Implementasi data mining menggunakan metode Decision Tree dalam mempolakan penjualan pada showroom motor bekas. Journal of Science and Social Research, 6(2). https://doi.org/10.54314/jssr.v6i2.1313
  4. Arraudhah, N. (2025). Peningkatan klasifikasi penjualan produk fashion di Sabhira Official dengan Random Forest. Jurnal Dinamika Informatika, 14(1).
  5. Awan, M. J., Rahim, M. S. M., Nobanee, H., Yasin, A., Khalaf, O. I., & Ishfaq, U. (2021). A big data approach to Black Friday sales. Intelligent Automation and Soft Computing, 27(3), 785–797. https://doi.org/10.32604/iasc.2021.014216
  6. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
  7. Chen, J., Koju, W., Xu, S., & Liu, Z. (2021). Sales forecasting using deep neural network and SHAP techniques. IEEE International Conference on Big Data, Artificial Intelligence, and Internet of Things Engineering (ICBAIE), 135–138. https://doi.org/10.1109/ICBAIE52039.2021.9389930
  8. Firnanda, P. A., Shofwatillah, L., Rahma, F., & Fauzi, F. (2025). Analisis perbandingan Decision Tree dan Random Forest dalam klasifikasi penjualan produk pada supermarket. Emerging Statistics and Data Science Journal, 3(1). https://doi.org/10.20885/esds.vol3.iss.1.art2
  9. Juwita, J., Safii, M., & Damanik, B. E. (2022). Naïve Bayes algorithm for predicting sales at the Pematang Siantar VJCakes store. JOMLAI: Journal of Machine Learning and Artificial Intelligence, 1(4). https://doi.org/10.55123/jomlai.v1i4.1674
  10. Li, J. (2022). A feature engineering approach for tree-based machine learning sales forecast, optimized by a genetic algorithm-based sales feature framework. IEEE International Conference on Artificial Intelligence and Big Data (ICAIBD), 133–139. https://doi.org/10.1109/ICAIBD55127.2022.9820532
  11. Mahmudati, R., Rohman, S., & Sa’adah, I. (2025). Sistem prediksi hasil laba penjualan di UNSIQ Mart menggunakan metode Naive Bayes. STORAGE: Jurnal Ilmiah Teknik dan Ilmu Komputer, 4(1). https://doi.org/10.55123/storage.v4i1.4849
  12. Niu, Y. (2020). Walmart sales forecasting using XGBoost algorithm and feature engineering. Proceedings of the International Conference on Big Data, Artificial Intelligence, and Software Engineering (ICBASE), 458–461. https://doi.org/10.1109/ICBASE51474.2020.00103
  13. Permadi, V. A., Tahalea, S. P., & Agusdin, R. P. (2023). K-Means and Elbow Method for cluster analysis of elementary school data. Progres Pendidikan, 4(1), 50–57. https://doi.org/10.29303/prospek.v4i1.328
  14. Pradana, R. Y., Nastiti, F. E., & Oktaviani, I. (2024). Machine learning pengklasifikasikan performa karyawan direct sales force kartu prabayar menggunakan metode Random Forest classifier. JEKIN – Jurnal Teknik Informatika, 4(3). https://doi.org/10.58794/jekin.v4i3.864
  15. Pratiwi, G. E., & Nugroho, A. (2024). Implementasi metode Random Forest untuk klasifikasi penjualan produk sabun paling laris. Jurnal Teknik Informasi dan Komputer (Tekinkom), 7(2). https://doi.org/10.37600/tekinkom.v7i2.1610
  16. Ramadhani, D., A’yuniyah, Q., Elvira, W., Nazira, N., Ambarani, I., & Intan, S. F. (2023). Analisa algoritma Naïve Bayes Classifier (NBC) untuk prediksi penjualan alat kesehatan. Indonesian Journal of Informatic Research and Software Engineering (IJIRSE), 3(2). https://doi.org/10.57152/ijirse.v3i2.941
  17. Shetty, S., & Shetty, S. (2023). Big Mart sales prediction using machine learning. In J. Stephen, P. Sharma, Y. Chaba, K. U. Abraham, P. K. Anooj, N. Mohammad, G. Thomas, & S. Srikiran (Eds.), International Conference on Advanced Computing, Control, and Telecommunication Technology (ACT) (pp. 1556–1561). Grenze Scientific Society. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85174270894
  18. Suranda, D. I., & Nugroho, A. (2024). Klasifikasi data penjualan untuk memprediksi tingkat penjualan produk menggunakan metode Decision Tree. Jurnal Teknik Informasi dan Komputer (Tekinkom), 7(1). https://doi.org/10.37600/tekinkom.v7i1.1269
  19. Umargono, E., Suseno, J. E., & Gunawan, S. K. V. (2020). K-Means clustering optimization using the Elbow Method and early centroid determination based on mean and median formula. Advances in Social Science, Education and Humanities Research, 121–129. https://doi.org/10.2991/assehr.k.201010.019
  20. Wahyudi, T., & Silfia, T. (2022). Implementation of data mining using K-Means clustering method to determine sales strategy in S&R Baby Store. Journal of Applied Engineering and Technological Science, 4(1), 93–103. https://doi.org/10.37385/jaets.v4i1.913
  21. Wei, H., & Zeng, Q. (2021). Research on sales forecast based on XGBoost–LSTM algorithm model. Journal of Physics: Conference Series, 1754(1). https://doi.org/10.1088/1742-6596/1754/1/012191

Author Biographies

How to Cite

Almagribi, A. B., & Redjeki, S. (2025). Clustering and Classification of Retail Sales Data: A Big Data and Data Mining Analysis. Journal Innovations Computer Science, 4(2), 242-253. https://doi.org/10.56347/jics.v4i2.303

Article Details

  • Volume: 4
  • Issue: 2
  • Pages: 242-253
  • Published:
  • Section: Article
  • Copyright: 2025
  • ISSN: 2961-970X

License

Articles in this journal are published under the Creative Commons Attribution Licence (CC-BY 4.0). This means that users may share and adapt the articles published on this website in a reasonable manner, but they must give appropriate credit to the creator and indicate the changes they have made. Users must not apply additional restrictions, but must publish the work under the same license (CC-BY 4.0).

Similar Articles

Similar Articles

Discover other articles with topics similar to what you're currently reading. Find more references and expand your knowledge base.

Related Articles You May Be Interested In

More Similar Articles

Social Media X-Based Public Opinion Analysis of...

Ridha Afifah, Sugiyono

Vol. 4 No. 2 (2025): November
Decision Tree-Based Predictive Model Development for...

Dita Tri Yuliantoro, Frencis Matheos Sarimole

Vol. 4 No. 2 (2025): November
Customer Review Sentiment Analysis of Alisa Batik Solo...

Delia Maharani, Mesra Betty Yell

Vol. 4 No. 2 (2025): November
Public Sentiment Analysis on Instagram Regarding the Film...

Putri Salfa Dhiyaa Azzizah, Mesra Betty Yel

Vol. 4 No. 2 (2025): November
Most read articles by the same author(s)

Related Articles