Analisis Perbandingan Pengukuran Jarak pada Algoritme K-Means Berbasis Sum of Square Error

Stendy Budi Hartono Sakur(1*)
(1) Politeknik Negeri Nusa Utara
(*) Corresponding Author
DOI : 10.35889/progresif.v19i2.1276

Abstract

The marketing strategy is very important to follow the culture of visitors or buyers because it is closely related to people's income levels. A number of visitor data are a data mining model that can extract information to determine the characteristics of each data. The purpose of this research is to compare distance measurements using the k-means clustering algorithm to see the optimal k value and the required time complexity. Using the K-Means clustering method with Euclidean, Manhattan, Minkowsky, Chebyshev, and Canberra distances to calculate the characteristic values of each object. Determining the value of k using the Elbow model which is formed from the Sum of Square Error (SSE) also considers the Mean of Square Error (MSE) value. The results showed that the Euclidean, Manhattan, Minkowsky, and Chebyshev distances can provide the right grouping so that they become an alternative to the Euclidean distance where the time needed by the Manhattan distance is 1.70 seconds faster than the Euclidean distance of 1.78 seconds, Minkowsky distance 1.82 seconds, Chebyshev distance 2.30 seconds and Canberra distance of 2.48 seconds. In conclusion, Euclidean, Manhattan, Minkowsky and Chebyshev distances can be used to measure closeness values between objects with good accuracy while Canberra distance cannot provide precise accuracy. The research resulted in five groups with different characteristics of income and expenses so that they can be used as a standard for developing marketing strategies.

Keywords: K-means; Euclidean; Manhattan; Minkowsky; Chebyshev; Canberra; Sum of square error, Mean of square error.

 

Abstrak

Strategi pemasaran sangat penting untuk mengikuti budaya pengunjung ataupun pembeli karena erat hubunganya dengan tingkat pendapatan masyarakat. Sejumlah data pengunjung merupakan suatu model data mining yang dapat digali informasinya guna mengetahui karakteristik dari setiap data. Tujuan penelitian ini adalah untuk membandingkan pengukuran jarak pada Algoritme K-means clustering sehingga diperoleh nilai k yang optimal serta kompleksitas waktu yang dibutuhkan. Menggunakan Metode K-Means clustering dengan Euclidean, Manhattan, Minkowsky, Chebyshev dan Canberra distance untuk menghitung nilai karakteristik dari setiap objek. Penentuan nilai k menggunakan kurva Elbow yang dibentuk dari Sum of Square Error (SSE) juga mempertimbangkan nilai Mean of Square Error (MSE). Hasil penelitian menunjukkan Euclidean, Manattan, Minkowsky, Chebyshev distance dapat memberikan pengelompokkan yang tepat sehingga menjadi alternatif pengganti Euclidean distance dimana waktu yang dibutuhkan oleh Manhattan distance sebesar 1.70 detik lebih cepat dibandingkan Euclidean distance 1.78 detik, Minkowsky distance 1.82 detik, Chebyshev distance 2.30 detik dan Canberra distance 2.48 detik. Kesimpulannya, Euclidean, Manhattan, Minkowsky dan Chebyshev distance dapat digunakan untuk mengukur nilai kedekatan antara objek dengan akurasi yang baik sedangkan Canberra distance tidak dapat memberikan akurasi dengan tepat. Penelitian menghasilkan 5 kelompok dengan karakteristik penghasilan dan pengeluaran yang berbeda sehingga dapat dijadikan sebagai standar pengembangan strategi pemasaran.

Kata Kunci: K-means; Euclidean; Manhattan; Minkowsky; Chebyshev; Canberra; Sum of square error; Mean of square error

References


V. V. Hegde and N. S. Gadwal, “A Review on Cloud Computing and K-means++ Clustering Algorithm with Map Reduce,” IJRESM, vol. 2, no. 5, pp. 526–528, May 2019.

R. Nooraeni and G. Nurfalah, “Kajian Penerapan Jarak Euclidean, Manhattan, Minkowski, dan Chebyshev pada Algoritme Clustering K-Prototype,” vol. 4, no. 2, Art. no. 2, 2022.

A. K. Singh, S. Mittal, P. Malhotra, and Y. V. Srivastava, “Clustering Evaluation by Davies-Bouldin Index(DBI) in Cereal data using K-Means,” in 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC), Erode, India: IEEE, Mar. 2020, pp. 306–310. doi: 10.1109/ICCMC48092.2020.ICCMC-00057.

A. Ghosal, A. Nandy, A. K. Das, S. Goswami, and M. Panday, “A Short Review on Different Clustering Techniques and Their Applications,” in Emerging Technology in Modelling and Graphics, J. K. Mandal and D. Bhattacharya, Eds., in Advances in Intelligent Systems and Computing, vol. 937. Singapore: Springer Singapore, 2020, pp. 69–83. doi: 10.1007/978-981-13-7403-6_9.

S. Pandit and S. Gupta, “A Comparative Study on Distance Measuring Approaches for Clustering,” IJORCS, vol. 2, no. 1, Art. no. 1, Dec. 2011, doi: 10.7815/ijorcs.21.2011.011.

M. Faisal, E. M. Zamzami, and Sutarman, “Comparative Analysis of Inter-Centroid K-Means Performance using Euclidean Distance, Canberra Distance and Manhattan Distance,” J. Phys.: Conf. Ser., vol. 1566, no. 1, Art. no. 1, Jun. 2020, doi: 10.1088/1742-6596/1566/1/012112.

J. Arora, K. Khatter, and M. Tushir, “Fuzzy c-Means Clustering Strategies: A Review of Distance Measures,” in Software Engineering, Springer, Singapore, 2019, pp. 153–162. doi: 10.1007/978-981-10-8848-3_15.

F. A. Sebayang, M. S. Lydia, and B. B. Nasution, “Optimization on Purity K-Means Using Variant Distance Measure,” in 2020 3rd International Conference on Mechanical, Electronics, Computer, and Industrial Technology (MECnIT), Medan, Indonesia: IEEE, Jun. 2020, pp. 143–147. doi: 10.1109/MECnIT48290.2020.9166600.

C. Yuan and H. Yang, “Research on K-Value Selection Method of K-Means Clustering Algorithm,” J, vol. 2, no. 2, Art. no. 2, Jun. 2019, doi: 10.3390/j2020016.

M. Cui, “Introduction to the K-Means Clustering Algorithm Based on the Elbow Method,” Accounting, Auditing and Finance, vol. 1, no. 1, Art. no. 1, Oct. 2020, doi: 10.23977/accaf.2020.010102.

R. Suwanda, Z. Syahputra, and E. M. Zamzami, “Analysis of Euclidean Distance and Manhattan Distance in the K-Means Algorithm for Variations Number of Centroid K,” J. Phys.: Conf. Ser., vol. 1566, no. 1, Art. no. 1, Jun. 2020, doi: 10.1088/1742-6596/1566/1/012058.

W. F. Ardianto, S. Sriani, and A. H. Hasugian, “Application of color extraction methods and k-nearest neighbor to determine maturity avocado butter,” J. Teknik Informatika CIT Medicom, vol. 15, no. 1, Art. no. 1, Mar. 2023, doi: 10.35335/cit.Vol15.2023.375.pp09-20.

S. B. H. Sakur, “PERBANDINGAN DISTANCE MEASURES PADA K-MEANS CLUSTER DAN TOPSIS DENGAN KORELASI PEARSON DAN SPEARMAN,” JITEK, vol. 3, no. 1, Art. no. 1, Mar. 2023, doi: 10.55606/jitek.v3i1.1394.

S. B. H. Sakur, M. Silangen, and D. Tuwohingide, “Penerapan Algoritme K-Means Cluster dan Metode TOPSIS pada Pemilihan Mahasiswa kunjungan Industri,” Jutisi : Jurnal Ilmiah Teknik Informatika dan Sistem Informasi, vol. 11, no. 3, Art. no. 3, Dec. 2022, doi: 10.35889/jutisi.v11i3.1045.

A. Ashabi, S. B. Sahibuddin, and M. Salkhordeh Haghighi, “The Systematic Review of K-Means Clustering Algorithm,” in 2020 The 9th International Conference on Networks, Communication and Computing, Tokyo Japan: ACM, Dec. 2020, pp. 13–18. doi: 10.1145/3447654.3447657.

V. A. Ekasetya and A. Jananto, “KLASTERISASI OPTIMAL DENGAN ELBOW METHOD UNTUK PENGELOMPOKAN DATA KECELAKAAN LALU LINTAS DI KOTA SEMARANG,” JDI, vol. 12, no. 1, Art. no. 1, Aug. 2020, doi: 10.35315/informatika.v12i1.8159.

A. Pugazhenthi and L. S. Kumar, “Selection of Optimal Number of Clusters and Centroids for K-means and Fuzzy C-means Clustering: A Review,” in 2020 5th International Conference on Computing, Communication and Security (ICCCS), Patna, India: IEEE, Oct. 2020, pp. 1–4. doi: 10.1109/ICCCS49678.2020.9276978.

M. A. Syakur, B. K. Khotimah, E. M. S. Rochman, and B. D. Satoto, “Integration K-Means Clustering Method and Elbow Method For Identification of The Best Customer Profile Cluster,” IOP Conf. Ser.: Mater. Sci. Eng., vol. 336, no. 1, Art. no. 1, Apr. 2018, doi: 10.1088/1757-899X/336/1/012017.

K. P. Sinaga and M.-S. Yang, “Unsupervised K-Means Clustering Algorithm,” IEEE Transactions on Fuzzy Systems, vol. 8, pp. 80716–80727, May 2023, doi: 10.1109/ACCESS.2020.2988796.

T. M. Dista and F. F. Abdulloh, “Clustering Pengunjung Mall Menggunakan Metode K-Means dan Particle Swarm Optimization,” mib, vol. 6, no. 3, Art. no. 3, Jul. 2022, doi: 10.30865/mib.v6i3.4172.

R. Nainggolan and G. Lumbantoruan, “OPTIMASI PERFORMA CLUSTER K-MEANS MENGGUNAKAN SUM OF SQUARED ERROR (SSE),” vol. 2, no. 2, Art. no. 2, 2018.

“Perbaikan Kinerja Clustering K-Means pada Data Ekonomi Nelayan dengan Perhitungan Sum of Square Error (SSE) dan Optimasi nilai K cluster,” ResearchGate, Mar. 2023, doi: 10.33633/tc.v20i2.4572.

“Data-Pengunjung-Mall,” May 30, 2023. https://www.kaggle.com/datasets/baktisiregar/ datapengunjungmall (accessed May 30, 2023).

“Qt Framework: C++/Python/QML,” web official Qt Framework, May 30, 2023. https://www.qt.io/product/framework (accessed May 30, 2023).

C. Sanderson and R. Curtin, “Armadillo: a template-based C++ library for linear algebra,” JOSS, vol. 1, no. 2, Art. no. 2, Jun. 2016, doi: 10.21105/joss.00026.

C. Sanderson and R. Curtin, “A User-Friendly Hybrid Sparse Matrix Class in C++,” in Mathematical Software – ICMS 2018, J. H. Davenport, M. Kauers, G. Labahn, and J. Urban, Eds., in Lecture Notes in Computer Science, vol. 10931. Cham: Springer International Publishing, 2018, pp. 422–430. doi: 10.1007/978-3-319-96418-8_50.

S. M. H. M. Huzir, N. Z. Mahabob, A. F. M. Amidon, N. Ismail, Z. M. Yusoff, and M. N. Taib, “A Ppreliminary study on the intelligent model of k-nearest neighbor for agarwood oil quality grading,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 27, no. 3, Art. no. 3, Sep. 2022.

R. L. Lafta, M. S. AL-Musaylh, and Q. M. Shallal, “Clustering similar time series data for the prediction the patients with heart disease,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 26, no. 2, Art. no. 2, May 2022.

A. Smiti, “A critical overview of outlier detection methods,” Computer Science Review, vol. 38, p. 100306, Nov. 2020, doi: 10.1016/j.cosrev.2020.100306.

S. N. Wahyuni, E. Sediono, I. Sembiring, and N. N. Khanom, “Comparative analysis of time series prediction model for forecasting COVID-19 trend,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 28, no. 1, Art. no. 1, Oct. 2022.


How To Cite This :

Refbacks

  • There are currently no refbacks.