Analisis Komparatif Unjuk Kerja Model Vision Transformers Dengan ConvNeXt Dalam Rekognisi Citra Warangka Keris Bali
Abstract
The application of attention mechanisms in image recognition has emerged as a new paradigm in computer vision, serving as a foundational approach in generative AI. Two state-of-the-art models frequently referenced in recent studies are Vision Transformers (ViT), introduced by Google, and ConvNeXt, developed by Meta (Facebook) AI Research. However, their application in recognizing local cultural imagery, such as the warangka (sheath) of the Balinese keris, remains highly limited. The urgency of this study lies in evaluating the effectiveness of AI models in supporting technology-based cultural preservation. This study aims to compare the unjuk kerjance of these two models in handling the classification and recognition of warangka keris (Balinese kris sheaths). The methodology involves data augmentation, feature extraction, patch processing (for ViT), model construction, evaluation, and image recognition analysis using Grad-CAM. The dataset comprises a combination of primary and secondary sources. Primary data were collected through field visits to kris-making workshops in Bali, while secondary data were obtained from previous studies. The kris sheath image classes used in this study include: 'Sesrengatan', 'Kojongan', 'Batun Poh', 'Kekandikan', and 'Beblatungan'. The study successfully developed image classification models, achieving an accuracy of 82% with the ViT model and 97% with the ConvNeXt model. The recognition process effectively highlighted the most significant regions of each image, providing valuable insight for future generative AI research.
Keywords: Attention, ConvNeXt, Keris Bali, Vision Transformers
Abstrak
Penerapan attention dalam rekognisi citra menjadi pendekatan baru dalam pengenalan gambar dan berpotensi menjadi benchmark dalam pengembangan kecerdasan buatan generatif. Dua model terkini yang banyak diteliti adalah Vision Transformers (ViT) dari Google dan ConvNeXt dari Meta AI. Namun, penerapan keduanya dalam pengenalan citra budaya lokal seperti warangka keris Bali masih sangat terbatas. Urgensi penelitian ini terletak pada upaya mengevaluasi efektivitas model kecerdasan buatan dalam mendukung pelestarian budaya berbasis teknologi. Penelitian ini bertujuan untuk membandingkan performa ViT dan ConvNeXt dalam klasifikasi serta rekognisi citra warangka keris Bali. Metode yang digunakan meliputi augmentasi data, ekstraksi fitur, proses patching (untuk ViT), pembuatan model, pengujian, serta analisis grad cam. Data yang digunakan merupakan gabungan data primer (hasil kunjungan ke workshop pembuatan keris Bali) dan data sekunder dari berbagai sumber. Citra keris yang digunakan antara lain: ‘Sesrengatan’, ‘Kojongan’, ‘Batun Poh’, ‘Kekandikan’, dan ‘Beblatungan’. Hasil menunjukkan akurasi 82% (ViT) dan 97% (ConvNeXt), serta bagian penting citra berhasil dikenali sebagai benchmark generatif.
Kata kunci: Attention; ConvNeXt; Keris Bali; Vision Transformers
References
Q. Xuanhao and Z. Min, “A Review of Attention Mechanisms in Computer Vision,” in Proceedings of the 2023 8th International Conference on Image, Vision and Computing (ICIVC), 2023, pp. 577–583, doi: 10.1109/ICIVC58118.2023.10270435.
A. Vaswani et al., “Attention Is All You Need,” in Advances in Neural Information Processing Systems, vol. 30, pp. 5998-6008, 2017.
X. Yang, “An Overview of the Attention Mechanisms in Computer Vision,” J. Phys. Conf. Ser., vol. 1693, p. 12173, 2020, doi: 10.1088/1742-6596/1693/1/012173.
H. Li and N. Chen, “The Use of Computer Vision Technology in the Inheritance of Intangible Cultural Heritage: The Case of Regional Cultural Characteristics,” Applied Mathematics and Nonlinear Sciences, vol. 9, no. 1, pp. 1–20, Jan. 2024, doi: 10.2478/amns-2024-1853.
L. Silva, O. Bellon, and K. Boyer, “Computer Vision And Graphics For Heritage Preservation And Digital Archaeology,” Revista de Informática Teórica e Aplicada, vol. 11, pp. 9–32, 2004, doi: 10.22456/2175-2745.5746.
J. Mitrić, I. Radulović, T. Popović, Z. Šćekić, dan S. Tinaj, “AI and Computer Vision in Cultural Heritage Preservation,” Proc. 2024 28th International Conference on Information Technology (IT), Žabljak, Montenegro, 21–24 Jan. 2024, pp. 1–4, doi: 10.1109/IT61232.2024.10475738.
B. Dana, “Identitas Seniman Tari Barong dan Keris terhadap Komodifikasi Tari Sakral di Batubulan,” Mudra: Jurnal Seni Budaya, vol. 37, no. 3, pp. 265–270, 2022, doi: 10.31091/mudra.v37i3.1702.
B. C. Mintaraga, Desain Keris Bali kontemporer: kajian makna simbolis & filosofis. 2019.
A. Fahrurrozhi and H. Kurnia, “Memahami Kekayaan Budaya dan Tradisi Suku Bali di Pulau Dewata yang Menakjubkan,” Jurnal Ilmu Sosial dan Budaya Indonesia, vol. 2, no. 1, pp. 39–50, 2024, doi: 10.61476/6635j851.
I. M. Made Ardika Yasa, Ida Bagus Putu Arnyana, dan I. Wayan Suastra, “Keris sebagai representatif manusia dalam peradaban masyarakat Bali di Lombok,” Widya Sandhi, vol. 14, no. 2, Nov. 2023, pp. 88–107, doi: 10.53977/ws.v14i2.1078.
I. Sujana, “Legal Implications of Keris Marriage on the Inheritance Rights of Balinese Women: A Human Rights Perspective,” Sci. Law, vol. 2025, pp. 43–47, 2025, doi: 10.55284/s2e1cs64.
I. G. M. D. Hartawan and K. A. Wiratni, “Pande Besi Di Era Modern (Studi di Desa Sawan, Kecamatan Sawan, Kabupaten Buleleng),” J. Daya Saing, vol. 9, no. 3, pp. 665–674, 2023, doi: 10.35446/dayasaing.v9i3.1456.
Desa Bengkala, “Statistik Berdasar Pekerjaan,” 2025. https://bengkala-buleleng.desa.id/index.php/first/statistik/pekerjaan (accessed May 16, 2025).
Desa Bukti, “Statistik Desa Bukti Berdasar Pekerjaan,” 2025. https://bukti-buleleng.desa.id/index.php/first/statistik/pekerjaan (accessed May 16, 2025).
Desa Sanggalangit, “Statistik Berdasar Pekerjaan,” 2025. https://sanggalangit-buleleng.desa.id/index.php/first/statistik/pekerjaan (accessed May 10, 2025).
N. J. W. Park, H. Regenbrecht, S. Duncan, S. Mills, R. W. Lindeman, N. Pantidi, dan H. Whaanga, “Mixed Reality Co-Design for Indigenous Culture Preservation & Continuation,” Proceedings of the 2022 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), pp. 149–157, Mar. 2022, doi: 10.1109/VR51125.2022.00033 .
A. N. Sihananto, M. M. Al Haromainy, Z. A. Fauzi, R. A. Reza, G. C. H. Putra, and T. M. Christianty, “Wayang’s Images Recognition using Vision Transformer,” Int. J. Data Sci. Eng. Anaylitics, vol. 4, no. 2, pp. 15–27, 2024, doi: 10.33005/ijdasea.v4i2.24.
M.-T. Tran, T.-P. Pham, T.-N. Nguyen, and T.-N. Do, “Classifying Intangible Cultural Heritage Images in the Mekong Delta,” SN Comput. Sci., vol. 6, no. 6, p. 584, 2025, doi: 10.1007/s42979-025-04117-8.
H. Pei, C. Zhang, X. Zhang, X. Liu, and Y. Ma, “Recognizing Materials In Cultural Relic Images Using Computer Vision And Attention Mechanism,” Expert Syst. Appl., vol. 239, p. 122399, 2024, doi: https://doi.org/10.1016/j.eswa.2023.122399.
T. Fan, H. Wang, and S. Deng, “Intangible Cultural Heritage Image Classification With Multimodal Attention And Hierarchical Fusion,” Expert Syst. Appl., vol. 231, p. 120555, 2023, doi: https://doi.org/10.1016/j.eswa.2023.120555.
L. Gao, Y. Wu, T. Yang, X. Zhang, Z. Zeng, C. K. D. Chan, and W. Chen, “Research on Image Classification and Retrieval Using Deep Learning with Attention Mechanism on Diaspora Chinese Architectural Heritage in Jiangmen, China,” Buildings, vol. 13, no. 2, art. no. 275, pp. 1–21, Jan. 2023, doi: 10.3390/buildings13020275.
J. Ahmad, K. Muhammad, and S. Baik, “Data Augmentation-Assisted Deep Learning Of Hand-Drawn Partially Colored Sketches For Visual Search,” PLoS One, vol. 12, p. e0183838, 2017, doi: 10.1371/journal.pone.0183838.
J. Yao, L. Xing, and H. Wu, “A Microblog Content Credibility Evaluation Model Based On The Influence Of Sentiment Polarity,” Mobile Information Systems, vol. 2022, pp. 1–11, 2022, doi: 10.1155/2022/8983534.
H. Kato, K. Osuge, S. Haruta, and I. Sasase, “A Preprocessing by Using Multiple Steganography for Intentional Image Downsampling on CNN-Based Steganalysis,” IEEE Access, vol. 8, pp. 195578–195593, 2020, doi: 10.1109/ACCESS.2020.3033814.
K. Alrfou, A. Kordijazi, and T. Zhao, “Computer Vision Methods for the Microstructural Analysis of Materials: The State-of-the-art and Future Perspectives.” 2022, doi: 10.48550/arXiv.2208.04149.
M. Stefanini, M. Cornia, L. Baraldi, S. Cascianelli, G. Fiameni, and R. Cucchiara, “From Show to Tell: A Survey on Deep Learning-based Image Captioning,” arXiv preprint arXiv:2107.06912, pp. 1–27, Jul. 2021, doi: 10.48550/arXiv.2107.06912.
A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv Prepr. arXiv2010.11929, 2020.
S. Kirstein, H. Wersing, H.-M. Gross, and E. Körner, “A Vector Quantization Approach For Life-Long Learning Of Categories” in Proc. 18th Int. Conf. Artificial Neural Networks (ICANN),” vol. 5506, pp. 805–812, 2008, doi: 10.1007/978-3-642-02490-0_98.
V. Singla, S. Bawa, and J. Singh, “Enhancing Indian Sign Language Recognition Through Data Augmentation And Visual Transformer,” Neural Comput. Appl., vol. 36, pp. 1–14, 2024, doi: 10.1007/s00521-024-09845-1.
T. Zhang, W. Xu, B. Luo, and G. Wang, “Depth-Wise Convolutions In Vision Transformers For Efficient Training On Small Datasets” Neurocomputing, vol. 617, p. 128998, 2025.
R. Ibadulla, T. M. Chen, and C. C. Reyes-Aldasoro, “ConvShareViT: Enhancing Vision Transformers with Convolutional Attention Mechanisms for Free-Space Optical Accelerators,” arXiv Prepr. arXiv2504.11517, 2025.
B. Graham et al., “Levit: A Vision Transformer In Convnet’s Clothing For Faster Inference” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12259–12269.
A. Todi, N. Narula, M. Sharma, and U. Gupta, “ConvNext: A Contemporary Architecture for Convolutional Neural Networks for Image Classification,” 2023, pp. 1–6, doi: 10.1109/CISCT57197.2023.10351320.
Z. Xing, Y. Liu, Q. Wang, and J. Fu, “Fault Diagnosis Of Rotating Parts Integrating Transfer Learning And Convnext Model, ” Sci. Rep., vol. 15, no. 1, p. 190, 2025, doi: 10.1038/s41598-024-84783-5.
Z. Li, T. Gu, B. Li, W. Xu, X. He, and X. Hui, “ConvNeXt-Based Fine-Grained Image Classification and Bilinear Attention Mechanism Model,” Appl. Sci., vol. 12, no. 18, 2022, doi: 10.3390/app12189016.
Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A Convnet For The 2020s,” In Proceedings Of The IEEE/CVF Conference On Computer Vision And Pattern Recognition, 2022, pp. 11976–11986.
D. Setiawan, A. S. Karnyoto, I. Intan, and B. Pardamean, “ConvNeXt Model for Breast Cancer Image Classification,” in Proc. 2024 6th International Conference on Radar, Informatics and Systems (ICORIS), Surakarta, Indonesia, Nov. 2024, pp. 1–5, doi: 10.1109/ICORIS63540.2024.10903832.
S. Tufchi, A. Yadav, and T. Ahmed, “AMTCF: An Advanced Multimodal Transformer And Convnext Fusion For Contextualized Fake News Detection In Digital Landscape,” Lang. Resour. Eval., pp. 1–35, 2025, doi: 10.1007/s10579-025-09838-z.
A. Ameneshewa, “ConvNeXt Based Hybrid Models with Multi-Modal Feature Fusion for ECG Classification,” in Artificial Intelligence and Human-Computer Interaction, Mar. 2025, pp. 166–175, doi: 10.1007/978-981-97-3965-0_14.
Y. Zhang, A. Xu, D. Lan, X. Zhang, J. Yin, and H. H. Goh, “Convnext-Based Anchor-Free Object Detection Model For Infrared Image Of Power Equipment,” Energy Reports, vol. 9, pp. 1121–1132, Sep. 2023, doi: 10.1016/j.energyrep.2023.04.145.
L. Ramos, E. Casas, C. Romero, F. Rivas, and M. E. Morocho-Cayamcela, “A Study Of Convnext Architectures For Enhanced Image Captioning,” IEEE Access, vol. 12, pp. 13711-13728., 2024, doi: 10.1109/ACCESS.2024.3356551.
M. Zhao, X. Xu, X. Bao, X. Chen, and H. Yang, “An Automated Instance Segmentation Method For Crack Detection Integrated With Crackmover Data Augmentation,” Sensors, vol. 24, no. 2, p. 446, 2024, doi: 10.3390/s24020446.
How To Cite This :
Refbacks
- There are currently no refbacks.