A Contextual and Embedded Approach for Part-of-Speech Tagging for Assamese- English Code-Mixed Text

Afsana Laskar

doi:10.52783/jes.8292

PDF

Published: Apr 25, 2024

DOI: https://doi.org/10.52783/jes.8292

Keywords:

Assamese-English, POS tagging, HMM, LSTM, XLM-RoBERTa, CRF

Afsana Laskar , Shikhar Kumar Sarma , Jessica Saikia , Dikshita Borah

Abstract

Part-of-speech (POS) tagging is an essential procedure in natural language processing (NLP) that allocates each word in a text to its appropriate grammatical category, including nouns, verbs, adjectives, and others. Part-of-speech tagging in code-mixed texts presents challenges, because of language mixing and the lack of large annotated datasets, especially in low-resource languages like Assamese. This paper presents a comparative analysis of different POS tagging models to develop a hybrid system for POS tagging of Assamese-English code-mixed texts. Each model’s performance is evaluated on Assamese-English code-mixed dataset, analysing metrics like accuracy, precision, and recall. Based on these findings, we propose a conceptual and embedded POS tagging system that combines the strengths of XLM-RoBERTa model and CRF model to enhance overall tagging accuracy.

Issue

Vol. 20 No. 3 (2024)

Section

Articles

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.

References

Phukan, Rituraj, et al. "Exploring Character-Level Deep Learning Models for POS Tagging in Assamese Language." Procedia Computer Science 235 (2024): 1467-1476.

Talukdar, Kuwali, and Shikhar Kumar Sarma. "Deep Learning based Part-of-Speech tagging for Assamese using RNN and GRU." Procedia Computer Science 235 (2024): 1707-1712.

Pradhan, Ashish, and Archit Yajnik. "Parts-of-speech tagging of Nepali texts with Bidirectional LSTM, Conditional Random Fields and HMM." Multimedia Tools and Applications 83.4 (2024): 9893-9909.

Suman Dowlagar and Radhika Mamidi. 2021. A Pre-trained Transformer and CNN Model with Joint Language ID and Part-of-Speech Tagging for Code-Mixed Social-Media Text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 367–374, Held Online. INCOMA Ltd.

Talukdar, Kuwali, et al. "Deep Learning based UPoS Tagger for Assamese Religious Text." International Journal of Religion 5.4 (2024): 163-170.

Talukdar, Kuwali, and Shikhar Kumar Sarma. "PoS to UPoS Conversion and Creation of UPoS Tagged Resources for Assamese Language." In Proceedings of the 20th International Conference on Natural Language Processing (ICON), pp. 450-459. 2023.

Pathak, Dhrubajyoti, et al. "Part-of-speech tagger for Bodo language using deep learning approach." Natural Language Processing (2024): 1-15.

Mitri, Aiom Minnette, et al. "Probing a pretrained RoBERTa on Khasi language for POS tagging." Natural Language Processing: 1-20.

Tathagata Raha, Sainik Mahata, Dipankar Das, and Sivaji Bandyopadhyay. 2019. Development of POS tagger for English-Bengali Code-Mixed data. In Proceedings of the 16th International Conference on Natural Language Processing, pages 143–149, International Institute of Information Technology, Hyderabad, India. NLP Association of India.

Deka, Ridip Ranjan, Simanta Kalita, Kishore Kashyap, Manash P. Bhuyan, and Shikhar Kr Sarma. "A study of t’nt and crf based approach for pos tagging in assamese language." In 2020 3rd International Conference on Intelligent Sustainable Systems (ICISS), pp. 600-604. IEEE, 2020.

Cing, Dim Lam, and Khin Mar Soe. "Improving accuracy of part-of-speech (POS) tagging using hidden markov model and morphological analysis for Myanmar Language." International Journal of Electrical and Computer Engineering 10, no. 2 (2020): 2023.

Dalai, Tusarkanta, Tapas Kumar Mishra, and Pankaj K. Sa. "Part-of-speech tagging of Odia language using statistical and deep learning based approaches." ACM Transactions on Asian and Low-Resource Language Information Processing 22, no. 6 (2023): 1-24.

Appidi, Abhinav Reddy, Vamshi Krishna Srirangam, Darsi Suhas, and Manish Shrivastava. "Creation of corpus and analysis in code-mixed Kannada-English social media data for POS tagging." In Proceedings of the 17th International Conference on Natural Language Processing (ICON), pp. 101-107. 2020.

Shekhar, Shashi, Dilip Kumar Sharma, and M. M. Beg. "An effective bi-LSTM word embedding system for analysis and identification of language in code-mixed social media text in English and Roman Hindi." Computación y Sistemas 24, no. 4 (2020): 1415-1427.

Bhattu, S. Nagesh, Satya Krishna Nunna, Durvasula VLN Somayajulu, and Binay Pradhan. "Improving code-mixed POS tagging using code-mixed embeddings." ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19, no. 4 (2020): 1-31.

Article Sidebar

Main Article Content

Abstract

Article Details

References