ACE-Mix: A Dataset for Assamese-English Code-Mixed Language Processing

Main Article Content

Afsana Laskar , Shikhar Kumar Sarma , Jessica Saikia , Dikshita Borah

Abstract

Code mixing plays a crucial role in easy way of communication in linguistically diverse societies. With the easy access of internet and social media platforms, there has been a precedent rise in use of multiple languages in communication. In multilingual and multiscript society such as India, people often switch between languages in social media. Code Mixing is a concept where languages from two or more different language families are used in the same sentence or passage. It is a phenomenon that has gained popularity in the last few years due to the enhanced availability of communication. It has become increasingly important in natural language processing tasks due to its prevalence in many different domains and its ability to accurately group users based on their regional and linguistic traits. This linguistic phenomenon poses a significant challenge and opportunity to traditional NLP systems, which predominantly depend on monolingual resources to process multilingual combinations. Finding a proper dataset for low resourced languages is tough.

Article Details

Section
Articles

References

Ritchie, William C., and Tej K. Bhatia. "Social and psychological factors in language mixing." The handbook of bilingualism and multilingualism (2012): 375-390.

Bali, Kalika, Jatin Sharma, Monojit Choudhury, and Yogarshi Vyas "i am borrowing ya mixing?" an analysis of english-hindi code mixing in facebook." In Proceedings of the first workshop on computational approaches to code switching, pp. 116-126. 2014.

Younas, Aqsa, Raheela Nasim, Saqib Ali, Guojun Wang, and Fang Qi. "Sentiment analysis of code-mixed Roman Urdu-English social media text using deep learning approaches." In 2020 IEEE 23rd International Conference on Computational Science and Engineering (CSE), pp. 66-71. IEEE, 2020.

Chakravarthi, Bharathi Raja, Ruba Priyadharshini, Vigneshwaran Muralidaran, Shardul Suryawanshi, Navya Jose, Elizabeth Sherly, and John P. McCrae. "Overview of the track on sentiment analysis for dravidian languages in code-mixed text." In Proceedings of the 12th Annual Meeting of the Forum for Information Retrieval Evaluation, pp. 21-24. 2020.

Chakravarthi, Bharathi Raja, Ruba Priyadharshini, Vigneshwaran Muralidaran, Shardul Suryawanshi, Navya Jose, Elizabeth Sherly, and John P. McCrae. "Overview of the track on sentiment analysis for dravidian languages in code-mixed text." In Proceedings of the 12th Annual Meeting of the Forum for Information Retrieval Evaluation, pp. 21-24. 2020.

SR, Mithun Kumar, Lov Kumar, and Aruna Malapati. "Sentiment Analysis on Code-Switched Dravidian Languages with Kernel Based Extreme Learning Machines." In Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages, pp. 184-190. 2022.

Srivastava, Abhishek, Kalika Bali, and Monojit Choudhury. "Understanding script-mixing: A case study of Hindi-English bilingual Twitter users." In Proceedings of the 4th Workshop on Computational Approaches to Code Switching, pp. 36-44. 2020.

Pathak, Dhrubajyoti, Sanjib Narzary, Sukumar Nandi, and Bidisha Som. "Part-of-speech tagger for Bodo language using deep learning approach." Natural Language Processing (2024): 1-15.

Raha, Tathagata, Sainik Kumar Mahata, Dipankar Das, and Sivaji Bandyopadhyay. "Development of pos tagger for english-bengali code-mixed data." arXiv preprint arXiv:2007.14576 (2020).

Talukdar, Kuwali, and Shikhar Kumar Sarma. "Deep Learning based Part-of-Speech tagging for Assamese using RNN and GRU." Procedia Computer Science 235 (2024): 1707-1712.

Chakravarthi, Bharathi Raja, et al. "Dravidiancodemix: Sentiment analysis and offensive language identification dataset for dravidian languages in code-mixed text." Language Resources and Evaluation 56.3 (2022): 765-806.

Ahmad, G. I., Singla, J., Anis, A., Reshi, A. A., & Salameh, A. A. (2022). Machine learning techniques for sentiment analysis of code-mixed and switched indian social media text corpus: A comprehensive review. International Journal of Advanced Computer Science and Applications, 13(2).

Ahmad, Gazi Imtiyaz, et al. "Machine learning techniques for sentiment analysis of code-mixed and switched indian social media text corpus: A comprehensive review." International Journal of Advanced Computer Science and Applications 13.2 (2022).

Shekhar, Shashi, et al. "Hatred and trolling detection transliteration framework using hierarchical LSTM in code-mixed social media text." Complex & Intelligent Systems 9.3 (2023).

Thara, S., and Prabaharan Poornachandran. "Transformer based language identification for malayalam-english code-mixed text." IEEE Access 9 (2021): 118837-118850.

Hegde, Asha, et al. "Corpus creation for sentiment analysis in code-mixed Tulu text." Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages. 2022.