Code-Mixed Text Analysis & Generation

Developing tools and methodologies for analyzing multilingual and code-switched text

Code-mixing is the tendency of multilingual speakers to alternate between two or more languages. It happens predominantly in speech and informal text sources - like User Generated Content - comments, posts etc.

Code-mixed text is prevalent in OSNs and UGC. and even if its a small percentage, at scale it would still be a problem worth the effort. Motivations of code-mixing are socio and psycho linguistic - e.g you want to bring in informality, the word is not known in your primary language, you want to express certain emotion etc.

Tweet expressing an opinion by mixing English and Hindi.

Code-mixing occurs at various levels - across sentences, within sentence, within in word.

Different ways of code-mixing

Our work on code-mixed text deals with three primary aspects:

Analysis of code-mixed text: improving methods to linguistically analyze code-mixed text
Improve generation of building quality control measures over synethetic code-mixed text
Build a toolkit that combines various data resources, models, and pipelines suitable for code-mixed text

Overview of our work in code-mixing

Syntactic Analysis of Code-Mixing

For capturing the variety of code mixing in, and across corpus, Language ID (LID) tags based measures (CMI) have been proposed. Syntactical variety/patterns of code-mixing and their relationship vis-a-vis computational model’s performance is under explored. In this work, we investigate a collection of English(en)-Hindi(hi) code-mixed datasets from a syntactic lens to propose, SyMCoM, an indicator of syntactic variety in code-mixed text, with intuitive theoretical bounds. We train SoTA en-hi PoS tagger, accuracy of 93.4%, to reliably compute PoS tags on a corpus, and demonstrate the utility of SyMCoM by applying it on various syntactical categories on a collection of datasets, and compare datasets using the measure. Please refer to our paper (Kodali et al., 2022) for more details, this paper was accepted at Findings of ACL 2022.

Comparing SymCoM to LID based metrics

Unravelling Acceptability in Code-Mixed Sentences

Current computational approaches for analysing or generating code-mixed sentences do not explicitly model “naturalness” or “acceptability” of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text.

Comparing code-mixed sentences for their acceptability.

To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi (en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to filter/curate/compare code-mixed corpora have low correlation with human acceptability judgements, underlining the necessity of our dataset. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs). Specifically, XLM-Roberta and Bernice outperform IndicBERT across different configurations in challenging data settings. Comparison with ChatGPT’s zero and fewshot capabilities shows that MLLMs fine-tuned on larger data outperform ChatGPT, providing scope for improvement in code-mixed tasks. Zero-shot transfer from English-Hindi to English-Telugu acceptability judgments using our model checkpoints proves superior to random baselines, enabling application to other code-mixed language pairs and providing further avenues of research. We publicly release our human-annotated dataset, trained checkpoints, code-mix corpus, and code for data generation and model training. Please refer to our paper (Kodali et al., 2024) for more details, this paper is under review at a journal.

Task-Oriented Dialog Dataset for Code-mixed Languages

Efforts for Task-oriented dialogue agents efforts have predominantly concentrated on a few widely spoken languages, limiting global adoption of dialogue technology. We created a multi-domain, large-scale, and high-quality task-oriented dialogue benchmark, produced by translating the Chinese RiSAWOZ data to Hindi and code-mixed English-Hindi language. The dataset was parallelly translated to other languages (English, Frensh, Korean by other collaboratores in the work), and are part of collective X-RiSAWOZ datasets. Please refer to our paper (Moradshahi et al., 2023) for more details, this paper was accepted at Findings of ACL 2023.

Our On-going Efforts

Adapting Multilingual Language Models for Code-Mixed Settings: As multilingual language models are increasingly being used in code-mixed settings, we are working on adapting these models for code-mixed settings. We are working on leveraging all the available data resources - labeled and unlabeled ; monolingual and code-mixed to improve the performance of these models in code-mixed settings. Figure 6.1 presents various data availability scenarios for dealing with code-mixed tasks. Ideally, access to both labeled and unlabeled data allows for continued pre-training followed by fine-tuning. However, labeled or unlabeled data might not be available for the languages concerned, but present for another language(s) – both monolingual or code-mixed. Another extreme is where neither task-specific labeled nor unlabeled data is available. This raises the question of what the optimal strategy is for building models of code-mixing under various resource availability scenarios.We intend to leverage recent parameter-efficient and modular techniques - like Adapters, Model augmentation - to leverage the different data resources effectively to effectively improve performance of multilingual language models for code-mixed tasks. We expect this work to be completed by the Dec 2024.
Toolkit Development: We are working on developing a toolkit comprising of all the resources and models developed in the thesis for easy access and usage by the research community. The toolkit will provide necessary tools for standardizing pipelines for computational code-mixed research. The proposed toolkit will contains resources created as part of thesis, along with previously published resources released by the research community. The toolkit will be released as an open-source project, and will be maintained for future updates and contributions. Following will be the essential components of the toolkit:
- Data Resources
  - Large scale synthetic unlabeled corpora for En-Hi, En-Kn, En-Te using GCM.
  - Annnotated task specific datasets - Dialog, Acceptability for En-Hicombined
  - Tools for creating/curating code-mixed data - Modified GCM tookit for ease of usage; Rule based approach for replacing tokens/phrases.
- Analysis Tools
  - Tools for computing various Code-mix Metrics (e.g CMI, SyMCoM)
  - Visualization tools for code-mixed sentences
- Models - Trained checkpoints, and codebase for training
  - PoS Tagger
  - Acceptability Classifier
  - Multilingual pretrianed models domain adapted to En-Hi code-mixed text.

For more deatils on our work, please feel free to reach out to: Prashant Kodali

Related Publications

2024

ACM
From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences

Prashant Kodali, Anmol Goel, Likhith Asapu, Vamshi Krishna Bonagiri, Anirudh Govil, Monojit Choudhury, Manish Shrivastava, and Ponnurangam Kumaraguru

In , 2024

Abs Cite PDF

Current computational approaches for analysing or generating code-mixed sentences do not explicitly model "naturalness" or "acceptability" of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi (en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to filter/curate/compare code-mixed corpora have low correlation with human acceptability judgements, underlining the necessity of our dataset. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs). Specifically, XLM-Roberta and Bernice outperform IndicBERT across different configurations in challenging data settings. Comparison with ChatGPT’s zero and fewshot capabilities shows that MLLMs fine-tuned on larger data outperform ChatGPT, providing scope for improvement in code-mixed tasks. Zero-shot transfer from English-Hindi to English-Telugu acceptability judgments using our model checkpoints proves superior to random baselines, enabling application to other code-mixed language pairs and providing further avenues of research. We publicly release our human-annotated dataset, trained checkpoints, code-mix corpus, and code for data generation and model training.
@inproceedings{kodali2024humanjudgementspredictivemodels, title = {From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences}, author = {Kodali, Prashant and Goel, Anmol and Asapu, Likhith and Bonagiri, Vamshi Krishna and Govil, Anirudh and Choudhury, Monojit and Shrivastava, Manish and Kumaraguru, Ponnurangam}, year = {2024}, booktitle = {}, }

2023

ACL
X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents

Mehrad Moradshahi, Tianhao Shen, Kalika Bali, Monojit Choudhury, Gael Chalendar, Anmol Goel, Sungkyun Kim, Prashant Kodali, Ponnurangam Kumaraguru, Nasredine Semmar, Sina Semnani, Jiwon Seo, Vivek Seshadri, Manish Shrivastava, Michael Sun, Aditya Yadavalli, Chaobin You, Deyi Xiong, and Monica Lam

In Findings of the Association for Computational Linguistics: ACL 2023, 2023

Abs Cite PDF

Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed English-Hindi language.X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language, and unlike most multilingual prior work, is an end-to-end dataset for building fully-functioning agents. The many difficulties we encountered in creating X-RiSAWOZ led us to develop a toolset to accelerate the post-editing of a new language dataset after translation. This toolset improves machine translation with a hybrid entity alignment technique that combines neural with dictionary-based methods, along with many automated and semi-automated validation checks. We establish strong baselines for X-RiSAWOZ by training dialogue agents in the zero- and few-shot settings where limited gold data is available in the target language. Our results suggest that our translation and post-editing methodology and toolset can be used to create new high-quality multilingual dialogue agents cost-effectively. Our dataset, code, and toolkit are released open-source.
@inproceedings{moradshahi-etal-2023-x, title = {{X}-{R}i{SAWOZ}: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents}, author = {Moradshahi, Mehrad and Shen, Tianhao and Bali, Kalika and Choudhury, Monojit and de Chalendar, Gael and Goel, Anmol and Kim, Sungkyun and Kodali, Prashant and Kumaraguru, Ponnurangam and Semmar, Nasredine and Semnani, Sina and Seo, Jiwon and Seshadri, Vivek and Shrivastava, Manish and Sun, Michael and Yadavalli, Aditya and You, Chaobin and Xiong, Deyi and Lam, Monica}, year = {2023}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2023}, }

2022

ACL
SyMCoM - Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing

Prashant Kodali, Anmol Goel, Monojit Choudhury, Manish Shrivastava, and Ponnurangam Kumaraguru

In Findings of the Association for Computational Linguistics: ACL 2022, 2022

Abs Cite PDF

Code mixing is the linguistic phenomenon where bilingual speakers tend to switch between two or more languages in conversations. Recent work on code-mixing in computational settings has leveraged social media code mixed texts to train NLP models. For capturing the variety of code mixing in, and across corpus, Language ID (LID) tags based measures (CMI) have been proposed. Syntactical variety/patterns of code-mixing and their relationship vis-a-vis computational model‘s performance is under explored. In this work, we investigate a collection of English(en)-Hindi(hi) code-mixed datasets from a syntactic lens to propose, SyMCoM, an indicator of syntactic variety in code-mixed text, with intuitive theoretical bounds. We train SoTA en-hi PoS tagger, accuracy of 93.4%, to reliably compute PoS tags on a corpus, and demonstrate the utility of SyMCoM by applying it on various syntactical categories on a collection of datasets, and compare datasets using the measure.
@inproceedings{kodali-etal-2022-symcom, title = {{S}y{MC}o{M} - Syntactic Measure of Code Mixing A Study Of {E}nglish-{H}indi Code-Mixing}, author = {Kodali, Prashant and Goel, Anmol and Choudhury, Monojit and Shrivastava, Manish and Kumaraguru, Ponnurangam}, year = {2022}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2022}, }