Translate and Classify: Improving Sequence Level Classification for English-Hindi Code-Mixed Data

Code-mixing is a common phenomenon in multilingual societies around the world and is especially common in social media texts. Traditional NLP systems, usually trained on monolingual corpora, do not perform well on code-mixed texts. Training specialized models for code-switched texts is difficult due to the lack of large-scale datasets. Translating code-mixed data into standard languages like English could improve performance on various code-mixed tasks since we can use transfer learning from state-of-the-art English models for processing the translated data. This paper focuses on two sequence-level classification tasks for English-Hindi code mixed texts, which are part of the GLUECoS benchmark - Natural Language Inference and Sentiment Analysis. We propose using various pre-trained models that have been fine-tuned for similar English-only tasks and have shown state-of-the-art performance. We further fine-tune these models on the translated code-mixed datasets and achieve state-of-the-art performance in both tasks. To translate English-Hindi code-mixed data to English, we use mBART, a pre-trained multilingual sequence-to-sequence model that has shown competitive performance on various low-resource machine translation pairs and has also shown performance gains in languages that were not in its pre-training corpus.


Introduction
In the last decade, social media has become a significant part of the lives of a large population in the world. Unlike previously popular communication platforms, online messaging is very informal, and in recent years, it has led to an increase in the usage of emojis, slang, and even a hybrid form of language, code-mixed language.
Code-mixed language is a mixture of multiple languages where words belonging to different languages are interleaved with each other in the same conversation. It is commonly used by multilingual speakers. It does not follow a formally defined structure and often varies from person to person, although some studies (Poplack, 1980;Belazi et al., 1994) have proposed linguistic constraints on code-switching. Code-mixing and code-switching are similar terms that slightly differ technically, but they are often used interchangeably by the research community. We will also be using them interchangeably in our paper.
In this paper, we work with English-Hindi codemixed data. English-Hindi code-mixed language often called Hinglish is very common in India because of a large number of bilingual speakers who often use English in their professional lives while using Hindi in their personal lives. An example of an English-Hindi code-mixed sentence from a dataset released by Dhar et al. (2018) is shown below: • Original Sentence: My brother always told me ki in retrospect, badi dikkatein chhoti lagti hain.
• Gloss: [My brother always told me] that [in retrospect], big problems small seem are.
• Translation: My brother always told me that, in retrospect, big problems seem to be small.
Although there is a large population globally that communicates using code-mixed languages, annotated datasets remain scarce even when the monolingual constituent languages have large-scale datasets. Recent work suggests that multilingual models trained on several monolingual datasets perform well with zero-shot cross-lingual transfer in code-switched settings (Patwa et al., 2020;Khanuja et al., 2020b). However, Khanuja et al. (2020b) conclude that their model had varying performance across tasks and especially struggled with NLI and sentiment analysis tasks. Another challenge with code-mixed language research is that, unlike monolingual data, there are no formal data sources like news articles or books written in codemixed languages. Instead, most research uses informal sources such as social media texts or messages, which are usually challenging to obtain. Also, most of the data is written in the Roman script, and Hindi words are transliterated informally without any standard rules. Instead, individuals generally provide a rough phonetic transcription of the intended word, which can vary from individual to individual due to any number of factors, including regional or dialectal differences in pronunciations, differing conventions of transcription, or simple idiosyncrasy (Roark et al., 2020). This makes it challenging to prepare reliable datasets to train robust deep learning models. Most of the existing datasets focus on a few language pairs and have been prepared by several shared task organizers.
To address these issues, we propose translating the code-mixed data to English (a high-resource language) and applying powerful models trained on English data to perform sequence-level classification tasks on the translated data. To translate the code-mixed data to English, we propose using mBART , a pre-trained multilingual sequence-to-sequence model. We experiment with our pipeline on two English-Hindi code-mixed sequence classification tasks of the GLUECoS (Khanuja et al., 2020b) benchmark -Natural Language Inference and Sentiment Analysis. We achieve state-of-the-art performance in both tasks. The code for our proposed system is available at https://github.com/ devanshg27/cm_translatify.
The main contributions of our work are as follows: • We explore the effectiveness of using mBART for low resource code-mixed Hinglish-English translation with transfer learning from Hindi-English translation.
• We propose performing sequence-level classifications on the code-mixed data by first translating it to English and then using powerful models trained on English data to classify the translated data.
• We achieve state-of-the-art performance on two classification tasks of the GLUECoS benchmark -Natural Language Inference and Sentiment Analysis with an absolute increase of 12.4% and 5.3%, respectively.
The rest of the paper is organized as follows. We discuss prior work related to code-mixed language processing and also discuss work related to machine translation, Natural language Inference, and Sentiment Analysis. We describe the translation system we use and show the effect of different training choices. We describe our pipeline for code-mixed sequence level classification tasks on the chosen tasks -Natural Language Inference and Sentiment Analysis and show its performance against past work. We conclude with a direction for future work and highlight our main findings.

Related Work
Code-mixing occurs when a speaker uses words belonging to different languages interleaved with each other in the same conversation. With the rise of social media and messaging platforms, there has been a significant increase in code-mixed language usage.
Although these tasks have helped progress codeswitching language research, most tasks require building specialized systems for the specific task and language pair due to the limited dataset sizes. Recently, large pre-trained multilingual models have been used for various code-mixed tasks (Patwa et al., 2020;Khanuja et al., 2020b).
Machine Translation refers to translating a text from a source language to its counterpart in a target language using machines. It has widespread applications in the real world and has been an active area of research.
Earlier works in machine translation mostly focused on statistical or rule-based approaches. In contrast, neural machine translation gained popularity in the last decade after Kalchbrenner and Blunsom (2013) successfully proposed the first DNN model for translation. Recent works use transformer-based approaches (Vaswani et al., 2017). Some approaches utilize multilingual pretraining (Song et al., 2019;Conneau and Lample, 2019;; however, these works focus only on monolingual language pairs. Despite the significant usage of English-Hindi code-mixing, there has been little work regarding English-Hindi code-mixed translation (Srivastava and Singh, 2020;Singh and Solorio, 2018;Dhar et al., 2018), which leads to a massive gap in communication as these texts can only be understood by people who are proficient in both these languages.
Natural Language Inference is the task of determining if the given "premise" supports a given "hypothesis" and classifying the hypothesis as true (entailment), false (contradiction), or undetermined (neutral). It is arguably one of the most fundamental tasks in natural language understanding. Wang et al. (2018) and Yin et al. (2019) suggest that various NLP tasks can be reduced to Natural Language Inference, which makes it an even more valuable task to solve.
Natural Language Inference for English texts has been an active area of research. It has been extensively studied under different tasks such as RTE (Recognizing Textual Entailment) (Dagan et al., 2006), NLI (Natural Language Inference) (Bowman et al., 2015), FEVER (Fact Extraction and VERification) (Thorne et al., 2018). In recent years, large-scale pre-trained models (Devlin et al., 2019; have dominated these tasks and have achieved close-to-human performance. Although NLI on English data has seen many advances, there has been little work on NLI for code-mixed data. Khanuja et al. (2020a) release the first NLI dataset for code-mixed languages. It consists of conversations from Hindi movies (Bollywood) as premises. Chakravarthy et al. (2020) compare the effectiveness of various approaches on the dataset. Sentiment Analysis is the task of understanding the sentiment expressed in the text and classifying the text into positive, negative, or neutral classes. It has several applications such as customer feedback, marketing, and social media monitoring. There has been extensive research on sentiment analysis of English texts with various shared tasks and datasets. Sentiment analysis for code-mixed texts is an essential task due to the widespread usage of Dhar et al. (2018) Srivastava   code-mixed texts on social media in multilingual societies. There has been some work related to code-mixed sentiment analysis with a few shared tasks (Patra et al., 2018;Patwa et al., 2020). The participants of the task organized by Patwa et al. (2020) explored various approaches such as pretrained language models, RNN, CNN, and word embeddings.

Translating Code-Mixed Text
In this section, we describe our proposed model, which uses mBART  to translate code-mixed texts to English.

mBART
We fine-tune mBART, which is a multilingual sequence-to-sequence denoising auto-encoder. It has been pre-trained using the BART  objective on large-scale monolingual corpora of 25 languages extracted from Common Crawl 1 . Both English and Hindi are part of the pre-training corpus with 55,608 million tokens (300.8 GB) and 1,715 million tokens (20.2 GB), respectively. It uses a standard sequence-to-sequence Transformer architecture (Vaswani et al., 2017), with 12 encoder and decoder layers each and a model dimension of 1024 on 16 heads resulting in ∼680 million parameters.

Data Preparation
We use the datasets released by Dhar et al. (2018) and Srivastava and Singh (2020), the statistics of the datasets are provided in the Table 1. Since both the datasets contain Hindi words in Roman script, we use the CSNLI library 2 (Bhat et al., 2017(Bhat et al., , 2018) as a preprocessing step. It transliterates the Hindi words to Devanagari and also performs text normalization. We split the datasets into an 8:1:1 train:validation:test split. We merge the training and validation sets of the two datasets and use the merged datasets for all our experiments. We also use the dataset released by Kunchukuttan et al. (2018) which contains parallel sentences for English and Hindi. We use the training set, which contains 1,609,682 sentences, for training our systems.

Optimization
We use the implementation of mBART available in the fairseq library 3 . We finetune on 4 Nvidia GeForce RTX 2080 Ti GPUs with an effective batch size of 1024 tokens per GPU. We use the Adam optimizer ( = 10 −6 , β 1 = 0.9, β 2 = 0.98) (Kingma and Ba, 2015) with 0.2 label smoothing, 0.3 dropout, 0.1 attention dropout and polynomial decay learning rate scheduling. We validate the models every 8000 steps and select the best checkpoint based on the lowest validation loss. To train our systems efficiently, we prune mBART's vocabulary by removing the tokens which are not present in any of the datasets mentioned in the previous section.
We compare the following 3 strategies for finetuning mBART: • mBART-cm: We fine-tune mBART on the merged dataset with parallel English-Hindi code-mixed sentences. We fine-tune for 20,000 steps with 2,500 warm-up steps and a learning rate of 3 * 10 −5 .
• mBART-hien: We fine-tune mBART on the dataset with parallel English-Hindi sentences. We fine-tune for 80,000 steps with 2,500 warm-up steps and a learning rate of 3 * 10 −5 .
• mBART-hien-cm: We fine-tune mBART on the dataset with parallel English-Hindi sentences for 80,000 steps with 2,500 warmup steps and a learning rate of 3 * 10 −5 , followed by further fine-tuning on on the merged dataset with parallel English-Hindi code-mixed sentences for 10,000 steps with 2,500 warm-up steps and a learning rate of 10 −5 .

Results
We use BLEU scores as the metric for comparing our systems, the scores are computed using the 3 https://github.com/pytorch/fairseq (2020) mBART-hien 17.2 16.7 mBART-cm 30.5 31.6 mBART-hien-cm 31.7 33.0 Table 2: BLEU scores of our systems on the test sets of the two datasets.

Fine-tuned Classification Model
The worst was the pin ball Translation सबसे बकवास was pin ball

Transliteration and Normalization
Sabse bakwaas was pin ball Negative Figure 1: The working of our pipeline for the task of code-mixed Natural Language Inference is demonstrated on an example (with minor edits) from the dataset (the details of the dataset are discussed later).
SacreBLEU library 4 (Post, 2018) after tokenization using the TweetTokenizer available with the NLTK library 5 (Bird et al., 2009). The scores of our systems are shown in Table 2. We find that mBART-hien which was only fine-tuned for Hindi-English translation, performs considerably worse than the other models, showing that fine-tuning on English-Hindi code-mixed data improves the performance substantially. We also find that mBARThien-cm has the best performance among the systems we consider. It uses transfer learning from Hindi to English translation to improve Hinglish-English translation.

Code-Mixed Sequence-level Classification
In this section, we describe our approach for codemixed sequence-level classification tasks using our   translation system. Our pipeline is shown in Figure 1. We evaluate the performance of our pipeline on two tasks -Natural Language Inference and Sentiment Analysis.

Data Preparation
We use the dataset released by Khanuja et al. (2020a), which is a part of the GLUECoS benchmark. The dataset consists of code-mixed conversations from Hindi Movies (Bollywood) as premises that have been annotated with hypotheses that are either entailed or contradicted by the conversational premise. The statistics for the dataset are shown in Table 4. Since the dataset consists of Hindi words in Roman script, we use the CSNLI library to transliterate the Hindi words to Devanagari and perform text normalization. The data is then translated to English using our best-performing translation system -mBART-hien-cm. The dataset has a split between a train set and a test set with 1792 and 447 premise-hypothesis pairs in each, respectively. We split the train set into a validation set to create a 3.5:1:1.25 train:validation:test split finally.

System Overview
Our systems use different models which have shown competitive performance on Natural Lan-guage Inference for English texts. We use publicly available checkpoints for each model, which have been fine-tuned for Natural Language Inference on various English datasets such as SNLI (Bowman et al., 2015), MultiNLI (Williams et al., 2018), FEVER-NLI (Nie et al., 2019), ANLI (R1, R2, R3) (Nie et al., 2020). We fine-tune the checkpoints further on the code-mixed data translated to English. The details about the checkpoints we use are shown in Table 3.

Optimization
For the implementation of our systems, we use the HuggingFace Transformers library 6  and the AdamW optimizer ( = 10 −8 , β 1 = 0.9, β 2 = 0.999, wd = 0.01) available in Py-Torch 7 (Paszke et al., 2019) with a learning rate of 10 −6 . All models were fine-tuned using 4 Nvidia GeForce RTX 2080 Ti GPU with a batch size of 8. The maximum sequence length was 512 for (1) and (2) and 256 for the other models. We fine-tune the models for 5 epochs with validation every 100 steps and choose the model with the best performance on the validation set. We use cross-entropy as the loss function.

Results
We compare the performance of our systems against the system with the highest test set performance discussed in Chakravarthy et al. (2020) and the baselines provided by Khanuja et al. (2020b). The performance of our systems is shown in Table 5. All our systems perform better than the current state-of-the-art. We find that (2) performs better than (1), which shows that transfer learning from a larger English dataset improves the performance on code-mixed texts. The confusion matrix for the predictions from our best model is shown 20 Model Accuracy mBERT (Khanuja et al., 2020b) 61.09 Mod. mBERT (Khanuja et al., 2020b) 63.1 mod-mBERT (Chakravarthy et al., 2020) 62.41 (1) -   in Figure 2. We find that the performance of our system on entailed and contradictory statements is similar.

Data Preparation
We use the dataset released by Patra et al. (2018), which is part of the GLUECoS benchmark. The dataset was created by collecting code-mixed tweets using common Hindi words as search keywords. The tweets were annotated with word-level language tags and sentiment tags (positive, negative, or neutral). A transliterated version of the dataset is also provided where the Hindi words are in the Devanagari script. We use the transliterated version and translate it to English using mBART-hien-cm after normalizing the text with the DevanagariNormalizer function available in the IndicNLP Library 8 (Kunchukuttan, 2020). The statistics for the dataset are shown in Table 6. We use the provided train:validation:test split, which is in the ratio 8:1:1.

System Overview
We use the following models which have shown competitive performance on sentiment analysis of English tweets: (1) BERTweet (Nguyen et al., 2020): A largescale pre-trained language model for English tweets which has been pre-trained on a large corpus of 850M English tweets. It has the same architecture as BERT base with ∼110M parameters.
(2) RoB-RT (Barbieri et al., 2020): The pretrained RoBERTa base model which has been re-trained on a corpus of 58M English tweets. It has ∼125M parameters.
We use publicly available checkpoints of the above models, which have been fine-tuned on the sentiment analysis dataset released for SemEval-2017 Task 4 (Rosenthal et al., 2017) which is part of the TweetEval (Barbieri et al., 2020) benchmark. The dataset consists of ∼60,000 tweets. We finetune the checkpoints further for sentiment analysis of code-mixed tweets that have been translated to English.

Optimization
For the implementation of our systems, we again use the HuggingFace Transformers library and the AdamW ( = 10 −8 , β 1 = 0.9, β 2 = 0.999, wd = 0.01) optimizer available in PyTorch with a learning rate of 10 −6 . All models were fine-tuned using 4 Nvidia GeForce RTX 2080 Ti GPU with a batch size of 16. The maximum sequence length was 128 for (1) BERTweet and 512 for (2) RoB-RT. We fine-tune the models for 5 epochs with validation every 100 steps and choose the model with  (Khanuja et al., 2020b) 58.24 Mod. mBERT (Khanuja et al., 2020b) 59.35 (1) BERTweet 64.6 ±0.3 (2) RoB-RT base 64.6 ±0.4 the best performance on the validation set. We use cross-entropy as the loss function.

Results
We compare the performance of our systems against the system achieving the highest score in the task organized by Patra et al. (2018) and the two best-performing baselines provided by Khanuja et al. (2020b). The performance of our systems is shown in Table 7. Both the systems we consider have similar performance and perform better than the current state-of-the-art. The confusion matrix for the predictions from our best model is shown in Figure 3. We find that our model struggles with negative sentiment tweets and misclassifies them as neutral sentiment in 37% of cases.

Conclusion
In this paper, we demonstrate that mBART can be used to translate English-Hindi code-mixed sentences to English and show that transfer learning from Hindi-English translation improves its performance on code-mixed translation. We evaluate how our translation system can be used to improve performance in code-mixed sequence classification tasks. We develop a pipeline that uses our translation system to translate code-mixed data to English and then uses large-scale pre-trained English models for the downstream tasks. Our experiments show that our pipeline achieves state-of-the-art performance on two tasks of the GLUECoS benchmark -Natural Language Inference and Sentiment Analysis.
The performance of our pipeline shows that improving code-mixed translation can improve the performance of several code-mixed tasks. In future work, we would like to improve our translation system by creating a larger parallel corpus or synthetically generating parallel sentences for data augmentation. We would also like to extend our system to other code-mixing language pairs. Gustavo Aguilar, Fahad AlGhamdi, Victor Soto, Thamar Solorio, Mona Diab, and Julia Hirschberg, editors. 2018b