CoMeT: Towards Code-Mixed Translation Using Parallel Monolingual Sentences

Code-mixed languages are very popular in multilingual societies around the world, yet the resources lag behind to enable robust systems on such languages. A major contributing factor is the informal nature of these languages which makes it difficult to collect code-mixed data. In this paper, we propose our system for Task 1 of CACLS 2021 to generate a machine translation system for English to Hinglish in a supervised setting. Translating in the given direction can help expand the set of resources for several tasks by translating valuable datasets from high resource languages. We propose to use mBART, a pre-trained multilingual sequence-to-sequence model, and fully utilize the pre-training of the model by transliterating the roman Hindi words in the code-mixed sentences to Devanagri script. We evaluate how expanding the input by concatenating Hindi translations of the English sentences improves mBART’s performance. Our system gives a BLEU score of 12.22 on test set. Further, we perform a detailed error analysis of our proposed systems and explore the limitations of the provided dataset and metrics.


Introduction
Code-mixing 2 is the mixing of two or more languages where words from different languages are interleaved with each other in the same conversation. It is a common phenomenon in multilingual societies across the globe. In the last decade, due to the increase in the popularity of social media and various online messaging platforms, there has been an increase in various forms of informal writing, such as emojis, slang, and the usage of code-mixed languages.
1 https://code-switching.github.io/2021 2 Code-switching is another term that slightly differs in its meaning but is often used interchangeably with code-mixing in the research community. We will also be following the same convention and use both the terms interchangeably in our paper.
Due to the informal nature of code-mixing, codemixed languages do not follow a prescriptively defined structure, and the structure often varies with the speaker. Nevertheless, some linguistic constraints (Poplack, 1980;Belazi et al., 1994) have been proposed that attempt to determine how languages mix with each other.
Given the increasing use of code-mixed languages by people around the globe, there is a growing need for research related to code-mixed languages. A significant challenge to research is that there are no formal sources like books or news articles in code-mixed languages, and studies have to rely on sources like Twitter or messaging platforms. Another challenge with Hinglish, in particular, is that there is no standard system of transliteration for Hindi words, and individuals provide a rough phonetic transcription of the intended word, which often varies with individuals.
In this paper, we describe our systems for Task 1 of CALCS 2021, which focuses on translating English sentences to English-Hindi code-mixed sentences. The code-mixed language is often called Hinglish. It is commonly used in India because many bilingual speakers use both Hindi and English frequently in their personal and professional lives. The translation systems could be used to augment datasets for various Hinglish tasks by translating datasets from English to Hinglish. An example of a Hinglish sentence from the provided dataset (with small modifications) is shown below: • Hinglish Sentence: Bahut strange choice thi ye.
• Gloss of Hinglish Sentence: Very [strange choice] was this.
• English Sentence: This was a very strange choice.
We propose to fine-tune mBART for the given task by first transliterating the Hindi words in the target sentences from Roman script to Devanagri script to utilize its pre-training. We further translate the English input to Hindi using pre-existing models and show improvements in the translation using parallel sentences as input to the mBART model. The code for our systems, along with error analysis, is public 3 .
The main contributions of our work are as follows: • We explore the effectiveness of fine-tuning mBART to translate to code-mixed sentences by utilizing the Hindi pre-training of the model in Devanagri script. We further explore the effectiveness of using parallel sentences as input.
• We propose a normalized BLEU score metric to better account for the spelling variations in the code-mixed sentences.
• Along with BLEU scores, we analyze the code-mixing quality of the reference translations along with the generated outputs and propose that for assessing code-mixed translations, measures of code-mixing should be part of evaluation and analysis.
The rest of the paper is organized as follows. We discuss prior work related to code-mixed language processing, machine translation, and synthetic generation of code-mixed data. We describe our translation systems and compare the performances of our approaches. We discuss the amount of codemixing in the translations predicted by our systems and discuss some issues present in the provided dataset. We conclude with a direction for future work and highlight our main findings.

Background
Code-mixing occurs when a speaker switches between two or more languages in the context of the same conversation. It has become popular in multilingual societies with the rise of social media applications and messaging platforms.
Although these workshops have gained traction, the field lacks standard datasets to build robust systems. The small size of the datasets is a major factor that limits the scope of code-mixed systems.
Machine Translation refers to the use of software to translate text from one language to another. In the current state of globalization, translation systems have widespread applications and are consequently an active area of research.
Neural machine translation has gained popularity only in the last decade, while earlier works focused on statistical or rule-based approaches. Kalchbrenner and Blunsom (2013) first proposed a DNN model for translation, following which transformerbased approaches (Vaswani et al., 2017) have taken the stage. Some approaches utilize multilingual pre-training (Song et al., 2019;Conneau and Lample, 2019;; however, these works focus only on monolingual language pairs. Although a large number of multilingual speakers in a highly populous country like India use English-Hindi code-mixed language, only a few studies (Srivastava and Singh, 2020;Singh and Solorio, 2018;Dhar et al., 2018) have attempted the problem. Enabling translation systems in the following pair can bridge the communication gap between several people and further improve the state of globalization in the world.
Synthetic code-mixed data generation is a plausible option to build resources for code-mixed language research and is a very similar task to translation. While translation focuses on retaining the meaning of the source sentence, generation is a simpler task requiring focus only on the quality of the synthetic data generated. Pratapa et al. (2018) started by exploring linguistic theories to generate code-mixed data. Later works attempt the problem using several approaches including Generative Adversarial Networks (Chang et al., 2019), an encoder-decoder framework (Gupta et al., 2020), pointer-generator networks (Winata et al., 2019), and a two-level

System Overview
In this section, we describe our proposed systems for the task, which use mBART  to translate English to Hinglish.

Data Preparation
We use the dataset provided by the task organizers for our systems, the statistics of the datasets are provided in Table 1. Since the target sentences in the dataset contain Hindi words in Roman script, we use the CSNLI library 4 (Bhat et al., 2017 as a preprocessing step. It transliterates the Hindi words to Devanagari and also performs text normalization. We use the provided train:validation:test split, which is in the ratio 8:1:1.

Model
We fine-tune mBART, which is a multilingual sequence-to-sequence denoising auto-encoder pretrained using the BART  objective on large-scale monolingual corpora of 25 languages including English and Hindi. It uses a standard sequence-to-sequence Transformer architecture (Vaswani et al., 2017), with 12 encoder and decoder layers each and a model dimension of 1024 on 16 heads resulting in ∼680 million parameters. To train our systems efficiently, we prune mBART's vocabulary by removing the tokens which are not present in the provided dataset or the dataset released by Kunchukuttan et al. (2018) which contains 1,612,709 parallel sentences for English and Hindi.
We compare the following two strategies for finetuning mBART: 4 https://github.com/irshadbhat/csnli • mBART-en: We fine-tune mBART on the train set, feeding the English sentences to the encoder and decoding Hinglish sentences. We use beam search with a beam size of 5 for decoding.
• mBART-hien: We fine-tune mBART on the train set, feeding the English sentences along with their parallel Hindi translations to the encoder and decoding Hinglish sentences. For feeding the data to the encoder, we concatenate the Hindi translations, followed by a separator token '##', followed by the English sentence. We use the Google NMT system 5 (Wu et al., 2016) to translate the English source sentences to Hindi. We again use beam search with a beam size of 5 for decoding.

Post-Processing
We transliterate the Hindi words in our predicted translations from Devanagari to Roman. We use the following methods to transliterate a given Devanagari token (we use the first method which provides us with the transliteration): 1. When we transliterate the Hindi words in the target sentences from Roman to Devanagari (as discussed in Section 3.1), we store the most frequent Roman transliteration for each Hindi word in the train set. If the current Devanagari token's transliteration is available, we use it directly.
2. We use the publicly available Dakshina Dataset (Roark et al., 2020) which has 25,000 Hindi words in Devanagari script along with their attested romanizations. If the current Devanagari token is available in the dataset, we use the transliteration with the maximum number of attestations from the dataset.
3. We use the indic-trans library 6 (Bhat et al., 2015) to transliterate the token from Devanagari to Roman.

Implementation
We use the implementation of mBART available in the fairseq library 7 (Ott et al., 2019). We finetune on 4 Nvidia GeForce RTX 2080 Ti GPUs  with an effective batch size of 1024 tokens per GPU. We use the Adam optimizer ( = 10 −6 , β 1 = 0.9, β 2 = 0.98) (Kingma and Ba, 2015) with 0.3 dropout, 0.1 attention dropout, 0.2 label smoothing and polynomial decay learning rate scheduling. We fine-tune the model for 10,000 steps with 2,500 warm-up steps and a learning rate of 3 * 10 −5 . We validate the models for every epoch and select the best checkpoint based on the best BLEU score on the validation set. To train our systems efficiently, we prune mBART's vocabulary by removing the tokens which are not present in any of the datasets mentioned in the previous section.

Evaluation Metrics
We use the following two evaluation metrics for comparing our systems: 1. BLEU: The BLEU score (Papineni et al., 2002) is the official metric used in the leader board. We calculate the score using the SacreBLEU library 8 (Post, 2018) after lowercasing and tokenization using the TweetTokenizer available with the NLTK library 9 (Bird et al., 2009).
2. BLEU normalized : Instead of calculating the BLEU scores on the texts where the Hindi words are transliterated to Roman, we calculate the score on texts where Hindi words are in Devanagari and English words in Roman. We transliterate the target sentences using the CSNLI library and we use the outputs of our system before performing the post-processing (Section 3.3). We again use the SacreBLEU library after lowercasing and tokenization using the TweetTokenizer available with the NLTK library.  Table 2 shows the BLEU scores of the outputs generated by our models described in Section 3.2. In Hinglish sentences, Hindi tokens are often transliterated to roman script, and that results in spelling variation. Since BLEU score compares token/ngram overlap between source and target, lack of canonical spelling for transliterated words, reduces BLEU score and can mischaracterize the quality of translation. To estimate the variety in roman spellings for a Hindi word, we perform normalization by back transliterating the Hindi words in a code-mixed sentence to Devanagari and aggregated the number of different spellings for a single Devanagari token. Figure 1 shows the extent of this phenomena in the dataset released as part of this shared task, and it is evident that there are Hindi words that have multiple roman spellings. Thus, even if the model is generating the correct Devanagari token, the BLEU scores will be understated due to the spelling variation in the transliterated reference sentence. By back-transliterating Hindi tokens to Devanagari, BLEU normalized score thus provides a better representation of translation quality.

Error Analysis of Translations of Test set
Since BLEU score primarily look at n-gram overlaps, it does not provide any insight into the quality of generated output or the errors therein. To  analyse the quality of translations on the test set, we randomly sampled 100 sentences (> 10% of test set) from the outputs generated by the two models: mBART-en and mBART-hien, and bucketed them into various categories. Table 3 shows the categories of errors and their corresponding frequency. Mistranslated/partially translated category indicates that the generated translation has no or very less semantic resemblance with the source sentence. Sentences, where Multi-Word Expressions/Named Entities are wrongly translated, is the second category. Morphology/Case Marking/Agreement/Syntax Issues category indicates sentences where most of the semantic content is faithfully captured in the generated output. However, the errors on a grammatical level render the output less fluent. mBART-hien makes fewer errors when compared to mBART-en, but that can possibly be attributed to the fact that this model generates a higher number of Hindi tokens while being low in code-mixing quality, and makes lesser grammatical errors. A more extensive and finegrained analysis of these errors will undoubtedly help improve the models' characterization, and we leave it for future improvements.

Code Mixing Quality of generated translations
In the code-mixed machine translation setting, it is essential to observe the quality of the code-mixing in the generated translations. While BLEU scores indicate how close we are to the target translation in terms of n-gram overlap, a measure like Code-Mixing Index (CMI)  provides us means to assess if the generated output is a mix of two languages or not. Relying on just the BLEU score for assessing translations can misrepresent the quality of translations, as models could generate monolingual outputs and still have a basic BLEU score due to n-gram overlap. If a measure of code mixing intensity, like CMI, is also part of the evaluation regime, we would be able to assess the code mixing quality of generated outputs as well. Figure 2 shows us that the distribution of CMI for outputs generated by our various models (mBART-en and mBART-hien) for both validation and test set. Figure 2 and Table 4 show that the code mixing quality of the two models is is more or less similar across the validation and test set. The high

Num of Pairs
Meaning of target similar to source 759 Meaning of target distored compared to source 141 Total 900 percentages of sentences having a 0 CMI score shows that in a lot of sentences, the model does not actually perform code-mixing. We also find that even though the outputs generated by the mBARThien model have a higher BLEU normalized score, the average CMI is lower and the percentage of sentences with a 0 CMI score is higher. This suggests that mBART-hien produces sentences with a lower amount of code-mixing. This observation, we believe, can be attributed to the mBART-hien model's propensity to generate a higher percentage of Hindi words, as shown in Table 5. We also find that in the train set, more than 20% of the sentences have a CMI score of 0. Replacing such samples with sentence pairs with have a higher degree of code mixing will help train the model to generate better code mixed outputs. Further analysis using different measures of code-mixing can provide deeper insights. We leave this for future work.

Erroneous Reference Translations in the dataset
We randomly sampled ∼10% (900 sentence pairs) of the parallel sentences from the train and validation set and annotated them for translation errors. For annotation, we classified the sentence pairs into one of two classes : 1) Error -semantic content in the target is distorted as compared to source; 2) No Error -semantic content of source and target are similar and the target might have minor errors. Minor errors in translations that are attributable to agreement issues, case markers issues, pronoun errors etc were classified into the No Error bucket. Out of the 900 samples that were manually annoatated, 141 samples, i.e 15% of annotated pairs, had targets whose meaning was distorted as compared to source sentence. One such example is shown below: • English Sentence: I think I know the football player it was based on.
• Translation of Hinglish Sentence: I thought that this is about football player. Table 6 shows the analysis of these annotated subset. The annotated file with all 900 examples can be found in our code repository. Filtering such erroneous examples from training and validation datasets, and augmenting the dataset with better quality translations will certainly help in improving the translation quality.

Discussion
In this paper, we presented our approaches for English to Hinglish translation using mBART. We analyse our model's outputs and show that the translation quality can be improved by including parallel Hindi translations, along with the English sentences, while translating English sentences to Hinglish. We also discuss the limitations of using BLEU scores for evaluating code-mixed outputs and propose using BLEU normalized -a slightly modified version of BLEU. To understand the codemixing quality of the generated translations, we propose that a code-mixing measure, like CMI, should also be part of the evaluation process. Along with the working models, we have analysed the model's shortcomings by doing error analysis on the outputs generated by the models. Further, we have also presented an analysis on the shared dataset : percentage of sentences in the dataset which are not code-mixed, the erroneous reference translations. Removing such pairs and replacing them with better samples will help improve the translation quality of the models.
As part of future work, we would like to improve our translation quality by augmenting the current dataset with parallel sentences with a higher degree of code-mixing and good reference translations. We would also like to further analyse the nature of code-mixing in the generated outputs, and study the possibility of constraining the models to generated translations with a certain degree of code-mixing.