IITP-MT at CALCS2021: English to Hinglish Neural Machine Translation using Unsupervised Synthetic Code-Mixed Parallel Corpus

This paper describes the system submitted by IITP-MT team to Computational Approaches to Linguistic Code-Switching (CALCS 2021) shared task on MT for English→Hinglish. We submit a neural machine translation (NMT) system which is trained on the synthetic code-mixed (cm) English-Hinglish parallel corpus. We propose an approach to create code-mixed parallel corpus from a clean parallel corpus in an unsupervised manner. It is an alignment based approach and we do not use any linguistic resources for explicitly marking any token for code-switching. We also train NMT model on the gold corpus provided by the workshop organizers augmented with the generated synthetic code-mixed parallel corpus. The model trained over the generated synthetic cm data achieves 10.09 BLEU points over the given test set.


Introduction
In this paper, we describe our submission to shared task on Machine Translation (MT) for English → Hinglish at CALCS 2021. The objective of this shared task to generate Hinglish (Hindi-English Code-Mixed 1 ) data from English. In this task, we submit an NMT system which is trained on the parallel code-mixed English-Hinglish synthetic corpus. We generate synthetic corpus in unsupervised fashion and the methodology followed to generate data is independent of languages involved. Since the target Hindi tokens are written in roman script, during the synthetic corpus creation, we transliterate the Hindi tokens from Devanagari script to Roman script.
Code-Mixing (CM) is a very common phenomenon in various social media contents, product description and reviews, educational domain etc. For better understanding and ease in writing, users * Equal contribution 1 Hindi words are romanized write posts, comments on social media in codemixed fashion. It is not consistent or convenient always to translate all the words, especially the named entities, quality related terms etc.
But translating in code-mixed fashion required code-mixed parallel training data. It is possible to generate code-mixed parallel corpus from a clean parallel corpus. From the term 'clean parallel corpus', we refer to a parallel corpus which consists of the non code-mixed parallel sentences. Generally noun tokens, noun phrases and adjectives are the major candidates to be preserved as it is (without translation) in the code-mixed output. This requires a kind of explicit token marking using parser, tagger (part of speech, named entity etc.) to find the eligible candidate tokens for code-mixed replacement. Since this method is dependent on linguistic resources, it is limited to the high resource languages only.
We introduce an alignment based unsupervised approach for generating code-mixed data from parallel corpus which can be used to train the NMT model for code-mixed text translation.
The paper is organized as follows. In section 2, we briefly mention some notable works on translation and generation of synthetic code-mixed corpus. In section 3, we describe our approach to generate synthetic code-mixed corpus along with the system description. Results are described in section 4. Finally, the work is concluded in section 5.

Related Works
Translation of code-mixed data has gained popularity in recent times. Menacer et al. (2019) conducted experiments on translating Arabic-English CM data to pure Arabic and/or to pure English with Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) approaches. Dhar et al. (2018) proposed an MT augmentation pipeline which takes CM sentence and determines the most dominating language and translates the remaining words into that language. The resulting sentence will be in one single language and can be translated to other language with the existing MT systems. Yang et al. (2020) have used code-mixing phenomenon and proposed a pre-training strategy for NMT. Song et al. (2019) augmented the codemixed data with clean data while training the NMT system and reported that this type of data augmentation improves the translation quality of constrained words such as named entities. Singh and Solorio (2017) There have been some efforts for creating codemixed data. Gupta et al. (2020) proposed an Encoder-Decoder based model which takes English sentence along with linguistic features as input and generates synthetic code-mixed sentence. Pratapa et al. (2018) explored 'Equivalence Constraint' theory to generate the synthetic code-mixed data which is used to improve the performance of Recurrent Neural Network (RNN) based language model. While Winata et al. (2019) proposed a method to generate code-mixed data using a pointer-generator network, Garg et al. (2018) explored SeqGAN for code-mixed data generation.

System Description
In this section, we describe the synthetic parallel corpus creation, dataset and experimental setup of our system.

Unsupervised Synthetic Code-Mixed Corpus Creation
We utilize the existing parallel corpus to create synthetic code-mixed data. First we learn word-level alignments between source and target sentences of a given parallel corpus of a specific language pair. We use the implementation 2 of fast_align algorithm (Dyer et al., 2013) to obtain the alignment matrix. Let X = {x 1 , x 2 , ..., x m } be the source sentence and Y = {y 1 , y 2 , ..., y n } be the target sentence. We consider only those alignment pairs {x j , y k } [for j = (1, ...., m) and k = (1, ...., n)] which are having one-to-one mapping, as candidate tokens. By 'One-to-one mapping', we mean that neither {x j } nor {y k } should be aligned to more than one token from their respective counter 2 https://github.com/clab/fast_align/ sides except {y k } and {x j } respectively. The obtained candidate token set is further pruned by removing the pairs where x j is a stopword. Based on the resulting candidate set, the source token x j is replaced with aligned target token y k . The generated code-mixed sentence is in the form: CM = {x 1 , x 2 , ..., y k , y l , ..., x m }. Figure 1 shows an example of English-Hindi code-mixed sentence generated through this method.

Romanization of the Hindi text
The task is to generate Hinglish data in which Hindi words are written in Roman script. But in the generated synthetic code-mixed corpus, Hindi words are written in Devanagari script. In order to convert the Devanagari script to Roman script, we utilize Python based transliteration tool. 3 This convert the Devanagari script to Roman script. We also create another version of the synthetic code-mixed corpus by replacing the two consecutive vowels with single vowel (Belinkov and Bisk, 2018). We call this version of code-mixed corpus as synthetic code-mixed corpus with user patterns. The main reason to create noisy version of the corpus is to simulate the user writing patterns when writing romanized code-mixed sentences in reallife. An example of such scenario would be, user may write 'Paani' (water) as 'Pani' (water). We tried to capture these scenarios by replacing the consecutive vowels with single vowel. These vowel replacement is done at target side (Hinglish) of the synthetic code-mixed corpus only and source (English) is kept as it is. The gold corpus provided by organizers is not modified in any way and also kept as it is.

Dataset
We consider English-Hindi IIT Bombay (Kunchukuttan et al., 2018) parallel corpus. We tokenize and true-case English using Moses tokenizer (Koehn et al., 2007) and truecaser 4 scripts and Indic-nlp-library 5 to tokenize Hindi. We remove the sentences having length greater that 150 tokens and created synthetic code-mixed corpus on the resulting corpus as described earlier.
The statistics of data used in the experiments are shown in Table 1.

Experimental Setup
We conduct the experiments on Transformer based Encoder-Decoder NMT architecture (Vaswani et al., 2017). We use 6 layered Encoder-Decoder stacks with 8 attention heads. Embedding size and hidden sizes are set to 512, dropout rate is set to 0.1. Feed-forward layer consists of 2048 cells. Adam optimizer (Kingma and Ba, 2015) is used for training with 8,000 warmup steps with initial learning rate of 2. We use Sentencepiece (Kudo and Richardson, 2018) with joint vocabulary size of 50K. Models are trained with OpenNMT toolkit 6 (Klein et al., 2017) with batch size of 2048 tokens till convergence and checkpoints are created after every 10,000 steps. All the checkpoints that are created during the training are averaged and considered as the best parameters for each model. During inference, beam size is set to 5.

Results
We train two models. Baseline model which is trained on the Gold standard corpus. Second model on the synthetic code-mixed data. We upload our model predictions on the test set provided by organizers to shared task leaderboard 7 . The test set con-6 https://opennmt.net/ 7 https://ritual.uh.edu/lince/leaderboard tains 960 sentences. Our model achieved BLEU (Papineni et al., 2002) score of 10.09. Table 2 shows the BLEU scores obtained from the trained models on Development and Test sets. Table 3 shows some sample translations.

Conclusion
In this paper, we described our submission to shared task on MT for English → Hinglish at CALCS 2021. We submitted a system which is trained on synthetic code-mixed corpus generated in unsupervised way. We trained an NMT model on the synthetic code-mixed corpus and gold standard data provided by organizers. On the test set, the model trained over the gold data provided by the workshop achieves 2.45 BLEU points while the model trained over our generated synthetic cm data yields BLEU score of 10.09. We believe that the proposed method to generate synthetic code-mixed data can be very useful for training MT systems in code-mixed settings as the proposed method does not require any linguistic resources to generate code-mixed data.