A Semi-supervised Approach to Generate the Code-Mixed Text using Pre-trained Encoder and Transfer Learning

Code-mixing, the interleaving of two or more languages within a sentence or discourse is ubiquitous in multilingual societies. The lack of code-mixed training data is one of the major concerns for the development of end-to-end neural network-based models to be deployed for a variety of natural language processing (NLP) applications. A potential solution is to either manually create or crowd-source the code-mixed labelled data for the task at hand, but that requires much human efforts and often not feasible because of the language specific diversity in the code-mixed text. To circumvent the data scarcity issue, we propose an effective deep learning approach for automatically generating the code-mixed text from English to multiple languages without any parallel data. In order to train the neural network, we create synthetic code-mixed texts from the available parallel corpus by modelling various linguistic properties of code-mixing. Our codemixed text generator is built upon the encoder-decoder framework, where the encoder is augmented with the linguistic and task-agnostic features obtained from the transformer based language model. We also transfer the knowledge from a neural machine translation (NMT) to warm-start the training of code-mixed generator. Experimental results and in-depth analysis show the effectiveness of our proposed code-mixed text generation on eight diverse language pairs.


Introduction
Multilingual content is very prominent on social media handles, especially in the multilingual communities like the Indian ones. Code-mixing is a common expression of multilingualism in informal text and speech, where there is a switch between the two languages, frequently with one in the character set of the other language. This has been a mean of communication in a multi-cultural and multi-lingual society, and varies according to the culture, beliefs, and moral values of the respective communities.
Linguists have studied the phenomenon of codemixing, put forward many linguistic hypotheses (Belazi et al., 1994;Pfaff, 1979;Poplack, 1978), and formulated various constraints (Sankoff and Poplack, 1981;Di Sciullo et al., 1986;Joshi, 1982) to define a general rule for code-mixing. However, for all the scenarios of code-mixing, particularly for the syntactically divergent languages (Berk-Seligson, 1986), these limitations cannot be postulated as a universal rule.
In recent times, the pre-trained language model based architectures (Devlin et al., 2019;Radford et al., 2019) have become the state-of the-art models for language understanding and generation. The underlying data to train such models comes from the huge amount of corpus, available in the form of Wikipedia, book corpus etc. Although, these are readily available in various languages, there is a scarcity of such amount of data in code-mixed form which could be used to train the state-of-the-art transformer (Vaswani et al., 2017) based language model, such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), XLM (Lample and Conneau, 2019) etc. The existing benchmark datasets on various NLP tasks can also be transformed to the code-mixed environmental setup, and subsequently be leveraged to assess the model's flexibility under the multilingual framework. Creating large-scale codemixed datasets for such tasks is expensive and time-consuming as it requires considerable human efforts and language expertise to generate these manually. Therefore, it is necessary to build an automated code-mixed generation system capable of modeling intra-sentential language phenomenon.
In this paper, we formulate the code-mixed phenomenon using the feature-rich and pre-trained lan-guage model assisted encoder-decoder paradigm. The feature-rich encoder assists the model to capture the linguistic phenomenon of code-mixing, especially to decide when to switch between the two languages. Similarly, the pre-trained language model provides the task-agnostic feature which helps to encode the generic features. We adopt the gating mechanism to fuse the features of the pretrained language model and the encoder. Additionally we also perform transfer learning to learn the prior distribution from the pre-trained NMT. The pre-trained NMT weights are used to initialize the code-mixed generation network. Transfer learning guides the code-mixed generator to generate syntactically correct and fluent sentences.
We summarize the contributions of our work below: (i). We propose a robust and generic method for code-mixed text generation. Our method exploits the capabilities of linguistic feature-rich encoding and pre-trained language model assisted encoder to capture the code-mixed formation across the languages. Our model is further tailored to generate the syntactically correct, adequate and fluent code-mixed sentences using the prior knowledge acquired by the transfer learning approach. (ii). To warm start the training, we devise a robust and generic technique to automatically create the synthetic code-mixed sentences by modeling the linguistic properties using the parallel corpus. To the best of our knowledge, this is the very first step where we attempt to propose a generic method that produces the correct and fluent code-mixed sentences on multiple language pairs. The generated synthetic dataset will be a useful resource for machine translation and multilingual applications. (iii). We demonstrate with detailed empirical evaluations the effectiveness of our proposed approach on eight different language pairs, viz. English-Hindi (en-hi), English-Bengali (en-bn), English-Malayalam (en-ml), English-Tamil (enta), English-Telugu (en-te), English-French (en-fr), English-German (en-de) and English-Spanish (enes).

Related Work
In the literature, there have been efforts for creating code-mixed texts by leveraging the linguistic properties. Pratapa et al. (2018) explored the equivalence constraint theory to construct artificial code-mixed data to reduce the perplexity of the RNN-based language model. Winata et al. (2018) proposed a multitask learning framework to address the issue of data scarcity in code-mixed setting. Particularly, they leveraged the linguistic information using a shared syntax representation, jointly learned over Part-of-Speech (PoS) and language modeling on codeswitched utterances. Garg et al. (2018) exploited SeqGAN in the generation of the synthetic codemixed language sequences. Most recently, Winata et al. (2019a) utilized the language-agnostic metarepresentation method to represent the code-mixed sentences. There are also other studies (Adel et al., 2013a(Adel et al., ,b, 2015Choudhury et al., 2017;Winata et al., 2018;Gonen and Goldberg, 2018;Samanta et al., 2019) for code-mixed language modelling.
In contrast to sthese existing works, firstly, we provide a linguistically motivated technique to create the code-mixed datasets from multiple languages with the help of parallel corpus (English to respective language). Thereafter, we utilize this data to develop a neural based model to generate the code-mixed sentences from the English sentence. Our current work has a wider scope as the underlying architecture can be used to harvest the code-mixed data for the various NLP tasks not only limited to the language modelling and speech recognition as it is generally been focused in the literature. In contrast to the previous studies, where only a few of the language pairs were considered for code-mixing, we propose an effective approach which shows its effectiveness in generating codemixed sentences for eight different language pairs of diverse origins and linguistic properties.

Synthetic Code-Mixed Generation
We follow the matrix language frame (MLF) (Myers-Scotton, 1997;Joshi, 1982) theory to generate the code-mixed text. It is less restrictive and can easily be applied on many language pairs. According to MLF, a code-mixed text will have a  dominant language (matrix language) and inserted language (embedded language). The insertions could be words or larger constituents and they will comply with the grammatical frame of the matrix language. However, random word insertions could lead to the formation of unnatural code-mixed sentences, which are very rare in practice.
Linguistically informed strategy to insert the words or constituents can improve the quality of code-mixed text. It is also shown in the literature (Gupta et al., 2018b) that such strategy benefits the quality of generated code-mixed text. In our work, we utilize the parallel corpora to learn the alignments between English and other languages. Given a pair of parallel sentences, we identify the words from English and substitute their aligned counterparts with the identified English words to synthesize the English embedded code-mixed sentences. The input to our synthetic code-mixed generation algorithm (details are in Appendix) is a parallel sentence pair. We use the Indic-nlp-library 1 to tokenize the sentences of the Indic languages. Moses based tokenizer 2 is used to translate the European and English language texts. Thereafter, we learn the alignment matrix, which guides to select the words or phrases to be mixed in the language.
We use the official implementation 3 of the fastalign algorithm (Dyer et al., 2013) to obtain the alignment matrix. The alignment matrix is used to construct the aligned phrases between the parallel sentences. We extract the PoS (mainly adjective), named entity (NE) and noun phrase (NP) from the English sentences, and insert them into the appropriate places of the sentences in the other language (i.e. the target language) counterparts. We use the Stanford library 4 Stanza (Qi et al., 2020)   The NE Mahatma Gandhi of type 'Person' is mixed in En-Hi 5 code-mixed sentence.
The need of replacing the aligned noun phrases can be understood with the examples of parallel sentences shown in Fig 1. In the given example, 'girl' and 'red umbrella' are the noun phrases 6 in the English sentence. To obtain the corresponding code-mixed sentence, their aligned phrases`लड़क ' and`लाल छाता' need to be replaced with English counterparts 'girl' and 'red umbrella', respectively. Similarly, we can visualize the requirement of choosing the adjective words to be mixed in the code-mixed sentence by the following example: • En: The situation in Mumbai has not yet come to normal.
In the given example the adjective 'normal' is present in the English sentence. To make the corresponding code-mixed sentence the adjective word has to be inserted in the code-mixed sentence. In this case corresponding target (i.e. Hindi here) word`सामा य' need to be replaced with the word 'normal' in the En-Hi code-mixed sentence. We show some samples in Table 1, and more details in the Appendix.

Methodology
We depict the architecture of our proposed model in Figure 2. Problem Statement: Given an English sentence E having m words e 1 , e 2 , . . . , e m , the task is to generate the code-mixed sentenceĈ having a sequence of n wordsĈ = {y 1 , y 2 , . . . , y n }.

Sub-word Vocabulary
The task of generation using neural networks requires a fixed-sized vocabulary. To deal with the problem of Out-of-Vocabulary (OOV) words, we use the Byte-pair encoding (BPE) (Sennrich et al., 2016), and segment the words into sub-words. The sub-word based tokenization schemes inspired by BPE have become the norm in most of the advanced models including the very popular family of contextual language models like XLM (Lample and Conneau, 2019), GPT-2 (Radford et al., 2019), etc. In this work, we process the language pairs with the vocabulary created using the BPE.

Feature-rich and Pre-trained Language Model Assisted Encoder
We introduce a specific encoder which is equipped with linguistic features and pre-trained language model features. Firstly, we discuss the linguistic feature encoding to the standard long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) encoder. Later, we describe the pre-trained language model feature assisted encoder. In order to encode the input English sentence, we use the two-layered LSTM networks. Firstly, we tokenize the English sentence to the sub-word tokens using BPE. Each sub-word is mapped to a real-valued vector through an embedding layer. In addition, we also incorporate the linguistic features in the form of NE and PoS. The motivation to use these linguistic features comes from the synthetic code-mixed text generation (c.f. section 3) itself, where these features guide the generation process by selecting the words to either replace with their aligned English words or to keep the same word in the code-mixed sentence. In neural based generation, explicit linguistic features help the decoder to decide whether to copy from the English (source) or generate from the vocabulary.
The network takes the concatenation of word embedding u t , NE encoding n t and p t (will be discussed shortly) at each time step t and generate the hidden state as follows: We compute the forward and backward hidden states − → h i and ← − h i , and compute the document encoder as the concatenation of the two hidden states,  2019), where the task is to predict the masked words from the sentence given the remaining words. The TLM objective is an extension of MLM for the parallel sentences. For the TLM objective function the input sentence is the concatenation of the source and target sentence and a random word is masked from the concatenated sentence and rest of the words is used to predict the masked word. The XLM model trained with multiple objective functions on different languages together has shown the effectiveness on cross-lingual classification and machine translation. By virtue of dealing with multiple languages and setting the stateof-the-arts in language generation task, the pretrained XLM model is adopted to extract the language model features for code-mixed generation as it is reminiscence of the cross-lingual and generation paradigms. For the given input sentence E : {e 1 , e 2 , . . . , e m }, we extract the language model feature L : {l 1 , l 2 , . . . , l m }. The extracted language model features are fused to the linguistic features as follows: where, ⊕ and ⊙ are the concatenation and elementwise multiplication operator. First, we project both the features h t and l t into the same vector space h * t and l * t via feed-forward network. Thereafter, we learn the gated value g which controls the flow of each feature. The gated value g controls how much of each feature should be the part of the final encoder representation f t .

Decoding with Pointer Generator
We use the one-layer LSTM network with the attention mechanism (Bahdanau et al., 2015) to generate the code-mixed sentence y 1 , y 2 , . . . , y n one word at a time. In order to deal with the rare or unknown words, the decoder has the flexibility to copy the words from documents via the pointing mechanism (See et al., 2017;Gulcehre et al., 2016). The LSTM decoder reads the word embedding u t−1 and the hidden state s t−1 to generate the hidden state s t at time step t. Concretely, Similar to (See et al., 2017), we compute the attention distribution α t and context vector c t . The generation probability is computed as follows: where W a , W b and W s are the weight matrices and σ is the Sigmoid function. We also consider the copying of the word from the English sentence. The probability to copy a word from English sentence at given time t is computed by the following equation: where 1{w == w i } denotes the vector of length m having the value 1 where w == w i , otherwise 0. The final probability distribution over the dynamic vocabulary (English and code-mixed sentence vocabulary) is calculated by the following: P (w) = p gen P vocab (w)+(1−p gen )P copy (w) (6)

Transfer Learning for Code-mixing
Transfer learning deals with the performance improvement of a task by using the learned knowledge from a near similar task. It has shown promise in solving various problems (Torrey and Shavlik, 2010;Pan and Yang, 2009)  In the above sentences, Target (Hi) and Codemixed (En-Hi) share many words (underlined words). Because of this underlying similarity between the machine translation and code-mixed sentence generation, we adapted the transfer learning approach used in machine translation (Zoph et al., 2016;Kocmi and Bojar, 2017) for code-mixed text generation.
We first train an NMT model on a large corpus of parallel sentences as discussed in Section 3. Next, we initialize the code-mixed text generation model with the already-trained NMT model. This is then trained on the synthetic code-mixed dataset. Rather than initializing the code-mixed model from the random parameters, we initialize it with the weights from the NMT model. By doing this, we achieve strong prior distribution from the NMT model to code-mixed text generation. When we train the code-mixed generation model initialized with the weights of the NMT model, it acquires the prior knowledge of translating the English sentences into the target language XX, and then is fine-tuned to adopt to the code-mixed phenomenon.

Results and Analysis
We evaluate the performance of our proposed approach on the synthetic code-mixed text from eight different language pairs. The datasets can be found here 8 . We compare the performance of our proposed code-mixed generation model with the (i) Seq2Seq (Sutskever et al., 2014), (ii) Attentive-Seq2Seq (Bahdanau et al., 2015) and (iii) Pointer Generator (See et al., 2017) baselines.

Experimental Setup
In our experiments, we use the same vocabulary for both the encoder and decoder. For the language pairs: en-hi, en-es, en-de, en-fr, we use the learned BPE codes 9 on 15 languages to segment the sentences into sub-words and use this vocabulary 10 to index the sub-words. For the language pairs: enbn, en-ml, en-ta, en-te, we use the learned BPE codes 11 on 100 languages from the XLM model to segment the sentences into sub-words and use the correspondent vocabulary to index the sub-words. The same set of vocabulary is used to extract the pre-trained language model feature and the corresponding NMT model for the transfer learning. We use the aligned multilingual word embedding 12 of dimension 300 for the language pairs: en-es, ende, en-fr, en-hi and en-bn from Bojanowski et al.    The hidden dimension of all the LSTM cells is set to 512. We use the pre-trained XLM model 14 to extract the language model feature of dimension 1024 for en-hi, en-es, en-de, en-fr language pairs. For the rest of the language pairs, the pretrained model 15 trained on MLM objective function is used to extract the language model feature. We use beam search of beam size 4 to generate the code-mixed sentence. Adam (Kingma and Ba, 2015) optimizer is used to train the model with (i) β 1 = 0.9, (ii) β 2 = 0.999, and (iii) ϵ = 10 −8 and initial learning rate of 0.0001. The maximum length of English and code-mixed tokens are set to 60 and 30, respectively. We set 5 as minimum decoding steps in each code-mixed language pair. We use the en-hi development dataset to tune the network hyper-parameters. All the model updates use a batch size of 16.

Quantitative Analysis
We report the results of our proposed model in Table 2 and Table 3. Performance comparisons to the three baselines are reported in Table 2 and Table 3. The Pointer Generator based baseline is the superior amongst all the baselines and achieve the maximum Bleu score of 21.45 for the en-de code-mixed language pair. Our proposed model achieves the maximum Bleu score of 24.89 for the en-fr code-mixed language pair. The minimum Bleu score that we achieve is 14.81 for the en-te language pair. We achieve lower Bleu scores for the language pairs, en-ta and en-te compared to the other language pairs. It is because the number of training samples for en-ta and en-te are very low (11, 380 and 9, 105) as compared to the other language pairs. Among the European languages, for en-fr pair, our model attains the highest performance; while for the Indian languages, our proposed model reports the comparable performance for both en-hi and en-bn language pairs. We also perform the ablation study to asses the efficacy of the model's components. We remove each component at a time from the proposed model

en-de Input
The real problem is statesponsored lawlessness. Reference Das real problem ist die vom statesponsored lawlessness.

PG
Das echtes problem ist die vom statesponsored Gesetz. Proposed Das real problem ist vom statesponsored lawlessness.

en-es Input
However we have proposed some minor changes. Reference Con todo hemos propuestos algunas minor changes.

(-) TL
Con todo hemos propuestos algunas minor cambios.  and report the results for each language pair in Table 2 and Table 3. The removal of BPE brings down the Bleu score from 0.57 (en-ta) to 0.84 (ende). The BPE encoding helps the model to mitigate the OOV word issue by providing the subword level information. Similarly, the removal of PoS feature reduces the Bleu score by 0.26 (en-es) to 0.58 (en-te). The NE feature helps most to the en-bn code-mixed language pair as we observe the decrease of 1.0 Bleu points while the NE feature is removed. The LM feature is obtained from the pre-trained language model, and it helps the model to obtain the better encoded representation. The ablation study reveals that removal of LM feature decreases the Bleu score by 1.36 points. We observe the near similar impact of LM feature on each language pair. Finally, the transfer learning is also proven to be an integral component of the proposed model as it contributes to the maximum of 2.25 Bleu score for en-fr and minimum of 1.02 Bleu score of en-te code-mixed language pair. The difference between the maximum and minimum contribution may be attributed to the fact that, we have sufficient parallel corpus (197,922) to train the en-fr NMT model as compared to the en-te parallel corpus (10, 105). We follow the bootstrap test (Dror et al., 2018) which confirms that the performance improvement over the baselines are statistically significant as (p < 0.005).

Qualitative Analysis
We assess the quality of the generated code-mixed text, and show these samples in Table 4. We ob-  serve that the code-mixed sentences generated using the PG model are able to copy the entities from the given English sentence, but the generated code-mixed sentences are incomplete and not fluent compared to the reference sentences. For example, in en-hi pair the PG based code-mixed sentence missed the 'main' word and it copies 'India's' rather than generating 'India का' which seems more natural and human-like code-mixed sentence.
Our analysis also reveals that quality of the generated code-mixed sentence without transfer learning lacks in fluency. The examples can be seen in the (-) TL generated code-mixed sentence (in Table 4) in the en-hi and en-fr. In contrast, the generated output using the proposed model takes the benefits of both the pointer generator and transfer learning to generate adequate, fluent and complete human-like code-mixed sentences. We observe that the proposed model learns when to switch between the languages, and when to either copy the entity/phrase from the English sentence or to generate from the vocabulary. The examples can be seen in en-hi language pair, where the model copies the word 'main strength' from the English sentence, and it also switches between the languages at the appropriate time step by generating the correct word from the vocabulary.
We perform human evaluation to judge the quality of the generated code-mixed text. For human evaluation, we randomly sample 100 English sentences from the en-hi code-mixed dataset, and ask three English and Hindi speakers to manually formulate the code-mixed sentences. These were then used to evaluate the quality of the generated code-mixed sentences. We ask the speakers to score (from 1 to 5) the machine generated codemixed sentence with respect to the human generated sentences. The rate will define how natural and human-like the code-mixed sentence sounds as compared to the human one. The scores are associated with the quality of the generated code-mixed sentence, where 1 shows that there is a strong dis-agreement between the machine generated and human formulated code-mixed sentence. Similarly, 2, 3, 4 and 5 are the categorical scores for Disagreement, Not Sure, Agreement and Strongly Agreement, respectively.
We also compute the automatic evaluation metrics, BLEU, Rouge-L and Meteor. The comparison between the different approaches on human and automatic evaluation metrics are reported in Table 5. The reported human evaluation score corresponds to the average of all the three human experts. The proposed model achieves the human evaluation (naturalness) score of 3.26 compared to the synthetic generation of 4.19. It is to be noted that the algorithm of synthetic text generation needs parallel corpus. However, our neural generation model does not require any parallel data except at the time of warm-start with the synthetic data. The human evaluation achieves better score (3.26) compared to the strongest (2.34) pointer generator based baseline model.
Error Analysis: We closely analyze the outputs of our proposed model to realize the challenges faced. We take up the language pair (en-hi), study the errors encountered by the proposed approach. and We categorize them into the following types: (1). Reference Inaccuracy: The errors encountered during the word alignment phase propagate, and lead to the inaccurate reference code-mixed sentences. Since, we use these sentences to train the generator model, it introduces errors in the generated code-mixed sentences too. This issue could possibly be reduced with an improved alignment algorithm.
(2). Missing/Incorrect Words: This is one of the very common error types, where the model generates the incorrect words/phrases. The missing or incorrect words cause fluency problem in the generated code-mixed sentence. We also observed that the majority of the missing words are function words, while incorrectly generated words belong to the content words category.
(3). Factual Inaccuracy: Our proposed model sometimes generates the factually incorrect NEs. These types of errors were mainly seen in the longer sentences, where the model was found to be confused to copy/generate the relevant entity in the given context. (4). Code-Mixed Inaccuracy: We observe the inaccuracy in the generated sentence, where the model sometimes produces the sentence which ei-ther violates the code-mixed theory or is unnatural (not human-like). (5) Rare Language Pairs: We notice that, the system makes the more errors on the en-ta and ente language pairs. It can be understand by the fact that, we had comparatively lesser number of samples of these language pairs to train the system. This error can be reduced by training the system with sufficient number of samples. (6) Others: We categorize the remaining errors in others category. The other type of errors include repeated word, inadequate sentence generation, extra word generation etc. We also observe that majority of the error occurred when the input sentence were relatively longer than 12 words.
We randomly take a sample of size 100 from the generated En-Hi code-mixed text and categorize them using the six different aforementioned error types. We found that top-3 frequent errors (Missing/Incorrect Words, Reference Inaccuracy, Code-Mixed Inaccuracy) come under 27.21%, 23.37%, and 17.44% respectively.

Conclusion
In this paper, we have proposed a neural network based effective method coupled with the linguistic and pre-trained feature representation along with the transfer learning to generate the code-mixed sentences. To train and evaluate the proposed approach, we have introduced a linguistically motivated approach for code-mixed sentence generation using the parallel sentences of any particular language pair. Our experimental results and indepth analysis show that the feature representation and transfer learning together effectively improve the model performance and the quality of the generated code-mixed sentence. We have shown the effectiveness of the proposed approach on eight different language pairs. In future work, we plan to explore the unsupervised neural approach for codemixed text generation.