Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences

Training code-switched language models is difficult due to lack of data and complexity in the grammatical structure. Linguistic constraint theories have been used for decades to generate artificial code-switching sentences to cope with this issue. However, this require external word alignments or constituency parsers that create erroneous results on distant languages. We propose a sequence-to-sequence model using a copy mechanism to generate code-switching data by leveraging parallel monolingual translations from a limited source of code-switching data. The model learns how to combine words from parallel sentences and identifies when to switch one language to the other. Moreover, it captures code-switching constraints by attending and aligning the words in inputs, without requiring any external knowledge. Based on experimental results, the language model trained with the generated sentences achieves state-of-the-art performance and improves end-to-end automatic speech recognition.


Introduction
Code-switching is a common linguistic phenomenon in multilingual communities, in which a person begins speaking or writing in one language and then switches to another in the same sentence. 1 It is motivated in response to social factors as a way of communicating in a multicultural society. In its practice, code-switching varies due to the traditions, beliefs, and normative values in the respective communities. Linguists have studied the code-switching phenomenon and proposed a number of linguistic theories (Poplack, 1978;Pfaff, 1979;Poplack, 1980;Belazi et al., 1994). Code-switching is not produced indiscriminately, but follows syntactic constraints. Many linguists have formulated various constraints to define a general rule for code-switching (Poplack, 1978(Poplack, , 1980Belazi et al., 1994). However, these constraints cannot be postulated as a universal rule for all code-switching scenarios, especially for languages that are syntactically divergent (Berk-Seligson, 1986), such as English and Mandarin since they have word alignments with an inverted order.
Building a language model (LM) and an automatic speech recognition (ASR) system that can handle intra-sentential code-switching is known to be a difficult research challenge. The main reason lies in the unpredictability of code-switching points in an utterance and data scarcity. Creating a large-scale code-switching dataset is also very expensive. Therefore, code-switching data generation methods to augment existing datasets are a useful workaround.
Existing methods that apply equivalence constraint theory to generate code-switching sentences (Li and Fung, 2012;Pratapa et al., 2018) may suffer performance issues as they receive erroneous results from the word aligner and the partof-speech (POS) tagger. Thus, this approach is not reliable and effective. Recently, Garg et al. (2018) proposed a SeqGAN-based model to generate code-switching sentences. Indeed, the model learns how to generate new synthetic sentences. However, the distribution of the generated sentences is very different from real code-switching data, which leads to underperforming results.
To overcome the challenges in the existing works, we introduce a neural-based codeswitching data generator model using pointergenerator networks (Pointer-Gen) (See et al., 2017) to learn code-switching constraints from a limited source of code-switching data and leverage their translations in both languages. Intu-itively, the copy mechanism can be formulated as an end-to-end solution to copy words from parallel monolingual sentences by aligning and reordering the word positions to form a grammatical codeswitching sentence. This method solves the two issues in the existing works by removing the dependence on the aligner or tagger, and generating new sentences with a similar distribution to the original dataset. Interestingly, this method can learn the alignment effectively without a word aligner or tagger. As an additional advantage, we demonstrate its interpretability by showing the attention weights learned by the model that represent the code-switching constraints. Our contributions are summarized as follows: • We propose a language-agnostic method to generate code-switching sentences using a pointer-generator network (See et al., 2017) that learns when to switch and copy words from parallel sentences, without using external word alignments or constituency parsers. By using the generated data in the language model training, we achieve the state-of-theart performance in perplexity and also improve the end-to-end ASR on an English-Mandarin code-switching dataset.
• We present an implementation applying the equivalence constraint theory to languages that have significantly different grammar structures, such as English and Mandarin, for sentence generation. We also show the effectiveness of our neural-based approach in generating new code-switching sentences compared to the equivalence constraint and Seq-GAN (Garg et al., 2018).
• We thoroughly analyze our generation results and further examine how our model identifies code-switching points to show its interpretability.

Generating Code-Switching Data
In this section, we describe our proposed model to generate code-switching sentences using a pointer-generator network. Then, we briefly list the assumptions of the equivalence constraint (EC) theory, and explain our application of EC theory for sentence generation. We call the dominant language the matrix language (L 1 ) and the inserted language the embedded language (L 2 ), following the definitions from Myers-Scotton (2001). Let us define Q = {Q 1 , ..., Q T } as a set of L 1 sentences and E = {E 1 , ..., E T } as a set of L 2 sentences with T number of sentences, where each Q t = {q 1,t , ..., q m,t } and E t = {e 1,t , ..., e n,t } are sentences with m and n words. E is the corresponding parallel sentences of Q.

Pointer-Gen
Initially, Pointer-Gen was proposed to learn when to copy words directly from the input to the output in text summarization, and they have since been successfully applied to other natural language processing tasks, such as comment generation . The Pointer-Gen leverages the information from the input to ensure high-quality generation, especially when the output sequence consists of elements from the input sequence, such as code-switching sequences. We propose to use Pointer-Gen by leveraging parallel monolingual sentences to generate codeswitching sentences. The approach is depicted in Figure 1. The pointer-generator model is trained from concatenated sequences of parallel sentences (Q,E) to generate code-switching sentences, constrained by code-switching texts. The words of the input are fed into the encoder. We use a bidirectional long short-term memory (LSTM), which, produces hidden state h t in each step t. The decoder is a unidirectional LSTM receiving the word embedding of the previous word. For each decoding step, a generation probability p gen ∈ [0,1] is calculated, which weights the probability of generating words from the vocabulary, and copying words from the source text. p gen is a soft gating probability to decide whether to generate the next token from the decoder or to copy the word from the input instead. The attention distribution a t is a standard attention with general scoring (Luong et al., 2015). It considers all encoder hidden states to derive the context vector. The vocabulary distribution P voc (w) is calculated by concatenating the decoder state s t and the context vector h * t : (1) where w h * , w s , and w x are trainable parameters and b ptr is the scalar bias. The vocabulary distribution P voc (w) and the attention distribution a t are weighted and summed to obtain the final distribution P (w), which is calculated as follows:

RNN Decoder
Parallel sentence Decoder input Figure 1: Pointer-Gen model, which includes an RNN encoder and RNN decoder. The parallel sentence is the input of the encoder, and in each decoding step, the decoder generates a new token.
Permissible switching Impermissible switching Figure 2: Example of equivalence constraint (Li and Fung, 2012). Solid lines show the alignment between the matrix language (top) and the embedded language (bottom). The dotted lines denote impermissible switching.
We use a beam search to select the N -best codeswitching sentences.

Equivalence Constraint
Studies on the EC (Poplack, 1980(Poplack, , 2013 show that code-switching only occurs where it does not violate the syntactic rules of either language. An example of a English-Mandarin mixed-language sentence generation is shown in Figure 2, where EC theory does not allow the word "其实" to come after "是" in Chinese, or the word "is" to come after "actually". Pratapa et al. (2018) apply the EC in English-Spanish language modeling with a strong assumption. We are working with English and Mandarin, which have distinctive grammar structures (e.g., part-of-speech tags), so applying a constituency parser would give us erroneous results. Thus, we simplify sentences into a linear structure, and we allow lexical substitution on non-crossing alignments between parallel sentences. Alignments between an L 1 sentence Q t and an L 2 sentence E t comprise a source vector with in- where u is a sorted vector of indices in an ascending order. The alignment between a i and b i does not satisfy the constraint if there exists a pair of a j and b j , where (a i < a j , and b i > b j ) or (a i > a j , and b i < b j ). If the switch occurs at this point, it changes the grammatical order in both languages; thus, this switch is not acceptable. During the generation step, we allow any switches that do not violate the constraint. We propose to generate synthetic code-switching data by the following steps: 1. Align the L 1 sentences Q and L 2 sentences E using fast_align 2 (Dyer et al., 2013). We use the mapping from the L 1 sentences to the L 2 sentences.
2. Permute alignments from step (1) and use them to generate new sequences by replacing the phrase in the L 1 sentence with the aligned phrase in the L 2 sentence.
switching ASR system. The end-to-end ASR model accepts a spectrogram as the input, instead of log-Mel filterbank features (Zhou et al., 2018), and predicts characters. It consists of N layers of an encoder and decoder. Convolutional layers are added to learn a universal audio representation and generate input embedding. We employ multi-head attention to allow the model to jointly attend to information from different representation subspaces at a different position. For proficiency in recognizing individual languages, we train a multilingual ASR system trained from monolingual speech. The idea is to use it as a pretrained model and transfer the information while training the model with codeswitching speech. This is an effective method to initialize the parameters of low-resource ASR such as code-switching. The catastrophic forgetting issue arises when we train one language after the other. Therefore, we solve the issue by applying a multi-task learning strategy. We jointly train speech from both languages by taking the same number of samples for each language in every batch to keep the information of both tasks.
In the inference time, we use beam search, selecting the best sub-sequence scored using the softmax probability of the characters. We define P (Y ) as the probability of the sentence. We incorporate language model probability p lm (Y ) to select more natural code-switching sequences from generation candidates. A word count is added to avoid generating very short sentences. P (Y ) is calculated as follows: (3) where α is the parameter to control the decoding probability from the probability of characters from the decoder P trans (Y |X), β is the parameter to control the language model probability p lm (Y ), and γ is the parameter to control the effect of the word count wc(Y ).
from Winata et al. (2018a). The details are depicted in Table 1. We tokenize words using the Stanford NLP toolkit (Manning et al., 2014). For monolingual speech datasets, we use HKUST (Liu et al., 2006), comprising spontaneous Mandarin Chinese telephone speech recordings, and Common Voice, an open-accented English dataset collected by Mozilla. 3 We split Chinese words into characters to avoid word boundary issues, similarly to Garg et al. (2018). We generate L 1 sentences and L 2 sentences by translating the training set of SEAME Phase II into English and Chinese using the Google NMT system (To enable reproduction of the results, we release the translated data). 4 Then, we use them to generate 270,531 new pieces of code-switching data, which is thrice the number of the training set. Table 2 shows the statistics of the new generated sentences. To calculate the complexity of our real and generated code-switching corpora, we use the following measures:

Switch-Point Fraction (SPF)
This measure calculates the number of switch-points in a sentence divided by the total number of word boundaries (Pratapa et al., 2018). We define "switchpoint" as a point within the sentence at which the languages of words on either side are different.
Code Mixing Index (CMI) This measure counts the number of switches in a corpus (Gambäck and Das, 2014). At the utterance level, it can be computed by finding the most frequent language in the utterance and then counting the frequency of the words belonging to all other languages present. We compute this metric at the corpus level by averaging the values for all the sentences in a corpus. The computation is shown as follows: where N (x) is the number of tokens of utterance x, t i is the tokens in language i , and P (x) is the number of code-switching points in utterance x. We compute this metric at the corpus-level by averaging the values for all sentences.

LM Training Strategy Comparison
We generate code-switching sentences using three methods: EC theory, SeqGAN (Garg et al., 2018), and Pointer-Gen. To find the best way of leveraging the generated data, we compare different training strategies as follows: (1) is the baseline, training with real codeswitching data. (2a-2c) train with only augmented data. (3a-3c) train with the concatenation of augmented data with rCS. (4a-4c) run a two-step training, first training the model only with augmented data and then fine-tuning with rCS. Our early hypothesis is that the results from (2a) and (2b) will not be as good as the baseline, but when we combine them, they will outperform the baseline. We expect the result of (2c) to be on par with (1), since Pointer-Gen learns patterns from the rCS dataset, and generates sequences with similar code-switching points.

Experimental Setup
In this section, we present the settings we use to generate code-switching data, and train our language model and end-to-end ASR.

Pointer-Gen
The pointer-generator model has 500-dimensional hidden states. We use 50k words as our vocabulary for the source and target. We optimize the training by Stochastic Gradient Descent with an initial learning rate of 1.0 and decay of 0.5. We generate the three best sequences using beam search with five beams, and sample 270,531 sentences, thrice the amount of the code-switched training data.
EC We generate 270,531 sentences, thrice the amount of the code-switched training data. To make a fair comparison, we limit the number of switches to two for each sentence to get a similar number of code-switches (SPF and CMI) to Pointer-Gen.

SeqGAN
We implement the SeqGAN model using a PyTorch implementation 5 , and use our best  trained LM baseline as the generator in SeqGAN. We sample 270,531 sentences from the generator, thrice the amount of the code-switched training data (with a maximum sentence length of 20).
LM In this work, we focus on sentence generation, so we evaluate our data with the same twolayer LSTM LM for comparison. It is trained using a two-layer LSTM with a hidden size of 200 and unrolled for 35 steps. The embedding size is equal to the LSTM hidden size for weight tying (Press and Wolf, 2017). We optimize our model using SGD with an initial learning rate of 20. If there is no improvement during the evaluation, we reduce the learning rate by a factor of 0.75. In each step, we apply a dropout to both the embedding layer and recurrent network. The gradient is clipped to a maximum of 0.25. We optimize the validation loss and apply an early stopping procedure after five iterations without any improvements. In the fine-tuning step of training strategies (4a-4c), the initial learning rate is set to 1.
End-to-end ASR We convert the inputs into normalized frame-wise spectrograms from 16-kHz audio. Our transformer model consists of two encoder and decoder layers. An Adam optimizer and Noam warmup are used for training with an initial learning rate of 1e-4. The model has a hidden size of 1024, a key dimension of 64,  Table 3: Results of perplexity (PPL) on a valid set and test set for different training strategies. We report the overall PPL, and code-switching points (en-zh) and (zh-en), as well as the monolingual segments PPL (en-en) and (zh-zh). and a value dimension of 64. The training data are randomly shuffled every epoch. Our character set is the concatenation of English letters, Chinese characters found in the corpus, spaces, and apostrophes. In the multilingual ASR pretraining, we train the model for 18 epochs. Since the sizes of the datasets are different, we over-sample the smaller dataset. The fine-tuning step takes place after the pretraining using code-switching data. In the inference time, we explore the hypothesis using beam search with eight beams and a batch size of 1.

Evaluation Metrics
We employ the following metrics to measure the performance of our models.
Token-level Perplexity (PPL) For the LM, we calculate the PPL of characters in Mandarin Chinese and words in English. The reason is that some Chinese words inside the SEAME corpus are not well tokenized, and tokenization results are not consistent. Using characters instead of words in Chinese can alleviate word boundary issues. The PPL is calculated by taking the exponential of the sum of losses. To show the effectiveness of our approach in calculating the probability of the switching, we split the perplexity computation into monolingual segments (en-en) and (zh-zh), and code-switching segments (en-zh) and (zh-en).
Character Error Rate (CER) For our ASR, we compute the overall CER and also show the individual CERs for Mandarin Chinese (zh) and English (en). The metric calculates the distance of two sequences as the Levenshtein Distance.

Results & Discussion
LM In Table 3, we can see the perplexities of the test set evaluated on different training strategies. Pointer-Gen consistently performs better than state-of-the-art models such as EC and Se-qGAN. Comparing the results of models trained using only generated samples, (2a-2b) leads to no it 's really a lot worse then Figure 4: The visualization of pointer-generator attention weights on input words in each time-step during the inference time. The y-axis indicates the generated sequence, and the x-axis indicates the word input. In this figure, we show the code-switching points when our model attends to words in the L 1 and L 2 sentences: left: ("no","没 有") and ("then","然后"), right: ("we","我们"), ("share", "一起") and ("room","房间").
the undesirable results that are also mentioned by Pratapa et al. (2018), but it does not apply to Pointer-Gen (2c). We can achieve a similar results with the model trained using only real codeswitching data, rCS. This demonstrates the quality of our data generated using Pointer-Gen. In general, combining any generated samples with real code-switching data improves the language model performance for both code-switching segments and monolingual segments. Applying concatenation is less effective than the two-step training strategy. Moreover, applying the two-step training strategy achieves the state-of-the-art performance.
As shown in Table 2, we generate new n-grams including code-switching phrases. This leads us to a more robust model, trained with both generated data and real code-switching data. We can see clearly that Pointer-Gen-generated samples have a distribution more similar to the real codeswitching data compared with SeqGAN, which shows the advantage of our proposed method.

Effect of Data Size
To understand the importance of data size, we train our model with different amounts of generated data. Figure 3 shows the PPL of the models with different amounts of generated data. An interesting finding is that our model trained with only 78K samples of Pointer-Gen data (same number of samples as rCS) achieves a similar PPL to the model trained with only rCS, while SeqGAN and EC have a significantly higher PPL. We can also see that 10K samples of Pointer-Gen data is as good as 270K samples of EC data. In general, the number of samples is positively correlated with the improvement in performance.

ASR Evaluation
We evaluate our proposed sentence generation method on an end-to-end ASR system. Table 4 shows the CER of our ASR systems, as well as the individual CER on each language. Based on the experimental results, pretraining is able to reduce the error rate by 1.64%, as it corrects the spelling mistakes in the prediction. After we add LM (rCS) to the decoding step, the error rate can be reduced to 32.25%. Finally, we replace the LM with LM (Pointer-Gen → rCS), and it further decreases the error rate by 1.18%.

Model Interpretability
We can interpret a Pointer-Gen model by extracting its attention matrices and then analyzing the activation scores. We show the visualization of the attention weights in Figure 4. The square in the heatmap corresponds to the attention score of an input word. In each time-step, the attention scores are used to select words to be generated. As we can observe in the figure, in some cases, our model attends to words that are translations of each other, for example, the words ("no","没有"), ("then","然 后") , ("we","我 们"), ("share", "一 起"), and ("room","房 间"). This indicates the model can identify code-switching points, word alignments, and translations without being given any explicit information.  Table 5: The most common English and Mandarin Chinese part-of-speech tags that trigger code-switching. We report the frequency ratio from Pointer-Gen-generated sentences compared to the real code-switching data. We also provide an example for each POS tag. Table 5 shows the most common English and Mandarin Chinese POS tags that trigger code-switching. The distribution of word triggers in the Pointer-Gen data are similar to the real code-switching data, indicating our model's ability to learn similar code-switching points. Nouns are the most frequent English word triggers. They are used to construct an optimal interaction by using cognate words and to avoid confusion. Also, English adverbs such as "then" and "so" are phrase or sentence connectors between two language phrases for intra-sentential and intersentential code-switching. On the other hand, Chinese transitional words such as the measure word "个" or associative word "的" are frequently used as inter-lingual word associations.

Related Work
Code-switching language modeling research has been focused on building a model that handles mixed-language sentences and on generating synthetic data to solve the data scarcity issue. The first statistical approach using a linguistic theory was introduced by Li and Fung (2012), who adapted the EC on monolingual sentence pairs during the decoding step of an ASR system. Ying and Fung (2014) implemented a functional-head constraint lattice parser with a weighted finite-state transducer to reduce the search space on a codeswitching ASR system. Then, Adel et al. (2013a) extended recurrent neural networks (RNNs) by adding POS information to the input layer and a factorized output layer with a language identifier. The factorized RNNs were also combined with an n-gram backoff model using linear interpolation (Adel et al., 2013b), and syntactic and semantic features were added to them (Adel et al., 2015). Baheti et al. (2017) adapted an effective curriculum learning by training a network with monolingual corpora of two languages, and subsequently trained on code-switched data. A further investigation of EC and curriculum learning showed an improvement in English-Spanish language modeling (Pratapa et al., 2018), and a multitask learning approach was introduced to train the syntax representation of languages by constraining the language generator (Winata et al., 2018a). Garg et al. (2018) proposed to use SeqGAN (Yu et al., 2017) for generating new mixed-language sequences. Winata et al. (2018b) leveraged character representations to address out-of-vocabulary words in the code-switching named entity recognition. Finally, Winata et al. (2019) proposed a method to represent code-switching sentence using language-agnostic meta-representations.

Conclusion
We propose a novel method for generating synthetic code-switching sentences using Pointer-Gen by learning how to copy words from parallel cor-pora. Our model can learn code-switching points by attending to input words and aligning the parallel words, without requiring any word alignments or constituency parsers. More importantly, it can be effectively used for languages that are syntactically different, such as English and Mandarin Chinese. Our language model trained using outperforms equivalence constraint theory-based models. We also show that the learned language model can be used to improve the performance of an endto-end automatic speech recognition system.