Modeling Code-Switch Languages Using Bilingual Parallel Corpus

Language modeling is the technique to estimate the probability of a sequence of words. A bilingual language model is expected to model the sequential dependency for words across languages, which is difficult due to the inherent lack of suitable training data as well as diverse syntactic structure across languages. We propose a bilingual attention language model (BALM) that simultaneously performs language modeling objective with a quasi-translation objective to model both the monolingual as well as the cross-lingual sequential dependency. The attention mechanism learns the bilingual context from a parallel corpus. BALM achieves state-of-the-art performance on the SEAME code-switch database by reducing the perplexity of 20.5% over the best-reported result. We also apply BALM in bilingual lexicon induction, and language normalization tasks to validate the idea.


Introduction
Monolingual language modeling has enabled many NLP tasks (Devlin et al., 2019;Dai et al., 2019;Radford et al., 2019). However, the bilingual language model was not well studied. The recent advances in cross-lingual word embedding (CLWE) , which projects word of different languages into a shared embedding space for cross-lingual representations (Devlin et al., 2019;Lample and Conneau, 2019), make possible some cross-lingual applications. Unfortunately, they are not optimized to model the sequential dependency for word prediction in a bilingual text.
In this paper, we would like to propose a bilingual language model that can learn word embeddings to represent the equivalent words between two languages, and more importantly, to model the sequential dependency for words across languages at the same time. For instance, the model should be able to predict the appropriate word to fill in the blank, given the bilingual context: ). 1 The above sentence is an example of codeswitching or code-mixing (henceforth, CS), where a bilingual speaker alternates words of two or more languages within a single sentence. The switches could happen at sentence boundaries or word boundaries and for some agglutinative languages even within words. Code-switching is common in both spoken and, to some extent, written communication in many multilingual societies, such as Southeast Asia. Hence, the study of codeswitch in linguistics and bilingual language modeling is becoming imperative, especially for NLP tasks such as code-switching automatic speech recognition (ASR) (Adel et al., 2013b;Li and Fung, 2013;, cross-lingual language normalization. It is tempting to think that, given enough of codeswitching text data, bilingual language modeling could be approached in the same way as that for monolingual data. The main challenge is the lack of such CS data. We note that CS mainly occurs in the spoken form, and CS does not occur in every sentence. Therefore, collecting enough pure CS data is just not practical or even feasible (Lee et al., 2017;Pratapa et al., 2018).
The problem is further exacerbated by the syntactic constraints of the two diverse languages, such as Chinese and English. Three dominant theories seek to explain the syntactic formation of CS sentences. They are the Matrix Language Frame theory (Myers-Scotton, 1997), which shows that individual monolingual sentences will conform to the grammar of the matrix language. The Equivalence Constraint theory (Poplack, 2000;Sankoff, 1998), which further constrains the intra-sentential CS points to the syntactic boundaries shared by both languages, and the Functional Head Constraint theory (Di Sciullo et al., 1986;Belazi et al., 1994) that imposes constraints on the functional head and its 1 English: The movie last night ( ) complements. A bilingual language model should be able to predict a word, either in the matrix language or otherwise, given either a bilingual or monolingual context. Therefore, it has to respect the respective monolingual word sequential dependency, the cross-lingual word correspondence, as well as the switching rules between languages. The contributions of this paper are summarized as follows: 1. We propose an attention-based, autoregressive model, bilingual attention language model (BALM), that not only learns the latent alignment from a parallel corpus for cross-lingual word embedding but also captures the word sequential dependency.
2. Adhering to the Matrix Language Frame theory (Myers-Scotton, 1997) and Equivalence Constraint theory (Poplack, 2000;Sankoff, 1998), we implement an objective function by jointly optimizing the cross-entropy loss as the monolingual constraint and the quasitranslation loss as the cross-lingual constraint.
3. We show that BALM can learn from bilingual parallel data without the need for CS data. When adapted on CS data, it outperforms the best reported result on the SEAME dataset in the perplexity test. We also successfully apply BALM in bilingual lexicon induction, and language normalization tasks to validate the idea.

Related Work
Several prior studies related to bilingual language modeling are the inspiration for this work. Cross-lingual correspondence: Several studies are focused on projecting words of different languages onto the common embedding space to establish cross-lingual correspondence. One idea is to train a model using bilingual information from corpora aligned at the sentence level (Zou et al., 2013;Hermann and Blunsom, 2014;Luong et al., 2015) and document level (Vulic and Moens, 2016;Levy et al., 2017). Another is to exploit the isomorphic structure (Conneau et al., 2017;Artetxe et al., 2018), dictionary (Mikolov et al., 2013;Faruqui and Dyer, 2014;Huang et al., 2015;Zhang et al., 2016), shared cognate, vocab (Hauer et al., 2017;Smith et al., 2017), numeral (Artetxe et al., 2017) through ad-hoc projection.
As the above approaches do not explicitly consider the sequential dependency of words, the embedding doesn't encode the word ordering information. The multilingual techniques, such as M-BERT (Devlin et al., 2019) and XLM (Lample and Conneau, 2019), do not explicitly model the syntactic constraints for CS as formulated in the Equivalence Constraint theory, thus not making full use of the information which could potentially improve their performance.
Code-switching modeling: Another school of thoughts is to extend the monolingual language modeling technique to accommodate code-switch content. Adel et al. (2013bAdel et al. ( , 2014 use factored language models and recurrent neural network (RNN) language model to improve the bilingual language model for CS ASR rescoring. They include additional linguistic information such as Partof-Speech, language identifier to improve model generalization. Inversion constraints (Li and Fung, 2013) and Functional Head constraints (Li and Fung, 2014) are also used in language models for the ASR decoding process.  use cross-lingual embedding to tie the input and output layer, and incorporate classes in the RNN language model. While these models are effective, they rely on the availability of CS training data. Therefore, they are not easily scalable. To address this, we propose a way to make use of the existing abundant parallel corpora. The method will be explained in Section 3.3.
Code-switching text generation: Closer to our line of research, Pratapa et al. (2018) propose to use synthetic data following the Equivalence Constraint theory, while  apply the Matrix Language Frame theory. In their works, a parser or an aligner is required to process the parallel corpus, which is followed by the standard monolingual language modeling process. Such techniques suffer from inaccurate alignment or parsing errors. These errors will be carried forward when training the language model. More recently, Winata et al. (2019) propose a technique to generate neuralbased synthetic data using parallel sentences, in which a Point-Gen network is used to synthesize CS data without external aligner or parser. In this paper, we propose to learn the bilingual context and the CS language model jointly by attending to the parallel sentences directly without the need for an external aligner, parser or explicitly generating the synthetic data.

Bilingual Attention Language Model
Next, we discuss the motivation and the theoretical formulation of the proposed Bilingual Attention Language Model (BALM). In a bilingual text, we could encounter a sequence of word, w = w l 1 1 , w l 2 2 , . . . w l 2 t , . . . , w l 1 T , code mixed between languages l 1 and l 2 . However, such code mixed training data are not easily available. Let us assume that only parallel corpus at sentence level between l 1 and l 2 languages is available to us.
Assuming the validity of the Matrix Frame theory, and Equivalence Constraint theory, the above code-switch sentence, w, can be constructed from two parallel sentences, w l 1 = w l 1 1 , w l 1 2 , . . . , w l 1 T 1 , w l 2 = w l 2 1 , w l 2 2 , . . . , w l 2 T 2 . For a monolingual case, the language model maximizes the log-likelihood of p(w t |w <t ) which effectively captures the monolingual word sequential dependency. For a CS case, we would like to maximize p(w t |w <t ), whereby the bilingual context, w <t , is non-existent during training. In the subsequent section, we will explain the idea to encode the bilingual context using an attention mechanism.

Background
A bilingual language model has to be built on a common word representation. The continuous space word embedding is an effective solution. We first draw some principled insights from the crosslingual word embedding (CLWE) study, which motivates this work.
Building on the idea of CLWE, we refer to the general form of the loss function, J , summarized by  as follows, The monolingual language constraint L, which could be implemented with negative sampling, preserves the monolingual integrity. Importantly, there has to be a cross-lingual constraint, which could be the mean squared error (MSE) between the l 2 embedding space X l 2 = {x l 2 i }, and the transformed l 1 embedding space, X l 1 = {x l 1 i }. We use x i to denote the embedding of a word w i , which is also referred to as a token. The vocabulary size is v. The cross-lingual language constraint Ω maps the two monolingual embeddings into a common space using the transformation matrix A, The CLWE network can also be jointly learned (Luong et al., 2015) with the alignment information as the regularization loss, Ω. While CLWE lays the foundation for many cross-lingual applications, it is not designed to model word sequential dependency.

Bilingual Objective
We draw inspiration from the CLWE loss function and extend the objective function to the modeling of word sequential dependency while preserving its general form. The monolingual objective, L(X l ) as formulated in Equation 3, is set to be the cross entropy loss between the target distribution, y l and the predicted distribution log p(w l t |w l <t ), for the respective language, which preserves the monolingual word sequential order.
This allows the bilingual language model to adhere to the monolingual syntactic rules of the Matrix Language Frame and the Equivalent Constraint theory during word prediction, that the dominant language still abide by its own syntactic principle. We also define a quasi-translation loss, Ω, that optimizes the model to learn the correspondence of tokens between languages as well as the dependencies between the current token in l 1 and the preceding context in l 2 . The quasi-translation loss can be interpreted as satisfying the requirement of the code-switching principle as described by the two theories.
Equation 4 is the quasi-translation loss, Ω l 1 l 2 →l 1 , when predicting a word in l 1 given a bilingual context. Similarly, we have Ω l 1 l 2 →l 2 to predict a word in l 2 .

Bilingual Attention
Motivated by the self-attention model (Vaswani et al., 2017), we hypothesize that an autoregressive translation-cum-language modeling objective could leverage on parallel sentences to learn the bilingual context. To start with, let us consider a monolingual case that deals with l 1 . We define a transformer language model, f , using a causal mask (Radford et al., 2019), which can be further broken down Figure 1: (a) Trained on a parallel sentence pair l 1 l 2 , "i like you" and "我喜欢你" , BALM learns to predict the next l 2 word, "你"，given its context x l2 <3 , "我喜欢", and its whole sentence translation x l1 <5 , "i like you". (b) During perplexity evaluation, BALM estimates the probability of p("you"|w <5 ), given a bilingual context w <5 , "他也是 like". (c) Normalizing a l 1 l 2 code-switch sentence to l 1 with BALM by generating the l 1 sentence sequentially in an auto-regressive manner. x = embed(w) is the cross-lingual word embedding layer and the transpose of the embed weight is used for the output projection layer to decode the word distribution.
into individual layer n in a total of N layers, The model will take in the embedding, x l 1 t = embed(w l 1 t ) of each word, w l 1 t , in l 1 at the first layer, f 1 1 , and the output will encode the contextual information that is a weighted sum of its preceding context, f 1 = f 1 2 (Attention(x l 1 <t )). In this way, the output of the last layer f N 2 contains the information, that is necessary for decoding p(w l 1 t |w l 1 <t ). This process is carried out on the monolingual side of the parallel data respectively for l 1 and l 2 to minimize the loss function in Equation 3.
Extending the context of l 1 to include words in l 2 , we enable the model to learn from a bilingual context, as shown in Figure 1a. The question is how to find the appropriate context in both l 1 and l 2 to predict a word in l 2 . The attention mechanism with the quasi-translation loss provides a solution. Figure 1a is an illustration for l 1 l 2 → l 2 training case.
At the last layer, the encoded output for the time step t in l 2 will be, f N 2 (Attention(x l 1 , x l 2 ≤t )). It is important to note that the model architecture allows learnable alignment between current word x t with its preceding context in its own language l 2 as well as the whole sentence translation x l 1 in l 1 . The use of preceding context can be seen as an autoregressive process over the words in a sentence.
As the predicted word always follows its preceding context sequentially, the word order in the matrix language matters in BALM. However, the attention mechanism does not attempt to distinguish word order within the encoded context, which is a weighted sum of the bilingual context (see discussions in Section 3.5). This can be observed in the quasi-translation loss, as formulated in Equation 4.

Training and Inference
During training, we use the two sides of the parallel corpus independently as two monolingual corpora and both sides together as the bilingual constraint. When presented with monolingual text in l 1 or l 2 , the network learns to attend to the words in either l 1 or l 2 using a causal mask for monolingual word prediction. When presented with l 1 l 2 parallel sentences, and predicting a word in l 1 or l 2 , the network learns to attend to the bilingual context for word prediction.
To summarize, given a parallel corpus, BALM is trained with 4 input → output pairs, l 1 → l 1 , l 2 → l 2 , l 1 l 2 → l 1 , and l 1 l 2 → l 2 . The bilingual attention in theory allows BALM to take any of l 1 , l 2 or l 1 l 2 as input, and generate any of l 1 , l 2 or l 1 l 2 as output in 6 possible combinations. l 1 l 2 → l 1 , l 2 represents the code-switch language modeling task of our interest. For brevity, we only illustrate the case of l 1 l 2 → l 2 in Figure 1a.
At run time inference, we do not have the two parallel sentences, but rather a code-switch sentence that consists of a mixture of words w <t from the two languages, as in Figure 1b. To predict p(w l 2 t |w <t ) for a code-switch sentence at run time, we assume that the model would have encountered some variants of the bilingual context through (Attention(x l 1 , x l 2 <t )). In this way, the model can estimate the run time probability according to the similarity between the encoding of the code-switch sequence, w <t , and the learned bilingual representation. The attention-based alignment is expected to find the appropriate bilingual context that was trained under the objective function to maximize p(w l 2 t |w l 1 , w l 2 <t ).

Positional Embedding
In stark contrast to the masked language model (MLM), which employs positional embedding on top of its sequence ordering invariant setup, BALM does not use positional embedding. We argue that under the auto-regressive objective, positional embedding is not necessary.
In BALM, the amount of information in an auto-regressive setup is strictly increasing. Taking one of its intermediate layers as an example, the hidden representation for the current token h t is the weighted sum of the previous tokens, and the weights are computed through the learned query and key matrix, A Q , A K . h t = a 1,t x 1 + a 2,t x 2 + · · · + a t,t x t a n,m = A K x n · A Q x m In comparison with a RNN layer, whereby the hidden state is a gated sum of the previous hidden states, i.e. h t = tanh(W h h t−1 + W x x t ), the difference is that the weight matrix, W h , for RNN is applied on the gated sum, h t−1 , at each time step while the weight for the attention model, a n,m , is a similarity comparison of the current token's query with the previous tokens' keys.
The two networks are similar in the sense that they both compute the weights and incorporate the past information. They only differ in their implementation. We argue that the sequential information is already included in the attention model under an auto-regressive setup. Thus the positional encoding is not necessary. This is corroborated by Irie et al. (2019), which shows that the removal of positional encoding slightly improves the language model performance. By dropping the positional embedding, we can mix the bilingual context, as discussed in Section 3.3.

Datasets
We evaluate the language models on the text transcripts of the South East Asia Mandarin-English (SEAME) corpus (LDC2015S04) (Lee et al., 2017), a well-documented database for spontaneous conversational speech code-switching between Chinese Mandarin (ZH) and English (EN). A large number of CS studies were reported on SEAME.
We adopt a slightly different setup as we focus on how BALM is able to learn from a parallel corpus alone without the need of CS training data. We use SEAME data mainly for adaptation and evaluation. We split the SEAME Phase II text transcripts equally into three portions, labeled as Adapt, Valid and Test respectively in Table 1. Such split also ensures that the individual component within the Test data, e.g. Test EN, is of sufficient size.
Additionally, we also split the dataset following approximately the same proportion as in the previous works (Winata et al., 2019; Lee et al., 2019) for a fair benchmarking, labeled as Train, Dev, and Eval respectively. We use a random split of 1.1M/60.8K/60.3K for the number of tokens in Train/Dev/Eval as compared to 1.2M/65K/60K in the previous works.
We use a bilingual parallel corpus from Ted and OpenSubtitle (Tiedemann, 2012;Lison and Tiedemann, 2016) for BALM training because they are text transcripts of spontaneous speech similar to SEAME. The English text is tokenized using NLTK tokenizer (Bird et al., 2009) while the Chinese text is tokenized using Stanford Word Segmenter (Chang et al., 2008). We also develop a test set of 200 sentences for language normalization experiments, labeled as SEAME Norm.

Experimental Setup
We conduct a series of experiments, namely BALM, Synthetic CS, CS-Only, and Mono, using the same BALM network architecture to evaluate different modeling strategies.
During training, we construct a 50K vocabulary consisting of the most frequent words in the combined SEAME and parallel dataset, of which there are 17.7K and 32.3K unique Chinese and English words, respectively. Only for the benchmarking in Table 3, we use the SEAME vocabulary, a subset of the 50K vocabulary, for the perplexity evaluation to meaningfully compare the perplexity with the prior work on SEAME corpus. Unless otherwise stated, we train for 60 epochs with 100K lines per epoch and adapt for 17 epochs with the full Adapt dataset. We use Adam optimizer (Kingma and Ba, 2014) for all the experiments. BALM The attention mechanism follows largely the implementation of GPT (Radford et al., 2019), with 384-dimension hidden states, 12 layers and 12 heads. While Dai et al. (2019) reports state-of-theart results using the recurrence mechanism within the attention, we exclude this in our experiment for two reasons. Firstly, the context beyond the given parallel sentence is not meaningful after shuffling the sentences. Furthermore, attending target sequence to context beyond the source sequence may introduce noise and depart from the theoretical motivation of the experiment. Secondly, for many downstream tasks like ASR, the decoding remains at the utterance level.
We first train the BALM on the parallel corpus as described in Section 3.4. The trained network is then adapted with SEAME Adapt to bridge the domain gap, namely from l 1 l 2 → l 1 and l 1 l 2 → l 2 towards l 1 l 2 → l 1 l 2 . Synthetic CS In this contrastive experiment, we remove the bilingual constraint, i.e. equation 4, from BALM, and use offline synthetic CS text outlined in  in the training. The idea of synthetic CS is motivated by the Matrix Language Frame theory. The phrase alignment is performed on the same parallel dataset in Table 1 , using Giza++ (Och and Ney, 2003). The aligned parallel sentences are then used to randomly switch phrases between the languages according to an empirical probability of 0.7. At the same, time the phrase table is used to inhibit switch within frequently occurring phrases. We train the same BALM network with both the synthetic CS data and the monolingual side of the parallel data. The model is finally adapted with SEAME Adapt.

Mono & CS-Only
In the Mono setting, we simply use parallel corpus as two independent monolingual corpora without any form of bilingual constraint. The monolingual sentences are passed alternating between the two languages to ensure a balanced training curriculum. The model is finally adapted with SEAME Adapt. This is similar to the Multilingual BERT pre-training under causal masking and subsequently fine-tune on the task dataset. The CS-Only model is trained only on the SEAME Adapt data without involving the parallel data. Positional Embedding We also implement the sinusoidal encoding matrix (Vaswani et al., 2017) and the learned weight matrix for the positional embedding in model PE-S and PE-L respectively. Both models are implemented on top of the BALM model using the same training data. The positional embedding is an element-wise addition to the word embedding layer. For the learned matrix in PE-L, we treat it as another lookup table. We simply extend the embedding matrix with the additional entries for each pos. In the case of sinusoidal encoding, the extended matrix is fixed to be, P E (pos,2i) = sin(pos/10000 2i/384 ) P E (pos,2i+1) = cos(pos/10000 2i/384 ).

CS Point Perplexity
While the perplexity test on SEAME Test CS describes the overall performance of the model on CS sentences. As shown in Table 1, CS only takes place at an average occurrence (SPF) of 23% in the CS sentences. We would like to take a closer look at how the model performs only at those CS points, which is the main focus of this work. A lower perplexity suggests a better word prediction ability. The perplexity is evaluated on SEAME Test CS, in which we only include perplexity for the word that is preceded by a different language.
ble 1 is used for training and the same dictionary 2 is used for testing for all models.
VecMap 3 (Artetxe et al., 2018) is a projection based CLWE alignment method which gives robust results using a unsupervised strategy (Glavaš et al., 2019). The respective monolingual embeddings are trained using fastText 4 (Bojanowski et al., 2017) with the default setup and 384 dimensions. The two monolingual embedding space are then mapped using the VecMap. BiSkip 5 (Luong et al., 2015) is jointly trained with word alignment constraint. We prepare the alignment using fast align 6 (Dyer et al., 2013) following the similar procedure outlined in the paper. For the BALM model, we use the embedding from the model without the SEAME adaptation phase for a fair comparison. These three models represent three distinct categories in CLWE implementation, i.e. projection-based, jointly learned, and deep learning based embedding for VecMap, BiSkip and BALM, respectively.

Language Normalization
Suppose that l 1 is the matrix language in a codeswitch sentence w. We would like to replace all l 2 tokens in w with their l 1 equivalent tokens, that is referred to as l 1 l 2 → l 1 . The normalized sentenceŵ l 1 can be expressed as,ŵ l 1 = arg max w l 1 p(w l 1 |w).
In practice, when w is presented to BALM, as illustrated in Figure 1c, the network predicts a sequence of tokens one by one in the matrix language as follows, The generated tokens w l 1 i<t becomes the context for the next token w l 1 t in an auto-regressive manner. The sequence with the highest probability is simply computed using beam search, which is performed when the eos token is observed.

Perplexity Evaluation
We conduct two perplexity (PPL) test experiments, one for comparing the variations of BALM, another for benchmarking against the state-of-the-art.
Comparing the variations of BALM, we report the overall test PPL as well as the PPL of each components, i.e. Test EN/ZH and Test CS for each model discussed in Section 4.2. It is observed in Table 2 that BALM outperforms all other variations, with a PPL of 118.25 on SEAME Test. Mono, Synthetic CS and BALM all benefit from the use of data beyond SEAME Adapt. BALM represents the most effective use of the bilingual parallel corpus. All the results are reported according to the best performing model on SEAME Valid dataset.
Benchmarking against the state-of-the-art, we show in Table 3 that BALM achieves a PPL of 103.20 on SEAME Eval, which is a 20.52% reduc-   Pires et al. (2019). The monolingual data contribute to a better word embedding, which is an integral part of the BALM. As the quality of the word embedding improves, so does the word prediction at the CS points. We also observe that Synthetic CS shows a 8.6% PPL reduction, from 554.71 to 506.81 with the inclusion of the synthetic CS data. This is consistent with the observations in  and Pratapa et al. (2018).
We further observe that BALM, which is trained on exactly the same parallel data as in Synthetic CS, but with a different objective function, outperforms Synthetic CS by 5.73% . This suggests that the quasi-translation loss function is an effective regularizer to enforce the linguistic constraint governing CS. We also confirm our aforementioned hypothesis that self-attention mechanism is able to attend to the appropriate bilingual context for word prediction without violating the grammar of the matrix language by qualitatively analysing the generated sentences from the model not yet adapted with CS adapt.

Positional embedding
Both the sinusoidal encoding and the learned encoding matrix degrade the model performance by 14.4% and 21.2% respectively. This result con-

Method
EN-ZH ZH-EN VecMap (Artetxe et al., 2018) 57.13% 48.46% BiSkip (Luong et al., 2015) 35.54% 33.39% BALM (our work) 56.24% 55.87% Vocabulary Coverage 38.84% 31.72% Table 5: BLI accuracy (%) for different methods on the same parallel corpus in Table 1 for training and the same dictionary 2 for testing. firms our hypothesis that the attention mechanism is able to encode the mixed context well without positional embedding. The improvement of BALM over BALM+PE in the monolingual PPL also demonstrates that dropping the positional embedding is in fact beneficial.

Bilingual Lexicon Induction
The comparable performance justifies the premise that the model is able to find word-level correspondence, which enables the subsequent bilingual context encoding. As shown in Table 5, when inferring ZH (Chinese) words from EN (English), BALM (56.24%) shows comparable performance with VecMap (57.13%), that reported the state-of-the-art results in CLWE. However, BALM significantly outperforms VecMap in the inverse pair ZH-EN with an absolute 7.41% improvement (48.46% → 55.87%). Two points to take note of, firstly, Glavaš et al. (2019) point out that BLI cannot be used as the only metric to assess the word embedding quality and we do not intend to do so. Secondly, while it is true that VecMap does not need the corpus to be parallel and ours does, so the comparison did not showcase the best ability of VecMap. However, the focus of this paper is not on comparing the best cross-lingual word embedding methods. We use BLI performance as evidence to support our claim that BALM does not compromise on its CLWE while focusing on sequential modeling.

Language Normalization
As the code-switch sentence follows the syntactic structure of the matrix language, we assume that the matrix language is known in advance, for example, English for sentences 1-3, and Chinese for sentences 4-6 in Table 4. We observe that sometimes, mistakes can take the form of bad translation, however the normalized sentence still maintains an appropriate structure of the matrix language. The 6 th sentence of Table 4 is an example, which is wrongly normalized to "to do my assignment (in the sense of task)" instead of "hand in my assignment (in the sense of homework)". We report the WER on SEAME Norm between the normalized text and the reference. We observe in Table 2 that, with a WER of 19.73%, BALM outperforms other models in the same way as in the perplexity tests.

Conclusion
We note that BALM is an implementation of l 1 l 2 → l 1 l 2 . The experiments show that it outperforms all state-of-the-art models in the literature for similar tasks. The results validate the idea of bilingual attention. The same BALM can be used in l 1 l 2 → l 1 or l 2 for language normalization. It can be further extended for l 1 → l 1 l 2 , or l 2 → l 1 l 2 for code switch sentence generation, and l 1 → l 2 , or l 2 → l 1 for machine translation.