Meta-Transfer Learning for Code-Switched Speech Recognition

An increasing number of people in the world today speak a mixed-language as a result of being multilingual. However, building a speech recognition system for code-switching remains difficult due to the availability of limited resources and the expense and significant effort required to collect mixed-language data. We therefore propose a new learning method, meta-transfer learning, to transfer learn on a code-switched speech recognition system in a low-resource setting by judiciously extracting information from high-resource monolingual datasets. Our model learns to recognize individual languages, and transfer them so as to better recognize mixed-language speech by conditioning the optimization on the code-switching data. Based on experimental results, our model outperforms existing baselines on speech recognition and language modeling tasks, and is faster to converge.


Introduction
In bilingual or multilingual communities, speakers can easily switch between different languages within a conversation (Wang et al., 2009). People who know how to code-switch will mix languages in response to social factors as a way of communicating in a multicultural society. Generally, code-switching speakers switch languages by taking words or phrases from the embedded language to the matrix language. This can occur within a sentence, which is known as intrasentential code-switching or between two matrix language sentences, which is called inter-sentential code-switching (Heredia and Altarriba, 2001).
Learning a code-switching automatic speech recognition (ASR) model has been a challenging task for decades due to data scarcity and difficulty in capturing similar phonemes in different * These two authors contributed equally. languages. Several approaches have focused on generating synthetic speech data from monolingual resources (Nakayama et al., 2018;. However, these methods are not guaranteed to generate natural code-switching speech or text. Another line of work explores the feasibility of leveraging large monolingual speech data in the pre-training and applying fine-tuning on the model using a limited source of code-switching data, which has been found useful to improve the performance (Li et al., 2011;. However, the transferability of these pretraining approaches is not optimized on extracting useful knowledge from each individual languages in the context of code-switching, and even after the finetuning step, the model forgets about the previously learned monolingual tasks. In this paper, we introduce a new method, metatransfer learning 1 , to learn to transfer knowledge from source monolingual resources to a codeswitching model. Our approach extends the model-agnostic meta learning (MAML) (Finn et al., 2017) to not only train with monolingual source language resources but also optimize the update on the codeswitching data. This allows the model to leverage monolingual resources that are optimized to detect code-switching speech. Figure 1 illustrates the optimization flow of the model. Different from joint training, meta-transfer learning computes the firstorder optimization using the gradients from monolingual resources constrained to the code-switching validation set. Thus, instead of learning one model that is able to generalize to all tasks, we focus on judiciously extracting useful information from the monolingual resources.
The main contribution is to propose a novel method to transfer learn information efficiently from monolingual resources to the code-switched speech recognition system. We show the effectiveness of our approach in terms of error rate, and that our approach is also faster to converge. We also show that our approach is also applicable to other natural language tasks, such as code-switching language modeling tasks.

Related Work
Meta-learning Our idea of learning knowledge transfer from source monolingual resources to a code-switching model comes from MAML (Finn et al., 2017). Probabilistic MAML (Finn et al., 2018) is an extension of MAML, which has better classification coverage. Meta-learning has been applied to natural language and speech processing (Hospedales et al., 2020).  extends MAML to the personalized text generation domain and successfully produces more personaconsistent dialogue. Gu et al. (2018) and Qian and Yu (2019) and  propose to apply meta-learning on low-resource learning. Yu et al. (2020) applies MAML to hypernym detection. Several applications have been proposed in speech applications, such as cross-lingual speech recognition (Hsu et al., 2019), speaker adaptation (Klejch et al., 2018(Klejch et al., , 2019, and cross-accent speech recognition (Winata et al., 2020). Li and Fung (2012) introduces a statistical method to incorporate a linguistic theory into a code-switching speech recognition system, and Adel et al. (2013a,b) explore syntactic and semantic features on recurrent neural networks (RNNs). Baheti et al. (2017) adapts effective curriculum learning by training a network Algorithm 1 Meta-Transfer Learning Require: D src , D tgt Require: α, β: step size hyperparameters 1: Randomly initialize θ 2: while not done do 3:

Code-Switching ASR
Compute adapted parameters with gradient descent: with monolingual corpora of two languages, and subsequently training on code-switched data. Pratapa et al. (2018) and  propose to use methods to generate artificial code-switching data using a linguistic constraint. Winata et al. (2018) proposes to leverage syntactic information to improve the identification of the location of code-switching points, and improve the language model performance. Finally Garg et al. (2018) and  propose new neural-based methods using SeqGAN and pointer-generator (Pointer-Gen) to generate diverse synthetic codeswitching sentences that are sampled from the real code-switching data distribution.

Meta-Transfer Learning
We aim to effectively transfer knowledge from source domains to a specific target domain. We denote our model by f θ with parameters θ. Our model accepts a set of speech inputs X = {x 1 , . . . , x n } and generates a set of utterances Y = {y 1 , . . . , y m }. The training involves a set of speech datasets in which each dataset is treated as a task T i . Each task is distinguished as either a source D src or target task D tgt . For each training iteration, we randomly sample a set of data as training D tra , and a set of data as validation D val . In this section, we present and formalize the method.

Setup
To facilitate the model to achieve a good generalization on the code-switching data, we sample the source dataset D src from monolingual English (en) and Chinese (zh) and code-switching (cs) corpora, and choose the target dataset D tgt only from the code-switching corpus. The code-switching data samples between D src and D tgt are disjoint. In this case, we exploit the meta-learning update using meta-transfer learning to acquire knowledge from the monolingual English and Chinese corpora, and optimize the learning process on the code-switching data. Then, we slowly fine-tune the trained model to become closer to the codeswitching domain by avoiding aggressive updates that can push the model to a worse position.

Meta-Transfer Learning Algorithm
Our approach extends the meta-learning paradigm to adapt knowledge learned from source domains to a specific target domain. This approach captures useful information from multiple resources to the target domain, and updates the model accordingly. Figure 1 presents the general idea of meta-transfer learning. The goal of the meta-transfer learning is not to focus on generalizing to all tasks, but to focus on acquiring crucial knowledge to transfer from monolingual resources to the code-switching domain. As shown in Algorithm 1, for each adaptation step on T i , we compute updated parameters θ T i via stochastic gradient descent (SGD) as follows: where α is a learning hyper-parameter of the inner optimization. Then, a cross-entropy loss L D val is calculated from a learned model upon the generated text given the audio inputs on the target domain: We define the objective as follows: where D tra T i ∼ (D src , D tgt ) and D val ∼ D tgt . We minimize the loss of the f θ T i upon D val . Then, we apply gradient descent on the meta-model parameter θ with a β meta-learning rate.

Model Description
We build our speech recognition model on a transformer-based encoder-decoder (Dong et al.,  (Simonyan and Zisserman, 2015) to learn a language-agnostic audio representation and generate input embeddings. The decoder receives the encoder outputs and applies multi-head attention to the decoder input. We apply a mask into the decoder attention layer to avoid any information flow from future tokens. During the training process, we optimize the next character prediction by shifting the transcription by one. Then, we generate the prediction by maximizing the log probability of the sub-sequence using beam search.

Language Model Rescoring
To further improve the prediction, we incorporate Pointer-Gen LM (Winata et al., 2019) in a beam search process to select the best sub-sequence scored using the softmax probability of the characters. We define P (Y ) as the probability of the predicted sentence. We add the pointer-gen language model p lm (Y ) to rescore the predictions. We also include word count wc(Y) to avoid generating very short sentences. P (Y ) is calculated as follows: where α is the parameter to control the decoding probability, β is the parameter to control the language model probability, and γ is the parameter to control the effect of the word count.

Dataset
We use SEAME Phase II, a conversational English-Mandarin Chinese code-switching speech corpus that consists of spontaneously spoken interviews and conversations (Nanyang Technological University, 2015). The data statistics and codeswitching metrics, such as code mixing index (CMI) (Gambäck and Das, 2014) and switch-point  fraction (Pratapa et al., 2018) are depicted in Table 1. For monolingual speech datasets, we use HKUST (Liu et al., 2006) as the monolingual Chinese dataset, and Common Voice (Ardila et al., 2019) as the monolingual English dataset. 2 We use 16 kHz audio inputs and up-sample the HKUST data from 8 to 16 kHz.

Experiment Settings
Our transformer model consists of two encoder layers and four decoder layers with a hidden size of 512, an embedding size of 512, a key dimension of 64, and a value dimension of 64. The input of all the experiments uses spectrogram, computed with a 20 ms window and shifted every 10 ms. Our label set has 3765 characters and includes all of the English and Chinese characters from the corpora, spaces, and apostrophes. We optimize our model using Adam and start the training with a learning rate of 1e-4. We fine-tune our model using SGD with a learning rate of 1e-5, and apply an early stop on the validation set. We choose α = 1, β = 0.1, and γ = 0.1. We draw the sample of the batch randomly with a uniform distribution every iteration.
We conduct experiments with the following approaches: (a) only CS, (b) joint training on EN + ZH, (c) joint training on EN + ZH + CS, and (d) meta-transfer learning. Then, we apply finetuning (b), (c), and (d) models on CS. We apply LM rescoring on our best model. We evaluate our model using beam search with a beam width of 5 and maximum sequence length of 300. The quality of our model is measured using character error rate (CER).

Results
The results are shown in Table 2. Generally, adding monolingual data EN and ZH as the training data is effective to reduce error rates. There is a significant margin between only CS and joint training (1.64%) or meta-transfer learning (4.21%). According to the experiment results, meta-transfer learning consistently outperforms the joint-training approaches. This shows the effectiveness of meta-transfer learning in language adaptation.
The fine-tuning approach helps to improve the performance of trained models, especially on the joint training (EN + ZH). We observe that joint training (EN + ZH) without fine-tuning cannot predict mixed-language speech, while joint training on EN + ZH + CS is able to recognize it. However, according to Table 3, adding a fine-tuning step badly affects the previous learned knowledge (e.g., EN: 11.84% → 63.85%, ZH: 31.30% → 78.07%). Interestingly, the model trained with meta-transfer learning does not suffer catastrophic forgetting even without focusing the loss objective to learn both monolingual languages. As expected, joint training on EN + ZH + CS achieves decent performance on all tasks, but it does not optimally improve CS.
The language model rescoring using Pointer-Gen LM improves the performance of the meta-transfer    Figure 2 depicts the dynamics of the validation loss per iteration on CS, EN, and ZH. As we can see from the figure, metatransfer learning is able to converge faster than only CS and joint training, and results in the lowest validation loss. For the validation losses on EN and ZH, both joint training (EN + ZH + CS) and metatransfer learning achieve a similar loss in the same iteration, while only CS achieves a much higher validation loss. This shows that meta-transfer learning is not only optimized on the code-switching domain, but it also preserves the generalization ability to monolingual domains, as depicted in Table 3.

Language Modeling Task
We further evaluate our meta-transfer learning approach on a language model task. We simply take the transcription of the same datasets and build a 2layer LSTM-based language model following the model configuration in . To further improve the performance, we apply finetuning with an SGD optimizer by using a learning rate of 1.0, and decay the learning rate by 0.25x for every epoch without any improvement on the validation performance. To prevent the model from over-fitting, we add an early stop of 5 epochs.
As shown in Table 4, the meta-transfer learning approach outperforms the joint-training approach. We find a similar trend for the language model task results to the speech recognition task where metatransfer learning without additional fine-tuning performs better than joint training with fine-tuning. Compared to our baseline model (Only CS), metatransfer learning is able to reduce the test set perplexity by 3.57 points (65.71 → 62.14), and the post fine-tuning step reduces the test set perplexity even further, from 62.14 to 61.97.

Conclusion
We propose a novel method, meta-transfer learning, to transfer learn on a code-switched speech recognition system in a low-resource setting by judiciously extracting information from highresource monolingual datasets. Our model recognizes individual languages and transfers them so as to better recognize mixed-language speech by conditioning the optimization objective to the code-switching domain. Based on experimental results, our training strategy outperforms joint training even without adding a fine-tuning step, and it requires less iterations to converge.
In this paper, we have shown that our approach can be effectively applied to both speech processing and language modeling tasks. Finally, we will explore further the generability of our meta-transfer learning approach to more downstream multilingual tasks in our future work.