Cross-lingual Spoken Language Understanding with Regularized Representation Alignment

Despite the promising results of current cross-lingual models for spoken language understanding systems, they still suffer from imperfect cross-lingual representation alignments between the source and target languages, which makes the performance sub-optimal. To cope with this issue, we propose a regularization approach to further align word-level and sentence-level representations across languages without any external resource. First, we regularize the representation of user utterances based on their corresponding labels. Second, we regularize the latent variable model (Liu et al., 2019) by leveraging adversarial training to disentangle the latent variables. Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios, and our model, trained on a few-shot setting with only 3\% of the target language training data, achieves comparable performance to the supervised training with all the training data.


Introduction
Data-driven neural-based supervised training approaches have shown effectiveness in spoken language understanding (SLU) systems (Goo et al., 2018;Chen et al., 2019;Haihong et al., 2019).However, collecting large amounts of high-quality training data is not only expensive but also timeconsuming, which makes these approaches not scalable to low-resource languages due to the scarcity of training data.Cross-lingual adaptation has naturally arisen to cope with this issue, which leverages the training data in rich-resource source languages and minimizes the requirement of training data in low-resource target languages.In general, there are two challenges in crosslingual adaptation.First, the imperfect alignment of word-level representations between the source and target language limits the adaptation performance.Second, even though we assume that the word-level alignment is perfect, the sentence-level alignment is still imperfect owing to grammatical and syntactical variances across languages.Therefore, we emphasize that cross-lingual methods should focus on the alignments of word-level and sentence-level representations, and increase the robustness for inherent imperfect alignments.
In this paper, we concentrate on the cross-lingual SLU task (as illustrated in Figure 1), and we consider both few-shot and zero-shot scenarios.To improve the quality of cross-lingual alignment, we first propose a Label Regularization (LR) method, which utilizes the slot label sequences to regularize the utterance representations.We hypothesize that if the slot label sequences of user utterances are close to each other, these user utterances should have similar meanings.Hence, we regularize the distance of utterance representations based on the corresponding representations of label sequences to further improve the cross-lingual alignments.
Then, we extend the latent variable model (LVM) proposed by Liu et al. (2019a).The LVM generates a Gaussian distribution instead of a feature vector for each token, which improves the adaptation robustness.However, there are no additional constraints on generating distributions, making the latent variables easily entangled for different slot labels.To handle this issue, we leverage Adversarial training to regularize the LVM (ALVM).We train a linear layer to fit latent variables to a uniform distribution over slot types.Then, we optimize the latent variables to fool the trained linear layer to output the correct slot type (one hot vector).In this way, latent variables of different slot types are encouraged to disentangle from each other, leading to a better alignment of cross-lingual representations.
The contributions of our work are summarized as follows: • We propose LR and ALVM to further improve the alignment of cross-lingual representations, which do not require any external resources.
• Our model outperforms the previous state-ofthe-art model in both zero-shot and few-shot scenarios on the cross-lingual SLU task.
• Extensive analysis and visualizations are made to illustrate the effectiveness of our approaches.

Related Work
Cross-lingual Transfer Learning Cross-lingual transfer learning is able to circumvent the requirement of enormous training data by leveraging the learned knowledge in the source language and learning inter-connections between the source and the target language.Artetxe et al. (2017) and Conneau et al. (2018) conducted cross-lingual word embedding mapping with zero or very few supervision signals.Recently, pre-training cross-lingual language models on large amounts of monolingual or bilingual resources have been proved to be effective for the downstream tasks (e.g., natural language inference) (Conneau and Lample, 2019;Devlin et al., 2019;Pires et al., 2019;Huang et al., 2019).Additionally, many cross-lingual transfer algorithms have been proposed to solve specific cross-lingual tasks, for example, named entity recognition (Xie et al., 2018;Mayhew et al., 2017;Liu et al., 2020a), part of speech tagging (Kim et al., 2017;Zhang et al., 2016), entity linking (Zhang et al., 2013;Sil et al., 2018;Upadhyay et al., 2018b), personalized conversations (Lin et al., 2020), and dialog systems (Upadhyay et al., 2018a;Chen et al., 2018).

Cross-lingual Task-oriented Dialog Systems
Deploying task-oriented dialogue systems in lowresource domains (Bapna et al., 2017;Wu et al., 2019;Liu et al., 2020b) or languages (Chen et al., 2018;Liu et al., 2019a,b), where the number of training of samples is limited, is a challenging task.

Methodology
Our model architecture and proposed methods are depicted in Figure 2, and combine label regularization (LR) and the adversarial latent variable model (ALVM) to conduct the intent detection and slot filling.In the few-shot setting, the input user utterances are in both the source and target languages, while in the zero-shot setting, the user utterances are only in the source language.Note that both the source and target languages contain only one language.

Motivation
Intuitively, when the slot label sequences are similar, we expect the corresponding representations of user utterances across languages to be similar.For example, when the slot label sequences contain the weather slot and the location slot, the user utterances should be asking for the weather forecast somewhere.However, the representations of utterances across languages can not always meet these requirements because of the inherent imperfect alignments in word-level and sentence-level representations.Therefore, we propose to leverage existing slot label sequences in the training data to regularize the distance of utterance representations.When a few training samples are available in the target language (i.e., few-shot setting), we regularize the distance of utterance representations between the source and target languages based on their slot labels.Given this regularization, the model explicitly learns to further align the sentencelevel utterance representations across languages so as to satisfy the constraints.Additionally, it can also implicitly align the word-level BiLSTM hidden states across languages because sentence-level representations are produced based on them.
When zero training samples are available in the target language (i.e., zero-shot setting), we regularize the utterance representations in the source language.It can help better distinguish the utterance representations and cluster similar utterance representations based on the slot labels, which increases the generalization ability in the target language.

Implementation Details
Figure 2 (Left) illustrates an utterance encoder and a label encoder that generate the representations for utterances and labels, respectively.
We denote the user utterance as w = [w 1 , w 2 , ..., w n ], where n is the length of the utterance.Similarly, we represent the slot label sequences as s = [s 1 , s 2 , ..., s n ].We combine a bidirectional LSTM (BiLSTM) (Hochreiter and Schmidhuber, 1997) and an attention layer (Felbo et al., 2017) to encode and produce the representations for user utterances and slot label sequences.The representation generation process is defined as follows: where the superscript w and s represents utterance and label, respectively, v is a trainable weight vector in the attention layer, α i is the attention score for each token i, E denotes the embedding layers for utterances and label sequences, and u and l denotes the representation of utterance w and slot label s, respectively.
In each iteration of the training phase, we randomly select two samples for the label regularization.As illustrated in Figure 2 (Left), we first calculate the cosine similarity of two utterance representations u a and u b , and the cosine similarity of two label representations l a and l b .Then, we minimize the distance of these two cosine similarities.The objective functions can be described as follows: where the superscript lr denotes label regularization, and MSE represents mean square error.In the zero-shot setting, both samples u a and u b come from the source language.While in the few-shot setting, one sample comes from the source language and the other one comes from the target language.
Since the features of labels and utterances are in different vector spaces, we choose not to share the parameters of their encoders.During training, it is easy to produce expressive representations for user utterances due to the large training samples, but it is difficult for label sequences since the objective function L lr is the only supervision.This supervision is weak at the beginning of the training since utterance representations are not sufficiently expressive, which leads to the label regularization approach not being stable and effective.To ensure the representations for slot label sequences are meaningful, we conduct pre-training for the label sequence encoder.

Label Sequence Encoder Pre-training
We leverage the large amount of source language training data to pre-train the label sequence encoder.Concretely, we use the model architecture illustrated in Figure 2 to train the SLU system in the source language, and at the same time, we optimize the label sequence encoder based on the objective function L lr in Eq (8).The label sequence encoder learns to generate meaningful label sequence representations that differ based on their similarities since the extensive source language training samples ensure the high quality of the utterance encoder.

Adversarial Latent Variable Model
In this section, we first give an introduction to the latent variable model (LVM) (Liu et al., 2019a), and then we describe how we incorporate the adversarial training into the LVM.

Latent Variable Model
Point estimation in the cross-lingual adaptation is vulnerable due to the imperfect alignments across languages.Hence, as illustrated in Figure 2 (Right), the LVM generates a Gaussian distribution with mean µ and variance σ for both word-level and sentence-level representations instead of a feature vector, which eventually improves the robustness of the model's cross-lingual adaptation ability.The LVM can be formulated as where W S l and W I l are trainable parameters to generate the mean and variance for word-level hidden states h i and sentence-level representations r, respectively, from user utterances.q S i ∼ N (µ S i , (σ S i ) 2 I) and q I ∼ N (µ I , (σ I ) 2 I) are the generated Gaussian distributions, which latent variables z S t and z I are sampled from, and p S i and p I is the predictions for the slot of the i th token and the intent of the utterance, respectively.
During training, all the sampled points from the same generated distribution will be trained to predict the same slot label, which makes the adaptation more robust.In the inference time, the true mean µ S i and µ I is used to replace z S i and z I , respectively, to make the prediction deterministic.

Adversarial Training
Since there are no constraints enforced on the latent Gaussian distribution during training, the latent distributions of different slot types are likely to be close to each other.Hence, the distributions for the same slot type in different user utterances or languages might not be clustered well, which could hurt the cross-lingual alignment and prevent the model from distinguishing slot types when adapting to the target language.
To improve the cross-lingual alignment of latent variables, we propose to make the latent variables of different slot types more distinguishable by adding adversarial training to the LVM.As illustrated in Figure 2 (Right), we train a fully connected layer to fit latent variables into a uniform distribution over slot types.At the same time, the latent variables are regularized to fool the trained fully connected layer by predicting the correct slot type.In this way, the latent variables are trained to be more recognizable.In other words, the generated distributions for different slot types are more likely to repel each other, and for the same slot type are more likely to be close to each other, which leads to a more robust cross-lingual adaptation.We denote the size of the whole training data as J and the length for data sample j as |Y j |.Note that in the few-shot setting, J includes the number of data samples in the target language.The process of adversarial training can be described as follows: where FC consists of a linear layer and a Softmax function, and z S jk and p jk is the latent variable and generated distribution, respectively, for the k th token in the j th utterance, MSE represents the mean square error, U represents the uniform distribution, and y S jk represents the slot label.The slot label is a one-hot vector where the value for the correct slot type is one and zero otherwise.We optimize L f c to train only FC to fit a uniform distribution, and L lvm is optimized to constrain the LVM to generate more distinguishable distributions for slot predictions.Different from the well-known adversarial training (Goodfellow et al., 2014) where the discriminator is to distinguish the classes, and the generator is to make the features not distinguishable, in our approach, the FC layer, acting as the discriminator, is trained to generate uniform distribution, and the generator is regularized to make latent variables distinguishable by slot types.

Optimization
The objective functions for the slot filling and intent detection tasks are illustrated as follows: where p S jk and y S jk is the prediction and label, respectively, for the slot of the k th token in the j th utterance, and p I j and y I j is the intent prediction and label, respectively, for the j th utterance.
The optimization for our model is to minimize the following objective function: where α and β are hyper-parameters, L f c only optimizes the parameters in FC, and L lvm optimizes all the model parameters excluding FC.

Dataset
We conduct our experiments on the multilingual spoken language understanding (SLU) dataset proposed by Schuster et al. (2019), which contains English, Spanish, and Thai across the weather, reminder, and alarm domains.The corpus includes 12 intent types and 11 slot types, and the data statistics are shown in Table 1.

Training Details
The utterance encoder is a 2-layer BiLSTM with a hidden size of 250 and dropout rate of 0.1, and the size of the mean and variance in the latent variable model is 150.The label encoder is a 1layer BiLSTM with a hidden size of 150, and 100dimensional embeddings for label types.We use the Adam optimizer with a learning rate of 0.001.We use accuracy to evaluate the performance of intent detection and BIO-based f1-score to evaluate the performance of slot filling.For the adversarial training, we realize that the latent variable model is not able to make slot types recognizable if the FC is too strong.Hence, we decide to first learn a good initialization for FC by setting both α and β parameters in Eq (18) as 1 in the first two training epochs, and then we gradually decrease the value of α.We use the refined cross-lingual word embeddings in Liu et al. (2019a) 2 to initialize the crosslingual word embeddings in our models and let them not be trainable.We use the delexicalization (delex.) in Liu et al. (2019a) few-shot setting.Our models are trained on GTX 1080 Ti.The number of parameters for our models is around 5 million.

Baselines
We compare our model to the following baselines.
BiLSTM-CRF This is the same cross-lingual SLU model structure as Schuster et al. (2019).
BiLSTM-LVM We replace the conditional random field (CRF) in BiLSTM-CRF with the LVM proposed in Liu et al. (2019a).
Multi.CoVe Multilingual CoVe (Yu et al., 2018) is a bidirectional machine translation system that tends to encode phrases with similar meanings into similar vector spaces across languages.Schuster et al. (2019) used it for the cross-lingual SLU task.
Multi.CoVe w/ auto-encoder Based on Multilingual CoVe, Schuster et al. ( 2019) added an autoencoder objective so as to produce better-aligned representations for semantically similar sentences across languages.
Multilingual BERT (M-BERT) It is a single language model pre-trained from monolingual corpora in 104 languages (Devlin et al., 2019), which is surprisingly good at cross-lingual model transfer.
Mixed Language Training (MLT) Liu et al. (2019b) utilized keyword pairs to generate mixed language sentences for training cross-lingual taskoriented dialogue systems, which achieves promising zero-shot transfer ability.
CoSDA-ML Qin et al. (2020) proposed a multilingual code-switching data augmentation framework to enhance the cross-lingual systems based on M-BERT (Devlin et al., 2019).It is a concurrent work of this paper.
XL-SLU It is a previous state-of-the-art model in the zero-shot cross-lingual SLU task, which combines Gaussian noise, cross-lingual embeddings refinement, and the LVM (Liu et al., 2019a).
Translate Train Schuster et al. ( 2019) trained a supervised machine translation system to translate English data into the target language, and then trained the model on the translated dataset.
All-shot Settings We train the BiLSTM-CRF model (Lample et al., 2016) on all the target language training samples, and on both the source and target language training set.

Few-shot Setting
Quantitative Analysis The few-shot results are illustrated in Table 2, from which we can clearly see consistent improvements made by label regularization and adversarial training.For example, on the 1% few-shot setting, our model improves on BiLSTM-LVM in terms of accuracy/f1-score by 1.85%/1.16% in Spanish, and by 4.16%/6.93% in Thai.Our model also surpasses a strong baseline, M-BERT, while our model based on BiLSTM has many fewer parameters compared to M-BERT.For example, on the 1% few-shot setting, our model improves on M-BERT in terms of accuracy/f1-score by 3.80%/3.83%in Thai.Instead of generating a feature point like CRF, the LVM creates a more robust cross-lingual adaptation by generating a distribution for the intent or each token in the utterance.However, distributions generated by the LVM for the same slot type across languages might not be sufficiently close.Incorporating adversarial training into the LVM alleviates this problem by regularizing the latent variables and making them more distinguishable.This improves the performance in both intent detection (a sentence-level task) and slot filling (a word-level task) by 0.92%/3.16% in Spanish and by 1.89%/4.67% in Thai on the 1% fewshot setting.This proves that both sentence-level and word-level representations are better aligned across languages.
In addition, LR aims to further align the sentence-level representations of target language utterances into a semantically similar space of source language utterances.As a result, there are 0.93%/2.82%improvements in intent detection for Spanish/Thai on the 1% few-shot setting after we add LR to BiLSTM-LVM.Interestingly, the performance gains are not only on the intent detection but also on the slot filling, with an improvement of 1.77%/3.94% in Spanish/Thai.This is attributed to the fact that utterance representations are produced based on word-level representations from BiLSTM.Therefore, the alignment of word-level representations will be implicitly improved in this process.Furthermore, incorporating LR and ALVM further tackles the inherent difficulties for the cross-lingual adaptation and achieves the state-of-the-art fewshot performance.Notably, by only leveraging 3% of target language training samples, the results of our best model are on par with the supervised training on all the target language training data.

Adaptation ability to unrelated languages
From Table 2, we observe impressive improvements in Thai, an unrelated language to English, by utilizing our proposed approaches, especially when the number of target language training samples is small.For example, compared to the BiLSTM-LVM, our best model significantly improves the accuracy and f1-score by ∼4%/∼7% in intent detection and slot filling in Thai in the few-shot setting on 1% data.Additionally, in the same setting, our model surpasses the strong baseline, M-BERT, in terms of accuracy and f1-score by ∼4%.This illustrates that our approaches provide strong adaptation robustness and are able to tackle the inherent adaptation difficulties to unrelated languages.
Comparison between Spanish and Thai To make a fair comparison for the few-shot performance in Spanish and Thai, we increase the training size of Thai to the same as 3% Spanish training samples, as depicted in Table 3.We can see that there is still a performance gap between the Spanish and Thai (3.11% in the intent detection task and 8.15% in the slot filling task).This is because Figure 3: Visualization for latent variables of parallel word pairs in English and Thai over different models trained on 1% target language training set.We choose the word pairs "temperature-อุ ณหภู มิ " and "tomorrow-พรุ ่ ง" from the parallel sentences "what will be the temperature tomorrow" and "อุ ณหภู มิ จะ อยู ่ ท เท่ า ไหร่ พรุ ่ ง" in English and Thai, respectively.To draw the contour plot, we sample 3000 points from the distribution of latent variables for the selected words, use PCA to project those points into 2D and calculate the mean and variance for each word.Spanish is grammatically and syntactically closer to English than Thai, leading to a better quality of cross-lingual alignment.

Visualization of Latent Variables
The effectiveness of the LR and ALVM can be clearly seen from Figure 3.The former approach decreases the distance of latent variables for words with similar semantic meanings in different languages.For the latter approach, to make the distributions for different slot types distinguishable, our model reg-ularizes the latent variables of different slot types far from each other, and eventually it also improves the alignment of words with the same slot type.Incorporating both approaches further improves the word-level alignment across languages.It further proves the robustness of our proposed approaches when adapting from the source language (English) to the unrelated language (Thai).

Zero-shot Setting
From Table 2, we observe the remarkable improvements made by LR and ALVM on the state-of-theart model XL-SLU in the zero-shot setting, and the slot filling performance of our best model in Spanish is on par with the strong baseline Translate Train, which leverages large amounts of bilingual resources.LR improves the adaptation robustness by making the word-level and sentence-level representations of similar utterances distinguishable.In addition, integrating adversarial training with the LVM further increases the robustness by disentangling the latent variables for different slot types.However, the performance boost for slot filling in Thai is limited.We conjecture that the inherent discrepancies in cross-lingual word embeddings and language structures for topologically different lan-guages pairs make the word-level representations between them difficult to align in the zero-shot scenario.We notice that Multilingual CoVe with auto-encoder achieves slightly better performance than our model on the slot filling task in Thai.This is because this baseline leverages large amounts of monolingual and bilingual resources, which largely benefits the cross-lingual alignment between English and Thai.CoSDA-ML, a concurrent work of our model, utilizes additional augmented multilingual code-switching data, which significantly improves the zero-shot cross-lingual performance.

Effectiveness of Label Sequence Encoder Pre-training
Label sequence encoder pre-training helps the label encoder to generate more expressive representations for label sequences, which ensures the effectiveness of the label regularization approach.
From Table 4, we can clearly observe the consistent performance gains made by pre-training in both few-shot and zero-shot scenarios.

Conclusion
Current cross-lingual SLU models still suffer from imperfect cross-lingual alignments between the source and target languages.In this paper, we propose label regularization (LR) and the adversarial latent variable model (ALVM) to regularize and further align the word-level and sentence-level representations across languages without utilizing any additional bilingual resources.Experiments on the cross-lingual SLU task illustrate that our model achieves a remarkable performance boost compared to the strong baselines in both zero-shot and few-shot scenarios, and our model has a robust adaptation ability to unrelated target languages in the few-shot scenario.In addition, visualization for latent variables further proves that our approaches are effective at improving the alignment of crosslingual representations.

Figure 1 :
Figure1: Illustration of cross-lingual spoken language understanding systems, where English is the source language and Spanish is the target language.

Table 1 :
βL lvm , (18) Number of utterances for the multilingual SLU dataset.English is the source language, and Spanish and Thai are the target languages.

Table 2 :
, which replaces the tokens that represent numbers, time, and duration with special tokens.We use 36 training samples in Spanish and 21 training samples in Thai on the 1% few-shot setting, and 108 training samples in Spanish and 64 training samples in Thai on the 3% Cross-lingual SLU results (averaged over three runs).
Liu et al. (2019a)019)raining on all the target language training samples.‡denotessupervisedtraining on both the source and target language datasets.The bold numbers denote the best results in the few-shot or zero-shot settings.The underlined numbers represent that the results are comparable (distances are within 1%) to the all-shot experiment with all the target language training samples.The results of Multi.CoVe and Multi.CoVe + Auto-encoder are taken fromSchuster et al. (2019), and the results of XL-SLU in the zero-shot settings are taken fromLiu et al. (2019a).

Table 3 :
Results of few-shot learning on 5% Thai training data, which are averaged over three runs.We make the training samples in Thai the same as the 3% Spanish training samples (108).

Table 4 :
Results of the ablation study for the label sequence encoder pre-training (averaged over three runs).Our model refers to the one that combines LR, ALVM and delex.with BiLSTM-LVM.