Zero-shot Cross-lingual Dialogue Systems with Transferable Latent Variables

Despite the surging demands for multilingual task-oriented dialog systems (e.g., Alexa, Google Home), there has been less research done in multilingual or cross-lingual scenarios. Hence, we propose a zero-shot adaptation of task-oriented dialogue system to low-resource languages. To tackle this challenge, we first use a set of very few parallel word pairs to refine the aligned cross-lingual word-level representations. We then employ a latent variable model to cope with the variance of similar sentences across different languages, which is induced by imperfect cross-lingual alignments and inherent differences in languages. Finally, the experimental results show that even though we utilize much less external resources, our model achieves better adaptation performance for natural language understanding task (i.e., the intent detection and slot filling) compared to the current state-of-the-art model in the zero-shot scenario.


Introduction
Task-oriented dialogue systems have been widely adopted in the industry (e.g., Amazon Alexa, Google Home, Apple Siri, Microsoft Cortana) as a virtual agent to tend to the needs of the users. However, these agents have mostly been trained with the monolingual dataset that is often expensive to build or acquire. In order to cope with the scarcity of low-resource language dialogue data, we are motivated to look into cross-lingual dialogue systems which can adapt with very little or no training data in the target language.
This task of zero-shot adaptation of dialogue systems to different languages is relatively new and has not been explored thoroughly enough yet. The main approach of previous work (Upadhyay et al., 2018;Chen et al., 2018;Schuster et al., 2019) in this task is using aligned cross-lingual word embeddings between source and target languages. However, this method suffers from imperfect alignments between the source and target language embeddings. This can be attributed not only to the noise in aligning two different embeddings, but also to the inherent discrepancies in different languages such as Thai and English which come from entirely different roots. To address such variance in the alignment, we turn to probabilistic modeling with latent variables as it has been successfully used in several recent taskoriented dialogue systems (Wen et al., 2017;Zhao et al., 2017Le et al., 2018).
However, we notice that naively using latent variables does not help the model improve much in slot filling and intent prediction. We hypothesize that the variance of the cross-lingual word embeddings is too large for the model to learn any meaningful latent variables. Hence, we propose to first refine the cross-lingual embeddings with ∼10 seed word-pairs related to the dialogue domains. We then add Gaussian noise (Zheng et al., 2016) to further compensate the imperfect alignment of cross-lingual embeddings.
As a result, a combination of these methods allows us to build a transferable latent variable model that learns the distribution of training language inputs that is invariant to noise in the cross-  lingual word embeddings. This enables our model to capture the variance of semantically similar sentences across different languages, and achieve state-of-the-art results in zero-shot adaptation of English to Spanish and Thai for the natural language understanding task (i.e., the intent prediction and slot filling) on the dataset proposed by Schuster et al. (2019), even though we use much less external resources (i.e., ∼10 seed word-pairs) while others utilize a large amount of bilingual corpus. We further visualize the learned latent variables to confirm that same-meaning words and sentences have similar distributions.

Related Work
Cross-lingual transfer learning which acts as one of the low-resource topics (Gu et al., 2018;Lee et al., 2019;Xu et al., 2018) has attracted more and more people recently, followed by the rapid development of cross-lingual word embeddings. Artetxe et al. (2017) proposed a self-learning framework and utilized a small size of word dictionary to learn the mapping between source and target word embeddings. Conneau et al. (2018) leveraged adversarial training to learn a linear mapping from a source to a target space without using parallel data. Joulin et al. (2018) utilized Relaxed CSLS loss to optimize this mapping problem. Winata et al. (2019) introduced a method to leverage cross-lingual meta-representations for code-switching name entity recognition by combining multiple monolingual word embeddings. Chen et al. (2018) proposed a teacher-student framework leveraging bilingual data for crosslingual transfer learning in dialogue state track-ing. Upadhyay et al. (2018) leveraged joint training and cross-lingual embeddings to do zero-shot and almost zero-shot transfer learning in intent prediction and slot filling. Finally, Schuster et al. (2019) utilizes Multilingual CoVe embeddings obtained from training Machine Translation systems as in (McCann et al., 2017). The main difference of our work with previous work is that our model does not leverage any external bilingual data other than 11 word pairs for embeddings refinement.

Methodology
Our model consists of a refined cross-lingual embedding layer followed by a BiLSTM (Hochreiter and Schmidhuber, 1997) which parameterizes the Latent Variable Model, as illustrated in Figure 2. We jointly train our model to predict both slots and user intents. We denote w = [w 1 , . . . , w T ] as the input words and e = [e 1 , . . . , e T ] as the word embeddings of w. The slot at time-step t is s t , while the intent for each sentence w is denoted as I. Note that only matrices are bold-faced.

Cross-lingual Embeddings Refinement
To further refine the cross-lingual alignments to our task, we draw from the hypothesis that domain-related words are more important than others. Hence, as shown in Figure 1, we propose to refine the cross-lingual word embeddings (Joulin et al., 2018) 1 using very few parallel word pairs, which is obtained by selecting 11 English words related to dialogue domains (weather, alarm, and reminder) and translate them using bilingual lexicons. We refine the embeddings by leveraging the framework proposed in Artetxe et al. (2017).
Let X and Z be the aligned cross-lingual word embeddings between two languages. X i * and Z j * are the embeddings for the i th source word and j th target word. We denote a binary dictionary matrix D: D ij = 1 if the i th source language word is aligned with the j th target language word and D ij = 0 otherwise. The goal is to find the optimal mapping matrix W * by minimizing: Following Artetxe et al. (2016), with orthogonal constraints, mean centering, and length normaliza-tion, we can maximize the following instead: We iteratively optimize Equation 2 until distances between domain-related seed words are closer than a certain threshold after refinement. Figure 1 illustrates better alignment for domainrelated words after refinement.

Gaussian Noise Injection
To cope with the noise in alignments, we inject Gaussian noise to English embeddings, so the trained model will be more robust to variance. This is a regularization method to improve the generalization ability to the unseen inputs in different languages, particularly languages from different roots such as Thai and Spanish. The final embeddings are e * = [e 1 + N 1 , . . . , e T + N T ], where N ∼ N (0, 0.1I).

Latent Variable Model (LVM)
Given a near-perfect cross-lingual embedding, there is still noise caused by the inherent discrepancies between source and target languages. This noise amplifies when combined with imperfect alignment, and makes point estimation vulnerable to the small, but not negligible differences across languages. Instead, using latent variables will allow us to model the distribution that captures the variance of semantically similar sentences across different languages. The whole training process is defined as follows: where attention vector (v) is obtained by following Felbo et al. (2017) and w a is the weight matrix for the attention layer, W

{S,I}
{r,g} are trainable parameters, superscripts S and I refer to slot prediction and intent detection respectively, subscript "r" refers to "recognition" for obtaining the mean and variance vectors while subscript "g" refers to "generation" for predicting the slots and intents, and q S t ∼ N (µ S t , (σ S t ) 2 I) and q I ∼ N (µ I , (σ I ) 2 I) are the posterior approximations which we sample our latent vectors z S t and z I from. Finally, p S t and p I are the predictions for the slot of the t-th token and the intent of the utterance respectively. The objective functions for slot filling and intent prediction are: hence, the final objective function to minimize is, The model prediction is not deterministic since the latent variables z S t and z I are sampled from the Gaussian distributions. Therefore, in the inference time, we use the true mean µ S t and µ I to replace z S t and z I respectively to make the prediction deterministic.

Dataset
We conduct our experiments under the zero-shot scenario of multilingual task-oriented dialogue dataset presented by Schuster et al. (2019). Our model is trained only with the English data and then do a zero-shot test on Spanish and Thai test set. We delexicalize words by replacing the tokens which represent numbers, time (such as am, pm), and duration (such as 30min) with special tokens <number>, <time>, and <last> respectively.

Training Details
In the training procedure, we freeze the word embeddings of the primary language, and then replace them with the corresponding aligned word embeddings of the unseen languages for a zeroshot test. We use bi-directional LSTM model with hidden dimension size of 250, and the latent variable model with both mean and variance in the size of 100. Gaussian noise with zero mean and variance of 0.1 is injected dynamically in different iterations. We use the accuracy to evaluate the performance of intent prediction and the standard BIO structure to calculate the F1 score for evaluating the performance of slot filling. In the zeroshot cross-lingual adaptation, we simply replace the training language (i.e., English) word embeddings with the cross-lingual target language (i.e., Spanish or Thai) word embeddings. Note that we never use any target language evaluation data to select the model for zero-shot cross-lingual adaptation, instead, we utilize the English validation set and early stop strategy according to the slot F1 score.

Word Pairs
We choose the number of word pairs based on the vocabulary size of the corpus. Intuitively, the larger the vocabulary size is, the more words we need to align across languages, and the more word pairs we need to achieve good performance. We select 11 domain-related words which frequently exist in the English training set. The number of words we select is around 0.25% of the vocabulary size for the English training set. The concrete information of the 11 word pairs is as follows: The English seed words we selected are weather, forecast, temperature, rain, hot, cold, remind, forget, alarm, cancel, tomorrow, which are related to the three dialogue domains (weather, alarm, and reminder). We translate them by leveraging bilingual dictionaries 2 . The corresponding translations in Spanish and Thai are clima, pronóstico, temperatura, lluvia, caliente, frío, recordar, olvidar, alarma, cancelar, mañana and อากาศ, พยากรณ์ , อุ ณหภู มิ , ฝน, ร้ อน, หนาว,  เตื อน, ลื ม, เตื อน, ยกเลิ ก, พรุ ่ ง respectively.

Evaluation
We implement and evaluate the following models: Zero-shot SLU Upadhyay et al. (2018) used cross-lingual embeddings (Bojanowski et al., 2017) to do zero-shot transfer learning.

Conditional Random Fields (CRF)
We reproduce the baseline model in Schuster et al. (2019), and also add embedding noise, cross-lingual refinement, and delexicalization.

Latent Variable Model (LVM) -Ours
We replace the CRF module with latent variables and also apply it to intent prediction.
Besides, we directly compare with the baseline models illustrated in Schuster et al. (2019): Multi. CoVe w/ auto They combined Multilingual CoVe  with an auto-encoder objective and then used the trained encoder with the CRF model.

English Spanish Thai
What will the weather be like this evening Cancel tuesday alarm clock Figure 3: Visualization of latent variables on words (left) and sentences (right). Left: We choose "weather-clima-อากาศ" and "evening-noche-เย็ น" from parallel sentences. English: "What will the weather be like this evening", Spanish: "Cómo será el clima esta noche", Thai: "ตอน เย็ น นี ้ อากาศ จะ เป็ น อย่ างไร". Right: We choose two English sentences and show their distributions and those of the corresponding Spanish and Thai translations.
Translate Train They trained a supervised machine translation system to translate English data into the target language and then trained the CRF model on this translated dataset.

Results & Discussion
From Table 1, in general, LVM outperforms CRF models. This is because for semantically same words (e.g., weather and clima) LVM considers such close enough points as the same distribution, but CRF is more likely to classify them differently. This can be shown very clearly from Figure 3, in which the latent variables demonstrate similar distributions for semantically similar sentences and words. In addition, we can see that adding only Gaussian noise to the Vanilla BiLSTM improves our prediction performance significantly, which implies that the robustness of our model towards the noisy signals which come from the target embedding inputs.
Furthermore, it is clearly visible that crosslingual embeddings refinement is more effective in Spanish than Thai. This is attributed to the quality of alignments in the two languages. Spanish is much more lexically and grammatically similar to English than Thai, so word-level embedding refinement is reasonably good. Jointly incorporating all three methods (Gaussian noise injection, crosslingual embeddings refinement, and delexicalization) further reduces the noise in the inputs as well as makes the model more robust to noise, which help LVM to more easily approximate the distribution.
Finally, in Table 2, we ablate the usage of LVM to see whether the boost of performance comes simply from the increase of parameter size. By removing or replacing LVM with MLP, we can see the clear performance gains by using LVM.

Conclusion
In this paper, we propose a transferable latent variable that focuses on improving the zero-shot cross-lingual adaptation of natural language understanding task to low-resource languages. We show that a combination of 1) cross-lingual embeddings refinement, 2) Gaussian noise injection, and 3) latent variables are effective in coping with the variance of semantically similar sentences across different languages, and the visualizations of the latent variables confirm such. We leverage very few resources (i.e., 11 seed word pairs) and achieve state-of-the-art performance for Englishto-Spanish and English-to-Thai in the zero-shot cross-lingual scenario.