Cross-lingual Transfer Learning with Data Selection for Large-Scale Spoken Language Understanding

A typical cross-lingual transfer learning approach boosting model performance on a language is to pre-train the model on all available supervised data from another language. However, in large-scale systems this leads to high training times and computational requirements. In addition, characteristic differences between the source and target languages raise a natural question of whether source data selection can improve the knowledge transfer. In this paper, we address this question and propose a simple but effective language model based source-language data selection method for cross-lingual transfer learning in large-scale spoken language understanding. The experimental results show that with data selection i) source data and hence training speed is reduced significantly and ii) model performance is improved.


Introduction
Spoken Language Understanding (SLU) plays an important role in spoken language technology and is typically divided into two sub-tasks: Intent Classification (IC) and Slot Filling (SF). While the former identifies a speaker's intent, the latter extracts semantic constituents from the natural language query. Recently, there have been emerging efforts on Cross-Lingual Transfer Learning (CLTL) methods to reduce data requirements in deep neural network (DNN) based SLU. A typical approach is to pre-train the model on labeled data from a richly resourced language, and then either apply it directly on a target language (Upadhyay et al., 2018) or finetune it on a smaller amount of supervised data from a target language (Do and Gaspers, 2019). However, both in SLU and other NLP tasks, prior work on CLTL typically utilized all available source data to transfer knowledge, as it has been mainly inves-tigated in rather small Academic settings. However, in large-scale settings with millions of utterances, this would lead to costly training times, high computational requirements and optimization difficulties. Moreover, the different characteristics between the source and target languages raise a natural question of whether source-language data selection in which only the most relevant source instances are picked for pre-training can improve CLTL performance, as a considerable amount of source utterances might be "irrelevant" to the target language or even yield negative transfer.
Addressing these questions, in this paper we explore source-language data selection for CLTL in SLU, focusing especially on large-scale settings in which we assume the existence of a large amount (millions) of source data and a moderate amount (thousands) of target data. Since the effectiveness of pre-training in CLTL depends on the similarity of the source data distribution and the real distribution of the target language, we propose a sourcelanguage data selection method which computes the relevance score of each source instance to the target language by using several N-gram language model based metrics. Our method is designed to satisfy both IC and SF sub-tasks in a multi-task training scenario, and to select data from multiple source languages, which have been rarely studied in the literature.
Our experimental results show that our proposed data selection method: i) improves CLTL performance in large-scale settings, while reducing the amount of source data significantly ii) brings higher gains to SF but does not hurt IC, and iii) can select data efficiently from multiple source languages for a single target language.

Related work
Prior work on CLTL for SLU has mainly focused on using machine translation (e.g. García et al. (2012); He et al. (2013); Gaspers et al. (2018)). Until recently, few approaches based on crosslingual joint training and cross-lingual supervised pre-training for DNNs have been proposed. The former takes advantage of the knowledge transferring in a SLU system by jointly training (relatively balanced) source and target data (e.g. Li et al. (2018)). Meanwhile, the latter is usually used when the amount of source data is significantly larger than the amount of target data. In particular, the SLU model is pre-trained on a large amount of supervised source data, and then either tested directly on the target language (e.g. Upadhyay et al. (2018)), or fine-tuned on a smaller amount of supervised target data (e.g. Do and Gaspers (2019)). Our CLTL method follows the line of Do and Gaspers (2019), but instead of utilizing all available source data, we aim at selecting the most relevant subset of the source data for the target language.
Data selection has been studied in the field of domain adaptation with most of the work targeting machine translation (Axelrod et al., 2011;van der Wees et al., 2017). These approaches usually rank sentence pairs in a large bitext from a source domain according to their difference in cross-entropy or perplexity with respect to a target domain corpus and then select the top n sentence pairs to train a machine translation system for the target domain. Although this task also deals with multiple languages, it is not a CLTL problem. The application of data selection on other tasks are relatively rare, e.g., dependency parsing (Plank and van Noord, 2011), sentiment analysis (Remus, 2012), POS tagging (Ruder and Plank, 2017).
Several common data metrics have been proposed to rank the relevance of the source instances to the target domain, e.g, word similarity measures, diversity. However, to the best of our knowledge, data selection has not yet been explored for DNNbased CLTL in SLU. In addition, two challenges tackled in this paper, i.e. applying data selection for a multi-task training scenario and dealing with multiple source languages, have been rarely studied in the literature.

Spoken language understanding 3.1 Task definition
Suppose for a language l with word vocabulary V l , intent vocabulary I l and slot vocabulary S l , we have a set of utterances which are annotated with an intent label and each word is annotated with a slot label. The task of SLU is divided into two sub-tasks: i) Intent classification, which learns a function mapping each unlabeled utterance to a proper intent label ∈ I l , and ii) Slot filling, which learns a function mapping each unlabeled token to a proper slot label ∈ S l .

Model
Our multi-task SLU model consists of: i) A shared embedding layer which is the concatenation of a 1-dimensional convolution neural network based character embedding and a word embedding. ii) A shared encoder which is a two-layers bi-directional highway Long-short Term Memory network (Srivastava et al., 2015) served by the embedding layer as inputs, learning a contextual, fixed-dimensional representation for each token. ii) Two decoders for SF and IC sub-tasks; each consists of a stack of two dense layers and a softmax layer on top.
The two sub-tasks are trained jointly via a weighted loss function: L = α iLi + α sLs , wherê L i ,L s are the normalized cross-entropy losses with label smoothing (Szegedy et al., 2016) of IC and SF, respectively.

Cross-lingual transfer learning
Given a target language l t with a limited supervised data set D l t divided into a training set D T l t and a validation set D V l t , CLTL aims at improving the SLU performance on l t by leveraging the larger supervised data sets D l s 1 , . . . , D l s N from N source languages l s 1 , . . . , l s N . A common idea behind CLTL methods is to map the source and target data into a shared space, so that the knowledge can be transferred in-between languages.
In this paper, we assume the availability of a multi-lingual word embedding function which maps a word in any language into a shared space: where V l is the vocabulary of language l. In addition, for a source language l s i in which a bilingual dictionary D l s i ,l t : V l s i → V l t is available, a word w can be alternatively mapped into the shared space by using W(D l s i ,l t (w)) 1 . The word embedding layer in our SLU model is fixed to the mapping from the word vocabulary to this shared space, without being updated during training. In contrast, the character embedding layer in our model is initialized randomly and updated during training. Our CLTL training strategy consists of two phases: First, the model is pre-trained on the source data D l s 1 ∪ . . . ∪ D l s N for T s w epochs, and validated on D V l t . Second, the model is fine-tuned on the target data D T l t for T t w epochs and validated on D V l t .

Data selection
The effectiveness of pre-training in CLTL depends on the similarity of the source data distribution and the real distribution of the target language. Let us consider each component in our model. First, for the word embedding layer, obtaining similar distributions of the source data and the target language can be considered as an "easy" task by using multilingual word embedding. Second, for the character embedding layer and the encoder, it depends on how similar the character patterns and the word patterns of the source data and the target language are, respectively. Finally, for the decoders, the similar distributions could be expected given the good distributions provided by the encoder. We, therefore, propose a relevance metric for the source utterances w.r.t. the target language: where f k and α k are respectively an attribute value and its importance weight, and M is the total number of attributes. Each attribute is associated with an N-gram wordor character-based language model trained on the target language which can be used to estimate the similarity of a pattern to the target language. Let us consider an attribute f k and its N-gram language model LM k trained on the target language l t . Given an utterance u = w 1 . . . w n in a source language l s i and the bilingual dictionary D l s i ,l t : V l s i → V l t mapping a word in l s i to another word in l t , we call S the set of N-grams generated from D l s i ,l t (w 1 ) . . . D l s i ,l t (w n ). The attribute value f k is computed as the average language model score of the elements in S: W(D(w)) works well for French as target language, while W(w) is better for German.  We then normalize f k (u) at intent level: where I u is the set of utterances (from all source languages) having the same intent as u. By using the proposed relevance metric, the source data can be ranked in descending order, and only the top-K utterances will be selected for the pre-training.

Data
Supervised data For large-scale experiments, we extracted random samples from a large-scale SLU system. The data are representative of user requests to voice-enabled devices and are labeled with intents and slots. We include four languages into our experiments, i.e. English (EN), German (DE), French (FR) and Spanish (ES). DE and FR are used as the target languages in our experiments. Data statistics can be found in Table 1.
Unlabelled data sets For each of the target languages (DE and FR), we build N-gram language models on unlabelled data sets in that language. We make use of 3M DE sentences and 1M FR sentences which are freely available from the Leipzig unlabelled corpus collection (Goldhahn et al., 2012). In addition, we collect 500K DE and 2.5K FR unlabelled utterances with a similar nature as labelled utterances from the SLU system.
Pre-trained resources We use pre-trained 300dimensional multilingual word embeddings and bilingual dictionaries provided by  and , respectively.

Setup
We carry out four experiments with 10K and 20K target data in DE and FR (see Table. 1 for the experiment names and the labelled data statistics). We conduct experiments with transferring from one source language (EN) to another (DE) and from three source languages (EN, DE, ES) to one target language (FR). In each experiment, we compare four different training strategies: i) Mono (w.o. CLTL): an SLU model is trained on only the supervised target data. ii) CLTL: all supervised source data is used for pre-training, and we fine-tune on target language data. iii) CLTL-RD: random K% utterances of the original source data are used for pre-training. iv) CLTL-DS: the K% most relevant source utterances are selected by using our proposed relevance metric for pre-training.
Settings The convolutions used for character embeddings have window sizes of 2, 3, 4, each consisting of 64 filters. The sizes of LSTM and dense layers are set to 300. All dropout keep probabilities are set to 0.9. The hyper-parameter tuning on the development set results in α i = 0.2, α s = 0.8, label smoothing rate = 0.1. We use Adam optimizer with learning rate = 0.001. Exponential decay is applied to the learning rate with decay steps = 500, and decay rate = 0.95. For CLTL, the number of training epochs T s w and T t w are set to 6 and 25, respectively. For data selection, we use four Ngram language models, i.e. word-based bi-gram and tri-gram language models and character-based bi-gram and tri-gram language models. The four importance weights are set to 1.0 each. For evaluation we use the standard metrics, i.e. F1, precision and recall for slot filling (computed using the CoNLL 2002 script) and accuracy for intent classification.

Results and discussions
Exp.
Model Slot Intent P R F1 Acc.

10K-DE
Mono 76.4 ± 1.6 75.4 ± 1.4 75.9 ± 1.5 87.9 ± 0.4 CLTL 79.7 ± 1.8 77.6 ± 1.1 78.7 ± 1.5 89.5 ± 0.3 CLTL-RD 79.3 ± 0.9 77.0 ± 0.5 78.1 ± 0.6 89.5 ± 0.4 CLTL-DS 80.1 ± 0.6 78.6 ± 0.7 79.4 ± 0.6 90.0 ± 0.3  Table 2: Performance of CLTL on large-scale data sets. K is set to 50 (%) in CLTL-RD and CLTL-DS. Reported results are the mean and std. values of 3 runs. Is 100% better than 50%? Table 2 shows the performances of the different training strategies in our experiments. In CLTL-RD and CLTL-DS settings, K is set to 50 (%). It helps to answer the question that whether using the full source data (100%) is better than using just 50%. Interestingly, although 100% (CLTL) is better than random 50% (CLTL-RD) in 3 out of 4 experiments, it is surpassed by our selected 50% in all of the experiments. These results do not only prove the effectiveness of our proposed data selection metric but also suggest a potentially powerful application of source-language data selection on cross-lingual transfer learning.

20K-DE
Slot filling vs. intent classification As shown in Table 2source-language, our data selection method tends to bring higher gains to SF than to IC. One possible reason is that IC is the easier between the two sub-tasks (less categories, single label vs. sequence label decoding etc.). However, it is important to stress that our data selection does not hurt IC, meaning that the method is useful for joint learning.
One vs several source languages The similar trends in experiments with DE as target and FR as target show that our proposed data selection method can be applied in both single-source and multi-source transfer learning. In order to compare the utility of multiple vs a single source, we ran an experiment on 10K-FR using only English as the source language. The means of slot F1 and IC dropped from 81.0% and 91.2% to 79.9% and 90.8%, respectively, when using only English, potentially because it is not the closest language to French. Hence, the model could probably choose better source utterances from multiple sources. This indicates that our method works for both settings with higher gains for using multiple source languages.
Value of K and importance weights of the language models? One may question whether it is possible to optimize the importance weights of the language models and choose K automatically. A possible solution could be using Bayesian optimization (Ruder and Plank, 2017). Instead of selecting the top K% utterances, we define a threshold θ: u is selected if R(u) >= θ. θ and α k become hyperparameters which can be optimized using Bayesian Optimization with the score of the CLTL model on a development set as the objective. However, while this could improve performance, it is expensive, especially in large-scale settings, i.e. optimization would likely add more computation time than what we can gain by training on a subset.
Performance on a small-scale benchmark dataset While we are mainly interested in largescale SLU, we also strive for further understanding of CLTL by conducting a similar experiment on a small-scale widely-used SLU benchmark data set, i.e. ATIS (Tür et al., 2010). It contains audio recordings and corresponding annotated transcriptions in English of people making flight reservations.
To compare with the state-of-the-art systems, we apply our monolingual model on ATIS. The model reaches 95.6% F1 for slot filling and 96.8% accuracy for intent classification which are comparable to the state-of-the-art results reported on the same data set (Do and Gaspers, 2019).
We then perform a cross-lingual transfer learning experiment from English to German on ATIS. To construct the training sets of the target language, we select two random subsets of 200 and 400 English utterances from the training part of ATIS and translate them into German. The development set of the target language is formed by the German translation of a random subset of 144 English utterances selected from the validation part of ATIS. The test set of the target language is the German translation version of the ATIS test set which includes 893 utterances. The annotated source data comprise 4015 English training utterances from ATIS.

Exp.
Model Slot Intent P R F1 Acc.

Conclusions
We presented an efficient approach to select source data for cross-lingual transfer learning in largescale SLU. Our results indicate that by using data selection we can both improve performance and reduce source data significantly without a negative effect on system performance, which can reduce training time and computational requirements in large-scale systems greatly. This suggests an interesting future research direction toward data selection for cross-lingual transfer learning problems.