End-to-End Slot Alignment and Recognition for Cross-Lingual NLU

Natural language understanding in the context of goal oriented dialog systems typically includes intent classification and slot labeling tasks. An effective method to expand an NLU system to new languages is using machine translation (MT) with annotation projection to the target language. Previous work focused on using word alignment tools or complex heuristics for slot annotation projection. In this work, we propose a novel end-to-end model that learns to align and predict slots. Existing multilingual NLU data sets only support up to three languages which limits the study on cross-lingual transfer. To this end, we construct a multilingual NLU corpus, MultiATIS++, by extending the Multilingual ATIS corpus to nine languages across various language families. We use the corpus to explore various cross-lingual transfer methods focusing on the zero-shot setting and leveraging MT for language expansion. Results show that our soft-alignment method significantly improves slot F1 over strong baselines on most languages. In addition, our experiments show the strength of using multilingual BERT for both cross-lingual training and zero-shot transfer.


Introduction
As a crucial component of goal oriented dialogue systems, natural language understanding (NLU) is responsible for parsing the user's utterance into a semantic frame to identify the user's need. These semantic frames are structured by what the user intends to do (the intent) and the arguments of the intent (the slots) (Tur et al., 2010). Given the English example in Figure 1, we identify the intent of the utterance as "flight" and label the slots to extract the departure city and airline name. Intent detection can be modeled as a sentence classification task * Work performed while interning at Amazon AI where each utterance is labeled with an intent y I . Slot filling is typically modeled as a sequence labeling task where given the utterance x 1...n , each word x i is labeled with a slot y i .
Despite the high accuracy achieved by neural models on intent detection and slot filling (Goo et al., 2018;Qin et al., 2019), training such models on a new language requires additional efforts to collect large amounts of training data. One would consider transfer learning from high-resource to low-resource languages to minimize the efforts of data collection and annotation. However, currently available multilingual NLU data sets (Upadhyay et al., 2018;Schuster et al., 2019) only support three languages distributed in two language families, which hinders the study of cross-lingual transfer across a broad spectrum of language distances.
In this paper, we introduce a multilingual NLU corpus by extending the Multilingual ATIS corpus (Upadhyay et al., 2018), an existing NLU corpus that includes training and test data for English, Hindi, and Turkish, with six new languages including Spanish, German, Chinese, Japanese, Portuguese, and French. The resulting corpus, namely MultiATIS++, consists in total of 37,084 training examples and 7,859 test examples covering nine languages in four language families.
Using our corpus, we evaluate the recently proposed multilingual BERT encoder (Devlin et al., 2019) on the cross-lingual training and zero-shot transfer tasks. In addition, we identify a major drawback in the traditional transfer methods using machine translation (MT): they rely on slot label projections by external word alignment tools (Mayhew et al., 2017;Schuster et al., 2019) or complex heuristics (Ehrmann et al., 2011;Jain et al., 2019) which may not be generalizable to other tasks or lower-resource languages. To address the problem, we propose an end-to-end model that learns to jointly align and predict slots, so that the soft slot alignment is improved jointly with other model components and can potentially benefit from powerful cross-lingual language encoders like multilingual BERT.
Experimental results show that our softalignment method achieves significantly higher slot F1 scores than the traditional projection method on most languages and leads to continuous improvements when the size of the target training data increases, while the traditional method quickly plateaus. In addition, our experiments show the strength of using multilingual BERT on both the cross-lingual training and zero-shot transfer tasks. When given a small amount of annotated data in the target language, multilingual BERT achieves comparable or even higher scores than the translation methods on slot F1.

Related Work
Cross-lingual transfer learning has been studied on a variety of sequence tagging tasks including part-of-speech tagging (Yarowsky et al., 2001;Täckström et al., 2013;Plank and Agić, 2018), named entity recognition (Zirikly and Hagiwara, 2015;Tsai et al., 2016;Xie et al., 2018) and natural language understanding (He et al., 2013;Upadhyay et al., 2018;Schuster et al., 2019). Existing methods can be roughly categorized into two categories: transfer through cross-lingual representations and transfer through machine translation.

Transfer via Cross-Lingual Representations
Recent advances on cross-lingual sequence encoders have enabled transfer between dissimilar languages. Representations learned by multilingual neural machine translation (NMT) encoders have been shown to be effective for cross-lingual text classification (Eriguchi et al., 2018;Yu et al., 2018;Singla et al., 2018). However, these methods still rely on high-quality NMT encoders trained on large amounts of parallel data. In this work, we explore using multilingual BERT (Devlin et al., 2019), a cross-lingual language model that is trained on the monolingual texts from a wide range of languages and has been shown to provide powerful sentence representations that lead to a new state-of-the-art performance on zero-resource cross-lingual language understanding tasks (Lample and Conneau, 2019;Pires et al., 2019).
Transfer via Machine Translation requires translating the source language training data into the target language or translating the target language test data into the source language. Despite its empirical success on cross-lingual text classification tasks (Wan, 2009), it faces a challenging problem on the sequence tagging tasks: labels on the source language sentences need to be projected to the translated sentences. Most of the prior work relies on unsupervised alignment from statistical MT (Yarowsky et al., 2001;Shah et al., 2010;Ni et al., 2017) or attention weights from NMT models (Schuster et al., 2019). Other heuristics have also been explored, such as matching tokens based on their surface forms (Feng et al., 2004;Samy et al., 2005;Ehrmann et al., 2011;Jain et al., 2019). A major problem in these methods is that the projections are produced independent of the sequence labels. By contrast, our method does not rely on heuristic projections, but models label projection through an attention model that can be jointly trained with other model components on the machine translated data.

Data
One of the most popular data sets for multilingual NLU with human translation is the ATIS data set (Price, 1990) and its multilingual extension (Upadhyay et al., 2018). The ATIS data set is created by asking each participant to interact with an agent (who has access to a database) to solve a given air travel planning problem. Upadhyay et al. (2018) extend the English ATIS to Hindi and Turkish by manually translating and annotating a subset of the training and test data via crowdsourcing. 1 To facilitate cross-lingual transfer across a broader spectrum of language distances, we create the MultiATIS++ corpus 2 by extending both the training and test set of the English ATIS corpus to six additional languages. The resulting corpus covers nine languages in four different language families: Indo-European (English, Spanish, German, French, Portuguese, and Hindi), Sino-Tibetan (Chinese), Japonic (Japanese), and Altaic (Turkish).
For each of these languages, we hire professional native translators to translate the utterances and annotate the slots at the same time. When translating, the translators are required to preserve the meaning and structure of the original  English sentences as much as possible. For example, repetitions such as "a flight before before 6 pm" are mimicked in the target language if possible. Slots spanning multiple tokens are marked using the BIO tagging scheme. We show an English training example and its translated versions in the other eight languages in Figure 1. Note that sub token slot values are tokenized from the surrounding text. For example in French, d'Atlanta is annotated with d'{Atlanta|from loc.city name} and the substring d' is not part of the slot value. Therefore, we tokenize d' and generate {d'|O} {Atlanta|from loc.city name}.
We report the data statistics in Table 1. Note that the Hindi and Turkish portions of the data are smaller than the other languages, covering only a subset of the intent and slot types.

Joint Intent Detection and Slot Filling
Following Liu and Lane (2016), we model intent detection and slot filling jointly. We add a special classification token x 0 in front of the input sequence x = (x 1 , x 2 , ..., x T ) of length T following Devlin et al. (2019). Next, an encoder Θ enc is used to produce a sequence of contextualized representations h 0...T given the input sequence For intent detection, we take the representation h 0 corresponding to x 0 as the sequence representation and apply a linear transformation and a softmax function to predict the intent probability For slot filling, we compute the probability for each slot using the representations h 1...T We explore two different encoder models: • biLSTM: We use the concatenation of the forward and backward hidden states of a bidirectional LSTM (Schuster and Paliwal, 1997) as the encoder representations. We initialize the encoder and embeddings randomly.
• Multilingual BERT: We use the multilingual BERT (Devlin et al., 2019) pre-trained in an unsupervised way on the concatenation of monolingual corpora in 104 languages. We take the hidden states of the top layer as the encoder representations and fine-tune the model on the NLU data.

Problems in Label Projection for Cross-Lingual NLU
Past work has shown the effectiveness of using MT systems to boost the performance of crosslingual NLU (Schuster et al., 2019). More specifically, one first translates the English training data to the target language using an MT system, and then projects the slot labels from English to the target language. Prior work projects the slot labels using word alignments from statistical MT models or attention weights from neural MT models (Yarowsky et al., 2001;Schuster et al., 2019). The slot projection is done as a preprocessing step to prepare the training data for the downstream task. Despite their empirical success, the label projection methods suffer from a major drawback: the projections are produced independently of the slot labels and the downstream task and are potentially erroneous. Jain et al. (2019) show that improving the quality of projection leads to significant improvements in the final performance on cross-lingual named entity recognition. However, the improvements come at the cost of much more complex and expensive projection process using engineered features. In addition, they incorrectly assume that each word in the target translation can be hard-aligned to a single word in the English sentence disregarding the morphological differences among languages.
To address the issues, we propose to perform end-to-end slot alignment and recognition using an attention module (Figure 2), so that no external slot projection is needed. Furthermore, we show that the our soft slot alignment can be strengthened by building it on top of strong encoder representations from multilingual BERT.

Soft-Alignment via Attention
Given a source (English) utterance s 1...S of length S and its translation t 1...T of length T in the target language, the model learns to predict the target slot labels and soft-align it with the source labels via attention. First, it encodes the source utterance into a sequence of embeddings e where Θ enc is the encoder. For intent classification, we assume that the translated utterance has the same intent as the source utterance. Thus we compute the intent probabilities using the representation h For slot filling, we introduce an attention module to connect the source slot labels y (src) 1...S with the target sequence t 1...T . First, we compute the hidden state at each source position as a weighted average of the target representations where z i is the hidden state at position i, and a ij is the attention weights between the source word s i and translation word t j . To compute the weights a ij , we first linearly project the query vector e (src) i and the key vector h (tgt) j with learnable parameters to d dimensions. We then perform the scaled dot-product attention on the projected query and key vectors where the projections W Q and W K are parameter matrices, and τ is a hyperparameter that controls the temperature of the softmax function.
Next, we compute the slot probabilities at the source position i using the hidden state z i and the slot filling loss given the slot labels y In addition, to improve the attention module to better align the source and target utterances, we add a reconstruction module consisting of a positionwise feed-forward and a linear output layer to recover the source utterance using the attention outputs. We compute the probability distribution over the source vocabulary at position i as and the reconstruction loss as The final training loss is L = L intent + L slot + L rec . Empirically, we find it beneficial to train the model jointly on the machine translated data with objective L and the source language data with the supervised objective.
Note that the attention and reconstruction modules are only used during training. During inference, we directly feed the encoder representations h (tgt) 0...T of the target language utterance to the intent and slot classification layers

Cross-Lingual Transfer
In our first set of experiments, we explore using pretrained multilingual BERT encoder with different training strategies to leverage the full training data for cross-lingual transfer: • Target only: Using only the training data in the target language.
• Cross-lingual: Joint training on the concatenation of training data in all languages.
Setup We train the models using the Adam optimizer (Kingma and Ba, 2015) for 20 epochs and select the model that performs the best on the development set. We set the initial learning rate to 1e-3 for the LSTM model and 1e-5 for the BERT model.  Results Table 2 shows the results using the full supervised data averaged over 5 runs. Multilingual BERT encoder brings significant 3 improvements of 1-6% on intent accuracy and 1-11% on slot F1. The largest improvements are on the two low-resource languages -Hindi and Turkish. In addition, cross-lingual training with biLSTM on all the languages brings significant improvements on Hindi and Turkish over the target-only models -it improves intent accuracy by 2-5% and slot F1 by 4-11%. Further, BERT boosts the performance of cross-lingual training by 1-7% on intent accuracy and 1-9% on slot F1.
Comparison with SOTA On English ATIS, Qin et al. (2019) report 97.5% intent accuracy and 96.1% slot F1 when using BERT with their proposed stack-propagation architecture. This is comparable to our target only with BERT scores in

Zero-Shot Transfer Learning and Learning Curves
In this section, we compare the following methods for cross-lingual transfer where only a small 3 All mentions of statistical significance are based on a paired Student t test with p < 0.05. amount of annotated data or even no data (zeroshot) is available for the target language.
• No MT: Training the models only on the English training data without machine translating them to the target language.
• MT+project: Using AWS Translate to translate the English data to the target language and word alignment from fast-align 4 for label projection; Joint training on the English and machine translated data.
• MT+soft-align: Using AWS Translate to translate the English data to the target language and our soft-alignment method described in Section 4.3; Joint training on the English and machine translated data.
We adopt the same setup as the previous section, except that we select the model at the last epoch as we assume no access to the development set in the target language in this setting. We set the temperature τ of the attention module to 0.1. Table 3, when training without MT data, BERT boosts the performance over LSTM by 14-32% on intent accuracy and 29-61% on slot F1 on all languages but Turkish -a dissimilar language to English. For both LSTM and BERT models, adding MT data with projected slot labels brings significant improvements on intent accuracy -by 13-33% for LSTM and 1-24% for BERT. However, we observe different trends on slot F1  Table 3: Zero-shot results on MultiATIS++ averaged over 5 runs. The No MT rows are models trained only on the English data. The MT + project rows correspond to models trained on the English and machine translated data with automatically projected slot labels using fast-align. The MT + soft-align row is the model trained on the English and machine translated data using our soft-alignment method.

As shown in
for different languages. For example, when using BERT, adding MT data improves slot F1 by 20-44% on Chinese, Japanese, and Hindi, but hurts by around 6% on French. This may be due to its mixed effect -training directly on the target language data is beneficial especially for languages dissimilar to English, however, the noisy projection of the slot labels could also bring harm to the model. Next, we compare our soft-alignment method with others. First, our method is more robust across languages than the projection method -it achieves consistent improvements over the BERT models trained without translated data in terms of both metrics, while the projection method leads to a degradation on French and Turkish. Next, we compare our method with the projection method in terms of slot F1. Although we observe a small degradation on Spanish and Portuguese, our method achieves significant improvements on five of the eight languages, especially on French (+9.5%), Hindi (+9.1%), and Turkish (+38.1%).
To further analyze the slot F1 performance of MT+project in comparison to MT+soft-align, we measure the slot projection accuracy. The projection accuracy is calculated by the matching rate between token-slot pairs in the human translated data versus the MT+project data. We find that the projection accuracy is 20% and 57% for Turkish and Hindi correspondingly while for the other languages the accuracy is around 70%. The low projection accuracy for Turkish can be attributed to the rich morphology of the language. For example for Intent acc.
Average over 8 languages  the Turkish phrase san francisco'ya, in MT+project san is not tagged as a city name and the whole token francisco'ya is tagged with city name rather than the substring francisco. Even though both MT+project and MT+soft-align use WordPiece internally, MT+soft-align benefits from the consistent end-to-end training to handle rich morphology.

Learning Curves
Results for cross-lingual transfer using different sizes of the target language training data are shown in Figure 3. We select French as a similar language to English, and Chinese as a dissimilar language. We find that: 1) BERT achieves remarkably good performance in all data scales when trained both with and without machine translated data. 2) BERT brings promising results given several hundred training examples in the target language: BERT trained without MT achieves comparable slot F1 score to the best translation method on Chinese and achieves higher score on French. 3) Our softalignment method is more robust than the projection method given a small amount of target training data. For example, on Chinese, the slot F1 score of the projection method quickly plateaus as the target data size increases, while our method leads to continuous improvements.

Conclusion
We introduce MultiATIS++, a multilingual NLU corpus that extends the Multilingual ATIS corpus to nine languages across four language families. We use the corpus to explore three different crosslingual transfer methods: a) using the multilingual BERT encoder, b) the traditional projection method based on machine translation and slot label projection using external word alignment tools, and c) our proposed soft-alignment method that requires no label projection as it performs soft label alignment dynamically during training via an attention mechanism. While the traditional projection method obtains comparable results on intent detection, it relies heavily on the quality of the slot label projection. Experimental results show that our method improves slot F1 over the traditional method on five of the eight languages, especially on Turkish, which is known to be different from other languages in terms of morphology. In addition, we find that using multilingual BERT brings substantial improvements for both the cross-lingual and zero-shot setups, and that given a small amount of annotated data in the target language, multilingual BERT achieves comparable or even higher scores than the translation methods on slot F1.
We hope that our work will encourage more research on learning cross-lingual sentence representations to match the performance of translation approaches as well as more advanced cross-lingual transfer methods that are robust to the translation and label projection errors.