XL-NBT: A Cross-lingual Neural Belief Tracking Framework

Task-oriented dialog systems are becoming pervasive, and many companies heavily rely on them to complement human agents for customer service in call centers. With globalization, the need for providing cross-lingual customer support becomes more urgent than ever. However, cross-lingual support poses great challenges—it requires a large amount of additional annotated data from native speakers. In order to bypass the expensive human annotation and achieve the first step towards the ultimate goal of building a universal dialog system, we set out to build a cross-lingual state tracking framework. Specifically, we assume that there exists a source language with dialog belief tracking annotations while the target languages have no annotated dialog data of any form. Then, we pre-train a state tracker for the source language as a teacher, which is able to exploit easy-to-access parallel data. We then distill and transfer its own knowledge to the student state tracker in target languages. We specifically discuss two types of common parallel resources: bilingual corpus and bilingual dictionary, and design different transfer learning strategies accordingly. Experimentally, we successfully use English state tracker as the teacher to transfer its knowledge to both Italian and German trackers and achieve promising results.


Introduction
Over the past few years, we have witnessed the burgeoning of real-world applications of dialog systems, with many academic, industrial, and startup efforts racing to lead the widely-believed next-generation human-machine interfaces. As a result, numerous task-oriented dialog systems such as virtual assistants and customer conversation services were developed Rojas-Barahona et al., 2017;Bordes and Weston, 2017;Williams et al., 2017;Li et al., 2017), with Google Duplexbeing the most recent example.
With the rapid process of globalization, more countries have observed growing populations of immigrants, and more companies have moved forward to develop their overseas business sectors. To provide better customer service and bring down the cost of labor at call centers, the development of universal dialog systems has become a practical issue. A straightforward strategy is to separately collect training data and train dialog systems for each language. However, it is not only tedious but also expensive. Two settings naturally arise for more efficient usage of the training data: (1) Multi-lingual setting: we annotate data for multiple languages and train a single model, with possible innovations on joint training. (2) Crosslingual setting: we annotate data and train a model for only one (popular) language, and transfer the learned knowledge to other languages. Here we are interested in the second case, and the important research question we ask is: How can we build cross-lingual dialog systems that can support less popular, low-or even zero-resource languages?
As an initial step towards cross-lingual dialog systems, we focus on the cornerstone of dialog systems -dialog state tracking (DST), or belief tracking, a key component for understanding user inputs and updating belief state, i.e., a system's internal representation of the state of conversation (Young et al., 2010). Based on the perceived belief state, the dialog manager can decide which action to take, and what verbal response to generate (Precup and Teh, 2017;Bordes and Weston, 2017).
DST models require a considerable amount of annotated data for training (Henderson et al., 2014b;. For a common dialog shown in Figure 1, a typical data acquisition process (Rojas-Barahona et al., 2017) not only requires two human users to converse for multiple turns but also requires annotators to identify user's intention in each turn. Such two-step annotation is very expensive, especially for rare languages.
We study the novel problem of cross-lingual DST, where one leverages the annotated data of a source language to train DST for a target language with zero annotated data ( Figure 1); no conversation dialog or dialog state annotation is available for the target language. In order to deal with this zero-resource challenging scenario, we first decouple the state-of-the-art neural belief tracker framework  into sub-modules, namely utterance encoder, context gate, and slotvalue decoder. By introducing a teacher-student framework, we are able to transfer knowledge across languages module by module, following the divide-and-conquer philosophy. Requiring no target-side dialog data, our method relies on other easy-to-access parallel resources to understand the connection between languages. Depending on the popularity and availability of target language resources, we study two kinds of parallel data: bilingual corpus and bilingual dictionary, and we respectively design two transfer learning strategies.
We use the popular Wizard-of-Oz (Rojas- Barahona et al., 2017) dataset as our DST benchmark to evaluate the effectiveness of our crosslingual transfer learning. We specify English as the source (primary) language and two different European languages (German and Italian) as our zero-annotation target languages. Compared with an array of alternative transfer learning strategies, our cross-lingual DST models consistently achieve promising results in both scenarios for both zero-annotation languages. To ensure reproducibility, we release our code, training data and parallel resources in the github 1 . Our main contributions are three-fold: • Towards building cross-lingual dialog systems, we are the first to study the crosslingual dialog state tracking problem. • We systematically study different scenarios for this problem based on the availability of parallel data and propose novel transfer learning methods to tackle the problem. Broadly speaking, the dialog belief tracking algorithms can be divided into three families: 1) hand-crafted rules 2) generative models, and 3) maximum-entropy model (Metallinou et al., 2013). Later on, many deep learning based discriminative models have surged to replace the traditional strategies (Henderson et al., 2014a;Williams et al., 2016) and achieved state-of-the-art results on various datasets. Though the discriminative models are reported to achieve fairly high accuracy, their applications are heavily restricted by the domain, ontology, and language. Recently, a pointer network based algorithm (Xu and Hu, 2018) and another multi-domain algorithm (Rastogi et al., 2017) have been proposed to break the ontology and domain boundary. Besides, (Mrkšić et al., 2017) has proposed an algorithm to train a unified framework to deal with multiple languages with annotated datasets. In contrast, our paper focuses on breaking the language boundary and transfer DST knowledge from one language into other zeroannotation languages.

Cross-Lingual Transfer Learning
Cross-lingual transfer learning has been a very popular topic during the years, which can be seen as a transductive process. In such process, the input domains of the source and target are different (Pan and Yang, 2010) since each language has its own distinct lexicon. By discovering the underlying connections between the source and target domain, we could design transfer algorithms for different tasks. Recently, algorithms have been successfully designed for POS tagging (Zhang et al., 2016;, NER (Pan et al., 2017;Ni et al., 2017) as well as image captioning (Miyazaki and Shimizu, 2016). These methods first aim at discovering the relatedness between two languages and separate languagecommon modules from language-specific modules, then resort to external resources to transfer the knowledge across the language boundary. Our method addresses the transfer learning using a teacher-student framework and proposes to use the teacher to gradually guide the student to make more proper decisions.

Restaurant
Price ( Figure 2: Cross-lingual DST structure, the ontology and database between multiple languages are shared. The dialog states are defined as a set of search constraints (i.e. informable slots or goals) that the user specified through the dialog and a set of attribute questions regarding the search results (i.e. requestable slots or requests). The objective of dialog state tracking (DST) is to predict and track the user intention (i.e., the values of the aforementioned slots) at each time step based on the current user utterance and the entire dialog history. As shown in Figure 2, for each slot, the DST computes an output distribution of the candidate values using three inputs: (i) system response a t , which is the sentence generated by the system, (ii) utterance u t , which is the sentence from the user, and (iii) previous state, which denotes the selected slot-value pairs. We define the ontology of the dialog system to be the set of all the possible words the dialog slot and value can take. In this paper, we are interested in learning a cross-lingual DST. Specifically, we assume that the DST for the source language has access to a human-annotated training dataset D while the DSTs for the target languages do not have access to annotated data in other languages except for testing data. We here mainly consider two different types of parallel resources to assist the transfer learning: (1) Bilingual Corpus, where abundant bilingual corpora exist between the source and the target languages. This is often the case for common language pairs like German, Italian, and French, etc.
(2) Bilingual Dictionary, where public bilingual dictionaries exist between the source and the target languages, but large-scaled parallel corpus are harder to obtain. This can be the case for rarer languages like Finnish, Bulgarian, etc. Furthermore, we assume that all the languages share a common multi-lingual database, whose column/row names and entry values are stored via multiple languages (see the database in Figure 1). That is, the ontology of dialog among different languages is known with a one-to-one mapping between them (e.g., greek=griechisch=greco, food=essen=cibo). Based on that, we could construct a mapping function M to associate the ontology terms from different languages with predesigned language-agnostic concepts: for exam-

Decoupled Neural Belief Tracker
We design our cross-lingual DST on top of the state-of-the-art Neural Belief Tracker (NBT) , which demonstrates many advantages (no hand-crafted lexicons, no linguistic knowledge required, etc). These nice properties are essential for our cross-lingual DST design because we are pursuing a general and simple framework regardless of the language properties. In short, NBT consists of a neural network that computes the matching score for every candidate slot-value pair (c s , c v ) based on the following three inputs: (i) the system dialog acts a t = (t q , t s , t v ), 2 (ii) the user utterance u t , and (iii) the candidate slot-value pair. And it identifies the user intents by evaluating the scores for all the slot-value pairs (see Figure 3). With a slight abuse of notation, we still use c s , c v , t s , t v , t q ∈ R H to denote the vector representations of themselves, where H is the embedding dimension. We will use pre-trained embedding vectors in our cross-lingual NBT, just like the original NBT and they will be fixed during training. To enable cross-lingual transfer learning, we first re-interpret the architecture of the original NBT by decomposing it into three components: Utterance Encoding The first component is an utterance encoder, which maps the utterance u t = {w 1 , w 2 , · · · , w N } of a particular language into a semantic representation vector r(u t ) ∈ R H , where w i ∈ R H is the word vector for the i-th token and N is the length of the utterance. Note that the dimension of the semantic vector r(u t ) is the same as that of the word vector. We implement  the encoder using the same convolutional neural network (CNN) as the original NBT, with a slight modification of adding a top batch normalization layer. We will explain this change in section 5.
Context Gate The second part is the context gate, which takes the system acts a t = (t q , t s , t v ) and the candidate slot-value pair (c s , c v ) as its inputs and filter out the desired information from the encoded utterance. The context gate g is a sum of three separate gates: where the individual gates are defined as: Slot-Value Decoding The final component is a slot-value decoder, which predicts the score y of a given slot-value pair using the filtered information from the utterance representation r as: where W y ∈ R H×1 is the weight vector. The above expression computes the score for the slotvalue pair based on the information from the current turn. We combine it with the information from previous turns to get the final score: y(cv|ut, at, cs) =λy(cs, cv, ut, at)+ (1 − λ)ŷ(cs, cv, ut−1, at−1) here λ is a combination weight. For each given slot c s , NBT selects the single highest value for informable slots and selects all values above a certain threshold for request slots. Here we replace the multi-layer perceptron in the orginal NBT by a linear output layer (to be explained in section 5).

Cross-lingual Neural Belief Tracker
In this section, we develop a cross-lingual Neural Belief Tracker (XL-NBT) that distills knowledge from one NBT to another using a teacherstudent framework. We assume the ontology mapping M is known a priori (see Figure 3). XL-NBT uses language-specific utterance encoder and context gate for each input language while sharing a common (language-agnostic) slot-value decoder across different languages (see Figure 3).
The key idea is to optimize the language-specific components of the student network (NBT of the target language) so that their outputs are languageagnostic. This is achieved by making these outputs close to that of the teacher network (NBT of the source language), as we detail below.

Teacher-Student Framework
We are given a well-trained NBT for a source language e, and we want to learn an NBT for a target language f without any annotated training data. Therefore, we cannot learn the target-side NBT from standard supervised learning. Instead, we use a teacher-student framework to distill the knowledge from the source-side NBT (teacher network) into the target-side NBT (student network) (see Figure 4). Let x e (c e s , c e v , u e t , a e t ) be the input to the teacher network and let x f (c f s , c f v , u f t , a f t ) be the associated input to the student network. The standard teacher-student framework trains the student network by minimizinĝ where y(c e s , c e v , u e t , a e t ) and y(c f s , c f v , u f t , a f t ) denote the scores by the teacher and the student networks, respectively, and the slot-value pairs satisfy and M (c f s ) = M (c e s ). However, the target-side inputs (c f s , c f v , u f t , a f t ) parallel to (c e s , c e v , u e t , a e t ) are usually not available in crosslingual DST, and, even worse, the target-side utterance u e t is not available. We may have to generate synthetic input data for the student network or leverage external data sources. It is relatively easy to use the mapping M (·) to generate (c f s , c f v , a f t )) (i.e., the inputs of the target-side context gate) from the (c e s , c e v , a e t ). But it is more challenging to obtain the parallel utterance data u f t from u e t ). Therefore, we have to leverage external bilingual data sources to alleviate the problem. However, the external bilingual data are usually not in the same domain as the utterance, and hence they are not aligned with the slot-value pair and system acts (i.e., (c e s , c e v , a e t ) or (c f s , c f v , a f t )). For this reason, we cannot perform the knowledge transfer by minimizing the cost (5). Instead, we need to develop a new cost function where the utterance is not required to be aligned with the slot-value pair and the system acts. To this end, let g e = g e (c e s , c e v , a e t ) and . And we substitute (3) into (5) and get: where r e = r e (u e t ) and r f = r f (u f t ). As we mentioned earlier, the weight W y in the slotvalue decoder is shared between the student and the teacher networks and will not be updated. The teacher-student optimization only adjusts the weights related to the language-specific parts in Figure 3 (i.e., utterance encoding and context gating). Therefore, the shared weight ||W y || is seen as a constant. Furthermore, c f v ,c e v ||g e || 2 can be seen as a constant since the teacher gate is fixed. Since we use batch normalization layer to normalize the encoder output (described in Figure 3), ||r f (u f t )|| 2 can also be treated as a constant C 2 . Therefore, we formally write the upper bound of J 1 as our surrogate cost function J: The surrogate cost has successfully decoupled utterance encoder with context gate, and we use J r and J g to measure the encoder matching cost and the gate matching cost, respectively.
The encoder cost J r is optimized to distill the knowledge from the teacher encoder to student encoder while gate cost J g is optimized to distill the knowledge from teacher gate to student gate. This objective function successfully decouples the optimization of encoder and gate, thus we are able to optimize J r and J g separately from different data sources. Recall that we can easily simulate the target-side system acts, slot-value pairs (c f s , c f v , a f ) by using the ontology mapping M . Therefore, optimizing J g is relatively easy. Formally, we write the gate matching cost as follows: However, exact optimization of J r is difficult and we have to approximate it using external parallel data. We consider two kinds of external resources (bilingual corpus and bilingual dictionary) in the sections 5.2-5.3 (see Figure 5 for the main idea).

Bilingual Corpus (XL-NBT-C)
In our first scenario, we assume there exists a parallel corpus D p consisting of sentence pairs from the source language and the target language. In this case, the cost function (6) is approximated by where α is the balancing factor and J g is defined in (6). The cost function (9) is minimized by stochastic gradient descent. At test time, we switch the encoder to receive target language inputs.

Bilingual Dictionary (XL-NBT-D)
In the second scenario, we assume there exists no parallel corpus but a bilingual dictionary D B that defines the correspondence between source words and target words (a one-to-many mapping {w : M D (w)}). Likewise, it is infeasible to optimize the exact encoder cost J r due to the lack of target-side utterances. We propose a word replacement strategy (to be described later) to generate synthetic parallel sentenceû f t of "mixed" language. Then, we use the generated target parallel sentences to approximate the cost (6) by where α is the balancing factor. For word replacement, we first decide the number of words N w to be replaced, then we draw N w positions randomly from the source utterance and substitute the corresponding word w i with their target word synonym from M D (w) based on the context as follows: where hŵ = 2 k=−2:k =0 w i+k represents the context vector and N denotes the utterance length. The context similarity of context and the targetside synonym can better help us in choosing the most appropriate candidate from the list. In our following experiments, we adjust the temperature of τ to control the aggressiveness of replacement.

Dataset
The Wizard of Oz (WOZ) (Rojas-Barahona et al., 2017) dataset is used for training and evaluation, which consists of user conversations with taskoriented dialog systems designed to help users find suitable restaurants around Cambridge, UK. The corpus contains three informable (i.e. goaltracking) slots: FOOD, AREA, and PRICE. The users can specify values for these slots in order to find which best meet their criteria. Once the system suggests a restaurant, the users can ask about the values of up to eight requestable slots (PHONE NUMBER, ADDRESS, etc.). Multilingual WOZ 2.0  has expanded this dataset to include more dialogs and more languages. The train, valid and test datasets for three different languages (English, German, Italian) are available online 3 . We use the English as source language where 600 dialogs are used for training, 200 for validation and 400 for testing. We use the German and Italian as the target language to transfer our knowledge from English DST system. In the experiments, we do not have access to any training or validation dataset for German and Italian, and we only have access to their testing dataset which is composed of 400 dialogs.
For external resource, we use the IWSLT2014 Ted Talk parallel corpus (Mauro et al., 2012) from the official website 4 for bilingual corpus scenario. In the IWSLT2014 parallel corpus, we only keep the sentences between 4 and 40 words and decrease the sentence pairs to around 150K. We use Panlex (Kamholz et al., 2014) as our data source and crawl translations for all the words appearing in the dialog datasets to build our bilingual dictionary. We specifically investigate two kinds of pretrained embedding, and we use Glove (Pennington et al., 2014) as the monolingual embedding and MUSE (Conneau et al., 2017) as the bilingual embedding to see their impacts on the DST performance.
We split the raw DST corpus into turn-level examples. During training, we use the ground truth previous state V t−1 as inputs. At test time, we use the model searched states as the previous state to continue tracking intention until the end of the dialog. When the dialog terminates, we use two evaluation metrics introduced in Henderson et al. (2014a) to evaluate the DST performance: (1) Goals: the proportion of dialog turns where all the users search goal constraints were correctly identified. (2) Requests: similarly, the proportion of dialog turns where users requests for information were identified correctly. Our implementation is based on the NBT 5 , the details of our system setting are described in the appendix.

Results
Here we highlight the baselines we use to compare with our cross-lingual algorithm as follows: (1) Supervised: this baseline algorithm assumes the existence of annotated dialog belief tracking datasets, and it determines the upper bound of the DST model.
(2) w/o Transfer: this algorithm trains an English NBT, and then directly feeds target language into the embedding level as inputs during test time to evaluate the performance.
(3) Ontology-match: this algorithm directly uses exact string matching against the utterance to discover the perceived slot-value pairs, it directly assigns a high score to the appearing candidates.
(4) Translation-based: this system pre-trains a translator on the external bilingual corpus and then translates the English dialog and ontology into target language as "annotated" data, which is used to train the NBT in the target language domain (more details about the implementation, performance and examples are listed in the appendix).
(5) Word-By-Word (WBW): this system transforms the English dialog corpus into target language word by word using the bilingual dictionary, which is used to train the NBT in target side. We demonstrate the results for our proposed algorithms and other competing algorithms in Table 2, from which we can easily conclude that that (i) our Decoupled NBT does not affect the performance, and (ii) our cross-lingual NBT framework is able to achieve significantly better accuracy for both languages in both parallel-resource scenarios.
Compare with Translator/WBW. With bilingual corpus, XL-NBT-C with pre-trained bilingual embedding can significantly outperform our Translator baseline (Klein et al., 2017). This is intuitive because the translation model requires  both source-side encoding and target-side wordby-word decoding, while our XL-NBT only needs a bilingual source-encoding to align two vector space, which averts the compounded decoding errors. With the bilingual dictionary, the word-byword translator is very weak and leading to many broken target sentences, which poses challenges for DST training. In comparison, our XL-NBT-D can control the replacement by adjusting its temperature to maintain the stability of utterance representation. Furthermore, for both cases, our teacher-student framework can make use of the knowledge learned in source-side NBT to assist its decision making, while translator-based methods learn from scratch.

Bilingual Corpus vs. Bilingual Dictionary.
From the table, we can easily observe that bilingual corpus is obviously a more informative parallel resource to perform cross-lingual transfer learning. The accuracy of XL-NBT-D is lower than XL-NBT-C. We conjecture that our replace-ment strategy to generate "mixed" language utterance can sometimes break the semantic coherence and cause additional noises during the transfer process, which remarkably degrades the DST performance.
Monolingual vs. Bilingual embedding. From the table, we can observe that the bilingual embedding and monolingual embedding does not make much difference in supervised training. However, the gap in the bilingual corpus case is quite obvious. Monolingual embedding even causes the transfer to fail in a bilingual dictionary case. We conjecture that the bilingual word embedding already contain many alignment information between two languages, which largely eases the training of encoder matching objective.
German vs. Italian As can be seen, the transfer learning results for Italian are remarkably higher than German, especially for the "Goal" accuracy. We conjecture that it is due to German declension, which can produce many word forms. The very diverse word forms present great challenges for DST to understand its intention behind. Especially for the bilingual dictionary, German tends to have much longer replacement candidate lists than Italian, which introduces more noises to the replacement procedure.
Error Analysis Here we showcase the most frequent error types in subsection 6.1. From our observation, these three types of errors distribute evenly in the test dialogs. The error mainly comes from the unaligned utterance space, which leads to failure in understanding the intention of human utterance in the target language. This can lead the system to fail in modifying the dialog state or maintaining the previous dialog states.

Discussion
Here we want to further highlight the comparison between our transfer learning algorithm with the MT-based approach. Though our approach outperforms the standard Translator trained on IWSLT-2014, it does not necessarily claim that our transfer algorithm outperforms any translation methods on any parallel corpus. In our further ablation studies, we found that using Google Translator 6 can actually achieve a better score than our transfer algorithm, which is understandable considering the complexity of Google Translator and the much larger parallel corpus it leverages. By leveraging more close-to-domain corpus and comprehensive entity recognition/replacement strategy, the translator model is able to achieve a higher score. Apparently, we need to trade off the efficiency for the accuracy. For DST problem, it is an overkill to introduce a more complex translation algorithm, what we pursue is a simple yet efficient algorithm to achieve promising scores. It is also worth mentioning that our XL-NBT algorithm only takes several hours to achieve the reported score, while the translator model takes much more time and memory to train depending on the complexity. Thus, the simplicity and efficiency makes our model a better fit for rarelanguage and limited-budget scenarios.

Ablation Test
Here we investigate the effect' of hyper-parameter α, τ on the evaluation results. The α is used to balance the optimization of encoder constraint and gate constraint, where larger α means more optimization on gate constraint. The temperature τ is used to control the aggressiveness of the replacement XL-NBT-D, where smaller τ means more source words are replaced by target synonyms. From the table  experimental results are not very sensitive to α, a dramatic change of α will not harm the final results too much, we simply choose α = 1 as the hyper-parameter. In contrast, the system is more sensitive to temperature. Too conservative replacement will lead to weak transfer, while too aggressive replacement will destroy the utterance representation. Therefore, we choose the a moderate temperature of τ = 0.1 throughout our experiments. We also draw the learning curve (Precision vs. Iteration) in the Appendix for both XL-NBT-C and XL-NBT-D. The learning curves show that our algorithm is stable and converges quickly, and the reported results are highly reproducible.

Conclusion
In our paper, we propose a novel teacher-student framework to perform cross-lingual transfer learning for DST. The key idea of our model is to decouple the current DST neural network into two separate modules and transfer them separately. We believe our method can be further extended into a general purpose multi-lingual transfer framework to resolve other NLP matching or classification problems.