To What Degree Can Language Borders Be Blurred In BERT-based Multilingual Spoken Language Understanding?

This paper addresses the question as to what degree a BERT-based multilingual Spoken Language Understanding (SLU) model can transfer knowledge across languages. Through experiments we will show that, although it works substantially well even on distant language groups, there is still a gap to the ideal multilingual performance. In addition, we propose a novel BERT-based adversarial model architecture to learn language-shared and language-specific representations for multilingual SLU. Our experimental results prove that the proposed model is capable of narrowing the gap to the ideal multilingual performance.


Introduction
Recently, modern voice-controlled devices with virtual assistants like Alexa, Siri, and Google Assistant, have been working their way into every activity of our daily life from playing a song to driving a car. Spoken Language Understanding (SLU) models interpreting the semantic meaning conveyed by an user's spoken utterance, are the AI brains of these devices. The task of SLU is usually divided into two sub-tasks, namely intent classification (IC), which identifies the intent of an user's utterance; and slot filling (SF), which extracts semantic constituents from the utterance. Consider an annotated utterance from the ATIS dataset (Price, 1990): The SF sub-task should classify "where" and "is" as O (i.e. Other); and "MCO" as B − airport code, while the IC sub-task should identify city as the intent. Current state-of-the-art SLU models are mostly DNN-based joint models of the two sub-tasks (Liu and Lane, 2016;Do and Gaspers, 2019b;Chen et al., 2019a). These models usually contain two individual decoders to detect slot and intent labels on top of a shared encoder. As in many other NLP fields, the BERT encoder-decoder architecture has shown its strong ability in capturing contextual information from data sources to improve SLU performance (Chen et al., 2019a). For common SLU use cases, the BERT encoder is pre-trained on a huge amount of unlabeled texts and then fine-tuned on SLU training data.
Since the voice-controlled device market has been expanding at an incredible rate all over the world, there is a rising need of fast language expansion for SLU models. Multilingual technology, which allows the development of a single model for multiple languages, and transferring knowledge from a data resource to languages other than its own, is currently one of the best solutions to meet the increased need. Traditionally, an SLU model is trained on supervised monolingual data leading to the fact that a virtual assistant often can use only one language within a working session. In contrast, a multilingual SLU model trained on supervised multilingual data, can provide the virtual assistant the ability to talk in multiple languages within a working session. Moreover, from a development perspective, multilingual modeling helps not only to reduce the number of models to build, but also to reduce supervised data needs by transferring knowledge across languages. However, as languages differ from each other in various aspects from lexicon to syntax, the cross-language knowledge transfer of the current multilingual techniques may still be limited. This raises the questions of: i) what an ideal cross-language knowledge transfer would be; ii) what a naive cross-language knowledge transfer would be; and iii) to what degree a multilingual model can transfer knowledge across languages.
As one of the most successful multilingual techniques recently, multilingual BERT (mBERT) (Devlin et al., 2018) is being used more and more commonly in natural language processing models in both zero-shot 1 and multilingual 2 scenarios (Wu and Dredze, 2019;. Since mBERT is trained without any cross-lingual objectives and does not leverage aligned data, its strong cross-lingual abilities are surprising and have, in turn, spurred research aiming to understand why it is able to achieve them (Pires et al., 2019). In this paper, we, however, focus on addressing the question of to what degree it can blur the language borders in the multilingual SLU setting by comparing its ability to transfer knowledge across languages with the ideal and naive cases. In addition, as mBERT is pre-trained for general purpose, we are also interested in how to improve its multilinguality for the particular use case of multilingual SLU.
Our contribution in this paper is two-fold: First, by performing a wide range of experiments with bilingual and trilingual SLU models, we show that although mBERT is substantially good in blurring language borders even on distant language groups, there is still a gap to the ideal multilingual performance. Second, we propose a novel adversarial model architecture to learn language-shared and language-specific representations on top of mBERT representations when fine-tuning on SLU data. The experimental results prove that our proposed approach can narrow the gap to the ideal multilingual performance.

Related Work
As one of the most successful model architectures in natural language understanding recently, BERT models have been explored in various researches for SLU (Chen et al., 2019a;Gaspers et al., 2020). In these works, SLU is considered as a downstream task of a trained BERT model. In particular, after being pre-trained on a large amount of unlabelled data, the BERT embeddings and encoders are fine-tuned on supervised SLU data together with two SLU-adapted decoders for the IC and SF sub-tasks.
Due to the fast expansion of voice-controlled device market, there has been a rising interest in crosslingual and multilingual SLU modeling. In the former direction, supervised data from one or multiple source languages is leveraged to improve the SLU performance on a target language (Johnson et al., 2019;Do and Gaspers, 2019a;. In the later direction, SLU models are trained on supervised multilingual data for multiple target languages . To obtain a multilingual model,  simply initialized the embeddings and encoder of a BERT-based SLU model by mBERT's parameters, and then trained the full model on a mix of supervised datasets from multiple languages. Noticeably, the SLU model used in  was designed towards cross-lingual scenarios using a soft-alignment method to improve the slot projection between source and target languages. In this paper, we focus on only multilingual SLU modeling. Unlike , our adversarial model is designed specially for multilingual model building. The impressive abilities of mBERT in cross-lingual and multilingual natural language understanding applications, have recently led to an increasing body of research analyzing how it is able to achieve those. For instance, Pires et al. (2019) probed the cross-linguality of mBERT using zero-shot transfer learning on morphological and syntactic tasks and found that mBERT is able to create multilingual representations. However, to the best of our knowledge, probing multilingual representation learning for downstream tasks, like SLU, and addressing the question of how far it is from the ideal expectation, have not yet been explored in literature.
Adversarial approaches aiming to account for linguistic differences across languages by dividing the model into language-shared and language-specific representations, have been explored for the SLU sub-tasks. Recently, He et al. (2020) investigated the sub-tasks in isolation using BiLSTMs and focused on improving SLU for low-resource languages. Meanwhile, Chen et al. (2019b) explored BiLSTMs to improve named entity recognition which is close to the SF sub-task of SLU. In this work, instead of isolating the two sub-tasks, we propose a joint model based on mBERT for multilingual use cases.

BERT-based multilingual SLU models
In this section, we describe the two multilingual SLU models evaluated in this paper.   Gaspers et al. (2020), our model consists of a BERT encoder receiving sub-word and position embedding layers as inputs, an IC decoder to identify intent labels, and an SF decoder to predict slot labels. In particular, given an utterance u, the final hidden states of the tokens h 1 . . . h T produced by the encoder, are used as the token representations. The sentence presentation, denoted by s u , is computed from h 1 . . . h T by using max-pooling.

SLU as a downstream task of BERT
s u is passed through the IC decoder, which is composed from a feed-forward of 2 dense layers (FFN I ) and a linear output layer to predict the intent:ŝ h 1 . . . h T are passed through the SF decoder, which is composed from a position-wise feed-forward of 2 dense layers (FFN S ) and a CRF layer on top (Zhou and Xu, 2015) to predict slot labels: The CRF layer takesĥ t as inputs, and estimates a transition matrix modeling the dependence among adjacent labels. The slot labels are predicted by the traditional Viterbi decoding. The model is trained by optimizing the joint loss L = L i + L s where L i and L s are the cross entropy loss for intent identification and the CRF loss for slot classification, respectively.
As a downstream task of BERT, the embedding and encoder layers are initialized by a pre-trained BERT language model. The full model is trained on a supervised SLU dataset.

A simple approach for multilingual modeling
Thanks to the strong multilingual abilities of mBERT, we simply use mBERT for the model described in 3.1 to obtain a simple multilingual SLU model. In particular, the embedding and encoder layers are initialized by the parameters of a pre-trained mBERT. The full model is then trained on a mix of supervised data from multiple languages.

An adversarial approach for multilingual modeling
The simple multilingual model, as described in Sec. 3.2, learns language-shared token and sentence representations from multilingual training data via its BERT encoder. However, as languages differ from each other in various aspects, using only language-shared representations may be not enough to reach the optimal performance in all of the target languages. Our hypothesis is that encouraging a model to learn both language-shared and language-specific representations will narrow the gap to the ideal multilingual SLU model in which knowledge can be transferred freely across language borders.
Motivated by the success of adversarial approaches in natural language processing, we propose a novel BERT-based adversarial SLU model using a single BERT encoder to learn language-shared representations across all target languages, and multiple CNN encoders to learn language-specific representations for each of the target languages. The language-shared and language-specific representations are concatenated before being fed to the IC and SF decoders. The full model is trained via an adversarial training strategy. Fig. 2 show our model architecture for a trilingual use case.
In a general use case targeting N languages l 1 , . . . , l N , the model has N + 3 encoders: First, it contains a BERT encoder denoted by enc bert , which can be initialized by pre-trained BERT parameters. Second, we use N 1-layer CNN encoders to learn language-specific representations for l 1 , . . . , l N , denoted by enc l 1 , . . . , enc l N , respectively. Third, an 1-layer CNN encoder denoted by enc lang is used to learn features for predicting the language of the input utterance. Finally, another 1-layer CNN encoder denoted by enc shared is used to learn shared representations across languages.
Given utterance u as input, the final hidden states of the tokens h 1 . . . h T produced by enc bert are used as the inputs of the CNN encoders. Each CNN encoder, enc, generates its token and sentence representations, namely h enc t and s enc u : In addition to the normal IC and SF decoders, we add two more decoders: a language predictor and a language discriminator. Each of the additional decoders consists of a feed-forward of 1 dense layer (FFN) and a linear output layer to predict the language of u.
The language predictor receives s enc lang u , which is the sentence representation computed by enc lang , as input, and outputs a language distribution p P :ŝ The language-specific token and sentence representations are computed as following: The combined token and sentence representations are computed as following: The combined token and sentence representations are used as inputs to the IC and SF decoders as described in Sec. 3.1.
The language discriminator receives the sentence representation from enc shared as input, and also predicts the language distribution of u as the language predictor: However, the discriminator is trained adversarially to confuse the system about the distinction between languages. This encourages the model to learn shared-language representations. We use CRF loss for the SF task, denoted by L s . Meanwhile, for the IC decoder, language predictor, and language discriminator, we use cross-entropy losses denoted by L i , L p and L d , respectively. The full model is trained by an adversarial training strategy as shown in Alg. 1.

Algorithm 1: Adversarial training strategy
Input: SLU training data in multiple languages with intent, slot and language labels. α d , α i , α s , α p , β d are model hyper-parameters. Task 1: To optimize L 1 = α d L d Task 2: To optimize L 2 = α i L i + α s L s + α p L p − β d L d while number of epochs(Task 1) < K or number of epochs(Task 2) < K do Randomly pick a task j (j ∈ {1, 2}) Generates a mini-batch for Task j Updates model weights to optimize L j end 4 An ideal multilingual SLU performance vs. a naive multilingual SLU performance To address the question of to what degree knowledge can be transferred across languages in multilingual SLU, in this section, we define an ideal and a naive multilingual performance. Let us consider a multilingual SLU model M for n languages l 1 , l 2 , . . . l n . We use the following notations: • D: a training set, in which each annotated utterance can be in one of the n target languages.
• D l : The monolingual version of D in language l. That means D l and D contain the same utterances, but all of the utterances in D l are in l only.

Ideal case
Ideally, the language borders are completely blurred and knowledge can be transferred freely across different languages. Given D, for each target language l, the multilingual performance (eval(M(D), l)) should reach or outperform the performance of a monolingual model trained on a similar amount of knowledge (eval(M (D l ), l)). For example, let us consider three models trained on the same amount of knowledge (represented as training data): i) A multilingual model xy which has 50% of the training data in language X, and the other 50% in language Y ; ii) A monolingual model x which has 100% of the training data in language X; iii) A monolingual model y which has 100% of the training data in language Y . In an ideal case, model xy should have similar performance on languages X and Y as the respective monolingual models x and y.
For the rest of this paper, we will refer to eval(M (D l ), l) as the Ideal baseline on language l given D. We also define an Ideal baseline for the average performance on the target languages in Eq. 9. Comparing avg M with avg ideal will indicate how close M is to the ideal performance.

Naive case
Given D, in a naive multilingual case, for each target language l, the multilingual performance (eval(M(D), l)) cannot outperform the monolingual model trained on the subset of its training data containing utterances which are in l (eval(M (d l ), l)). That means, the model performance in a language can not be improved by adding more training utterances from other languages. Let us reconsider the previous example. In a naive case, in which the language border between X and Y is completely closed, the performance of model xy on languages X and Y should be worse than the respective monolingual models x and y.
For the rest of this paper, we will refer to eval(M (d l ), l) as the Naive baseline on language l given D. A Naive baseline for the average performance on the target languages can be found in Eq. 10. Comparing avg M with avg naive will indicate how far M is from being naive.

Datasets
We evaluate our model on a resampled version of the publicly available MultiATIS++ dataset , which contains parallel SLU data from several languages, and thus allows evaluating performance in relation to the ideal and naive baselines. In addition, we evaluate our model on real-world SLU data to explore the impact of our proposed adversarial model architecture in a real-world scenario.

Publicly available data
The ATIS ("Air Travel Information Service") dataset (Price, 1990) is one of the most well-known datasets for evaluating SLU models. It was created by having participants solve given air travel planning scenarios, and then the resulting queries were manually transcribed and annotated. MultiATIS (Upadhyay et al., 2018) extended the original dataset by adding two additional languages (Turkish and Hindi), and recently six further languages (Spanish, German, Chinese, Japanese, Portuguese, French) were added yielding MultiATIS++ . For our experiments, we focus on four languages for which we have both real-world and MultiATIS++ data available, i.e. English (EN), German (DE), Spanish (ES) and Japanese (JA). In MultiATIS++, the data for additional languages were created by manually translating the original English utterances. For translations, several of the slot values, such as city names, were simply kept, yielding e.g. German requests about American airports. We consider keeping a large numbers of slot values the same across languages as being unrealistic, as in real-world applications typically slot value usage differs across locales. In addition, this makes it unrealistically easy for evaluating multilingual models, which may show strong performance just by remembering slot values from the source language. To make the data more realistic we resampled slot values for the city name slot, which is the most frequent one in the data. In particular, we replaced slot values for the city name slot using the cities with the biggest airports in the region, i.e. the top 30 European airport cities for DE, the top 24 Latin-American airport cities for ES and the top 28 Japanese airport cities for JA.
Across the four languages considered for our experiments, MultiATIS++ (resampled) comprises 4488, 490 and 893 utterances for train, dev and test respectively. The data cover 18 intents and 84 slots.
Balanced and Imbalanced multilingual datasets We created two variants of the resampled data to allow detailed evaluation of our proposed approach in relation to different multilingual SLU scenarios, i.e. a balanced and an imbalanced variant. The imbalanced version reflects a real-world scenario, where usually an application is rolled out to different locales over time, yielding different data amounts, with the English version usually being the first and having highest data amounts.
To create trilingual balanced versions, for each train and dev, we split the utterance ids 3 into three equal parts, and we construct mixed datasets by taking disjoint subsets across languages. To create imbalanced versions, the train and dev data were split into three parts across languages: 50%, 33%, and 17%. The largest subset corresponds to EN data, the second largest to DE data, and the smallest part is either ES or JA data for the two sets of three languages respectively. The goal was to select the data in a way that with multilingual data we have all the available utterances but in three languages. The split for the balanced versions was 50/50, and for the imbalanced version it was 70/30, with 70% for EN and 30% for DE or JA.
Each of the target languages has a test set taken from MultiATIS++ (resampled).

Real-world data
We created a dataset comprising real-world data by extracting random samples from a commercial largescale SLU system. The data are representative of user requests to voice-controlled devices, and they were manually annotated with slot and intent labels. Aiming to get a diverse dataset, we included data from three domains 4 (Music, Books, Video). To reflect the real-world use case, where user frequency and hence data size differs across domains, we extracted samples of different sizes. In particular, we use 20k, 10k and 5k user utterances for Music, Video and Books, respectively, yielding a total of 35k utterances for each language. Each domain sample was split into 80% training, 10% dev and 10% test data. Note that unlike the MultiATIS++ data, the real-world data are not parallel, as they were not artificially ported from one language to another.

Experiments
In this section, we first describe the multilingual performance evaluation and experimental settings. Afterwards, we present experiments on the resampled MultiATIS++ dataset and subsequently experiments with real-world-data.

Multilingual performance evaluation
To address the question of to what degree language borders can be blurred in BERT-based multilingual SLU, we compare two multilingual SLU models as discussed in Sec. 3 with the Ideal and Naive baselines on the Balanced and Imbalanced multilingual datasets (see Sec. 5.2). We denote the standard and adversarial multilingual SLU models as Multi-lang. and Lang.-adv., respectively. The Ideal and Naive baselines on each of the target languages are computed as in Sec. 4 by using the BERT-based model described in Sec. 3.1 as the monolingual model M .

Settings
In all of our experiments, we use pre-trained multilingual BERT (Devlin et al., 2018). Notably, we compared pre-trained multilingual BERT and monolingual BERT in monolingual models, and found that they have similar performance on our datasets. The IC and SF decoders each have two dense layers of size 768 with gelu activation. The dropout values used in IC and SF decoders are 0.5 and 0.2, respectively. The language decoders each have 1 dense layer of size 768 with gelu activation, and dropout value of 0.5. The CNN encoders each have one 2D convolutional layer with kernel size and hidden dim set to 3 and 512, respectively. All encoders use max-pooling for computing sentence representations. For optimization, we use Adam optimizer with learning rate 0.1 and a Noam learning rate scheduler, and we trained our models with mini-batch size of 32 on a single GPU. For adversarial models, the α d , α i , α s , α p , β d hyperparameters are set to (1.0, 1.0, 1.0, 1.0, 0.2) and (1.0, 1.0, 0.5, 0.5, 0.2) in our trilingual and bilingual experiments, respectively. The models are trained for 180 epochs with early stopping. We choose the best epoch based on the validation score of Task 2 in Alg. 1. To evaluate our models, we use the standard SLU metrics, i.e. F1 for slot filling (computed using the CoNLL2002 script) and accuracy for intent classification. In addition, following Gaspers et al. (2018) we use a semantic error rate, which measures IC and SF jointly and is defined as follows: SemER = #(slot+intent errors) #slots in reference + 1 (11) Table 2 and Table 1 show our experimental results on bilingual and trilingual datasets, respectively. As can be seen, both of the two multilingual SLU models consistently outperform the Naive baseline, meaning that multilingual techniques can effectively transfer knowledge across different languages. The relative improvements in avg. SemER range from 19.98% on Balanced (EN-JA) to 48.58% on Imbalanced (EN-DE-ES) scenarios. Interestingly, this holds even on the groups of (EN, JA) and (EN, DE, JA) in which JA is linguistically distant to the other languages. Moreover, the multilingual models surpass Naive performance not only on the average performance over the target languages but also on every individual language.

Results
Overall, there is still a gap between the Ideal performance and the two multilingual SLU models. The largest difference in avg. SemER is observed on Imbalanced(EN-DE-JA) where Multi-lang. is 28.33%   relatively worse than Ideal. However, the fact that the absolute gap of 2.7% here is rather small compared to the absolute improvement of 6.12% over Naive baseline, confirms the effectiveness of the multilingual techniques.
The experimental results also prove our hypothesis that using both language-shared and languagespecific features can narrow the gap to the Ideal performance. In particular, Lang.-adv. gives lower average semantic error rates (SemER) than Multi-lang. in all of our scenarios. The other two metrics, SF F1 and IC acc., are also improved by Lang.-adv. over Multi-lang. in 5 out of 6 datasets. Surprisingly, Lang.-adv. even exceeds or comes very close to the Ideal performance for ES and DE on the dataset Balanced (EN-DE-ES), and for DE on the dataset Balanced (EN, DE, JA). In addition, the results show that there is a significant difference in model performance on the balanced and imbalanced versions of the datasets. Across the different trilingual and bilingual settings, the imbalanced versions of the dataset have on average a lower performance than their balanced counterparts. This shows that while transferring knowledge across different languages works, there is still a drop in performance if data available for one of the languages is scarce.
Another interesting finding can be seen in the comparison of the two bilingual settings, as the language pair EN-DE consistently performs better than EN-JA. This is to be expected as JA is more distant to EN than EN is to DE as EN and DE are both Germanic languages. Fig. 3 visualizes the comparisons of the two multilingual SLU models against the Naive and Ideal performance.

Comparison to state-of-the-art and model complexity
Using the same training, development and testing data as , our monolingual SLU model obtains 97.54 intent accuracy and 95.53 slot F1 on English ATIS (Ideal baseline on English). These results are comparable to the BERT-based performances reported by , which are 97.20 intent accuracy and 95.57 slot F1. This confirms the strength of our baselines.
A common concern with a BERT-based model is about its complexity. In fact, using CNN encoders prevents our adversarial model Lang.-adv. from a dramatic increase in the number of parameters. On the trilingual datasets, the number of parameters is about 190.4M which is just 5.6% more than that of Multi-lang. We also found that replacing CNN encoders by BERT encoders not only increases the number of model parameters to 260.9M but also does not bring significant gain in model performance.

Performance on real-world SLU data
We investigated performance of our SLU models on real-world data to explore the impact on a real-world scenario. Since parallel data were not available for these experiments, we cannot provide performance in relation to an Ideal baseline. Instead, we use the Naive baseline and determine the relative gain obtained by the two multilingual SLU models. The results for two language triples (EN, DE, ES) and (EN, DE, JA) are presented in Table 3.  Table 3: Relative change in semantic error rate (SemER), intent classification accuracy and slot F1 for multilingual and language-adversarial training compared to the Naive baseline. Negative numbers indicate better performance for SemER, while positive numbers indicate better performance for slot F1 and intent classification accuracy.
Overall, multilingual models improves performance over the Naive baselines. In particular, the only drops in performance are observed for slot filling in JA with Multi-lang. on two out of the three domains. This drop may be attributed to JA being linguistically distant to the other languages, which are all from the European language family. The highest relative gain is up to 27.29% in avg. SemER obtained with Lang.-adv.
Between the two multilingual models, our proposed model, Lang.-adv. outperforms Multi-lang. in all of the three domains. While the relative gains in avg. SemER of Multi-lang. fluctuate between 5.39% and 24.02%, those of Lang.-adv. range between 18.71% to 29.01%. Moreover, unlike the standard multilingual approach, Lang.-adv. models do not yield drops when the distant language JA is included in training. In fact, their relative SemER reductions for JA range from 13.67% to 31.9%. Thus, in particular, when distant languages are included in multilingual training, dividing the model into languageshared and language-specific parts seems to be beneficial. Taken together, the results indicate that our proposed adversarial architecture improves performance over the standard BERT-based approach not just on academic benchmark datasets, but also on real-world SLU data.

Conclusion
We have addressed the question of to what degree language borders can be blurred in BERT-based multilingual SLU. Our experimental results on a wide range of multilingual SLU datasets showed that although mBERT is substantially good in blurring language borders even on distant language groups, there is still a gap to the ideal multilingual performance. To narrow this gap, we proposed an adversarial model architecture which uses BERT and CNN encoders to learn language-shared and language-specific representations for SLU.