Multi-Source Cross-Lingual Model Transfer: Learning What to Share

Modern NLP applications have enjoyed a great boost utilizing neural networks models. Such deep neural models, however, are not applicable to most human languages due to the lack of annotated training data for various NLP tasks. Cross-lingual transfer learning (CLTL) is a viable method for building NLP models for a low-resource target language by leveraging labeled data from other (source) languages. In this work, we focus on the multilingual transfer setting where training data in multiple source languages is leveraged to further boost target language performance. Unlike most existing methods that rely only on language-invariant features for CLTL, our approach coherently utilizes both language-invariant and language-specific features at instance level. Our model leverages adversarial networks to learn language-invariant features, and mixture-of-experts models to dynamically exploit the similarity between the target language and each individual source language. This enables our model to learn effectively what to share between various languages in the multilingual setup. Moreover, when coupled with unsupervised multilingual embeddings, our model can operate in a zero-resource setting where neither target language training data nor cross-lingual resources are available. Our model achieves significant performance gains over prior art, as shown in an extensive set of experiments over multiple text classification and sequence tagging tasks including a large-scale industry dataset.


Introduction
Recent advances in deep learning enabled a wide variety of NLP models to achieve impressive performance, thanks in part to the availability of * Most work was done while the first author was an intern at Microsoft Research. 1 The code is available at https://github.com/ microsoft/Multilingual-Model-Transfer.
large-scale annotated datasets. However, such an advantage is not available to most of the world languages since many of them lack the the labeled data necessary for training deep neural nets for a variety of NLP tasks. As it is prohibitive to obtain training data for all languages of interest, crosslingual transfer learning (CLTL) offers the possibility of learning models for a target language using annotated data from other languages (source languages) (Yarowsky et al., 2001). In this paper, we concentrate on the more challenging unsupervised CLTL setting, where no target language labeled data is used for training. 2 Traditionally, most research on CLTL has been devoted to the standard bilingual transfer (BLTL) case where training data comes from a single source language. In practice, however, it is often the case that we have labeled data in a few languages, and would like to be able to utilize all of the data when transferring to other languages. Previous work (McDonald et al., 2011) indeed showed that transferring from multiple source languages could result in significant performance improvement. Therefore, in this work, we focus on the multi-source CLTL scenario, also known as multilingual transfer learning (MLTL), to further boost the target language performance.
One straightforward method employed in CLTL is weight sharing, namely directly applying the model trained on the source language to the target after mapping both languages to a common embedding space. As shown in previous work , however, the distributions of the hidden feature vectors of samples from different languages extracted by the same neural net remain divergent, and hence weight sharing is not sufficient for learning a language-invariant feature space that generalizes well across languages. As such, previ-ous work has explored using language-adversarial training Kim et al., 2017) to extract features that are invariant with respect to the shift in language, using only (non-parallel) unlabeled texts from each language.
On the other hand, in the MLTL setting, where multiple source languages exist, languageadversarial training will only use, for model transfer, the features that are common among all source languages and the target, which may be too restrictive in many cases. For example, when transferring from English, Spanish and Chinese to German, language-adversarial training will retain only features that are invariant across all four languages, which can be too sparse to be informative. Furthermore, the fact that German is more similar to English than to Chinese is neglected because the transferred model is unable to utilize features that are shared only between English and German.
To address these shortcomings, we propose a new MLTL model that not only exploits languageinvariant features, but also allows the target language to dynamically and selectively leverage language-specific features through a probabilistic attention-style mixture of experts mechanism (see §3). This allows our model to learn effectively what to share between various languages. Another contribution of this paper is that, when combined with the recent unsupervised cross-lingual word embeddings (Lample et al., 2018;Chen and Cardie, 2018b), our model is able to operate in a zero-resource setting where neither task-specific target language annotations nor general-purpose cross-lingual resources (e.g. parallel corpora or machine translation (MT) systems) are available. This is an advantage over many existing CLTL works, making our model more widely applicable to many lower-resource languages. We evaluate our model on multiple MLTL tasks ranging from text classification to named entity recognition and semantic slot filling, including a real-world industry dataset. Our model beats all baseline models trained, like ours, without crosslingual resources. More strikingly, in many cases, it can match or outperform state-of-the-art models that have access to strong cross-lingual supervision (e.g. commercial MT systems).

Related Work
The diversity of human languages is a critical challenge for natural language processing. In order to alleviate the need for obtaining annotated data for each task in each language, cross-lingual transfer learning (CLTL) has long been studied (Yarowsky et al., 2001;Bel et al., 2003, inter alia).
For unsupervised CLTL in particular, where no target language training data is available, most prior research investigates the bilingual transfer setting. Traditionally, research focuses on resource-based methods, where general-purpose cross-lingual resources such as MT systems or parallel corpora are utilized to replace taskspecific annotated data (Wan, 2009;Prettenhofer and Stein, 2010). With the advent of deep learning, especially adversarial neural networks (Goodfellow et al., 2014;Ganin et al., 2016), progress has been made towards model-based CLTL methods. Chen et al. (2016) propose languageadversarial training that does not directly depend on parallel corpora, but instead only requires a set of bilingual word embeddings (BWEs).
On the other hand, the multilingual transfer setting, although less explored, has also been studied (McDonald et al., 2011;Naseem et al., 2012;Täckström et al., 2013;Hajmohammadi et al., 2014;Zhang and Barzilay, 2015;Guo et al., 2016), showing improved performance compared to using labeled data from one source language as in bilingual transfer.
Another important direction for CLTL is to learn cross-lingual word representations (Klementiev et al., 2012;Zou et al., 2013;Mikolov et al., 2013). Recently, there have been several notable work for learning fully unsupervised cross-lingual word embeddings, both for the bilingual (Zhang et al., 2017;Lample et al., 2018;Artetxe et al., 2018) and multilingual case (Chen and Cardie, 2018b). These efforts pave the road for performing CLTL without cross-lingual resources.
Finally, a related field to MLTL is multi-source domain adaptation (Mansour et al., 2009), where most prior work relies on the learning of domaininvariant features (Zhao et al., 2018;Chen and Cardie, 2018a). Ruder et al. (2019) propose a general framework for selective sharing between domains, but their method learns static weights at the task level, while our model can dynamically select what to share at the instance level. A very recent work (Guo et al., 2018)

Model
One commonly adopted paradigm for neural cross-lingual transfer is the shared-private model (Bousmalis et al., 2016), where the features are divided into two parts: shared (languageinvariant) features and private (language-specific) features. As mentioned before, the shared features are enforced to be language-invariant via language-adversarial training, by attempting to fool a language discriminator. Furthermore, Chen and Cardie (2018a) propose a generalized sharedprivate model for the multi-source setting, where a multinomial adversarial network (MAN) is adopted to extract common features shared by all source languages as well as the target. On the other hand, the private features are learned by separate feature extractors, one for each source language, capturing the remaining features outside the shared ones. During training, the labeled samples from a certain source language go through the corresponding private feature extractor for that particular language. At test time, there is no private feature extractor for the target language; only the shared features are used for cross-lingual transfer. As mentioned in §1, using only the shared features for MLTL imposes an overly strong con-straint and many useful features may be wiped out by adversarial training if they are shared only between the target language and a subset of source languages. Therefore, we propose to use a mixture-of-experts (MoE) model (Shazeer et al., 2017;Gu et al., 2018) to learn the private features. The idea is to have a set of language expert networks, one per source language, each responsible for learning language-specific features for that source language during training. However, instead of hard-switching between the experts, each sample uses a convex combination of all experts, dictated by an expert gate. Thus, at test time, the trained expert gate can decide the optimal expert weights for the unseen target language based on its similarity to the source languages. , the MoE Private Feature Extractor F p , and finally the MoE Predictor C. Based on the actual task (e.g. sequence tagging, text classification, sequence to sequence, etc.), different architectures may be adopted, as explained below. Multilingual Word Representation embeds words from all languages into a single semantic space so that words with similar meanings are close to each other regardless of language. In this work, we mainly rely on the MUSE embeddings (Lample et al., 2018), which are trained in a fully unsupervised manner. We map all other languages into English to obtain a multilingual embedding space. However, in certain experiments, MUSE yields 0 accuracy on one or more language pairs (Søgaard et al., 2018), in which case the VecMap embeddings (Artetxe et al., 2017) are used. It uses identical strings as supervision, which does not require parallel corpus or human annotations. We further experiment with the recent unsupervised multilingual word embeddings (Chen and Cardie, 2018b), which gives improved performance ( §4.2).
In addition, for tasks where morphological fea- tures are important, one can add character-level word embeddings (Dos Santos and Zadrozny, 2014) that captures sub-word information. When character embeddings are used, we add a single CharCNN that is shared across all languages, and the final word representation is the concatenation of the word embedding and the char-level embedding. The CharCNN can then be trained end to end with the rest of the model.

MAN Shared Feature
Extractor F s is a multinomial adversarial network (Chen and Cardie, 2018a), which is an adversarial pair of a feature extractor (e.g. LSTM or CNN) and a language discriminator D. D is a text classifier (Kim, 2014) that takes the shared features (extracted by F s ) of an input sequence and predicts which language it comes from. On the other hand, F s strives to fool D so that it cannot identify the language of a sample. The hypothesis is that if D cannot recognize the language of the input, the shared features then do not contain language information and are hence language-invariant. Note that D is trained only using unlabeled texts, and can therefore be trained on all languages including the target language.
MoE Private Feature Extractor F p is a key difference from previous work, shown in Figure 2. The figure shows the Mixture-of-Experts (Shazeer et al., 2017) model with three source languages, English, Spanish, and Chinese. F p has a shared BiLSTM at the bottom that extracts contextualized word representations for each token w in the input sentence. The LSTM hidden representation h w is then fed into the MoE module, where each source language has a separate expert network (a MLP). In addition, the expert gate G is a linear transformation that takes h w as input and outputs a softmax score α i for each expert. The final private feature vector is a mixture of all expert outputs, dictated by the expert gate weights α.
During training, the expert gate is trained to predict the language of a sample using the gate loss J g , where the expert gate output α is treated as the softmax probability of the predicted languages. In other words, the more accurate the language prediction is, the more the correct expert gets used. Therefore, J g is used to encourage samples from a certain source language to use the correct expert, and each expert is hence learning languagespecific features for that language. As the BiL-STM is exposed to all source languages during training, the trained expert gate will be able to examine the hidden representation of a token to predict the optimal expert weights α, even for unseen target languages at test time. For instance, if a German test sample is similar to the English training samples, the trained expert gate will predict a higher α for the English expert, resulting in a heavier use of it in the final feature vector. Therefore, even for the unforeseen target language (e.g. German), F p is able to dynamically determine what knowledge to use from each individual source language at a token level.
MoE Task-Specific Predictor C is the final module that make predictions for the end task, and may take different forms depending on the task. For instance, for sequence tagging tasks, the shared and private features are first concatenated for each token, and then past through a MoE module similar to F p (as shown in Figure 6 in the Appendix). It is straightforward to adapt C to work for other tasks. For example, for text classification, a pooling layer such as dot-product attention (Luong et al., 2015) is added at the bottom to fuse token-level features into a single sentence feature vector.
C first concatenates the shared and private features to form a single feature vector for each token. It then has another MoE module that outputs a softmax probability over all labels for each token. The idea is that it may be favorable to put different weights between the language-invariant and language-specific features for different target languages. Again consider the example of English, German, Spanish and Chinese. When transferring to Chinese from the other three, the source lan-

Algorithm 1 MAN-MoE Training
Require: labeled corpus X; unlabeled corpus U; Hyperpamameter λ1, λ2 > 0, k ∈ N 1: repeat 2: D iterations 3: for diter = 1 to k do 4: lD = 0 5: for all l ∈ ∆ do For all languages 6: Sample a mini-batch x ∼ U l 7: fs = Fs(x) Shared features 8: lD += LD(D(fs); l) D loss 9: Update D parameters using ∇lD 10: Main iteration 11: loss = 0 12: for all l ∈ S do For all source languages 13: Sample a mini-batch (x, y) ∼ X l 14: fs = Fs(x) Shared features 15: fp, g1 = Fp(x) Private feat. & gate outputs 16:ŷ, g2 = C(fs, fp) 17: loss += LC(ŷ; y) + λ2(Lg(g1; l) + Lg(g2; l)) 18: for all l ∈ ∆ do For all languages 19: Sample a mini-batch x ∼ U l 20: fs = Fs(x) Shared features 21: loss += −λ1 · LD(D(fs); l) Confuse D 22: Update Fs, Fp, C parameters using ∇loss 23: until convergence guages are similar to each other while all being rather distant from Chinese. Therefore, the adversarially learned shared features might be more important in this case. On the other hand, when transferring to German, which is much more similar to English than to Chinese, we might want to pay more attention to the MoE private features. Therefore, we adopt a MoE module in C, which provides more flexibility than using a single MLP 3 .

Model Training
Denote the set of all N source languages as S, where |S| = N . Denote the target language as T , and let ∆ = S ∪ T be the set of all languages. Denote the annotated corpus for a source language l ∈ S as X l , where (x, y) ∼ X l is a sample drawn from X l . In addition, unlabeled data is required for all languages to facilitate the MAN training. We hence denote as U l the unlabeled texts from a language l ∈ ∆.
The overall training flow of variant components is illustrated in Figure 1, while the training algorithm is depicted in Algorithm 1. Similar to MAN, there are two separate optimizers to train MAN-MoE, one updating the parameters of D (red arrows), while the other updating the parameters of all other modules (green arrows). In Algo-rithm 1, L C , L D and L g are the loss functions for the predictor C, the language discriminator D, and the expert gates in F p and C, respectively.
In practice, we adopt the NLL loss for L C for text classification, and token-level NLL loss for sequence tagging: where y is a scalar class label, and y is a vector of token labels. L C is hence interpreted as the negative log-likelihood of predicting the correct task label. Similarly, D adopts the NLL loss in (1) for predicting the correct language of a sample. Finally, the expert gates G use token-level NLL loss in (2), which translates to the negative loglikelihood of using the correct language expert for each token in a sample. Therefore, the objectives that C, D and G minimize are, respectively: (4) where h w in (5) is the BiLSTM hidden representation in F p as shown in Figure 2. In addition, note that D is trained using unlabeled corpora over all languages (∆), while the training of F p and C (and hence G) only take place on source languages (S). Finally, the overall objective function is: where J G and J G are the two expert gates in F p and C, respectively. More implementation details can be found in Appendix B.

Experiments
In this section, we present an extensive set of experiments across three datasets. The first experiment is on a real-world multilingual slot filling (sequence tagging) dataset, where the data is used in a commercial personal virtual assistant. In addition, we conduct experiments on two public  academic datasets, namely the CoNLL multilingual named entity recognition (sequence tagging) dataset (Sang, 2002;Sang and Meulder, 2003), and the multilingual Amazon reviews (text classification) dataset (Prettenhofer and Stein, 2010).

Cross-Lingual Semantic Slot Filling
As shown in Table 1, we collect data for four languages: English, German, Spanish, and Chinese, over three domains: Navigation, Calendar, and Files. Each domain has a set of pre-determined slots (the slots are the same across languages), and the user utterances in each language and domain are annotated by crowd workers with the correct slots (see the examples in Table 1). We employ the standard BIO tagging scheme to formulate the slot filling problem as a sequence tagging task. For each domain and language, the data is divided into a training, a validation, and a test set, with the number of samples in each split shown in Table 1. In our experiments, we treat each domain as a separate experiment, and consider each of German, Spanish and Chinese as the target language while the remaining three being source languages, which results in a total of 9 experiments.

Results
In Table 2, we report the performance of MAN-MoE compared to a number of baseline systems. All systems adopt the same base architecture, which is a multi-layer BiLSTM sequence tagger (İrsoy and Cardie, 2014) with a token-level MLP on top (no CRFs were used). MT baselines employ machine translation (MT) for cross-lingual transfer. In particular, the trainon-trans(lation) method translates the entire English training set into each target language which are in turn used to train a supervised system on the target language. On the other hand, the test-ontrans(lation) method trains an English sequence tagger, and utilizes MT to translate the test set of each target language into English in order to make predictions. In this work, we adopt the Microsoft Translator 4 , a strong commercial MT system. Note that for a MT system to work for sequence tagging tasks, word alignment information must be available, in order to project wordlevel annotations across languages. This rules out many MT systems such as Google Translate since they do not provide word alignment information through their APIs. BWE baselines rely on Bilingual Word Embeddings (BWEs) and weight sharing for CLTL. Namely, the sequence tagger trained on the source language(s) are directly applied to the target language, in hopes that the BWEs could bridge the language gap. This simple method has been shown to yield strong results in recent work (Upadhyay et al., 2018). The MUSE (Lample et al., 2018) BWEs are used by all systems in this experiment. 1-to-1 indicates that we are only transferring from English, while 3-to-1 means the training data from all other three languages are leveraged. 5 The final baseline is the MAN model (Chen and Cardie, 2018a), presented before our MAN-MoE approach. As shown in Table 2, MAN-MoE substantially outperforms all baseline systems that do not employ cross-lingual supervision on almost all domains and languages. Another interesting observation is that MAN performs strongly on Chinese while being much worse on German and Spanish compared to the BWE baseline. This corroborates our hypothesis that MAN only leverages features that are invariant across all languages for CLTL, and it learns such features better than weight sharing. Therefore, when transferring to German or Spanish, which is similar to a subset of source languages, the performance of   MAN degrades significantly. On the other hand, when Chinese serves as the target language, where all source languages are rather distant from it, MAN has its merit in extracting language-invariant features that could generalize to Chinese. With MAN-MoE, however, this trade-off between close and distant language pairs is well addressed by the combination of MAN and MoE. By utilizing both language-invariant and language-specific features for transfer, MAN-MoE outperforms all crosslingually unsupervised baselines on all languages. Furthermore, even when compared with the MT baseline, which has access to hundreds of millions of parallel sentences, MAN-MoE performs competitively on German and Spanish. It even significantly beats both MT systems on German as MT sometimes fails to provide accurate word alignment for German. On Chinese, where the unsupervised BWEs are much less accurate (BWE baselines only achieve 20% F1), MAN-MoE is able to greatly improve over the BWE and MAN baselines and shows promising results for zero-resource CLTL even between distant language pairs.

Feature Ablation
In this section, we take a closer look at the various modules of MAN-MoE and their impacts on performance (Table 3). When the MoE in C is removed, moderate decrease is observed on all languages. The performance degrades the most on Chinese, suggesting that using a single MLP in C is not ideal when the target language is not similar to the sources. When removing the private MoE, the MoE in C no longer makes much sense as C only has access to the shared features, and the performance is even slightly worse than removing both MoEs. With both MoE modules removed, it reduces to the MAN model, and we see a significant drop on German and Spanish. Finally, when removing MAN while keeping MoE, where the shared features are simply learned via weight-sharing, we see a slight drop on German and Spanish, but a rather great one on Chinese. The ablation results support our hypotheses and validate the merit of MAN-MoE.

Cross-Lingual Named Entity Recognition
In this section, we present experiments on the CoNLL 2002 & 2003 multilingual named entity recognition (NER) dataset (Sang, 2002;Sang and Meulder, 2003), with four languages: English, German, Spanish and Dutch. The task is also formulated as a sequence tagging problem, with four types of tags: PER, LOC, ORG, and MISC.
The results are summarized in Table 4. We observe that using only word embeddings does not yield satisfactory results, since the out-ofvocabulary problem is rather severe, and morphological features such as capitalization is crucial for NER. We hence add character-level word embeddings for this task ( §3.1) to capture subword fea-   Table 4 also shows the performance of several state-of-the-art models in the literature 6 . Note that most of these systems are specifically designed for the NER task, and exploit many taskspecific resources, such as multilingual gazetteers, or metadata in Freebase or Wikipedia (such as entity categories). Among these, Täckström et al. (2012) rely on parallel corpora to learn crosslingual word clusters that serve as features. Nothman et al. (2013); Tsai et al. (2016) both leverage information in external knowledge bases such as Wikipedia to learn useful features for crosslingual NER. Ni et al. (2017) employ noisy parallel corpora (aligned sentence pairs, but not always translations) and bilingual dictionaries (5k words for each language pair) for model transfer. They further add external features such as entity types learned from Wikipedia for improved performance. Finally, Mayhew et al. (2017) propose a multi-source framework that utilizes large cross-lingual lexica. Despite using none of these resources, general or task-specific, MAN-MoE nonetheless outperforms all these methods. The only exception is German, where task-specific resources remain helpful due to its unique capitalization rules and high OOV rate. 6 We also experimented with the MT baselines, but it often failed to produce word alignment, resulting in many empty predictions. The MT baselines attain only a F1 score of ∼30%, and were thus excluded for comparison.  In a contemporaneous work by (Xie et al., 2018), they propose a cross-lingual NER model using Bi-LSTM-CRF that achieves similar performance compared to MAN-MoE+CharCNN. However, our architecture is not specialized to the NER task, and we did not add task-specific modules such as a CRF decoding layer, etc.
Last but not least, we replace the MUSE embeddings with the recently proposed unsupervised multilingual word embeddings (Chen and Cardie, 2018b), which further boosts the performance, achieving a new state-of-the-art performance as shown in Table 4 (last row).

Cross-Lingual Text Classification on Amazon Reviews
Finally, we report results on a multilingual text classification dataset (Prettenhofer and Stein, 2010). The dataset is a binary classification dataset where each review is classified into positive or negative sentiment. It has four languages: English, German, French and Japanese. As shown in Table 5, MT-BOW uses machine translation to translate the bag of words of a target sentence into the source language, while CL-SCL learns a cross-lingual feature space via structural correspondence learning (Prettenhofer and Stein, 2010). CR-RL (Xiao and Guo, 2013) learns bilingual word representations where part of the word vector is shared among languages. Bi-PV (Pham et al., 2015) extracts bilingual paragraph vector by sharing the representation between parallel documents. UMM (Xu and Wan, 2017) is a multilingual framework that could utilize parallel corpora between multiple language pairs, and pivot as needed when direct bitexts are not available for a specific source-target pair. Finally CLDFA (Xu and Yang, 2017) proposes cross-lingual distillation on parallel corpora for CLTL. Unlike other works listed, however, they adopt a task-specific parallel corpus (translated Amazon reviews) that are difficult to obtain in practice, making the num-

German
French Japanese Domain books dvd music avg books dvd music avg books dvd music avg bers not directly comparable to others. Among these methods, UMM is the only one that does not require direct parallel corpus between all source-target pairs. It can instead utilize pivot languages (e.g. English) to connect multiple languages. MAN-MoE, however, takes another giant leap forward to completely remove the necessity of parallel corpora while achieving similar results on German and French compared to UMM. On Japanese, the performance of MAN-MoE is again limited by the quality of BWEs. (BWE baselines are merely better than randomness.) Nevertheless, MAN-MoE remains highly effective and the performance is only a few points below most SoTA methods with cross-lingual supervision.
For a better understanding of the model behavior, Figure 3 visualizes the expert weights when transferring to different languages, which corroborates our model hypothesis and the findings in §4.1.2 (see Appendix A for more details).

Conclusion
In this paper, we propose MAN-MoE, a multilingual model transfer approach that exploits both language-invariant (shared) features and language-specific (private) features, which departs from most previous models that can only make use of shared features. Following earlier work, the shared features are learned via languageadversarial training . On the other hand, the private features are extracted by a mixture-of-experts (MoE) module, which is able to dynamically capture the relation between the tar-get language and each source language on a token level. This is extremely helpful when the target language is similar to a subset of source languages, in which case traditional models that solely rely on shared features would perform poorly. Furthermore, MAN-MoE is a purely model-based transfer method, which does not require parallel data for training, enabling fully zero-resource MLTL when combined with unsupervised cross-lingual word embeddings. This makes MAN-MoE more widely applicable to lower-resourced languages. Our claim is supported by a wide range of experiments over multiple text classification and sequence tagging tasks, including a large-scale industry dataset. MAN-MoE significantly outperforms all cross-lingually unsupervised baselines regardless of task or language. Furthermore, even considering methods with strong cross-lingual supervision, MAN-MoE is able to match or outperform these models on closer language pairs. When transferring to distant languages such as Chinese or Japanese (from European languages), where the quality of cross-lingual word embeddings are unsatisfactory, MAN-MoE remains highly effective and substantially mitigates the performance gap introduced by cross-lingual supervision.
For future work, we plan to apply MAN-MoE to more challenging languages for tasks such as syntactic parsing, where multilingual data exists (Nivre et al., 2017). Furthermore, we would like to experiment with multilingual contextualized embeddings such as the Multilingual BERT .

Appendix A Visualization of Expert Gate Weights
In Figure 4 and 5, we visualize the average expert gate weights for each of the three target languages in the Amazon and CoNLL datasets, respectively. For each sample, we first compute a sentencelevel aggregation by averaging over the expert gate weights of all its tokens. These sentence-level expert gate weights are then further averaged across all samples in the validation set, which forms a final language-level average expert gate weight for each target language. For the Amazon dataset, we take the combination of all three domains (books, dvd, music). The visualization further collaborates with our hypothesis that our model makes informed decisions when selecting what features to share to the target language. On the Amazon dataset, it can be seen that when transferring to German or French (from the remaining three), the Japanese expert is less utilized compared to the European languages. On the other hand, it is interesting that when transferring to Japanese, the French and English experts are used more than the German one, and the exact reason remains to be investigated. However, this phenomenon might be of less significance since the private features may not play a very important role when transferring to Japanese as the model is probably focusing more on the shared features, according to the ablation study in Section 4.1.2.
In addition, on the CoNLL dataset, we observe that when transferring to German, the experts from the two more similar lanaguages, English and Dutch, are favored over the Spanish one. Similarly, when transferring to Dutch, the highly relevant German expert is heavily used, and the Spanish expert is barely used at all. Interestingly, when transferring to Spanish, the model also shows a skewed pattern in terms of expert usage, and prefers the German expert over the other two.

Appendix B Implementation Details
In all experiments, Adam (Kingma and Ba, 2015) is used for both optimizers (main optimizer and D optimizer), with learning rate 0.001 and weight decay 10 −8 . Batch size is 64 for the slot filling experiment and 16 for the NER and Amazon Reviews experiments, which is selected mainly due to memory concerns. CharCNN increases the GPU memory usage and NER hence could only λ 1 λ 2 k Slot Filling 0.01 1 5 CoNLL NER 0.0001 0.01 1 Amazon 0.002 0.1 1 use a batch size of 16 to fit in 12GB of GPU memory. The Amazon experiment does not employ character embeddings but the documents are much longer, and thus also using a smaller batch size. All embeddings are fixed during training. Dropout (Srivastava et al., 2014) with p = 0.5 is applied in all components. Unless otherwise mentioned, ReLU is used as non-linear activation. Bidirectional LSTM is used in the feature extractors for all experiments. In particular, F s is a two-layer BiLSTM of hidden size 128 (64 for each direction), and F p is a two-layer BiLSTM of hidden size 128 stacked with a MoE module (see Figure 2). Each expert network in the MoE module of F p is a two-layer MLP again of hidden size of 128. The final layer in the MLP has a tanh activation instead of ReLU to match the LSTMextracted shared features (with tanh activations). The expert gate is a linear transformation (matrix) of size 128 × N , where N is the number of source languages.
On the other hand, the architecture of the task specific predictor C depends on the task. For sequence tagging experiments, the structure of C is shown in Figure 6, where each expert in the MoE module is a token-level two-layer MLP with a softmax layer on top for making token label predictions. For text classification tasks, a dotproduct attention mechanism (Luong et al., 2015) is added after the shared and private features are concatenated. It has a length 256 weight vector that attends to the feature vectors of each token and computes a softmax mixture that pools the token-level feature vectors into a single sentencelevel feature vector. The rest of C remains the same for text classification.
For the language discriminator D, a CNN text classifier (Kim, 2014) is adopted in all experiments. It takes as input the shared feature vectors of each token, and employs a CNN with maxpooling to pool them into a single fixed-length feature vector, which is then fed into a MLP for clas-  sifying the language of the input sequence. The number of kernels is 200 in the CNN, while the kernel sizes are 3, 4, and 5. The MLP has one hidden layer of size 128.
The MUSE, VecMap, and UMWE embeddings are trained with the monolingual 300d fastText Wikipedia embeddings (Bojanowski et al., 2017). When character-level word embeddings are used, a CharCNN is added that takes randomly initialized character embeddings of each character in a word, and passes them through a CNN with kernel number 200 and kernel sizes 3, 4, and 5. Finally, the character embeddings are max-pooled and fed into a single fully-connected layer to form a 128 dimensional character-level word embedding, which is concatenated with the pre-trained cross-lingual word embedding to form the final word representation of that word.
The remaining hyperparameters such as λ 1 , λ 2 and k (see Algorithm 1) are tuned for each individual experiment, as shown in Table 6.