How Multilingual is Multilingual BERT?

In this paper, we show that Multilingual BERT (M-BERT), released by Devlin et al. (2018) as a single language model pre-trained from monolingual corpora in 104 languages, is surprisingly good at zero-shot cross-lingual model transfer, in which task-specific annotations in one language are used to fine-tune the model for evaluation in another language. To understand why, we present a large number of probing experiments, showing that transfer is possible even to languages in different scripts, that transfer works best between typologically similar languages, that monolingual corpora can train models for code-switching, and that the model can find translation pairs. From these results, we can conclude that M-BERT does create multilingual representations, but that these representations exhibit systematic deficiencies affecting certain language pairs.


Introduction
Deep, contextualized language models provide powerful, general-purpose linguistic representations that have enabled significant advances among a wide range of natural language processing tasks (Peters et al., 2018b;Devlin et al., 2019). These models can be pre-trained on large corpora of readily available unannotated text, and then fine-tuned for specific tasks on smaller amounts of supervised data, relying on the induced language model structure to facilitate generalization beyond the annotations. Previous work on model probing has shown that these representations are able to encode, among other things, syntactic and named entity information, but they have heretofore focused on what models trained on English capture about English (Peters et al., 2018a;Tenney et al., 2019b,a). * Google AI Resident.
In this paper, we empirically investigate the degree to which these representations generalize across languages. We explore this question using Multilingual BERT (henceforth, M-BERT), released by Devlin et al. (2019) as a single language model pre-trained on the concatenation of monolingual Wikipedia corpora from 104 languages. 1 M-BERT is particularly well suited to this probing study because it enables a very straightforward approach to zero-shot cross-lingual model transfer: we fine-tune the model using task-specific supervised training data from one language, and evaluate that task in a different language, thus allowing us to observe the ways in which the model generalizes information across languages.
Our results show that M-BERT is able to perform cross-lingual generalization surprisingly well. More importantly, we present the results of a number of probing experiments designed to test various hypotheses about how the model is able to perform this transfer. Our experiments show that while high lexical overlap between languages improves transfer, M-BERT is also able to transfer between languages written in different scriptsthus having zero lexical overlap-indicating that it captures multilingual representations. We further show that transfer works best for typologically similar languages, suggesting that while M-BERT's multilingual representation is able to map learned structures onto new vocabularies, it does not seem to learn systematic transformations of those structures to accommodate a target language with different word order.

Models and Data
Like the original English BERT model (henceforth, EN-BERT), M-BERT is a 12 layer transformer (Devlin et al., 2019), but instead of be-  ing trained only on monolingual English data with an English-derived vocabulary, it is trained on the Wikipedia pages of 104 languages with a shared word piece vocabulary. It does not use any marker denoting the input language, and does not have any explicit mechanism to encourage translationequivalent pairs to have similar representations.
For NER and POS, we use the same sequence tagging architecture as Devlin et al. (2019). We tokenize the input sentence, feed it to BERT, get the last layer's activations, and pass them through a final layer to make the tag predictions. The whole model is then fine-tuned to minimize the cross entropy loss for the task. When tokenization splits words into multiple pieces, we take the prediction for the first piece as the prediction for the word.

Named entity recognition experiments
We perform NER experiments on two datasets: the publicly available CoNLL-2002 and-2003 sets, containing Dutch, Spanish, English, and German (Tjong Kim Sang, 2002;Sang and Meulder, 2003); and an in-house dataset with 16 languages, 2 using the same CoNLL categories. Table 1 shows M-BERT zero-shot performance on all language pairs in the CoNLL data.

Part of speech tagging experiments
We perform POS experiments using Universal Dependencies (UD) (Nivre et al., 2016) data for 41 languages. 3 We use the evaluation sets from Zeman et al. (2017).  Table 2: POS accuracy on a subset of UD languages. Figure 1: Zero-shot NER F1 score versus entity word piece overlap among 16 languages. While performance using EN-BERT depends directly on word piece overlap, M-BERT's performance is largely independent of overlap, indicating that it learns multilingual representations deeper than simple vocabulary memorization.

Vocabulary Memorization
Because M-BERT uses a single, multilingual vocabulary, one form of cross-lingual transfer occurs when word pieces present during fine-tuning also appear in the evaluation languages. In this section, we present experiments probing M-BERT's dependence on this superficial form of generalization: How much does transferability depend on lexical overlap? And is transfer possible to languages written in different scripts (no overlap)?

Effect of vocabulary overlap
If M-BERT's ability to generalize were mostly due to vocabulary memorization, we would expect zero-shot performance on NER to be highly dependent on word piece overlap, since entities are often similar across languages. To measure this effect, we compute E train and E eval , the sets of word pieces used in entities in the training and evaluation datasets, respectively, and define overlap as the fraction of common word pieces used in the entities: overlap = |E train ∩E eval | / |E train ∪E eval |. Figure 1 plots NER F1 score versus entity overlap for zero-shot transfer between every language pair in an in-house dataset of 16 languages, for both M-BERT and EN-BERT. 4 We can see that  performance using EN-BERT depends directly on word piece overlap: the ability to transfer deteriorates as word piece overlap diminishes, and F1 scores are near zero for languages written in different scripts. M-BERT's performance, on the other hand, is flat for a wide range of overlaps, and even for language pairs with almost no lexical overlap, scores vary between 40% and 70%, showing that M-BERT's pretraining on multiple languages has enabled a representational capacity deeper than simple vocabulary memorization. 5 To further verify that EN-BERT's inability to generalize is due to its lack of a multilingual representation and not an inability of its Englishspecific word piece vocabulary to represent data in other languages, we evaluate on non-cross-lingual NER and see that it performs comparably to a previous state of the art model (see Table 3).

Generalization across scripts
M-BERT's ability to transfer between languages that are written in different scripts, and thus have effectively zero lexical overlap, is surprising given that it was trained on separate monolingual corpora and not with a multilingual objective. To probe deeper into how the model is able to perform this generalization, Table 4 shows a sample of POS results for transfer across scripts.
Among the most surprising results, an M-BERT model that has been fine-tuned using only POSlabeled Urdu (written in Arabic script), achieves 91% accuracy on Hindi (written in Devanagari script), even though it has never seen a single POStagged Devanagari word. This provides clear evidence of M-BERT's multilingual representation ability, mapping structures onto new vocabularies based on a shared representation induced solely from monolingual language model training data.
However, cross-script transfer is less accurate for other pairs, such as English and Japanese, indicating that M-BERT's multilingual representation is not able to generalize equally well in all cases. A possible explanation for this, as we will see in section 4.2, is typological similarity. English and Japanese have a different order of subject, verb 5 Individual language trends are similar to aggregate plots.

Encoding Linguistic Structure
In the previous section, we showed that M-BERT's ability to generalize cannot be attributed solely to vocabulary memorization, and that it must be learning a deeper multilingual representation. In this section, we present probing experiments that investigate the nature of that representation: How does typological similarity affect M-BERT's ability to generalize? Can M-BERT generalize from monolingual inputs to code-switching text? Can the model generalize to transliterated text without transliterated language model pretraining?

Effect of language similarity
Following Naseem et al. (2012), we compare languages on a subset of the WALS features (Dryer and Haspelmath, 2013) relevant to grammatical ordering. 6 Figure 2 plots POS zero-shot accuracy against the number of common WALS features. As expected, performance improves with similarity, showing that it is easier for M-BERT to map linguistic structures when they are more similar, although it still does a decent job for low similarity languages when compared to EN-BERT. Table 5 shows macro-averaged POS accuracies for transfer between languages grouped according to two typological features: subject/object/verb order, and adjective/noun order 7 (Dryer and Haspelmath, 2013). The results reported include only zero-shot transfer, i.e. they do not include cases  training and testing on the same language. We can see that performance is best when transferring between languages that share word order features, suggesting that while M-BERT's multilingual representation is able to map learned structures onto new vocabularies, it does not seem to learn systematic transformations of those structures to accommodate a target language with different word order.

Code switching and transliteration
Code-switching (CS)-the mixing of multiple languages within a single utterance-and transliteration-writing that is not in the language's standard script-present unique test cases for M-BERT, which is pre-trained on monolingual, standard-script corpora. Generalizing to codeswitching is similar to other cross-lingual transfer scenarios, but would benefit to an even larger degree from a shared multilingual representation. Likewise, generalizing to transliterated text is similar to other cross-script transfer experiments, but has the additional caveat that M-BERT was not pre-trained on text that looks like the target.
We test M-BERT on the CS Hindi/English UD corpus from Bhat et al. (2018), which provides texts in two formats: transliterated, where Hindi words are written in Latin script, and corrected, where annotators have converted them back to Devanagari script. Table 6 shows the results for mod-  Table 6: M-BERT's POS accuracy on the code-switched Hindi/English dataset from Bhat et al. (2018), on script-corrected and original (transliterated) tokens, and comparisons to existing work on code-switch POS. els fine-tuned using a combination of monolingual Hindi and English, and using the CS training set (both fine-tuning on the script-corrected version of the corpus as well as the transliterated version). For script-corrected inputs, i.e., when Hindi is written in Devanagari, M-BERT's performance when trained only on monolingual corpora is comparable to performance when training on codeswitched data, and it is likely that some of the remaining difference is due to domain mismatch. This provides further evidence that M-BERT uses a representation that is able to incorporate information from multiple languages.
However, M-BERT is not able to effectively transfer to a transliterated target, suggesting that it is the language model pre-training on a particular language that allows transfer to that language. M-BERT is outperformed by previous work in both the monolingual-only and code-switched supervision scenarios. Neither Ball and Garrette (2018) nor Bhat et al. (2018) use contextualized word embeddings, but both incorporate explicit transliteration signals into their approaches.

Multilingual characterization of the feature space
In this section, we study the structure of M-BERT's feature space. If it is multilingual, then the transformation mapping between the same sentence in 2 languages should not depend on the sentence itself, just on the language pair.

Experimental Setup
We sample 5000 pairs of sentences from WMT16 (Bojar et al., 2016) and feed each sentence (separately) to M-BERT with no fine-tuning. We then extract the hidden feature activations at each layer for each of the sentences, and average the representations for the input tokens except [CLS] and [SEP], to get a vector for each sentence, at each layer l, v EN→DE , find the closest German sentence vector 8 , and measure the fraction of times the nearest neighbour is the correct pair, which we call the "nearest neighbor accuracy".

Results
In Figure 3, we plot the nearest neighbor accuracy for EN-DE (solid line). It achieves over 50% accuracy for all but the bottom layers, 9 which seems to imply that the hidden representations, although separated in space, share a common subspace that represents useful linguistic information, in a language-agnostic way. Similar curves are obtained for EN-RU, and UR-HI (in-house dataset), showing this works for multiple languages.
As to the reason why the accuracy goes down in the last few layers, one possible explanation is that since the model was pre-trained for language modeling, it might need more language-specific information to correctly predict the missing word.

Conclusion
In this work, we showed that M-BERT's robust, often surprising, ability to generalize crosslingually is underpinned by a multilingual representation, without being explicitly trained for it. The model handles transfer across scripts and to code-switching fairly well, but effective transfer to typologically divergent and transliterated targets will likely require the model to incorporate an explicit multilingual training objective, such as that used by Lample and Conneau (2019) or Artetxe and Schwenk (2018).
As to why M-BERT generalizes across languages, we hypothesize that having word pieces used in all languages (numbers, URLs, etc) which have to be mapped to a shared space forces the co-occurring pieces to also be mapped to a shared space, thus spreading the effect to other word pieces, until different languages are close to a shared space.
It is our hope that these kinds of probing experiments will help steer researchers toward the most promising lines of inquiry by encouraging them to focus on the places where current contextualized word representation approaches fall short.

A Model Parameters
All models were fine-tuned with a batch size of 32, and a maximum sequence length of 128 for 3 epochs. We used a learning rate of 3e−5 with learning rate warmup during the first 10% of steps, and linear decay afterwards. We also applied 10% dropout on the last layer. No parameter tuning was performed. We used the BERT-Base, Multilingual Cased checkpoint from https://github. com/google-research/bert.  Table 7: NER results on the CoNLL test sets for EN-BERT. The row is the fine-tuning language, the column the evaluation language. There is a big gap between this model's zero-shot performance and M-BERT's, showing that the pre-training is helping in cross-lingual transfer.  Table 8: POS accuracy on the UD test sets for a subset of European languages using EN-BERT. The row specifies a fine-tuning language, the column the evaluation language. There is a big gap between this model's zeroshot performance and M-BERT's, showing the pretraining is helping learn a useful cross-lingual representation for grammar.