Identifying Elements Essential for BERT’s Multilinguality

It has been shown that multilingual BERT (mBERT) yields high quality multilingual representations and enables effective zero-shot transfer. This is surprising given that mBERT does not use any crosslingual signal during training. While recent literature has studied this phenomenon, the reasons for the multilin-guality are still somewhat obscure. We aim to identify architectural properties of BERT and linguistic properties of languages that are necessary for BERT to become multilingual. To allow for fast experimentation we propose an efﬁcient setup with small BERT models trained on a mix of synthetic and natural data. Overall, we identify four architectural and two linguistic elements that inﬂuence multilingual-ity. Based on our insights, we experiment with a multilingual pretraining setup that modiﬁes the masking strategy using VecMap, i.e., unsupervised embedding alignment. Experiments on XNLI with three languages indicate that our ﬁndings transfer from our small setup to larger scale settings.


Introduction
Multilingual models, i.e., models capable of processing more than one language with comparable performance, are central to natural language processing. They are useful as fewer models need to be maintained to serve many languages, resource requirements are reduced, and low-and mid-resource languages can benefit from crosslingual transfer. Further, multilingual models are useful in machine translation, zero-shot task transfer and typological research. There is a clear need for multilingual models for the world's 7000+ languages.
The exact reason for mBERT's multilinguality is -to the best of our knowledge -still debated. K et al. (2020) provide an extensive study and conclude that a shared vocabulary is not necessary, but that the model needs to be deep and languages need to share a similar "structure". Artetxe et al. (2020) show that neither a shared vocabulary nor joint pretraining is required for BERT to be multilingual. Conneau et al. (2020b) find that BERT models across languages can be easily aligned and that a necessary requirement for achieving multilinguality are shared parameters in the top layers. This work continues this line of research. We find indications that six elements influence the multilinguality of BERT. Figure 1 summarizes our main findings.

Contributions
• Training BERT models consumes tremendous resources. We propose an experimental setup that allows for fast experimentation.
• We hypothesize that BERT is multilingual because of a limited number of parameters. By forcing the model to use its parameters efficiently, it exploits common structures by aligning representations across languages. We provide experimental evidence that the number of parameters and training duration is interlinked with multilinguality and an indication that generalization and multilinguality might be conflicting goals.
• We show that shared special tokens, shared position embeddings and the common masking strategy to replace masked tokens with random words contribute to multilinguality. This is in line with findings from (Conneau et al., 2020b).
• We show that having identical structure across languages, but an inverted word order in one language destroys multilinguality. Similarly having shared position embeddings contributes to multilinguality. We thus hypothesize that word order across languages is an important ingredient for multilingual models.
• Using these insights we perform initial experiments to create a model with higher degree of multilinguality.
• We conduct experiments on Wikipedia and evaluate on XNLI to show that our findings transfer to larger scale settings.
Our code is publicly available. 2 2 Setup and Hypotheses

Setup
We aim at having a setup that allows for gaining insights quickly when investigating multilinguality.
2 https://github.com/pdufter/minimult 'He ate wild honey. ' [He, ate, wild, hon, ##e, ##y, .] [ 195, 1291, 1750, 853, 76, 80, 8] [2243, 3339, 3798, 2901, 2124 ,2128, 2056] [::He, ::ate, ::wild, ::hon, ::##e, ::##y, ::.] Our assumption is that these insights are transferable to a larger scale real world setup. We verify this assumption in §5. Languages. K et al. (2020) propose to consider English and Fake-English, a language that is created by shifting unicode points by a large constant. Fake-English in their case has the exact same linguistic properties as English, but is represented by different unicode points. We follow a similar approach, but instead of shifting unicode points we simply shift token indices after tokenization by a constant; shifted tokens are prefixed by "::" and added to the vocabulary. See Figure 2 for an example. While shifting indices and unicode code points have similar effects, we chose shifting indices as we find it somewhat cleaner. 3 Data. For our setup, aimed at supporting fast experimentation, a small corpus with limited vocabulary is desirable. As training data we use the English Easy-to-Read version of the Parallel Bible Corpus (Mayer and Cysouw, 2014) that contains the New Testament. The corpus is structured into verses and is word-tokenized. We sentence-split verses using NLTK (Loper and Bird, 2002). The final corpus has 17k sentences, 228k words, a vocabulary size of 4449 and 71 distinct characters. The median sentence length is 12 words. By creating a Fake-English version of this corpus we get a shifted replica and thus a sentence-parallel corpus.

BERT Model
As development data we apply the same procedure to the first 10k sentences of the Old Testament of the English King James Bible. All our evaluations are performed on development data, except for word translation and when indicated explicitly.
Vocabulary. We create a vocabulary of size 2048 from the Easy-to-Read Bible with the wordpiece tokenizer (Schuster and Nakajima, 2012). 4 Using the same vocabulary for English and Fake-English yields a final vocabulary size of 4096.
Model. We use the BERT-Base architecture (Devlin et al., 2019), modified to achieve a smaller model: we divide hidden size, intermediate size of the feed forward layer and number of attention heads by 12; thus, hidden size is 64 and intermediate size 256. While this leaves us with a single attention head, K et al. (2020) found that the number of attention heads is important neither for overall performance nor for multilinguality. We call this smaller model BERT-small.
As a consistency check for our experiments we consider random embeddings in the form of a randomly initialized but untrained BERT model, referred to as "untrained".
Training Parameters. We mostly use the original training parameters as given in (Devlin et al., 2019). Learning rate and number of epochs was chosen to achieve reasonable perplexity on the training corpus (see supplementary for details). Unless indicated differently we use a batch size of 256, train for 100 epochs with AdamW (Loshchilov and Hutter, 2019) (learning rate 2e-3, weight decay .01, epsilon 1e-6), and use 50 warmup steps. We only use the masked-language-modeling objective, without next-sequence-prediction. With this setup we can train a single model in under 40 minutes on a single GPU (GeForce GTX 1080Ti). We run each experiment with five different seeds, and report mean and standard deviation.

Evaluation
We evaluate two properties of our trained language models: the degree of multilinguality and -as a consistency check -the overall model fit (i.e., is the trained language model of reasonable quality).

Multilinguality
We evaluate the degree of multilinguality with three tasks. Representations from different layers of BERT can be considered. We use layer 0 (uncontextualized) and layer 8 (contextualized). Several papers have found layer 8 to work well for monolingual and multilingual tasks (Tenney et al., 2019;Hewitt and Manning, 2019;Sabet et al., 2020). Note that representations from layer 0 include position and segment embeddings besides the token embeddings as well as layer normalization.
Word Alignment. Sabet et al. (2020) find that mBERT performs well on word alignment. By construction, we have a sentence-aligned corpus with English and Fake-English. The gold word alignment between two sentences is the identity alignment. We use this automatically created goldalignment for evaluation.
To extract word alignments from BERT we use (Sabet et al., 2020)'s Argmax method. Consider the parallel sentences s (eng) , s (fake) , with length n. We extract d-dimensional wordpiece embeddings from the l-th layer of BERT to obtain embeddings E(s (k) ) ∈ R n×d for k ∈ {eng, fake}. The similarity matrix S ∈ [0, 1] n×n is computed by S ij := cosine-sim E(s (eng) ) i , E(s (fake) ) j . Two wordpieces i and j are aligned if The alignments are evaluated using precision, recall and F 1 as follows: where P is the set of predicted alignments and G the set of true alignment edges. We report F 1 . Sentence Retrieval is popular for evaluating crosslingual representations (e.g., (Artetxe and Schwenk, 2019;Libovickỳ et al., 2019)). We obtain the embeddings E(s (k) ) as before and compute a sentence embedding e (k) s simply by averaging vectors across all tokens in a sentence (ignoring CLS and SEP tokens). Computing cosine similarities between English and Fake-English sentences yields the similarity matrix R ∈ R m×m where R ij = cosine-sim(e Given an English query sentence s (eng) i , we obtain the retrieved sentences in Fake-English by ranking them according to similarity. Since we can do the same with Fake-English as query language, we report the mean precision of these directions, computed as We also evaluate word translation. Again, by construction we have a ground-truth bilingual dictionary of size 2048. We obtain word vectors by feeding each word in the vocabulary individually to BERT, in the form "[CLS] {token} [SEP]". We then evaluate word translation like sentence retrieval and denote the measure with τ .
Multilinguality Score. For an easier overview we compute a multilinguality score by averaging retrieval and translation results across both layers. That is µ = 1/4(τ 0 + τ 8 + ρ 0 + ρ 8 ) where τ k ,ρ k means representations from layer k have been used. We omit word alignment here as it is not a suitable measure to compare all models: with shared position embeddings, the task is almost trivial given that the gold alignment is the identity alignment.

Model Fit
MLM Perplexity. To verify that BERT was successfully trained we evaluate the models on perplexity (with base e) for training and development data. Perplexity is computed on 15% of randomly selected tokens that are replaced by "[MASK]". Given those randomly selected tokens in a text w 1 , . . . , w n and probabilities p w 1 , . . . , p wn that the correct token was predicted by the model, perplexity is calculated as exp(−1/n n k=1 log(p w k )).

Architectural Properties
Here we formulate hypotheses as to which architectural components contribute to multilinguality. Overparameterization: overparam. If BERT is severely overparameterized the model should have enough capacity to model each language separately without creating a multilingual space. Conversely, if the number of parameters is small, the model has a need to use parameters efficiently. The model is likely to identify common structures among languages and model them together, thus creating a multilingual space.
To test this, we train a larger BERT model that has the same configuration as BERT-base (i.e., hidden size: 768, intermediate size: 3072, attention heads: 12) and is thus much larger than our standard configuration, BERT-small. Given our small training corpus and the small number of languages, we argue that BERT-base is overparameterized. For the overparameterized model we use learning rate 1e-4 (following (Devlin et al., 2019)).
Shared Special Tokens: shift-special. It has been found that a shared vocabulary is not essential for multilinguality (K et al., 2020;Artetxe et al., 2020;Conneau et al., 2020b). Similar to prior studies, in our setting each language has its own vocabulary, as we aim at breaking the multilinguality of BERT. However in prior studies, special tokens ([UNK], [CLS], [SEP], [MASK], [PAD]) are usually shared across languages. Shared special tokens may contribute to multilinguality because they are very frequent and could serve as "anchor points". To investigate this, we shift the special tokens with

Seg.
Figure 3: lang-pos: input indices to BERT with language specific position and segment embeddings.
the same shift as applied to token indices. Shared Position Embeddings: lang-pos. Position and segment embeddings are usually shared across languages. We investigate their contribution to multilinguality by using language-specific position (lang-pos) and segment embeddings. For an example see Figure 3.
Random Word Replacement: no-random. The MLM task as proposed by Devlin et al. (2019) masks 15% of tokens in a sentence. These tokens are replaced with "[MASK]" in p [mask] = 80%, remain unchanged in p [id] = 10% and are replaced with a random token of the vocabulary in p [rand] = 10% of the cases. The randomly sampled token can come from any language resulting in Fake-English tokens to appear in English sentences and vice-versa. We hypothesize that this random replacement could contribute to multilinguality. We experiment with the setting p = (0.8, 0.2, 0.0) where p denotes the triple (p [mask] , p [id] , p [rand] ).

Linguistic Properties
Inverted Word Order: inv-order. K et al. (2020) shuffled word order in sentences randomly and found that word order has some, but not a severe effect on multilinguality. They conclude that "structural similarity" across languages is important without further specifying this term. We investigate an extreme case: inversion. We invert each sentence in the Fake-English corpus: [w 1 , w 2 , . . . , w n ] → [w n , w n−1 , . . . , w 1 ]. Note that, apart from the reading order, all properties of the languages are preserved, including ngram statistics. Thus, the structural similarity of English and inverted Fake-English is arguably very high.
Comparability of Corpora: no-parallel. We hypothesize that the similarity of training corpora contributes to "structural similarity": if we train on a parallel corpus we expect the language structures to be more similar than when we train on two independent corpora, potentially from different domains. For mBERT, Wikipedias across languages are in the same domain, share some articles and thus are comparable, yet not parallel. To test our Corresponding tokens in English and Fake-English are nearest neighbors of each other or nearly so. This is quantitatively confirmed in Table 1. hypothesis, we train on a non-parallel corpus. We create it by splitting the Bible into two halves, using one half for English and Fake-English each, thus avoiding any parallel sentences during training. Table 1 shows results. Each model has an associated ID that is consistent with the code. The original model (ID 0) shows a high degree of multilinguality. As mentioned, alignment is an easy task with shared position embeddings yielding F 1 = 1.00. Retrieval works better with contextualized representations on layer 8 (.97 vs. .16) whereas word translation works better on layer 0 (.88 vs. .79), as expected. Overall the embeddings seem to capture the similarity of English and Fake-English exceptionally well (see Figure 4 for a PCA of token embeddings). The untrained BERT models perform poorly (IDs 18,19), except for word alignment with shared position embeddings.

Architectural Properties
When applying our architectural modifications (lang-pos, shift-special, no-random) individually we see medium to slight decreases in multilinguality (IDs 1, 2, 4). lang-pos has the largest negative impact. Apparently, applying just a single modification can be compensated by the model. Indeed, when using two modifications at a time (5-7) multilinguality goes down more, only with 7 there is still a high degree of multilinguality. With all three modifications (8) the degree of multilinguality is drastically lowered (µ .12 vs. .70).
We see that the language model quality (see columns MLM-Perpl.) is stable on train and dev across models (IDs 1-8) and does not deviate from original BERT (ID 0) by much. 5 Thus, we can conclude that each of the models has fitted the training data well and poor results on µ are not due to the fact that the architectural changes have hobbled BERT's language modeling performance.
The overparameterized model (ID 15) exhibits lower scores for word translation, but higher ones for retrieval and overall a lower multilinguality score (.58 vs. .70). However, when we add langpos (16) or apply all three architectural modifications (17), multilinguality drops to .01 and .00. This indicates that by decoupling languages with the proposed modifications (lang-pos, shift-special, no-random) and greatly increasing the number of parameters (overparam), it is possible to get a well-  performing language model (low perplexity) that is not multilingual. Conversely, we can conclude that the four architectural properties together are necessary for BERT to be multilingual.

Linguistic Properties
Inverting Fake-English (IDs 3, 9) breaks multilinguality almost completely -independently of any architectural modifications. Having a language with the exact same structure (same ngram statistics, vocabulary size etc.), only with inverted order, seems to block BERT from creating a multilingual space. Note that perplexity is almost the same. We conclude that having a similar word order structure is necessary for BERT to create a multilingual space. The fact that shared position embeddings are important for multilinguality supports this finding. Our hypothesis is that the drop in multilinguality with inverted word order comes from an incompatibility between word and position encodings: BERT needs to learn that the word at position 0 in English is similar to word at position n in Fake-English. However, n (the sentence length) varies from sentence to sentence. This suggests that relative position embeddings -rather than absolute position embeddings -might be beneficial for multilinguality across languages with high distortion.
To investigate this effect more, Figure 8 shows cosine similarities between position embeddings for models 1, 9. Position IDs 0-127 are for English, 128-255 for Fake-English. Despite language specific position embeddings, the embeddings exhibit a similar structure: in the top panel there is a clear yellow diagonal at the beginning, which weakens at the end. The bottom shows that for a model with inverted Fake-English the position embeddings live in different spaces: no diagonal is visible.
In the range 90-128 (a rare sentence length) the similarities look random. This indicates that smaller position embeddings are trained more than larger ones (which occur less frequently). We suspect that embedding similarity correlates with the number of gradient updates a single position embedding receives. Positions 0, 1 and 128, 129 receive a gradient update in every step and can thus be considered an average of all gradient updates (up to random initialization). This is potentially one reason for the diagonal pattern in the top panel.

Corpus Comparability
So far we have trained on a parallel corpus. Now we show what happens with a merely comparable corpus. The first half of the training corpus is used for English and the other half for Fake-English. To mitigate the reduced amount of training data we train for twice as many epochs. Table 2 shows that multilinguality indeed decreases as the training cor- pus becomes non-parallel. This suggests that the more comparable a training corpus is across languages the higher the multilinguality. Note, however, that the models fit the training data worse and do not generalize as well as the original model.

Multilinguality During Training
One central hypothesis is that BERT becomes multilingual at the point at which it is forced to use its parameters efficiently. We argue that this point depends on several factors including the number of parameters, training duration, "complexity" of the data distribution and how easily common structures across language spaces can be aligned. The latter two are difficult to control for. We provided insights that two languages with identical structure but inverted word order are harder to align. Figure 6 analyzes the former two factors and shows model fit and multilinguality for the small and large model settings over training steps.
Generally, multilinguality rises very late at a stage where model fit improvements are flat. In fact, most of multilinguality in the overparameterized setting (15) arises once the model starts to overfit and perplexity on the development set goes up. The original setting (0) has far fewer parameters. We hypothesize that it is forced to use its parameters efficiently and thus multilinguality scores rise much earlier when both training and development perplexity are still going down.
Although this is a very restricted experimental setup it indicates that having multilingual models is a trade-off between good generalization and high degree of multilinguality. By overfitting a model one could achieve high multilinguality. Conneau et al. (2020a) introduced the concept of "curse of multilinguality" and found that the number of parameters should be increased with the number of languages. Our results indicate that too many parameters can also harm multilinguality. However, in practice it is difficult to create a model with so many parameters that it is overparameterized when being trained on 104 Wikipedias. Rönnqvist et al. (2019) found that current multilingual BERT models may be undertrained. This is consistent with our findings that multilinguality arises late in the training stage.

Improving Multilinguality
So far we have tried to break BERT's multilinguality. Now we turn to exploiting our insights for improving it. mBERT has shared position embeddings, shared special tokens and we cannot change linguistic properties of languages. Our results on overparameterization suggest that smaller models become multilingual faster. However, mBERT may already be considered underparameterized given that it is trained on 104 large Wikipedias.
One insight we can leverage for the masking procedure is no-random: replacing masked words with random tokens. We propose to introduce a fourth masking option: replacing masked tokens with semantically similar words from other languages. To this end we train static fastText embeddings (Bojanowski et al., 2017) on the training set and then project them into a common space using VecMap (Artetxe et al., 2018). We use this crosslingual space to replace masked tokens with nearest neighbors from the other language. Each masked word is then replaced with the probabilities (p [mask] , p [id] , p [rand] , p [knn] ) = (0.5, 0.1, 0.1, 0.3), i.e., in 30% of the cases masked words get replaced with the nearest neighbor from the multilingual static embedding space. Note that this procedure (including VecMap) is fully unsupervised (i.e., no parallel data or dictionary required). We call this method knn-replace. Conneau et al. (2020b) performed similar experiments by creating code switched data and adding it to the training data. However, we only replace masked words. Figure 7 shows the multilinguality score and model fit over training time. Compared to the original model in Figure 6, retrieval and translation have higher scores earlier. Towards the end multilinguality scores become similar, with knn-replace outperforming the original model (see Table 1). This finding is particularly important for training BERT on large amounts of data. Given how expensive training is, it may not be possible to train a model long enough to obtain a high degree of multilinguality. Longer training incurs the risk of overfitting as well. Thus achieving multilinguality early in the training process is valuable. Our new masking strategy has this property.

XNLI
We have presented experiments on a small corpus with English and Fake-English. Now we provide results on real data. Our setup is similar to (K et al., 2020): we train a multilingual BERT model on English, German and Hindi. As training corpora we sample 1GB of data from Wikipedia (except for Hindi, as its size is <1GB ) and pretrain the model for 2 epochs/140k steps with batch size mBERT Results by (Hu et al., 2020) .81 .70 .59 Table 3: Accuracy on XNLI test for different model settings. Shown is the mean and standard deviation (subscript) across three random seeds. All models have the same architecture as BERT-base, are pretrained on Wikipedia data and finetuned on English XNLI training data. mBERT was pretrained longer and on much more data and has thus higher performance. Best non-mBERT performance in bold.
256 and learning rate 1e-4. In this section, we use BERT-base, not BERT-small because we found that BERT-small with less than 1M parameters performs poorly in a larger scale setup. The remaining model and training parameters are the same as before. Each language has its own vocabulary with size 20k. We then evaluate the pretrained models on XNLI (Conneau et al., 2018). We finetune the pretrained models on English XNLI (3 epochs, batch size 32, learning rate 2e-5, following Devlin et al. (2019)). Then the model is evaluated on English. In addition, we do a zero-shot evaluation on German and Hindi. Table 3 presents accuracy on XNLI test. Compared to mBERT, accuracy is significantly lower but reasonable on English (.75 vs. .81) -we pretrain on far less data. ID 0 shows high multilinguality with 0-shot accuracies .57 and .45. Inverting the order of German has little effect on HIN, but DEU drops significantly (majority baseline is .33). Our architectural modifications (8) harm both HIN and DEU. The proposed knn-replace model exhibits the strongest degree of multilinguality, boosting the 0shot accuracy in DEU / HIN by 4% / 9%. Note that to accommodate noise in the real world data, we randomly replace with one of the five nearest neighbors (not the top nearest neighbor). This indicates that knn-replace is useful for real world data and that our prior findings transfer to larger scale settings.

Related Work
There is a range of prior work analyzing the reason for BERT's multilinguality. Singh et al. (2019) show that BERT stores language representations in different subspaces and investigate how subword tokenization influences multilinguality. Artetxe et al. (2020) show that neither a shared vocabulary nor joint pretraining is essential for multilinguality. K et al. (2020) extensively study reasons for multilinguality (e.g., researching depth, number of parameters and attention heads). They conclude that depth is essential. They also investigate language properties and conclude that structural similarity across languages is important, without further defining this term. Last, Conneau et al. (2020b) find that a shared vocabulary is not required. They find that shared parameters in the top layers are required for multilinguality. Further they show that different monolingual BERT models exhibit a similar structure and thus conclude that mBERT somehow aligns those isomorphic spaces. They investigate having separate embedding look-ups per language (including position embeddings and special tokens) and a variant of avoiding cross-language replacements. Their method "extra anchors" yields a higher degree of multilinguality. In contrast to this prior work, we investigate multilinguality in a clean laboratory setting, investigate the interaction of architectural aspects and research new aspects such as overparameterization or inv-order.
Other work focuses on creating better multilingual models. Mulcaire et al. (2019) proposed a method to learn multilingual contextual representations. Conneau and Lample (2019) introduce the translation modeling objective. Conneau et al. (2020a) propose XLM-R. They introduce the term "curse of multilinguality" and show that multilingual model quality degrades with an increased number of languages given a fixed number of parameters. This can be interpreted as the minimum number of parameters required whereas we find indications that models that are too large can be harmful for multilinguality as well. Cao et al. (2020) improve the multilinguality of mBERT by introducing a regularization term in the objective, similar to the creation of static multilingual embedding spaces. Huang et al. (2019) extend mBERT pretraining with three additional tasks and show an improved overall performance. More recently, better multilinguality is achieved by Pfeiffer et al. (2020) (adapters) and Chi et al. (2020) (parallel data). We propose a simple extension to make mBERT more multilingual; it does not require additional supervision, parallel data or a more complex loss function -in contrast to this prior work.
Finally, many papers find that mBERT yields competitive zero-shot performance across a range of languages and tasks such as parsing and NER (Pires et al., 2019;Wu and Dredze, 2019), word alignment and sentence retrieval (Libovickỳ et al., 2019) and language generation (Rönnqvist et al., 2019); Hu et al. (2020) show this for 40 languages and 9 tasks. Wu and Dredze (2020) consider the performance on up to 99 languages for NER. In contrast, Lauscher et al. (2020) show limitations of the zero-shot setting and Zhao et al. (2020) observe poor performance of mBERT in reference-free machine translation evaluation. Prior work here focuses on investigating the degree of multilinguality, not the reasons for it.

Conclusion
We investigated which architectural and linguistic properties are essential for BERT to yield crosslingual representations. The main takeaways are: i) Shared position embeddings, shared special tokens, replacing masked tokens with random tokens and a limited amount of parameters are necessary elements for multilinguality. ii) Word order is relevant: BERT is not multilingual with one language having an inverted word order. iii) The comparability of training corpora contributes to multilinguality. We show that our findings transfer to larger scale settings. We experimented with a simple modification to obtain stronger multilinguality in BERT models and demonstrate its effectiveness on XNLI. We considered a fully unsupervised setting without any crosslingual signals. In future work we plan to incorporate crosslingual signals as  argue that a fully unsupervised setting is hard to motivate. Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the cross-lingual transferability of monolingual representations.

A.1 Word Translation Evaluation
Word translation is evaluated in the same way as sentence retrieval. This section provides additional details.
For each token in the vocabulary w (k) we feed the "sentence" "[CLS] { w (k) } [SEP]" to the BERT model to obtain the embeddings E(w (k) ) ∈ R 3×d from the l-th layer of BERT for k ∈ {eng, fake}. Now, we extract the word embedding by taking the second vector (the one corresponding to w (k) ) and denote it by e (k) w . Computing cosine similarities between English and Fake-English tokens yields the similarity matrix R ∈ R m×m where R ij = cosine-sim(e Given an English query token s (eng) i , we obtain the retrieved tokens in Fake-English by ranking them according to similarity. Note that we can do the same with Fake-English as query language. We report the mean precision of these directions that is computed as 1 arg max l R il =i + 1 arg max l R li =i .

A.3 knn-replace
We use the training data to train static word embeddings for each language using the tool fastText. Subsequently we use VecMap (Artetxe et al., 2018) to map the embedding spaces from each language into the English embedding space, thus creating a multilingual static embedding space. We use VecMap without any supervision. During MLM-pretraining of our BERT model 15% of the tokens are randomly selected and  "masked". They then get either replaced by "[MASK]" (50% of the cases), remain the same (10% of the cases), get replaced by a random other token (10% of the cases) or we replace the token with one of the five nearest neighbors (in the fake-English setup only with the nearest neighbor) from another language (30% of the cases). Among those five nearest neighbors we pick one randomly. In case more than one other language is available we pick one randomly.

B.1 Model 17
One might argue that our model 17 in Table 1 of the main paper is simply not trained enough and thus not multilingual. However, Table 10 shows that even when continuing to train this model for a long time no multilinguality arises. Thus in this configuration the model has enough capacity to model the languages independently of each other -and due to the modifications apparently no incentive to try to align the language representations.

B.2 Word Order in XNLI
To verify whether similar word order across languages influences the multilinguality we propose to compute a word reordering metric and correlate this metric with the performance of 0-shot transfer capabilities of mBERT. To this end we consider the performance of mBERT on XNLI. We follow Birch and Osborne (2011) in computing word reordering metrics between parallel sentences (XNLI is a parallel corpus). More specifically we compute the Kendall's tau metric. To this end, we compute word alignments between two sentences using the Match algorithm by Sabet et al. (2020), which directly yield a permutation between sentences as   required by the distance metric. We compute the metric on 2500 sentences from the development data of XNLI and average it across sentences to get a single score per language. The scores and XNLI accuracies are in Table 4. The Pearson correlation between Kendall's tau metric and the XNLI classification accuracy in a zero-shot scenario (mBERT only finetuned on English and tested on all other languages) is 46% when disregarding English and 64% when including English. Thus there is a some correlation observable. This indicates that zero-shot performance of mBERT might also rely on similar word order across languages. We plan to extend this experiment to more zero-shot results and examine this effect more closely in future work.

B.3 Larger Position Similarity Plots
We provide larger versions of our position similarity plots in Figure 8.
C Reproducibility Information C.1 Data Table 7 provides download links to data.

C.2 Technical Details
The number of parameters for each model are in Table 6.
We did all computations on a server with up to 40 Intel(R) Xeon(R) CPU E5-2630 v4 CPUs and 8 GeForce GTX 1080Ti GPU with 11GB memory. No multi-GPU training was performed. Typical runtimes are reported in Table 5.
Used third party systems are shown in Table 8.

C.3 Hyperparameters
We show an overview on hyperparameters in Table 9. If not shown we fall back to default values in the systems. Bible (Mayer and Cysouw, 2014) English We use the editions Easyto-Read and King-James-Version.
We use all 17178 sentences in Easy-to-Read (New Testament) and the first 10000 sentences of King-James in the Old Testament. n/a