It’s not Greek to mBERT: Inducing Word-Level Translations from Multilingual BERT

Recent works have demonstrated that multilingual BERT (mBERT) learns rich cross-lingual representations, that allow for transfer across languages. We study the word-level translation information embedded in mBERT and present two simple methods that expose remarkable translation capabilities with no fine-tuning. The results suggest that most of this information is encoded in a non-linear way, while some of it can also be recovered with purely linear tools. As part of our analysis, we test the hypothesis that mBERT learns representations which contain both a language-encoding component and an abstract, cross-lingual component, and explicitly identify an empirical language-identity subspace within mBERT representations.


Introduction
Multilingual-BERT (mBERT) is a version of BERT (Devlin et al., 2019), trained on the concatenation of Wikipedia in 104 different languages. Recent works show that it excels in zero-shot transfer between languages, for a variety of tasks (Pires et al., 2019;Muller et al., 2020), despite being trained with no parallel supervision.
Previous work has mainly focused on what is needed for zero-shot transfer to work well (Muller et al., 2020;Karthikeyan et al., 2020;Wu and Dredze, 2019), and on characterizing the representations of mBERT (Singh et al., 2019). However, we still lack a proper understanding of this model.
In this work we study (1) how much word-level translation information is recoverable by mBERT; and (2) how this information is stored. We focus on the representations of the last layer, and on the embedding matrix that is shared between the input and output layers -which are together responsible for token prediction.
For our first goal, we start by presenting a simple and strong method to extract word-level translation information. Our method is based on explicit querying of mBERT: given a source word and a target language, we feed mBERT with a template such as "The word 'SOURCE' in LANGUAGE is: [MASK]." where LANGUAGE is the target language, and SOURCE is an English word to translate. Getting the correct translation as the prediction of the masked token exposes mBERT's ability to provide word-level translation. This template-based method is surprisingly successful, especially considering the fact that no parallel supervision was provided to the model while training, and that word translation is not part of the training objective.
This raises the possibility of easy disentanglement between language identity and lexical semantics in mBERT representations. We test this hypothesis by trying to explicitly disentangle languageidentity from lexical semantics under linearity assumptions. We propose a method for disentangling a language-encoding component and a languageneutral component from both the embedding representations and word-in-context representations. Furthermore, we learn the emperical "langauge subspace" in mBERT, which is a linear subspace that is spanned by all directions that are linearly correlative to the language identity. We demonstrate that the representations are well-separated by language on that subspace.
We leverage these insights and empirical results to show that it is possible to perform analogiesbased translation by taking advantage of this disentanglement: we can alter the language-encoding component, while keeping the lexical component intact. We compare between the template-based method and the analogies-based method and discuss their similarities and differences, as well as their limitations.
The two methods together show that mBERT acquired, to a large degree, the ability to perform word-level translation, despite the fact that it is not trained on any parallel data explicitly. The results suggest that most of the information is stored in a non-linear way, but with some linearly-recoverable components.
Our contribution in this work is two-fold: (a) we present two simple methods for word-level translation using mBERT, that require no training or finetuning of the model, which demonstrate that mBERT stores parallel information in different languages; (b) we show that mBERT representations are composed of language-encoding and languageneutral components and present a method for extracting those components. Our code is available at https://github.com/gonenhila/mbert. Pires et al. (2019) begin a line of work that studies mBERT representations and capabilities. In their work, they inspect the model's zero-shot transfer abilities using different probing experiments, and propose a way to map sentence representations in different languages, with some success. Karthikeyan et al. (2020) further analyze the properties that affect zero shot transfer by experimenting with bilingual BERTs on RTE (recognizing textual entailment) and NER. They analyse performance with respect to linguistic properties and similarities of the source and the target languages, and some parameters of the model itself (e.g. network architecture and learning objective). In a closely related work, Wu and Dredze (2019) perform transfer learning from English to 38 languages, on 5 tasks (POS, parsing, NLI, NER, Document classification), and report good results. Additionally, they show that language-specific information is preserved in all layers. Wang et al. (2019) learn alignment between contextualized representations, and use it for zero shot transfer.

Previous Work
Beyond focusing on zero-shot transfer abilities, an additional line of work studies the representations of mBERT and the information it stores. Using hierarchical clustering based on the CCA similarity scores between languages, Singh et al. (2019) are able to construct a tree structure that faithfully describes relations between languages. Chi et al. (2020) learn a linear syntax-subspace in mBERT, and point out to syntactic regulartieis in the representations that transfer across languages. In the recent work of Cao et al. (2020), the authors define the notion of contextual word alignment. They design a fine-tuning loss for improving alignments and show that they are able to improve zero-shot transfer after this alignment-based fine-tuning. One main difference from our work is that they fine-tune the model according to their new definition of contextual alignment, while we analyze and use the information already stored in the model. One of the closest works to ours is that of Libovickỳ et al. (2019), where they assume that mBERT's representations have a language-neutral component, and a language-specific component. They remove the language specific component by subtracting the centroid of the language from the representations, and make an attempt to prove the assumption by using probing tasks on the original vs. new representations. They show that the new representations are more language-neutral to some extent, but lack experiments that show a complementary component. While those works demonstrate that mBERT representations in different languages can be aligned successfully with appropriate supervision, we propose an explicit decomposition of the representations to language-encoding and language-neutral components, and also demonstrate that implicit word-level translations can be easily distilled from the model when exposed to the proper stimuli.

Word-level Translation using
Pre-defined Templates We study the extent to which it is possible to extract word-level translation directly from mBERT.

Word-level Translation: You Just Have to Ask
We present a simple and overwhelmingly successful method for word-level translation with mBERT. This method is based on the idea of explicitly querying mBERT for a translation, similar to what has been done with LMs for other tasks (Petroni et al., 2019;Talmor et al., 2019). We experimented with seven different templates and found the following to work best: "The word 'SOURCE' in LANGUAGE is: [MASK]." 1 The predictions from the [MASK] token induce a distribution over the vocabulary, and we take the most probable word as the translation.  Table 1: Word-level translation results with the template-based method and the analogies-based method (introduced in Section 5). @1-100 stand for accuracy@k (higher is better), "rank" stands for the average rank of the correct translation, "log" stands for the log of the average rank, and "win" stands for the percentage of cases in which the tested method is strictly better than the baseline.

Evaluation
To evaluate lexical translation quality, we use NorthEuraLex 2 (Dellert et al., 2019), a lexical database providing translations of 1016 words into 107 languages. We use these parallel translations to evaluate our translation method when translating from English to other target languages. 3 We restrict our evaluation to a set of common languages from diverse language families: Russian, French, Italian, Dutch, Spanish, Hebrew, Turkish, Romanian, Korean, Arabic and Japanese. We omit cases in which the source word or the target word are tokenized using mBERT into more than a single token. 4 The words in the dataset are from different POS, with Nouns, Adjectives and Verbs being the most common ones. 5 For all our experiments with mBERT, we use the transformer library of HuggingFace (Wolf et al., 2019).

Results
We report accuracy@k in translating the English source word into different languages (for k ∈ {1, 10, 100}): for each word pair, we check whether the target word was included in the first k retrieved words. Note that we remove the source word itself from the ranking. 6 We report three additional metrics: (a) avg-rank: the average rank of the target word (its position in the ranking of pre-@1 @5 @10 @50 @100  dictions); (b) avg-log-rank: the average of the log of the rank, to limit the effect of cases in which the rank is extremely low and skews the average; (c) hard-win: percent of cases in which the method results in a strictly better rank for the translated word compared to the baseline. We take the predictions we get for the masked token as the method's candidates for translation. As a baseline, we take the embedding representation of the source word and look for the closest words to it. Table 1 shows the results of the template-based method and the baseline. This method significantly improves over the baseline in all metrics and achieves impressive accuracy results: acc@1 of 0.449 and acc@10 of 0.703, beating the baseline in 91.6% of the cases.
Accuracy per POS To get a finer analysis of this method, we also evaluate the translations per POS. We report results on the 3 most common POS: nouns, adjectives and verbs. 7 As one might expect, nouns are the easiest to translate (both for the baseline and for our method), followed by adjectives, then verbs. See Table 2 for full results. Note that the results for these common POS tags are lower than the average over the full dataset. We hypothesize that words belonging to closed-class POS tags, such as pronouns, are easier to translate.

Visualization of the Representation Space
To further understand the mechanism of the method, we turn to inspect the resulting representations. For each word pair, we feed mBERT with the full template and extract the last-layer representation of the masked token, right before the multiplication with the output embeddings. In cording to the target language. The ability of these representations to encode the target language may explain how this method successfully produces the translation into the correct language.

Predicting the Language
Due to the representations clustering based on the target language (rather than semantics), we hypothesize that mBERT is also capable of predicting the target language given the source word and its translation.
To verify that, we take the same template as before, this time masking the name of the language instead of the target word. 8 We then compute acc@1,5,10 for all languages and report that for the 20 languages with the most accurate results in Table 3 (the full results can be found in Table 8 in the Appendix). The results are impressive, suggesting that mBERT indeed encodes the target language identity in this setting. The languages on which mBERT is most accurate are either widely-spoken languages (e.g. German, French), or languages with a unique script (e.g. Greek, Russian, Arabic). Indeed, we get a Spearman correlation of 0.53 between acc@1 and the amount of training data in each language. 9 We also compute a confusion matrix for the 20 most accurate languages, shown in Figure 2. In order to better identify the nuances, we use the squareroot of the values, instead of the values themselves, and remove English (which is frequently predicted 8 We use all languages in NortEuraLex that are a single token according to mBERT tokenization -there are 47 such languages except for English. 9 We considered the number of articles per language from Wikipedia: https://en.wikipedia.org/wiki/ Wikipedia:Multilingual_statistics, recorded in May 2020.  as the target language, probably since the template is in English). The confusion matrix reveals the expected behavior -mBERT confuses mainly between typologically related languages, specifically those of the same language family: Germanic languages (German, Dutch, Swedish, Danish), Romance languages (French, Latin, Italian, Spanish, Portuguese), and Semitic languages (Arabic, Hebrew). In addition, we can also identify some confusion between Germanic and Romance languages (which share much of the alphabet), as well as overprediction of languages with a lot of training data (e.g. German, French).

Dissecting mBERT Representations
In the previous section, we saw that mBERT contains abundant word-level translation knowledge. How is this knowledge represented? We turn to analyze both the representations of words in context and those of the output embeddings. It has been assumed in previous work that the representations are composed of a languageencoding component and a language-neutral component (Libovickỳ et al., 2019). In what follows, we explicitly try to find such a decomposition: we decompose v = v lang + v lex , where v lang and v lex are orthogonal vectors, v lang is the part in the representation that is indicative of language identity, and v lex maintains lexical information, but is in-  variant to language identity. Specifically, we test the hypothesis using the following interventions: • Measuring the degree to which removing v lang results in language-neutral word representations.
• Measuring the degree to which removing v lex results in word representations which are clustered by language identity (regardless of lexical semantics).
• Removing the v lang component from word-incontext representations and from the output embeddings, to induce MLM prediction in other languages.
Splitting the representations into components is done using INLP (Ravfogel et al., 2020), an algorithm for removing information from vector representations.

mBERT Decomposition by Nullspace Projections
We formalize the decomposition objective defined earlier as finding two linear subspaces within the representation space, which contain languageindependent and language-identity features. The recently proposed Iterative Null-space Projection (INLP) method (Ravfogel et al., 2020) allows to remove linearly-decodable information from vector representations. Given a dataset of representations X (in our case, mBERT word-in-context representations and output embeddings) and annotations Z for the information to be removed (language identity) the method renders Z linearly unpredictable from X. It does so by iteratively training linear predictors w 1 , . . . , w n of Z, calculating the projection matrix onto their nullspace P N := P N (w 1 ), . . . , P N (w n ), and transforming X ← P N X. Recall that by the nullsapce definition this guarantees w i P N X = 0, ∀w i , i.e., the features w i uses for language prediction are neutralized. While the nullsapce N (w 1 , . . . , w n ) is a subspace in which Z is not linearly predictable, the complement rowspace R(w 1 , . . . , w n ) is a subspace of the representation space X that corresponds to the property Z. In our case, this subspace is mBERT language-identity subspace. In the following sections we utilize INLP in two complementary ways: (1) we use the null-space projection matrix P N to zero out the language identity subspace, in order to render the representations invariant to language identity 10 ; and (2) we use the rowspace projection matrix P R = I − P N to project mBERT representations onto the languageidentity subspace, keeping only the parts that are useful for language-identity prediction. We hypothesize that the first operation would render the representations more language-neutral, while the latter would discard the components that are shared across languages.
Setup We start by applying INLP on random representations and getting the two mentioned projection matrices: on the nullspace, and on the rowspace. We repeat this process twice: first, for representations in context, and second, for output embeddings. For each of these two cases, we sample random tokens from 5000 sentences 11 in 15 different languages, extract their respective representations (in context or simply output embeddings), and run INLP on those representations with the objective of identifying the language, for 20 iterations. We end up with 4 matrices: projection matrix on the null-space and on the rowspace for representations in context, and the same for output embeddings.

Language-Neutral and Language-Encoding Representations
We aim to use INLP nullspace and rowspace projection matrices as an intervention that is designed to test the hypothesis on the exsitence of two independent subspaces in mBERT. Concretely, we perform two experiments: (a) a cluster analysis, using t-SNE (Maaten and Hinton, 2008) and a clustercoherence measure, of representations projected on the null-space and the row-space from different languages. We expect to see decreased and increased separation by language identity, respectively; (b) we perform nullsapce projection intervention on both the last hidden state of mBERT, and on the output embeddings, and proceed to predict a distribution over all tokens. We expect that neutralizing the language-identity information this way will encourage mBERT to perform semanticallyadequate word prediction, while decreasing its ability to choose the correct language in the context of the input sentence.

t-SNE and Clustering
To test the hypothesis on the existence of a "language-identity" subspace in mBERT, we project the representations of a random subset of words from TED dataset, from the embedding layer and the last layer, on the subspace that is spanned by all language classifiers, using INLP rowspace-projection matrix. Figures 3 and 4 12 https://github.com/neulab/ word-embeddings-for-nmt present the results for the embedding layer and the last layer, respectively. In both cases, we witness a significant improvement in clustering according to language identity. At the same time, some different trends are observed across layers: the separability is better in the last layer. Romance languages, which share much of the script and some vocabulary, are well separated in the last layer, but less so in the embeddings layer. Taiwanese and mainland Chinese (zh-tw and zh-cn, respectively) are well separable in the last layer, but not in the embedding layer. These findings suggest that the way mBERT encodes language identity differs across layers: while lower layers focus on lexical dimensions -and thus cluster the two Chinese variants, and the Romance languages, together -higher layers separate them, possibly by subtler cues, such as topical differences or syntactic alternations. This aligns with Singh et al. (2019) who demonstrated that mBERT representations become more language-specific along the layers.
To quantify the influence of the projection to the language rowspace, we calculate V-measure (Rosenberg and Hirschberg, 2007), which assesses the degree of clustering according to language identity. Specifically, we perform K-means clustering with the number of languages as K, and then calculate V-measure to quantify alignment between the clusters and the language identity. On the embedding layer, this measure increases from 35.5% in the original space, to 61.8% on the languageidentity subspace; and for the last layer, from 80.5% in the original space, to 90.35% in the languageidentity subspace, both showing improved clustering by language identity.
When projecting the representations on the nullspace we get the opposite trend: less separation by language-identity. The full results of this complementary projection can be found in Section C in  the Appendix.

Inducing Language-Neutral Token Predictions
By the disentanglement hypothesis, removing the language-encoding part of the representations should render the prediction language-agnostic. To test that, we take contextualized representations of random tokens in English sentences, and look at the original masked language model (MLM) predictions over those representations. We then compare these predictions with three variations: (a) when projecting the representations themselves on the null-space of the language-identity subspace, (b) when projecting the output embedding matrix on that null-space, (c) when projecting both the representations and the output embedding matrix on the null-space. In order to inspect the differences in predictions we get, we train a classifier 13 that given the embedding of a word, predicts whether it is in English or not. Then, we compute the percentage of English/non-English words in the top-k predictions for each of the variants. The results are depicted in Table 4, for k ∈ 1, 5, 10, 20, 50. As expected, when projecting both the representations and the embeddings, we get most predictions that are not in English (the results are the average over 6000 instances).
The decrease in English predictions can be the result of noise that is introduced by the projection operation. To verify that the influence of the projection is focused on the langauge-identity, and not on the lexical-semantics content of the vectors, we employ a second evaluation that focuses on the semantic coherence of the predictions. We look at the top-10 predictions in each case, and compute the cosine-similarity between the original word in the sentence, and each prediction. We expect the average cosine-similarity to drop significantly if the new predictions are mostly noise. However, if the 13 SCIKIT-LEARN implementation (Pedregosa et al., 2011) with default parameters.  predictions are reasonably related to the original words, we expect to get a similar average. Since some of the predictions are not in English, we use MUSE cross-lingual embeddings for this evaluation (Conneau et al., 2017). The results are shown in Table 5. As expected, the average cosine similarity is almost the same in all cases (the average is taken across the same 6000 instances). To get a sense of the resulting predictions, we show four examples (of different POS) in Table 6. In all cases most words that were removed from the top-10 predictions are English words, while most new words are translations of the original word into other languages.

Analogies-based Translation
In the previous section we established the assumption that mBERT representations are composed of a language-neutral and a language-encoding components. In this section, we present another mechanism for word-translation with mBERT, which is based on manipulating the language-encoding component of the representation, in a similar way to how analogies in word embeddings work (Mikolov et al., 2013b). This new method has a clear mechanism behind it, and it serves as an additional validation for our assumption about the two independent components. The idea is simple: we create a single vector representation for each language, as explained below. Then, in order to change the embedding of a word in language SOURCE to language TARGET, we simply subtract from it the vector representation of language SOURCE and add the vector representation of language TARGET. Finally, in order to get the translation of the source word into the target language, we multiply the resulting representation by the output embedding matrix to get the closest words to it out of the full vocabulary. Below is a detailed explanation of the implementation.  Table 6: Examples of resulting top-10 MLM predictions before and after performing INLP on both the output embeddings and representations in context. Words in red (italic) appear only in the "before" list, while words in blue (underlined) appear only in the "after" list.

Creating language-representation vectors
We start by extracting sentences in each language. From each sentence, we choose a random token and extract its representation from the output embedding matrix. Then, for each language we average all the obtained representations, to create a single vector representing that language. For that we use the same representations extracted for training INLP, as described in Section 4.1. Note that no hyper-parameter tuning was done when calculating these language vectors. The assumption here is that when averaging this way, the lexical differences between the representations cancel out, while the shared language component in all of them persists.

Performing Translation with Analogies
We are interested in translating words from a SOURCE language to a TARGET language. For that we simply take the word embedding of the SOURCE word, subtract the representation of the SOURCE language from it, and add the representation of the TARGET language. We multiply this new representation by the output embedding matrix to get a ranking over all the vocabulary, from the closest word to it to the least close.

Results
In Table 1 we report the results of translation using analogies (second row). The success of this method supports the reasoning behind it -indeed changing the language component of the representation enables us to get satisfactory results in wordlevel translation. While the template-based method, which is non-linear, puts a competitive lower bound on the amount of parallel information embedded in mBERT, this strictly linear method is able to recover a large portion of it.

Visualization of the Representation Space
In contrast to the template-based method, t-SNE visualization of the analogies-based translation vectors reveals low clustering by language (see Figure 8 in Section D in the Appendix).

Translation between every Language Pair
The analogies-based translation method can be easily applied to all language pairs, by subtracting the representation vector of the source language and adding that of the target language. Figure 5 presents a heatmap of the acc@10 for every language pair, with source languages on the left and target languages at the bottom. We note the high translation scores between related languages, for example, Arabic and Hebrew (both ways), and French, Spanish and Italian (all pairs).

Discussion
The template-based method we presented is nonlinear and puts a high lower bound on the amount of parallel information found in mBERT, with surprisingly good results on word-level translation. The analogies-based method also gets impressive results, but to a lesser extent than the template-based one. In addition, the resulting representations in the analogies-based method are much less structured. These together suggest that most of the parallel information is not linearly decodable from mBERT. The reasoning behind the analogies-based method is very clear: under linearity assumption we explicitly characterize and compute the decomposition to language-encoding and languageneutral components, and derive a word-level translation method based on this decomposition.
The mechanism behind the template-based method and the source of its success, however, are much harder to understand and interpret. While it is possible that some parallel data, in one form or another, is present in the training corpora, this is still an implicit signal: there is no explicit supervision for the learning of translation. The fact that MLM training is sufficient -at least to some degree -to induce learning of the algorithmic function of translation (without further supervised finetuning) is nontrivial. We believe that the success of this method is far from being obvious. We leave further investigation of the sources of this success to future work.

Conclusion
We aim to shed light on a basic question regarding multilingual BERT: How much word-level translation information does it embed and what are the ways to extract it? answering this question can help understand the empirical findings on its impressive transfer ability across languages.
We show that the knowledge needed for wordlevel translation is implicitly encoded in the model, and is easy to extract with simple methods, without fine-tuning. This information is likely stored in a non-linear way. However, some parts of this representations can be recovered linearly: we identify an empirical language-identity subspace in mBERT, and show that under linearity assumptions, the representations in different languages are easily separable in that subspace; neutralizing the language-identity subspace encourages the model to perform word predictions which are less sensitive to language-identity, but are nonetheless semantically-meaningful. We argue that the results of those interventions support the hypothesis on the existence of identifiable language components in mBERT.

A Templates
The different templates are listed in Table 7, from the best performing to least performing. We report the results throughout the paper using the best template (first). Templates 5-7 fail completely, while templates 2-4 result in reasonable accuracy.  Table 8 depicts the results of language prediction from the template. We report acc@1,5,10 for all languages. for language-identity prediction. As expected, the nullspace does not encode language identity: Vmeasure drops to 11.5% and 11.4% in the embeddings layer and in the last layer, respectively.

D Visualization of the Representation Space
We plot the t-SNE projection of the representations of the analogies-based method (after subtraction and addition of the language vectors), colored by target language. While the representations of the template-based method are clearly clustered according to the target language, the representations in this method are completely mixed, see Figure 8.