Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT

Pretrained contextual representation models (Peters et al., 2018; Devlin et al., 2018) have pushed forward the state-of-the-art on many NLP tasks. A new release of BERT (Devlin, 2018) includes a model simultaneously pretrained on 104 languages with impressive performance for zero-shot cross-lingual transfer on a natural language inference task. This paper explores the broader cross-lingual potential of mBERT (multilingual) as a zero shot language transfer model on 5 NLP tasks covering a total of 39 languages from various language families: NLI, document classification, NER, POS tagging, and dependency parsing. We compare mBERT with the best-published methods for zero-shot cross-lingual transfer and find mBERT competitive on each task. Additionally, we investigate the most effective strategy for utilizing mBERT in this manner, determine to what extent mBERT generalizes away from language specific features, and measure factors that influence cross-lingual transfer.

Code is available at https://github.com/ shijie-wu/crosslingual-nlp At the same time, cross-lingual embedding models have reduced the amount of cross-lingual supervision required to produce reasonable models; Conneau et al. (2017);  use identical strings between languages as a pseudo bilingual dictionary to learn a mapping between monolingual-trained embeddings. Can jointly training contextual embedding models over multiple languages without explicit mappings produce an effective cross-lingual representation? Surprisingly, the answer is (partially) yes. BERT, a recently introduced pretrained model (Devlin et al., 2019), offers a multilingual model (mBERT) pretrained on concatenated Wikipedia data for 104 languages without any cross-lingual alignment (Devlin, 2018). mBERT does surprisingly well compared to cross-lingual word embeddings on zeroshot cross-lingual transfer in XNLI (Conneau et al., 2018), a natural language inference dataset. Zeroshot cross-lingual transfer, also known as singlesource transfer, refers trains and selects a model in a source language, often a high resource language, then transfers directly to a target language.
While XNLI results are promising, the question remains: does mBERT learn a cross-lingual space that supports zero-shot transfer? We evaluate mBERT as a zero-shot cross-lingual transfer model on five different NLP tasks: natural language inference, document classification, named entity recognition, part-of-speech tagging, and dependency parsing. We show that it achieves competitive or even state-of-the-art performance with the recommended fine-tune all parameters scheme (Devlin et al., 2019). Additionally, we explore different fine-tuning and feature extraction schemes and demonstrate that with parameter freezing, we further outperform the suggested fine-tune all approach. Furthermore, we explore the extent to which mBERT generalizes away from a specific language by measuring accuracy on language ID using each layer of mBERT. Finally, we show how subword tokenization influences transfer by measuring subword overlap between languages.
2 Background (Zero-shot) Cross-lingual Transfer Crosslingual transfer learning is a type of transductive transfer learning with different source and target domain (Pan and Yang, 2010). A cross-lingual representation space is assumed to perform the cross-lingual transfer. Before the widespread use of cross-lingual word embeddings, task-specific models assumed coarse-grain representation like part-of-speech tags, in support of a delexicalized parser (Zeman and Resnik, 2008). More recently cross-lingual word embeddings have been used in conjunction with task-specific neural architectures for tasks like named entity recognition (Xie et al., 2018), part-of-speech tagging (Kim et al., 2017) and dependency parsing (Ahmad et al., 2019).
Cross-lingual Word Embeddings. The quality of the cross-lingual space is essential for zero-shot cross-lingual transfer. Ruder et al. (2017) surveys methods for learning cross-lingual word embeddings by either joint training or post-training mappings of monolingual embeddings. Conneau et al. (2017) and  first show two monolingual embeddings can be aligned by learning an orthogonal mapping with only identical strings as an initial heuristic bilingual dictionary.
Contextual Word Embeddings ELMo (Peters et al., 2018), a deep LSTM (Hochreiter and Schmidhuber, 1997) pretrained with a language modeling objective, learns contextual word embeddings. This contextualized representation outperforms standalone word embeddings, e.g. Word2Vec (Mikolov et al., 2013b) and Glove (Pennington et al., 2014), with the same task-specific architecture in various downstream tasks. Instead of taking the representation from a pretrained model, GPT (Radford et al., 2018) and Howard and Ruder (2018) also fine-tune all the parameters of the pretrained model for a specific task. Also, GPT uses a transformer encoder (Vaswani et al., 2017) instead of an LSTM and jointly fine-tunes with the language modeling objective. Howard and Ruder (2018) propose another fine-tuning strategy by using a different learning rate for each layer with learning rate warmup and gradual unfreezing.
Concurrent work by Lample and Conneau (2019) incorporates bitext into BERT by training on pairs of parallel sentences. Schuster et al. (2019) aligns pretrained ELMo of different languages by learning an orthogonal mapping and shows strong zero-shot and few-shot cross-lingual transfer performance on dependency parsing with 5 Indo-European languages. Similar to multilingual BERT, Mulcaire et al. (2019) trains a single ELMo on distantly related languages and shows mixed results as to the benefit of pretaining. Parallel to our work, Pires et al. (2019) shows mBERT has good zero-shot cross-lingual transfer performance on NER and POS tagging. They show how subword overlap and word ordering effect mBERT transfer performance. Additionally, they show mBERT can find translation pairs and works on code-switched POS tagging. In comparison, our work looks at a larger set of NLP tasks including dependency parsing and ground the mBERT performance against previous state-of-the-art on zeroshot cross-lingual transfer. We also probe mBERT in different ways and show a more complete picture of the cross-lingual effectiveness of mBERT.

Multilingual BERT
BERT (Devlin et al., 2019) is a deep contextual representation based on a series of transformers trained by a self-supervised objective. One of the main differences between BERT and related work like ELMo and GPT is that BERT is trained by the Cloze task (Taylor, 1953), also referred to as masked language modeling, instead of right-to-left or left-to-right language modeling. This allows the model to freely encode information from both directions in each layer. Additionally, BERT also optimizes a next sentence classification objective. At training time, 50% of the paired sentences are consecutive sentences while the rest of the sentences are paired randomly. Instead of operating on words, BERT uses a subword vocabulary with WordPiece (Wu et al., 2016), a data-driven approach to break up a word into subwords.
Fine-tuning BERT BERT shows strong performance by fine-tuning the transformer encoder followed by a softmax classification layer on various sentence classification tasks. A sequence of shared softmax classifications produces sequence tagging models for tasks like NER. Fine-tuning usually takes 3 to 4 epochs with a relatively small learning rate, for example, 3e-5.
Multilingual BERT mBERT (Devlin, 2018) follows the same model architecture and training procedure as BERT, except with data from Wikipedia in 104 languages. Training makes no use of explicit cross-lingual signal, e.g. pairs of words, sentences or documents linked across languages. In mBERT, the WordPiece modeling strategy allows the model to share embeddings across languages. For example, "DNA" has a similar meaning even in distantly related languages like English and Chinese 1 . To account for varying sizes of Wikipedia training data in different languages, training uses a heuristic to subsample or oversample words when running WordPiece as well as sampling a training batch, random words for cloze and random sentences for next sentence classification.
Transformer For completeness, we describe the Transformer used by BERT. Let x, y be a sequence of subwords from a sentence pair. A special token [CLS] is prepended to x and [SEP] is appended to both x and y. The embedding is obtained bŷ where E is the embedding function and LN is layer normalization (Ba et al., 2016). M transformer blocks are followed by the embeddings. In each transformer block, where GELU is an element-wise activation function (Hendrycks and Gimpel, 2016). In practice, MHSA is the multi-heads self-attention function. We show how one new positionĥ i is computed.
In each attention, referred to as attention head,

Tasks
Does mBERT learn a cross-lingual representation, or does it produce a representation for each language in its own embedding space? We consider five tasks in the zero-shot transfer setting. We assume labeled training data for each task in English, and transfer the trained model to a target language. We select a range of different tasks: document classification, natural language inference, named entity recognition, part-of-speech tagging, and dependency parsing. We cover zero-shot transfer from English to 38 languages in the 5 different tasks as shown in Tab. 1. In this section, we describe the tasks as well as task-specific layers.

Document Classification
We use MLDoc (Schwenk and Li, 2018), a balanced subset of the Reuters corpus covering 8 languages for document classification. The 4-way topic classification task decides between CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social), and MCAT (Markets). We only use the first two sentences 2 of a document for classification due to memory constraint. The sentence pairs are provided to the mBERT encoder. The task-specific classification layer is a linear function mapping h 12 0 ∈ R d h into R 4 , and a softmax is used to get class distribution. We evaluate by classification accuracy.

Natural Language Inference
We use XNLI (Conneau et al., 2018) which cover 15 languages for natural language inference. The 3-way classification includes entailment, neutral, and contradiction given a pair of sentences. We ar bg ca cs da de el en es et fa fi fr he hi hr hu id it ja ko la lv nl no pl pt ro ru sk sl sv sw th tr uk ur vi zh MLDoc NLI NER POS Parsing feed a pair of sentences directly into mBERT and the task-specific classification layer is the same as §4.1. We evaluate by classification accuracy.

Named Entity Recognition
We use the CoNLL 2002 and 2003 NER shared tasks (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003) (4 languages) and a Chinese NER dataset (Levow, 2006). The labeling scheme is BIO with 4 types of named entities. We add a linear classification layer with softmax to obtain word-level predictions. Since mBERT operates at the subword-level while the labeling is word-level, if a word is broken into multiple subwords, we mask the prediction of non-first subwords. NER is evaluated by F1 of predicted entity (F1). Note we use a simple post-processing heuristic to obtain a valid span.

Part-of-Speech Tagging
We use a subset of Universal Dependencies (UD) Treebanks (v1.4) (Nivre et al., 2016), which cover 15 languages, following the setup of Kim et al. (2017). The task-specific labeling layer is the same as §4.3. POS tagging is evaluated by the accuracy of predicted POS tags (ACC).

Dependency parsing
Following the setup of Ahmad et al. (2019), we use a subset of Universal Dependencies (UD) Treebanks (v2.2) (Nivre et al., 2018), which includes 31 languages. Dependency parsing is evaluated by unlabelled attachment score (UAS) and labeled attachment score (LAS) 3 . We only predict the coarsegrain dependency label following Ahmad et al. We use the model of Dozat and Manning (2016), a graph-based parser as a task-specific layer. Their LSTM encoder is replaced by mBERT. Similar to §4.3, we only take the representation of the first subword of each word. We use masking to prevent the parser from operating on non-first subwords.

Experiments
We use the base cased multilingual BERT, which has N = 12 attention heads and M = 12 transformer blocks. The dropout probability is 0.1 and d h is 768. The model has 179M parameters with about 120k vocabulary.
Training For each task, no preprocessing is performed except tokenization of words into subwords with WordPiece. We use Adam (Kingma and Ba, 2014) for fine-tuning with β 1 of 0.9, β 2 of 0.999 and L2 weight decay of 0.01. We warm up the learning rate over the first 10% of batches and linearly decay the learning rate.
Maximum Subwords Sequence Length At training time, we limit the length of subwords sequence to 128 to fit in a single GPU for all tasks. For NER and POS tagging, we additionally use the sliding window approach. After the first window, we keep the last 64 subwords from the previous window as context. In other words, for a non-first window, only (up to) 64 new subwords are added for prediction. At evaluation time, we follow the same approach as training time except for parsing. We threshold the sentence length to 140 words, including words and punctuation, following Ahmad et al. (2019). In practice, the maximum subwords sequence length is the number of subwords of the first 140 words or 512, whichever is smaller.

Hyperparameter Search and Model Selection
We select the best hyperparameters by searching a combination of batch size, learning rate and the number of fine-tuning epochs with the following range: learning rate Note the best hyperparameters and model are selected by development performance in English.

Question #1: Is mBERT Multilingual?
MLDoc We include two strong baselines. Schwenk and Li (2018) use MultiCCA, multilingual word embeddings trained with a bilingual dictionary (Ammar et al., 2016), and convolution neural networks. Concurrent to our work,  Table 2: MLDoc experiments. ♠ denotes the model is pretrained with bitext, and † denotes concurrent work. Bold and underline denote best and second best. Artetxe and Schwenk (2018) use bitext between English/Spanish and the rest of languages to pretrain a multilingual sentence representation with a sequence-to-sequence model where the decoder only has access to a max-pooling of the encoder hidden states. mBERT outperforms (Tab. 2) multilingual word embeddings and performs comparably with a multilingual sentence representation, even though mBERT does not have access to bitext. Interestingly, mBERT outperforms Artetxe and Schwenk (2018) in distantly related languages like Chinese and Russian and under-performs in closely related Indo-European languages.
XNLI We include three strong baselines, Artetxe and Schwenk (2018) and Lample and Conneau (2019) are concurrent to our work. Lample and Conneau (2019) with MLM is similar to mBERT; the main difference is that it only trains with the 15 languages of XNLI, has 249M parameters (around 40% more than mBERT), and MLM+TLM also uses bitext as training data 4 . Conneau et al. (2018) use supervised multilingual word embeddings with an LSTM encoder and max-pooling. After an English encoder and classifier are trained, the target encoder is trained to mimic the English encoder with ranking loss and bitext.
In Tab. 3, mBERT outperforms one model with bitext training but (as expected) falls short of models with more cross-lingual training information. Interestingly, mBERT and MLM are mostly the same except for the training languages, yet we observe that mBERT under-performs MLM by a large margin. We hypothesize that limiting pretraining to only those languages needed for the downstream task is beneficial. The gap between Artetxe and Schwenk (2018) and mBERT in XNLI is larger than MLDoc, likely because XNLI is harder. 4 They also use language embeddings as input and exclude the next sentence classification objective NER We use Xie et al. (2018) as a zero-shot cross-lingual transfer baseline, which is state-ofthe-art on CoNLL 2002 and 2003. It uses unsupervised bilingual word embeddings (Conneau et al., 2017) with a hybrid of a character-level/word-level LSTM, self-attention, and a CRF. Pseudo training data is built by word-to-word translation with an induced dictionary from bilingual word embeddings.
mBERT outperforms a strong baseline by an average of 6.9 points absolute F1 and an 11.8 point absolute improvement in German with a simple one layer 0 th -order CRF as a prediction function (Tab. 4). A large gap remains when transferring to distantly related languages (e.g. Chinese) compared to a supervised baseline. Further effort should focus on transferring between distantly related languages. In §5.4 we show that sharing subwords across languages helps transfer.  (Dozat and Manning, 2016), and dictionary supervised cross-lingual embeddings (Smith et al., 2017). Dependency parsers, including Ahmad et al., assume access to gold POS tags: a cross-lingual representation. We consider two versions of mBERT: with and without gold POS tags. When tags are available, a tag embedding is concatenated with the final output of mBERT.
Tab. 6 shows that mBERT outperforms the base-   Summary Across all five tasks, mBERT demonstrate strong (sometimes state-of-the-art) zeroshot cross-lingual performance without any crosslingual signal. It outperforms cross-lingual embeddings in four tasks. With a small amount of target language supervision and cross-lingual signal, mBERT may improve further; we leave this as future work. In short, mBERT is a surprisingly effective cross-lingual model for many NLP tasks.

Question #2: Does mBERT vary layer-wise?
The goal of a deep neural network is to abstract to higher-order representations as you progress up the hierarchy (Yosinski et al., 2014). Peters et al. (2018) empirically show that for ELMo in English the lower layer is better at syntax while the upper layer is better at semantics. However, it is unclear how different layers affect the quality of cross-lingual representation. For mBERT, we hypothesize a similar generalization across the 13 layers, as well as an abstraction away from a specific language with higher layers. Does the zero-shot transfer performance vary with different layers?
We consider two schemes. First, we follow the feature-based approach of ELMo by taking a learned weighted combination of all 13 layers of mBERT with a two-layer bidirectional LSTM with d h hidden size (Feat). Note the LSTM is trained from scratch and mBERT is fixed. For sentence and document classification, an additional max-pooling is used to extract a fixed-dimension vector. We train the feature-based approach with Adam and learning rate 1e-3. The batch size is 32. The learning rate is halved whenever the development evaluation does not improve. The training is stopped early when learning rate drop below 1e-5. Second, when fine-tuning mBERT, we fix the bottom n layers (n included) of mBERT, where layer 0 is the input embedding. We consider n ∈ {0, 3, 6, 9}.
Freezing the bottom layers of mBERT, in general, improves the performance of mBERT in all five tasks (Fig. 1). For sentence-level tasks like document classification and natural language inference, we observe the largest improvement with n = 6. For word-level tasks like NER, POS tagging, and parsing, we observe the largest improvement with n = 3. More improvement in under-performing languages is observed.
In each task, the feature-based approach with LSTM under-performs fine-tuning approach. We hypothesize that initialization from pretraining with lots of languages provides a very good starting point that is hard to beat. Additionally, the LSTM could also be part of the problem. In Ahmad et al. (2019) for dependency parsing, an LSTM encoder was worse than a transformer when transferring   Table 6: Dependency parsing results by language (UAS/LAS). * denotes delexicalized parsing in the baseline. S and Z denotes supervised learning and zeroshot transfer. Bold and underline denotes best and second best. We order the languages by word order distance to English.
to languages with high word ordering distance to English.

Question #3: Does mBERT retain language specific information?
mBERT may learn a cross-lingual representation by abstracting away from language-specific information, thus losing the ability to distinguish between languages. We test this by considering language identification: does mBERT retain languagespecific information? We use WiLI-2018 (Thoma, 2018), which includes over 200 languages from Wikipedia. We keep only those languages included in mBERT, leaving 99 languages 5 . We take various layers of bag-of-words mBERT representation of the first two sentences of the test paragraph and add a linear classifier with softmax. We fix mBERT and train only the classifier the same as the featurebased approach in §5.2. All tested layers achieved around 96% accuracy (Fig. 2), with no clear difference between layers. This suggests each layer contains language-specific information; surprising given the zero-shot crosslingual abilities. As mBERT generalizes its representations and creates cross-lingual representations, it maintains language-specific details. This may be encouraged during pretraining since mBERT needs to retain enough language-specific information to perform the cloze task.
As discussed in §3, mBERT shares subwords in closely related languages or perhaps in distantly related languages. At training time, the representation of a shared subword is explicitly trained to contain enough information for the cloze task in all languages in which it appears. During fine-tuning for zero-shot cross-lingual transfer, if a subword in the target language test set also appears in the source language training data, the supervision could be leaked to the target language explicitly. However, all subwords interact in a non-interpretable way inside a deep network, and subword representations could overfit to the source language and potentially hurt transfer performance. In these experiments, we investigate how sharing subwords across languages effects cross-lingual transfer.
To quantify how many subwords are shared  Figure 2: Language identification accuracy for different layer of mBERT. layer 0 is the embedding layer and the layer i > 0 is output of the i th transformer block. across languages in any task, we assume V en train is the set of all subwords in the English training set, V ℓ test is the set of all subwords in language ℓ test set, and c ℓ w is the count of subword w in test set of language ℓ. We then calculate the percentage of observed subwords at type-level p ℓ type and token-level p ℓ token for each target language ℓ.
where V ℓ obs = V en train ∩ V ℓ test . In Fig. 3, we show the relation between crosslingual zero-shot transfer performance of mBERT and p ℓ type or p ℓ token for all five tasks with Pearson correlation. In four out of five tasks (not XNLI) we observed a strong positive correlation (p < 0.05) with a correlation coefficient larger than 0.5. In Indo-European languages, we observed p ℓ token is usually around 50% to 75% while p ℓ type is usually less than 50%. This indicates that subwords shared across languages are usually high frequency 6 . We hypothesize that this could be used as a simple indicator for selecting source language in crosslingual transfer with mBERT. We leave this for future work.

Discussion
We show mBERT does well in a cross-lingual zeroshot transfer setting on five different tasks covering a large number of languages. It outperforms crosslingual embeddings, which typically have more cross-lingual supervision. By fixing the bottom layers of mBERT during fine-tuning, we observe further performance gains. Language-specific information is preserved in all layers. Sharing subwords helps cross-lingual transfer; a strong correlation is observed between the percentage of overlapping subwords and transfer performance. mBERT effectively learns a good multilingual representation with strong cross-lingual zero-shot transfer performance in various tasks. We recommend building future multi-lingual NLP models on top of mBERT or other models pretrained similarly. Even without explicit cross-lingual supervision, these models do very well. As we show with XNLI in §5.1, while bitext is hard to obtain in low resource settings, a variant of mBERT pretrained with bitext (Lample and Conneau, 2019) shows even stronger performance. Future work could investigate how to use weak supervision to produce a better cross-lingual mBERT, or adapt an already trained model for cross-lingual use. With POS tagging in §5.1, we show mBERT, in general, under-performs models with a small amount of supervision while Devlin et al. (2019) show that in English NLP tasks, fine-tuning BERT only needs a small amount of data. Future work could investigate when cross-lingual transfer is helpful in NLP tasks of low resource languages. With such strong cross-lingual NLP performance, it would be interesting to prob mBERT from a linguistic perspective in the future.