Extending Multilingual BERT to Low-Resource Languages

Multilingual BERT (M-BERT) has been a huge success in both supervised and zero-shot cross-lingual transfer learning. However, this success has focused only on the top 104 languages in Wikipedia that it was trained on. In this paper, we propose a simple but effective approach to extend M-BERT (E-BERT) so that it can benefit any new language, and show that our approach benefits languages that are already in M-BERT as well. We perform an extensive set of experiments with Named Entity Recognition (NER) on 27 languages, only 16 of which are in M-BERT, and show an average increase of about 6% F1 on languages that are already in M-BERT and 23% F1 increase on new languages.


Introduction
Recent works (Wu and Dredze, 2019;Karthikeyan et al., 2020) have shown the zero-shot cross-lingual ability of M-BERT (Devlin et al., 2018) on various semantic and syntactic tasks -just fine-tuning on English data allows the model to perform well on other languages. Cross-lingual learning is imperative for low-resource languages (LRL), such as Somali and Uyghur, as obtaining supervised training data in these languages is particularly hard. However, M-BERT is not pre-trained with these languages, thus limiting its performance on them. Languages like Oromo, Hausa, Amharic and Akan are spoken by more than 20 million people, yet M-BERT does not cover these languages. Indeed, there are about 4000 1 languages written by humans, * Equal Contribution; most of this work was done while the author interned at the University of Pennsylvania. † This work was done while the author was a student at the University of Pennsylvania. 1 https://www.ethnologue. com/enterprise-faq/ how-many-languages-world-are-unwritten-0 of which M-BERT covers only the top 104 languages (less than 3%). One of the approaches to use the idea of M-BERT for languages that are not already present is to train a new M-BERT from scratch. However, this is extremely time-consuming and expensive: training BERT-base itself takes about four days with four cloud TPUs (Devlin et al., 2019), so training M-BERT should take even more time 2 . Alternatively, we can train Bilingual BERT (B-BERT) (Karthikeyan et al., 2020), which is more efficient than training an M-BERT. However, one major disadvantage of B-BERT is that we can not use supervised data from multiple languages, even if it is available.
To accommodate a language that is not in M-BERT, we propose an efficient approach Extend that adapts M-BERT to the language. Extend works by enlarging the vocabulary of M-BERT to accommodate the new language and then continue pre-training on this language. Our approach consumes less than 7 hours to train with a single cloud TPU.
We performed comprehensive experiments on NER task with 27 languages of which 11 languages are not present in M-BERT. From Figure 1 we can see that our approach performs significantly better than M-BERT when the target language is out of the 104 languages in M-BERT. Even for highresource languages that are already in M-BERT, our approach is still superior.
The key contributions of the work are (i) We propose a simple yet novel approach to add a new language to M-BERT (ii) We show that our approach improves over M-BERT for both languages that are in M-BERT and out of M-BERT (iii) We show that, in most cases, our approach is superior to training B-BERT from scratch. Our results are reproducible and we will release both the models and code.

Related works
Cross-lingual learning has been a rising interest in NLP. For example, BiCCA (Faruqui and Dyer, 2014), LASER (Artetxe and Schwenk, 2019) and XLM (Conneau and Lample, 2019). Although these models have been successful, they need some form of cross-lingual supervision such as a bilingual dictionary or parallel corpus, which is particularly challenging to obtain for low-resource languages. Our work differ from above as we do not require such supervision. While other approaches like MUSE (Lample et al., 2018) and VecMap (Artetxe et al., 2018) can work without any cross-lingual supervision, M-BERT already often outperforms these approaches (Karthikeyan et al., 2020). Schuster et al. (2019) has a setting of continuing training similar to ours. However, their approach focus more on comparing between whether B-BERT (JointPair) learns cross-lingual features from overlapping word-pieces, while ours focus more on improving M-BERT on target languages, and addresses the problem of missing word-pieces. We show that our Extend method works well on M-BERT, and is better than B-BERT in several languages, whereas their method (MonoTrans) has a similar performance as B-BERT. This together implies that our Extend method benefits from the multilinguality of the base model (M-BERT vs BERT).

Multilingual BERT (M-BERT)
M-BERT is a bi-directional transformer language model pre-trained with Wikipedia text of top 104 languages -languages with most articles in Wikipedia. M-BERT uses the same pre-training objective as BERT -masked language model and next sentence prediction objectives (Devlin et al., 2019). Despite not being trained with any specific cross-lingual objective or aligned data, M-BERT is surprisingly cross-lingual. For cross-lingual transfer, M-BERT is fine-tuned on supervised data in high-resource languages like English and tested on the target language.

Our Method: Extend
In this section, we discuss our training protocol Extend which works by extending the vocabulary, encoders and decoders to accommodate the target language and then continue pre-training on this language.
Let the size of M-BERT's vocabulary be V mbert and the embedding dimension be d. We first create the vocabulary with the monolingual data in the target language following the same procedure as BERT, and filter out all words that appear in M-BERT's vocabulary. Let the size of this new vocabulary be V new . Throughout the paper, we set V new = 30000. Then, we append this new vocabulary to M-BERT's vocabulary. We extend the encoder and decoder weights of the M-BERT model so that it can encode and decode the newvocabulary. That is, we extend the M-BERT's encoder matrix of size V mbert ×d with a matrix of size V new ×d , which is initialized following M-BERT's procedure, to create an extended encoder of size (V mbert + V new ) × d; we do similar extension for decoder. Note that M-BERT uses weight-tying, hence the decoder is the same as the encoder, except it has an additional bias.
We then continue pre-training with the monolingual data of the target language. Note that except for the newly appended part of encoder and decoder, we initialize all weights with M-BERT's pretrained weight. We call the trained model model E-MBERT.

Experimental Settings
Dataset. Our text corpus and NER dataset are from LORELEI (Strassel and Tracey, 2016). We use the tokenization method from BERT to preprocess text corpuses. For zero-shot cross-lingual NER, we evaluate the performance on the whole annotated set; for supervised learning, since we just want an understanding of a upper-bound, we apply cross validation to estimate the performance: each fold is evaluated by a model trained on the other folds, and the average F 1 is reported. NER Model. We use a standard Bi-LSTM-CRF (Ma and Hovy, 2016;Lample et al., 2016) framework and use AllenNLP (Gardner et al., 2018) as our toolkit. The scores reported in NER is the F 1 score averaged across five runs with different random seeds. BERT training. While extending, we use a batch size of 32 and a learning rate of 2e-5, which BERT suggests for fine-tuning, and we train for 500k iterations. Whereas for B-BERT we use a batch size of 32 and learning rate of 1e-4 and train for 2M iterations. We follow BERT setting for all other hyperparameters.

Comparing between E-MBERT and M-BERT
We compare the cross-lingual zero-shot NER performance of M-BERT and E-MBERT. We train only with supervised LORELEI English NER data. We also report the performance of M-BERT with supervision on the target language, which allows us to get a reasonable "upper-bound" on the dataset. From Figure 2, we can see that in almost all languages, E-MBERT outperforms M-BERT irrespective of whether they exist or do not exist in M-BERT.
It is clear that E-MBERT performs better than M-BERT when the language is not present; however, it is intriguing that E-MBERT improves over M-BERT when the language is already present in M-BERT. We attribute this improvement in performance to three reasons • Increased vocabulary size of target language -Since most languages have a significantly smaller Wikipedia data than English, they have a fewer vocabulary in M-BERT, our approach eliminates this issue. Note that it may not be a good idea to train single M-BERT with larger vocabulary sizes for every language, as this will create a vast vocabulary (a few million). • E-MBERT is more focused on the target language, as during the last 500k steps, it is optimized to perform well on it. • Extra monolingual data -More monolingual data in the target language can be beneficial. Lang

Comparing between E-MBERT and B-BERT
Another way of addressing M-BERT on unseen languages is to completely train a new M-BERT. Restricted by computing resources, it is often only feasible to train on both the source and the target, hence a bilingual BERT (B-BERT). Both E-MBERT with B-BERT uses the same text corpus in the target language; for B-BERT, we subsample English Wikipedia data. We focus only on languages that are not in M-BERT so that E-MBERT will not have an advantage on the target language because of data from Wikipedia. Although the English corpus of B-BERT is different from E-MBERT, the difference is marginal considering its size. Indeed we show that B-BERT and E-MBERT have similar performance on English NER, refer Appendix A and Appenddix A.3.
From Table 2, we can see that E-MBERT often outperforms B-BERT. Moreover, B-BERT is trained for 2M steps for convergence, while E-MBERT requires only 500k steps. We believe that this advantage comes for the following reason: E-MBERT makes use of the better multilingual model M-BERT, which potentially contains lan-guages that help transfer knowledge from English to target, while B-BERT can only leverage English data. For example, in the case of Sinhala and Uyghur, a comparatively high-resource related language like Tamil and Turkish in M-BERT can help the E-MBERT learn Sinhala and Uyghur better.

Rate of Convergence
In this subsection, we study the convergence rate of E-MBERT and B-BERT. We evaluate these two models on two languages, Hindi (in M-BERT) and Sinhala (not in M-BERT), and report the results in Figure 3. We can see that E-MBERT is able to converge within just 100k steps, while for B-BERT, it takes more than 1M steps to converge. This shows that E-MBERT is much more efficient than B-BERT.

Performance on non-target languages
Our Extend method results in the base model (M-BERT) to focus on the target language, and naturally this degrades performance on the other languages that are not the target language. We report the performance of Hindi and Sinhala E-MBERT models evaluated on the other languages in Appendix A.2. In this work, we propose Extend that deals with languages not in M-BERT. Our method has shown great performance across several languages comparing to M-BERT and B-BERT.
While Extend deals with one language each time, it would be an interesting future work to extend on multiple languages at the same time. Furthermore, instead of randomly initialising the embeddings of new vocabulary, we could possibly use alignment models like MUSE or VecMap with bilingual dictionaries to initialize. We could also try to apply our approach to better models like RoBERTa (Liu et al., 2019) in multilingual case.

A Appendices
A.1 Performance of E-MBERT on English: The knowledge of E-MBERT on English (source language) is not affected. From Table 3, we can see that, except for few languages, the English performance of E-MBERT is almost as good as M-BERT's.

A.2 Detailed data on all languages
In Table 4, we report the full result on comparing M-BERT and E-MBERT.
We can also see that Extend is not only useful for cross-lingual performance but also for useful for supervised performance (in almost all cases).
We also notice that extending on one language hurts the transferability to other languages.

A.3 Comparison between B-BERT and E-MBERT:
In    Table 5: Comparison Between B-BERT vs E-MBERT: We compare the performance of E-MBERT with B-BERT on both English and target language. As a reference, performance of M-BERT is 79.37 on English. This shows that neither B-BERT nor E-MBERT gets unfair advantage from the English part of the model.