Load What You Need: Smaller Versions of Multilingual BERT

Pre-trained Transformer-based models are achieving state-of-the-art results on a variety of Natural Language Processing data sets. However, the size of these models is often a drawback for their deployment in real production applications. In the case of multilingual models, most of the parameters are located in the embeddings layer. Therefore, reducing the vocabulary size should have an important impact on the total number of parameters. In this paper, we propose to generate smaller models that handle fewer number of languages according to the targeted corpora. We present an evaluation of smaller versions of multilingual BERT on the XNLI data set, but we believe that this method may be applied to other multilingual transformers. The obtained results confirm that we can generate smaller models that keep comparable results, while reducing up to 45% of the total number of parameters. We compared our models with DistilmBERT (a distilled version of multilingual BERT) and showed that unlike language reduction, distillation induced a 1.7% to 6% drop in the overall accuracy on the XNLI data set. The presented models and code are publicly available.


Introduction
While transformer based models are getting larger, there is a growing difficulty to meet production requirements when deploying them. Reducing the size of these models is therefore an important step towards a democratization of transformers in real industrial environments. In addition to the model architecture, the vocabulary size may have a huge impact on the total number of parameters. However, in the case of multilingual transformers, we need to increase the model vocabulary as we include more languages. For instance, the cased version of multilingual BERT (mBERT) has a vocabulary of 119k entries, while the english version (BERT BASE ) has only a 30k tokens vocabulary (Devlin et al., 2018). Therefore, even if both models share the same architecture, mBERT has 178 million parameters, while BERT BASE has only 110 million. As a matter of fact, mBERT allocates more than 51% of its parameters to the embeddings layer.
Still, reducing the vocabulary size may induce an important drop of the model performance on downstream tasks. Indeed, Conneau et al. (2020) obtained more than 3% improvement in overall accuracy on XNLI by increasing the vocab size from 128k to 512k. Here, we propose to reduce the vocabulary size by reducing the number of languages the model handles. Indeed, most of multilingual transformers have been learned on more than 100 languages. However, in several real world applications, we need to handle a lower number of languages. In this paper, we suggest to extract smaller multilingual transformers that handle fewer languages. Here, we evaluate smaller versions of mBERT on the XNLI data set (Conneau et al., 2018), but we believe that this method may be applied to other multilingual transformers and other NLP tasks.
We compared our models with the original mBERT and the hugging face multilingual Dis-tilBERT (DistilmBERT) . To our knowledge this is the first detailed evaluation of DistilmBERT on the XNLI data set. The obtained results confirm that unlike DistilmBERT, our strategy reduces the model size without decreasing the average accuracy. The aim of this work is to draw the attention of the community on this simple yet efficient way of reducing the size of multilingual transformers.

Related work
Several methods have recently emerged to compress transformer models.
A family of approaches focuses on quantizing model weights, i.e. reducing the memory footprint of a model by representing its weights by lower-precision values. This method, especially effective when used with specific hardware, has been recently applied by Shen et al. (2020) to the transformer architecture.
Knowledge distillation (Buciluǎ et al., 2006;Hinton et al., 2015) consists in transferring knowledge learned by a large teacher network to a smaller student. Works from this area aim at building models with simpler architectures than the original ones while mocking their behaviour. Knowledge distillation has been applied to reduce the number of layers of BERT models (Sun et al., 2019;Tang et al., 2019). Not limited to architecture simplification, Zhao et al. (2019) performs a model distillation by simultaneously training the teacher and student models in order to reduce the vocabulary size and the embeddings size.
Regarding multilingual transformers, Tsai et al. (2019) evaluated distilled versions of BERT and mBERT for POS tagging and Morphology tasks. Their version of mBERT is 6 times smaller and 27 times faster but induces an average F1 drop of 1.6% and 5.4% in the two evaluated tasks. Furthermore, the learnt model is not publicly available on the internet. In this paper, we compare our models with the widely used DistilmBERT , a distilled version of mBERT that reduces its size by 21%.
Finally, fewer methods have tried to reduce the number of parameters located in the embeddings. Unlike Mehta et al. (2020), our method do not require to train the model from scratch.

Methods
In order to generate smaller versions of mBERT, we have (i) identified the vocabulary of each language, and then (ii) rebuilt the embedding layer to generate the corresponding models.

Selecting Language Vocabularies
As for the original mBERT, we started from the entire Wikipedia dump of each language 1 . In our case, we selected the 15 languages covered by the XNLI data set (Conneau et al., 2018). The recommended mBERT cased tokenizer has been used to tokenize the data. The frequency of each entry of  the original mBERT vocabulary has been computed for each language. Manual evaluation of the tokens distributions over the different data sets allowed us to chose an appropriate frequency threshold. For each language, tokens appearing in at least 0.05% of its paragraphs (lines) were selected in its vocabulary. Table 1 presents the number of selected tokens for each language and their proportions in the original mBERT vocabulary. As expected, the number of selected tokens for the 15 languages (union) is not equal to the sum of selected tokens in each language. Indeed, several languages should share a certain number of tokens (proper nouns, punctuation signs, numbers, etc.).

Generating Smaller Models
Once the tokens selected for each targeted language, we extracted their corresponding embeddings to generate smaller models. Except the embeddings selection and re-arranging, no other modification has been applied to the model parameters.
We generated 30 models covering the 15 XNLI languages in different ways: • One multilingual model covering all the 15 languages (mBERT 15langs ).
• 14 bilingual models combining english with another language from the remaining ones (mBERT en-xx ); • 15 monolingual models (mBERT xx ); All these models have been uploaded to the transformers hub to facilitate their use by the NLP community 2 . They can be easily fine-tuned on downstream tasks as conducted in the following section. The data and code are also available on github to allow users to generate other configurations of multilingual transformers 3 .

Results and Discussion
In this section, we present the results of the original mBERT, its available distilled version (Distilm-BERT) and our extracted mBERT versions on the XNLI data set. We also discuss the obtained results and show a few limitations of the proposed method.

Results
The above cited models have been evaluated for Cross-lingual Natural Language Inference. We used the XNLI data set which is an extension of the MultiNLI corpus  to 15 languages. The original english development and test sets have been manually translated to the remaining 14 languages. Furthermore, the XNLI data set comes with other configurations where the items have been automatically translated. In this paper, we used all the proposed configurations: • Cross-lingual Transfer: Cross-lingual transfer from an english training set; • Translate Train: Translate the english training set in order to learn the models on the same language of the test data; • Translate Train-all: Translate the english training set and learn the models on all languages; • Translate Test: Translate each test set to english and learn on the english training data. Table 2 presents the obtained accuracies as well as the number of parameters of each model. All the results presented here may be reproduced using the shared models and the evaluation scripts available in the transformers library .

Discussion
Overall, our extracted versions of mBERT give similar results to those of the original model while being between 21% and 45% smaller in size. Regarding disilmBERT, the results show an average drop of 1.7% to 6.1% in terms of accuracy while being 25% smaller than the original mBERT. The drop in DistilmBERT's accuracy is much higher in the case of Cross-lingual Transfer from english to other languages (6.1%). On the contrary, our extracted versions seem to be resilient over all configurations.
The average accuracy of mBERT 15langs is always very close to the one obtained by the original mBERT except for the Translate Train-all configuration. In this case, the accuracy of mBERT 15langs is higher by 1.1%. Conneau et al. (2020) reported similar observations showing that the average accuracy decreases when the number of languages goes from 15 to 100. However, the authors kept the same vocabulary size in both experiments (150k tokens). Therefore, the per-language vocabulary should be lower in the 100-languages model than in the one handling only 15 languages. In our experiments, the vocabulary size of mBERT 15langs is 40% smaller than the one of mBERT since we were trying to select tokens that are frequent only in these languages. Another important difference is that we are starting from models that were already trained on more than 100 languages and just fine-tuning them on less languages. Here the obtained results may suggest that starting from a certain point, keeping tokens that are very rare (or non-existent) in some languages may harm the fine-tuning of multilingual transformers on these specific languages even if we increase the vocabulary capacity.
Bilingual models (mBERT en-xx ) have been evaluated for Cross-lingual Transfer from the original english training set to the human translated test sets. They obtained a similar average accuracy to the one obtained by mBERT. All the presented models show better accuracies when evaluated on languages that are somewhat similar to english such as: french (fr), spanish (es) and german (de).
Finally, monolingual models (mBERT xx ) have been evaluated when the training and test sets are in the same language (Translate Train and Translate Test). Their average accuracies are lower but very close to ones obtained by mBERT. The difference is so small (less than 0.2%) that we avoid making interpretations here.  Table 2: Results on the XNLI data set of mBERT, its distilled version (DistilmBERT) and our extracted smaller versions mBERT 15langs , mBERT en-xx and mBERT xx . We also report the number of parameters of each model.

Limitations
In this work, we were interested in reducing the number of parameters of multilingual transformer models, which leads to smaller models that require less memory. We believe that memory limits are crucial especially when deploying transformers on public cloud platforms. Moreover, smaller models are loaded faster than larger ones which may also improve the speed of deployed applications. However, the proposed method does not improve the inference speed since the model architecture has not changed. Whereas distillation allows to build smaller models that usually run faster. For example, DistilmBERT reduces the number of layers by a factor of 2 (from 12 to 6 layers), which also reduces the number of operations executed either during training or inference. Table 3 presents the model size, the allocated memory, the loading time and the inference time for all the evaluated versions of mBERT. All these measurements have been computed on a Google Cloud n1-standard-1 4 machine (1 vCPU, 3.75 GB). As expected, our extracted models allow to reduce the first three measurements but without changing the inference time. Whereas ditilmBERT, in addition to reducing the size, the memory and the loading time, also enhances the inference speed by a factor of 2. Our experiments confirm that reduc-  Table 3: The model size, the allocated memory, the loading time and the inference time for all the evaluated versions of mBERT. We present the average measurements for bilingual and monolingual models. The loading times were computed 10 times for each model, while the inference times were averaged over 100 items from the XNLI data set (batch size = 1).
ing the number of embeddings and therefore the cost of the lookup operation has almost no impact on the inference time.
That being said, we can still apply language reduction to distilled transformers to take advantage of both methods and make our models even smaller.

Conclusion
Multilingual transformers have several advantages such as their capacity to do zero shot cross-lingual transfer. Therefore, most of multilingual transformers have been learnt to handle an important number of languages (around 100 languages). However, handling more languages requires to increase the vocabulary capacity and therefore the model size.
In this paper, we evaluated a simple method to break multilingual transformers into smaller models according to the targeted languages. We evaluated smaller versions of mBERT on the XNLI data set and showed that they reduced the number of parameters without decreasing the average accuracy.
As a future work, it would be interesting to evaluate this method on more models and tasks. Indeed, we are planning to reduce more recent multilingual transformers that showed better results than mBERT such as XLM-R (Conneau et al., 2020). We hope that these models will facilitate the deployment of multilingual transformers in real world applications.