MultiFiT: Efficient Multi-lingual Language Model Fine-tuning

Pretrained language models are promising particularly for low-resource languages as they only require unlabelled data. However, training existing models requires huge amounts of compute, while pretrained cross-lingual models often underperform on low-resource languages. We propose Multi-lingual language model Fine-Tuning (MultiFiT) to enable practitioners to train and fine-tune language models efficiently in their own language. In addition, we propose a zero-shot method using an existing pretrained cross-lingual model. We evaluate our methods on two widely used cross-lingual classification datasets where they outperform models pretrained on orders of magnitude more data and compute. We release all models and code.


Introduction
Pretrained language models (LMs) have shown striking improvements on a range of natural language processing (NLP) tasks (Peters et al., 2018a;Howard and Ruder, 2018;Devlin et al., 2018). These models only require unlabelled data for training and are thus particularly useful in scenarios where labelled data is scarce. As much of NLP research has focused on the English language, the larger promise of these models is to bridge the digital language divide 2 and enable the application of NLP methods to many of the world's other 6,000 languages where labelled data is less plentiful.
Recently, cross-lingual extensions of these LMs have been proposed that train on multiple languages jointly (Artetxe and Schwenk, 2018;Lample and Conneau, 2019). These models are able to perform zero-shot learning, only requiring labelled data in the source language. However, source data in another language may often not be available, whereas obtaining a small number of labels is typically straightforward.
Furthermore such models have several downsides: a) some variants rely on large amounts of parallel data, which may not be available for truly low-resource languages; b) they require a huge amount of compute for training 3 ; and c) cross-lingual models underperform on lowresource languages-precisely the setting where they would be most useful. We are aware of two possible reasons for this: 1) Languages that are less frequently seen during training are underrepresented in the embedding space. 4 2) Infrequent scripts are over-segmented in the shared word piece vocabulary (Wang et al., 2019).
In this work, we show that small monolingual LMs are able to outperform expensive crosslingual models both in the zero-shot and the supervised setting. We propose Multi-lingual language model Fine-tuning (MultiFit) to enable practitioners to train and fine-tune language models efficiently. 5 Our model combines universal language model fine-tuning (ULMFiT; Howard and Ruder, 2018) with the quasi-recurrent neural network (QRNN; Bradbury et al., 2017) and subword tokenization (Kudo, 2018) and can be pretrained on a single Tesla V100 GPU in a few hours. In addition, we propose to use a pretrained cross-lingual model's predictions as pseudo labels to adapt the monolingual language model to the zero-shot setting. We evaluate our models on two widely used cross-lingual classification datasets, MLDoc (Schwenk and Li, 2018) and CLS (Prettenhofer and Stein, 2010) where we outperform the stateof-the-art zero-shot model LASER (Artetxe and Schwenk, 2018) and multi-lingual BERT (Devlin et al., 2018) in the supervised setting-even without any pretraining. In the zero-shot setting, we outperform both models using pseudo labels-and report significantly higher performance with as little as 100 examples. We finally show that information from monolingual and cross-lingual language models is complementary and that pretraining makes models robust to noise.

Related work
Pretrained language models Pretrained language models based on an LSTM (Peters et al., 2018a;Howard and Ruder, 2018) and a Transformer (Radford et al., 2018;Devlin et al., 2018) have been proposed. Recent work (Peters et al., 2018b) suggests that-all else being equal-an LSTM outperforms the Transformer in terms of downstream performance. For this reason, we use a variant of the LSTM as our language model.

Cross-lingual pretrained language models
The multi-lingual BERT model is pretrained on the Wikipedias of 104 languages using a shared word piece vocabulary. LASER (Artetxe and Schwenk, 2018) is trained on parallel data of 93 languages with a shared BPE vocabulary. XLM (Lample and Conneau, 2019) additionally pretrains BERT with parallel data. These models enable zero-shot transfer, but achieve lower results than monolingual models. In contrast, we focus on making the training of monolingual language models more efficient in a multi-lingual context. Concurrent work (Mulcaire et al., 2019) pretrains on English and another language, but shows that cross-lingual pretraining only helps sometimes.
Multi-lingual language modeling Training language models in non-English languages has only recently received some attention. Kawakami et al. (2017) evaluate on seven languages. Cotterell et al. (2018) study 21 languages. Gerz et al. (2018) create datasets for 50 languages. All of these studies, however, only create small datasets, which are inadequate for pretraining language models. In contrast, we are among the first to report the  performance of monolingual language models on downstream tasks in multiple languages.

Our method 3.1 Multi-lingual Fine-Tuning
We propose Multi-lingual Fine-tuning (MultiFit). Our method uses the ULMFiT model (Howard and Ruder, 2018) with discriminative fine-tuning as foundation. ULMFiT is based on a 3-layer AWD-LSTM (Mer, 2017) language model. The AWD-LSTM is a regular LSTM (Hochreiter and Schmidhuber, 1997) with tuned dropout hyperparameters. To enable faster training and finetuning of the model, we replace it with a QRNN (Bradbury et al., 2017). The QRNN alternates convolutional layers, which are parallel across timesteps, and a recurrent pooling function, which is parallel across channels. It has been shown to outperform LSTMs, while being up to 16× faster at train and test time. ULMFiT in addition is restricted to words as input. To make our model more robust across languages, we use subword tokenization based on a unigram language model (Kudo, 2018), which is more flexible compared to byte-pair encoding (Sennrich et al., 2016). We additionally employ label smoothing (Szegedy et al., 2016) and a novel cosine variant of the one-cycle policy (Smith, 2018) 6 , which we found to outperform ULMFiT's slanted triangular learning rate schedule and gradual unfreezing. The full model can be seen in Figure 1.

Cross-lingual Bootstrapping
Prior methods have employed cross-lingual training strategies relying on parallel data and a shared BPE vocabulary. These can be combined with our language model, but increase its training complexity. For the case where an existing pretrained cross-lingual model and source language data are available, we propose a bootstrapping method (Ruder and Plank, 2018) that uses the pretrained model's zero-shot predictions as pseudo labels to fine-tune the monolingual model on target language data. The steps of the method can be seen in Figure  2. Specifically, we first fine-tune a linear classification layer on top of pretrained cross-lingual representations on source language training data. We then apply this cross-lingual classifier to the target language data and store its predicted label for every example. We now fine-tune our pretrained LM on the target language data and these pseudo labels 7 . Importantly, this method enables our monolingual LM to significantly outperform its crosslingual teacher in the zero-shot setting ( §5).

Experimental setup
This section provides an overview of our experimental setup; see the appendix for full details.
Data We evaluate our models on the Multilingual Document Classification Corpus (MLDoc; Schwenk and Li, 2018)    product reviews in four languages. We provide an overview of the datasets in Table 1.
Pretraining We pretrain our models on 100M tokens extracted from the Wikipedia of the corresponding language for 10 epochs. As fewer tokens might be available for some languages, we also compare against a version (no wiki) that uses no pretraining. For all models, we fine-tune the LMs on the target data of the same language for 20 epochs. We perform subword tokenization with the unigram language model (Kudo, 2018).
Evaluation settings We compare two settings based on the availability of source and target language data: supervised and zero-shot. In the supervised setting, every model is fine-tuned and evaluated on examples from the target language.
In the zero-shot setting, every model is fine-tuned on source language examples and evaluated on target language examples. In all cases, we use English as the source language.
Baselines We compare against the state-ofthe-art cross-lingual embedding models LASER (Artetxe and Schwenk, 2018)   ERT) 10 , and monolingual BERT 11 . We also compare against the best models on each dataset, Mul-tiCCA (Ammar et al., 2016), a cross-lingual word embedding model, and BiDRL (Zhou et al., 2016), which translates source and target data.
Our methods We evaluate our monolingual LMs in the supervised setting (MultiFit) and our LMs fine-tuned with pseudo labels from LASER in the zero-shot setting (pseudo).
forms the comparison methods as the shared embedding space between many languages is overly restrictive. Our monolingual LMs outperform their cross-lingual teacher LASER in almost every setting. When fine-tuned with only 100 target language examples, they are able to outperform all zero-shot approaches except MultiFiT on DE and FR. This calls into question the need for zeroshot approaches, as fine-tuning with even a small number of target examples is able to yield superior performance. When fine-tuning with 1,000 target examples, MultiFiT-even without pretrainingoutperforms all comparison methods, including monolingual BERT. Table 3. Mul-tiFiT is able to outperform its zero-shot teacher LASER across all domains. Importantly, the bootstrapped monolingual model also outperforms more sophisticated models that are trained on translations across almost all domains. In the supervised setting, MultiFiT similarly outperforms multilingual BERT. For both datasets, our methods that have been pretrained on 100 million tokens outperform both multilingual BERT and LASER, models that have been trained with orders of magnitude more data and compute.

Analysis
Speed We compare the LSTM and QRNN cell in MultiFiT based on the speed for processing a single batch for pretraining and fine-tuning in Table 4. MultiFiT with a QRNN pretrains and finetunes about 2× and 3× faster respectively.
MultiFiT vs. ULMFiT We compare Multi-FiT pretrained on 100M Wikipedia tokens against ULMFiT pretrained on the same data using a 3layer LSTM and spaCy tokenization 12 as well as MultiFiT pretrained on 2M Wikipedia tokens, and MultiFiT with no pretraining in Table 5. Pretraining on more data generally helps. MultiFiT outperforms ULMFiT significantly; the performance improvement is particularly pronounced in Chinese where ULMFiT's word-based tokenization underperformed.    Robustness to noise We suspect that MultiFiT is able to outperform its teacher as the information from pretraining makes it robust to label noise. To test this hypothesis, we train MultiFiT and a randomly initialized model with the same architecture on 1k and 10k examples of the Spanish ML-Doc. We randomly perturb labels with a probability ranging from 0-0.75 and show results in Figure 3. The pretrained MultiFiT is able to partially ignore the noise, up to 65% of noisy training examples. Without pretraining, the model does not exceed the theoretical baseline (the percentage of correct examples). In addition, we compare Mul-tiFiT with and without pretraining in Table 6. Pretraining enables MultiFiT to achieve much better performance compared to a randomly initialised model. Both results together suggest a) that pretraining increases robustness to noise and b) that information from monolingual and cross-lingual language models is complementary.
Tokenization Subword tokenization has been found useful for language modeling with morphologically rich languages (Czapla et al., 2018;Mielke and Eisner, 2019) and has been used in recent pretrained LMs (Devlin et al., 2018), but its concrete impact on downstream performance has not been observed. We train models with the best performing vocabulary sizes for subword (15k) and regular word-based tokenization (60k) with the Moses tokenizer (Koehn et al., 2007) on German and Russian MLDoc and show results in Table 7. Subword tokenization outperforms wordbased tokenization on most languages, while being faster to train due to the smaller vocabulary size.

Conclusion
We have proposed novel methods for multilingual fine-tuning of languages that outperform models trained with far more data and compute on two widely studied cross-lingual text classification datasets on eight languages in both the zero-shot and supervised setting.