Towards Continual Learning for Multilingual Machine Translation via Vocabulary Substitution

We propose a straightforward vocabulary adaptation scheme to extend the language capacity of multilingual machine translation models, paving the way towards efficient continual learning for multilingual machine translation. Our approach is suitable for large-scale datasets, applies to distant languages with unseen scripts, incurs only minor degradation on the translation performance for the original language pairs and provides competitive performance even in the case where we only possess monolingual data for the new languages.


Introduction
The longstanding goal of multilingual machine translation (Firat et al., 2016;Johnson et al., 2017;Aharoni et al., 2019;Gu et al., 2018) has been to develop a universal translation model, capable of providing high-quality translations between any pair of languages. Due to limitations on the data available, however, current approaches rely on first selecting a set of languages for which we have data and training an initial translation model on this data jointly for all languages in a multi-task setup. In an ideal setting, one would continually update the model once data for new language pairs arrives. This setting, dubbed in the literature as continual learning (Ring, 1994;Rebuffi et al., 2017;Kirkpatrick et al., 2017;Lopez-Paz and Ranzato, 2017), introduces new challenges not found in the traditional multi-task setup, most famously catastrophic forgetting (McCloskey and Cohen, 1989), in which the model may lose its previously-learned knowledge as it learns new language pairs. This situation is further complicated by the training procedures of standard tokenizers, such as Byte-Pair Encoding (BPE) (Sennrich et al., 2015b) or Sentencepiece (Kudo and Richardson, 2018), which necessitate access to monolingual data for all the languages considered before producing the vocabulary. Failing to comply with these requirements, one risks suboptimal segmentation rules which in the worst case could result in strings of entirely <UNK> tokens for text in a previously-unseen alphabet.
In this work, we investigate how vocabularies derived from BPE transform if they are rebuilt with the same settings but with additional data from a new language. We show in Section 3.1 that there is a large token overlap between the original and updated vocabularies. This large overlap allows us to retain the performance of a translation model after replacing its vocabulary with the updated vocabulary that additionally supports a new language.
Past works have explored adapting translation models to new languages, typically focusing on related languages which share similar scripts (Gu et al., 2018;Neubig and Hu, 2018;Lakew et al., 2019;Chronopoulou et al., 2020). These works usually focus solely on learning the new language pair, with no consideration for catastrophic forgetting. Moreover, these works only examine the setting where the new language pair comes with parallel data, despite the reality that for a variety of low-resource languages, we may only possess high-quality monolingual data with no access to parallel data. Finally, unlike our approach, these approaches do not recover the vocabulary one would have built if one had access to the data for the new language from the very beginning.
Having alleviated the vocabulary issues, we study whether we are able to learn the new language pair rapidly and accurately, matching the performance of a model which had access to this data at the beginning of training. We propose a simple adaptation scheme that allows our translation model to attain competitive performance with strong bilingual and multilingual baselines in a small amount of additional gradient steps. Moreover, our model retains most of the translation quality on the original language pairs it was trained on, exhibiting no signs of catastrophic forgetting.

2 Continual learning via vocabulary substitution
Related works Adapting translation models to new languages has been studied in the past. Neubig and Hu (2018) showed that a large multilingual translation model trained on a subset of languages of the TED dataset (Qi et al., 2018) could perform translation on the remaining (related) languages. Tang et al. (2020) was able to extend the multilingual translation model mBART (Liu et al., 2020) from 25 to 50 languages by exploiting the fact that mBART's vocabulary already supported those additional 25 languages. (Escolano et al., 2021) was able to add new languages to machine translation models by training language-specific encoders and decoders. Other works (Zoph et al., 2016;Lakew et al., 2018Lakew et al., , 2019Escolano et al., 2019) have studied repurposing translation models as initializations for bilingual models for a target low-resource language pair. Most recently (Chronopoulou et al., 2020) examined reusing language models for highresource languages as initializations for unsupervised translation models for a related low-resource language through the following recipe: build vocabulary V X and a language model for high-resource language X; once data for low-resource language Y arrives, build a joint vocabulary V X,Y and let V Y |X be the tokens from Y that appear in V X,Y ; substitute the vocabulary for the language model with the one given by V X Y V Y |X and use the language model as the initialization for the translation model.
Our approach In this work, we are not only interested in the performance of our multilingual translation models on new language pairs, we also require that our models retain the performance on the multiple language pairs that they were initially trained on. We will also be interested in how the performance of these models compares with those obtained in the oracle setup where we have all the data available from the start. The approaches discussed above generate vocabularies that are likely different (both in selection and number of tokens) from the vocabulary one would obtain if one had a priori access to the missing data, due to the special attention given to the new language. This architectural divergence will only grow as we continually add new languages, which inhibits the comparisons to the oracle setup. We eliminate this mismatch by first building a vocabulary V N on the N languages available, then once the new language arrives, build a new vocabulary V N`1 as we would have if we had possessed the data from the beginning and replace V N with V N`1 . We then reuse the embeddings for tokens in the intersection 1 and continue training. The success of our approach relies on the fact for large N (i.e. the multilingual setting), V N and V N`1 are mostly equivalent, which allows the model to retain its performance after we substitute vocabularies. We verify this in the following section.

Experiments
In this section, we outline the set of experiments we conducted in this work. We first discuss the languages and data sources we use for our experiments. We then provide the training details for how we trained our initial translation models. Next, we compute the token overlap between various vocabularies derived from BPE before and after we include data for a new language and empirically verify that this overlap is large if the vocabulary already suppots a large amount of languages. We then examine the amount of knowledge retained after vocabulary substitution by measuring the degradation of the translation performance on the original language pairs from replacing the original vocabulary with an updated one. Finally, we examine the speed and quality of the adaptation to new languages under various settings.
Languages considered Our initial model will have to access to data coming from 24 languages 2 . Our monolingual data comes primarily from the newscrawl datasets 3 and Wikipedia, while the parallel data comes WMT training sets and Paracrawl. We will adapt our model to the following four languages: Kazakh, which is not related linguistically to any of the original 24 languages, but does share scripts with Russian and Bulgarian; Bengali, which is related to the other Indo-Aryan languages but possesses a distinct script; Polish, which is related to (and shares scripts with) many of the Slavic languages in our original set; Pashto, which 1 Tokens shared between the two vocabularies are also forced to share the same indices. The remaining tokens are rewritten but we still reuse the outdated embeddings.
3 http://data.statmt.org/news-crawl/ is not closely related 4 to any of the languages in our original set and has a distinct script. We provide an in-depth account of the data available for each language in the appendix.

Model configurations
We perform our experiments in JAX (Bradbury et al., 2018), using the neural network library FLAX 5 . We use Transformers (Vaswani et al., 2017) as the basis of our translation models. We use the Transformer Big configuration and a shared BPE model of 64k tokens with bytelevel fallback using the Sentencepiece 6 library. We used a maximum sequence length of 100, discarded all sequences longer than that during training.
Sampling scheme We train our models leveraging both monolingual and parallel datasets, following previous work (Siddhant et al., 2020;Garcia et al., 2020 Table 2: BLEU scores on the new languages. The "monolingual" models have access to exclusively monolingual data for the new language(s), while "monolingual & parallel" models add parallel data as well. Models with "xx" add a single language, while "4xx" models add four languages together. report the results in Table 1. In the multilingual setting, we attain large token overlap, more than 90%, even for languages with distinct scripts or when we add multiple languages at once. We extend this analysis to different vocabulary sizes and examine which tokens are "lost" in Appendix A.3.

Evaluating translation quality and catastrophic forgetting
Measuring the deterioration from swapping vocabularies at inference To measure the amount of knowledge transferred through the vocabulary substitution, we compute the translation performance of our initial translation model with the adapted vocabularies without any additional updates. For each new language, we compute the change in BLEU from the model with its original vocabulary and the one utilizing the adapted one and plot the results in Figure 1. Notably, we only incur minor degradation in performance from the vocabulary substitution. We now study the effect of introducing a new language into our translation model. We require an adaptation recipe which enjoys the following prop-erties: fast, in terms of number of additional gradient steps; performant, in terms of BLEU scores on the new language pair; retentive, in terms of minimal regression in the translation performance of the model on the original language pairs. Our solution: re-compute the probabilities for the temperature-based sampling scheme using the new data, upscale the probabilities of sampling new datasets by a factor then rescale the remaining probabilities so that their combined sum is one. We limit ourselves to either 15k or 30k additional steps (3% and 6% respectively of the training time for the original model) depending on the data available 9 to ensure fast adaptation. We reset the Adam optimizer's stored accumulators, reset the learning rate to 5e-5 and keep it fixed. We provide more details in Appendix A.2. Aside from these modifications, we continue training with the same objectives as before unless noted otherwise. We include the results for oracle models trained in the same way as the original model but with access to both the adapted 9 We use 15k steps if we leverage both monolingual and parallel data for a single language pair. We use 30k steps if we only use monolingual data or if we are adapting to all four languages at once. vocabulary and the missing data. We compute the BLEU scores and report them in Table 2.
Our models adapted with parallel data are competitive with the oracle models, even when we add all four languages at once and despite the restrictions we imposed on our adaption scheme. For languages that share scripts with the original ones (Kazakh and Polish), we can also attain strong performance leveraging monolingual data alone, albeit we need to introduce back-translation (Sennrich et al., 2015a) for optimal performance. We can also adapt the translation model using the original vocabulary, but the quality lags behind the models using the adapted vocabularies. This gap is larger for Bengali and Pashto, where the model is forced to rely on byte-level fallback, further reaffirming the value of using the adapted vocabularies.
To examine whether catastrophic forgetting has occured, we proceed as in Section 3.1 and examine the performance on the original language pairs after adaptation on the new data against the oracle model which had access to this data in the beginning of training. We present the results for the models adapted to Kazakh in Figure 2. All the models' performance on the original language pairs deviate only slightly from the oracle model, mitigating some of the degradation from the vocabulary substitution i.e. compare the kk and bn+pl+kk+ps curves in Figure 1 to the curves in Figure 2.
Lastly, we compare our models with external baselines for Kazakh. We consider the multilingual model mBART (Liu et al., 2020) as well as all the WMT submissions that reported results on English Ø Kazakh. Of these baselines, only mBART and (Kocmi et al., 2018) use sacreBLEU which inhibits proper comparison with the rest of the models. We include them for completeness. We report the scores in Table 3. Our adapted models are able to outperform mBART in both directions, and as well some of the weaker WMT submissions, despite those models specifically optimizing for that language pair and task.

Conclusion
We present an approach for adding new languages to multilingual translation models. Our approach allows for rapid adaptation to new languages with distinct scripts with only a minor degradation in performance on the original language pairs.    vocabulary which are not in the adapted vocabulary for that language in Figure 3. Critically, we observe that most of the tokens lost are towards the end of spectrum, suggesting that the model is mostly discarding infrequent tokens. Notably, it cannot discard the tail due to our requirement of full character coverage, which introduces a variety of rare Unicode characters as tokens that reside in the tail.