Language Adapters for Zero Shot Neural Machine Translation

We propose a novel adapter layer formalism for adapting multilingual models. They are more parameter-efficient than existing adapter layers while obtaining as good or better performance. The layers are specific to one language (as opposed to bilingual adapters) allowing to compose them and generalize to unseen language-pairs. In this zero-shot setting, they obtain a median improvement of +2.77 BLEU points over a strong 20-language multilingual Transformer baseline trained on TED talks.


Introduction
Of the many virtues of multilingual neural machine translation (MNMT), arguably the most attractive is the promise of improving performance in the low resource setting (Johnson et al., 2017;Arivazhagan et al., 2019;Dabre et al., 2020). These models even allow for the extreme of these cases, namely to translate in language pair directions which are unseen at training time (zero-shot setting in this paper). Unfortunately, while performance in the low-resource setting indeed increases significantly, their zero-shot performance remains very low (Johnson et al., 2017). In this paper, we propose a neural architecture that allows to translate from any of the source languages towards any of the target languages seen in the training data, regardless of the presence of that specific language direction during training. For that, we build upon the recently proposed adapter layers for NMT , by using monolingual (language-specific) adapter layers, instead of bilingual (language-pair specific) ones. This design difference improves their compositionality, permitting to combine any encoder adapter with other decoder adapters. Monolingual adapter layers perform as good as bilingual adapter layers in * Work done during an internship at NAVER LABS Europe. the non-zero-shot setting, are effective in the zeroshot setting and have the additional advantage of requiring fewer parameters.

Related Work
Zero-shot translation is direct translation in a language pair unseen during training. Aharoni et al. (2019) analyze the zero-shot performance of MNMT models as a function of the number of language pairs. They observe that having more languages results in better zero-shot performance. However, several artifacts arise, as described by Dabre et al. (2020); Zhang et al. (2020); Aharoni et al. (2019); Arivazhagan et al. (2019), like offtarget translation and insufficient modeling capacity of the MNMT models. Zhang et al. (2020) use language-aware layer normalization and linear transformation to improve some drawbacks of MNMT; they also rely massively on backtranslation to improve zero-shot translation.
Adaptation to a new language pair may be addressed by training a multilingual model then fine-tuning it with parallel data in the language pair of interest (Neubig and Hu, 2018;Variš and Bojar, 2019;Stickland et al., 2020). Escolano et al. (2020) propose plug-and-play encoders and decoders per language, which take advantage of a single representation in each language but at the cost of larger model sizes. In order to add only a few trainable parameters per task, adapter modules -initially introduced for computer vision (Rebuffi et al., 2017(Rebuffi et al., , 2018) -were proposed for language modeling by Houlsby et al. (2019).  used them for parameter-efficient adaptation in MNMT. The parameters of the original MNMT network (the parent model) remain fixed, which permits a high degree of parameter sharing. The final multilingual model (the adapted model) is just slightly larger than the original one.  . We use languages as the tasks in the encoder and in the decoder. xx and yy denote source and target languages respectively. and Firat, 2019) show that adapters mitigate one major problem of MNMT models: performance drop in high-resource languages.
The motivation of that work was not zero-shot, and it is not obvious how to use them in such a scenario as the adapter layers are language-pair specific. While in Section 5 we propose a way of using those adapters through pivoting adapter layers, the main contribution of this paper is monolingual adapters which allow combining any encoder adapter with other decoder adapters.

Monolingual adapters
Adapter modules (Rebuffi et al., 2017;Houlsby et al., 2019) were formulated for NMT by : lightweight adapter layers are transplanted between the layers of a pre-trained network and fine-tuned on the adaptation corpus. As shown in Figure 1 (left), an adapter layer is a down projection to a bottleneck dimension followed by an up projection to the initial dimension. The bottleneck allows to limit the number of parameters of the adapter module. The residual connection coupled with a near-identity initialization enables a pass-through and allows keeping at least the performance of the parent model. In their initial formulation,  proposed adapters for each language pair (bilingual adapters), while we propose monolingual adapters.
We illustrate the mechanism in Figure 1: our monolingual-adapter layers are inserted into each of the transformer encoder and decoder layers. When translating from language xx to language yy, we only activate the encoder adapter layers for xx, denoted by θ E xx ; and the decoder adapter layers for yy, denoted by θ D yy .
adapt #tasks + params/task zero-shot Our formulation is different from Bapna and Firat (2019), who propose adapter layers for each language direction (θ xx→yy ). In a multiparallel setting (i.e., where parallel data is available for all language pairs), this requires training n(n − 1) sets of layers, where n is the number of languages. Our monolingual (language-specific) adapters only require 2n layers. Table 1 summarizes the amount of parameters needed for adaptation with regular finetuning (FT), bilingual adapters  and our proposed monolingual adapters. In our setting of 20 languages, fine-tuning would multiply the number of parameters by 380 (20 × 19). As the bottleneck dimension determines the increase of parameters, we experiment with both 64 (used in past work) and 1024, which matches the total number of parameters for bilingual adapters (see Table 2).

Datasets
We use the TED talks (Qi et al., 2018)   over the test set. 1 The TED talks dataset is multiparallel, i.e., each English sentence has translations in multiple languages. Here, we restrict to the top 20 languages, 2 resulting in training corpora ranging between 108k and 214k parallel sentences. We use the dataset as a full multiparallel corpus (data aligned in all directions) and simulate an English-centric setting by using only parallel corpora with English as one of the languages.

Training
Architecture We use the Transformer architecture (Vaswani et al., 2017), implemented in fairseq (Ott et al., 2019), which we modify to include monolingual and bilingual adapters. We train a joint BPE model (Sennrich et al., 2016) on all languages, with inline casing (Berard et al., 2019) and 64k merge operations (resulting in a 70k vocabulary size). The Transformer architecture used in this work 3 has 4 attention heads, 6 encoder layers, 6 decoder layers, an embedding size of 512 and a feed-forward dimension of 1024.

MNMT Training
We train a standard MNMT model following similar settings as Johnson et al. (2017). A single many-to-many model is trained on all the data English-centric data, using a sourceside control token to indicate the target language. This model, which we call "parent", serves as an initialization for our adapter-enabled models. We use Adam (Kingma and Ba, 2015) with an inverse square root schedule, with 4000 warmup updates and a maximum learning rate of 0.0005. We set the maximum batch size per GPU to 4000 tokens, and train on 4 GPUs with mixed-precision (Ott et al., 2018). We apply dropout with a rate of 0.3, and label smoothing with a rate of 0.1. Like Ari-vazhagan et al. (2019), we mitigate the training size imbalance between language pairs by following a temperature-based sampling strategy with T = 5. To ensure all languages are represented adequately in the vocabulary, we use the same temperature-based sampling strategy for training the BPE model. This MNMT model is trained for 120 epochs over all the English-centric training data (38 language pairs). As shown in Table 3, it is a strong MNMT baseline.

Adapter Variations
With monolingual adapters enabled, we optimize the adapter parameters for an additional 60 epochs with the same English-centric data. This setting lets us study the zero-shot capabilities of monolingual adapters. We also consider an "adaptation" setting where the monolingual adapters see data in all language pairs (380).
In this setting, we only optimize adapter parameters for 10 epochs due to the increase of training time owing to more data. We use a bottleneck dimension of 64 for bilingual adapters, and try two values for the monolingual adapters: 64 and 1024. Table 2 shows how many extra parameters are added in each setting.
To train the adapters, we use the same settings as the parent MNMT model but reset the learning rate schedule and freeze all model parameters except the new adapter parameters. We train the 380 sets of bilingual adapters sequentially, as they are independent from each other. However, the monolingual adapters are trained all at once. To do so, we aggregate the training data for all language directions, using the same temperature-based sampling strategy as the parent model. For ease of implementation, we build homogeneous batches (i.e., only containing sentences for one language direction) and only activate corresponding adapters. An epoch consists in a pass over the training data in all language directions (≈ 160k line pairs × 380 lang dirs ≈ 62M examples in the adaptation case, and ≈ 7.1M examples in the zero-shot case).

Results and Discussion
We evaluate the effectiveness of monolingual adapters in two settings: adaptation, where multiparallel training data is available for the adapters; and in the zero-shot setting where translation is done on unseen language pairs.   (2) Many-to-many MNMT models. The best model for each case is highlighted in italics and the best overall is in bold. Note that "(2) Aharoni et al. (2019)" is a 58-language model. "Parent" is our MNMT model trained on Englishcentric data. "Parent adaptation" is the same model fine-tuned for 10 epochs on the full multiparallel corpus (similar setting as "Mono-1024 adaptation", but without adapters). "Bilingual baselines" are models trained on one language direction only, with the same architecture as "Parent". Pivot-translating through English with "Parent" gives an average zero-shot performance (xx→yy) of 14.39 BLEU. . The y-axis shows the absolute BLEU differences with the parent model (trained on English-centric data only). The trendlines are obtained by interpolating a polynomial of degree 7 over the individual points.

Adaptation
In this setting, the adapter layers are trained on multiparallel data in 380 language pairs. Figure 2 shows the absolute difference in BLEU with the parent model, trained on English-centric data only, on each language pair. We compare bilingual adapters of dimension 64 with monolingual adapters of dimension 1024 or 64. As can be seen from the trendlines, while mono-64 performs slightly (but consistently) worse than biling-64, mono-1024 (which has a lower parameter budget than biling-64) obtains even better results, ranging from an absolute difference of -0.22 to +14.43, with a median of +5.59.
Because multilingual models are known to degrade performance on high-resource language directions, we study specifically translation to and from English. For en→yy, mono-1024 (median +1.65) consistently outperforms biling-64 (+1.24) and mono-64 (+0.48) over the 19 language pairs. For xx→en however, biling-64 adapters are slightly superior to both mono-64 and mono-1024 (+1.08 vs +0.09 and +0.50 respectively). Monolingual adapters can be naturally used for zero-shot translation, where a new language pair is provided at inference time. For this, we simply use the encoder adapters of the source language and the decoder adapters of the target language. To evaluate zero-shot translation, we use the adapterenabled models trained on English-centric data and translate the test sets in the 342 language pairs not involving English (19 × 18).

Zero-shot
Absolute improvements in BLEU scores of the adapter-enabled models over the MNMT parent model are shown in Figure 3. A median improvement of +1.26 is observed in the mono-64 setting, while the mono-1024 setting brings a median improvement of +2.77. The smallest difference (over the parent model) observed in each case was -0.14 and +0.30 respectively, indicating near-systematic improvement by using monolingual adapter layers. These results demonstrate the compositionality property of our monolingual adapters. Because of the English-centric nature of TEDx, we also apply bilingual adapters to the zero-shot setting. We do this by composing the encoder and decoder adapter layers through a pivot language. That is, to translate xx→yy, we choose the bilingual adapter corresponding to xx→en in the encoder and en→yy in the decoder. As can be seen in Figure 3, this slightly outperforms mono-64 but not mono-1024.

Ablation Study
We investigate the individual contribution of the encoder and decoder adapter layers at inference time. We compare the full model using mono-1024 adapters against the two options of activating (1) only encoder adapters (2) only decoder adapters.  The interpolated curves for all language pairs are in Figure 4 for the adaptation setting and in Figure 5 for the zero-shot setting. In the adaptation setting, enabling only the decoder layers brings a median improvement of +1.03 over the parent model, while enabling only the encoder gives -7.00 BLEU (versus +5.59 when both encoder and decoders are enabled). In the zero-shot setting, the contribution of the encoder is larger (+1.69) than the decoder (+0.63), compared to +2.77 when both encoder and decoder adapters are enabled.
We have seen that in the adaptation case, using the encoder adapter layers alone leads to a severe drop in performance. This might indicate that -at least during the adaptation -important information is captured in the encoder's adapter layer (in line with previous reports by Kudugunta et al., 2019) or that the decoder adaptation grows dependent on the encoder adapters, to the point where dropping the latter degrades the system. However, further analysis would be needed to confirm either of these hypotheses.

Conclusion
This work investigated adapter modules and their compositionality for MNMT, in particular in the zero-shot setting. We introduced monolingual adapters and compared them to bilingual adapters, which we also applied to zero-shot translation. Our adaptation experiments show the potential of the proposed monolingual adapters, which outperform bilingual adapters while having fewer parameters. In a zero-shot setting, we naturally compose our monolingual adapters and obtain a median improvement of +2.77 BLEU points over a strong MNMT model. Future work will investigate the compositional capability of these adapters, and combine domain and monolingual adapters for NMT.
More generally, this work adds to the growing evidence of the flexibility of adapter layers (Pfeiffer et al., 2020a), and their potential for lightweight fine-tuning, including in zero-shot scenarios (Pfeiffer et al., 2020b) and in a variety of tasks (Üstün et al., 2020).