Evaluating the Supervised and Zero-shot Performance of Multi-lingual Translation Models

We study several methods for full or partial sharing of the decoder parameters of multi-lingual NMT models. Using only the WMT 2019 shared task parallel datasets for training, we evaluate both fully supervised and zero-shot translation performance in 110 unique translation directions. We use additional test sets and re-purpose evaluation methods recently used for unsupervised MT in order to evaluate zero-shot translation performance for language pairs where no gold-standard parallel data is available. To our knowledge, this is the largest evaluation of multi-lingual translation yet conducted in terms of the total size of the training data we use, and in terms of the number of zero-shot translation pairs we evaluate. We conduct an in-depth evaluation of the translation performance of different models, highlighting the trade-offs between methods of sharing decoder parameters. We find that models which have task-specific decoder parameters outperform models where decoder parameters are fully shared across all tasks.


Introduction
Multi-lingual translation models, which can map from multiple source languages into multiple target languages, have recently received significant attention because of the potential for positive transfer between high-and low-resource language pairs, and because of the potential efficiency gains enabled by translation models which share parameters across many languages (Dong et al., 2015;Ha et al., 2016;Firat et al., 2016;Blackwood et al., 2018;. Multi-lingual models which share parameters across tasks can also perform zero-shot translation, translating between language pairs for which no parallel training data is available Ha et al., 2016;. Although multi-task models have recently been shown to achieve positive transfer for some combinations of NLP tasks, in the context of MT, multi-lingual models do not universally outperform models trained to translate in a single direction when sufficient training data is available. However, the ability to do zero-shot translation may be of practical importance in many cases, as parallel training data is not available for most language pairs . Therefore, small decreases in the performance of supervised pairs may be admissible if the corresponding gain in zero-shot performance is large. In addition, zeroshot translation can be used to generate synthetic training data for low-or zero-resource language pairs, making it a practical alternative to the bootstrapping by back-translation approach that has recently been used to build completely unsupervised MT systems (Firat et al., 2016;Artetxe et al., 2018;Lample et al., 2018a,b). Therefore, understanding the trade-offs between different methods of constructing multi-lingual MT systems is still an important line of research.
Deep sequence to sequence models have become the established state-of-the-art for machine translation. The dominant paradigm continues to be models divided into roughly three high-level components: embeddings, which map discrete tokens into real-valued vectors, encoders, which map sequences of vectors into an intermediate representation, and decoders, which use the representation from an encoder, combined with a dynamic representation of the current state, and output a sequence of tokens in the target language conditioned upon the encoder's representation of the input. For multi-lingual systems, any combination of encoder and/or decoder parameters can potentially be shared by groups of tasks, or duplicated and kept private for each task.  (Vaswani et al., 2017). We can share all parameters across all target tasks, or we can create a unique set of decoder parameters for each task (outer dashed line). Alternatively, we can create unique attention parameters for each task, while sharing the final feed-forward layers (inner dotted lines). The possiblility of including an embedding for the target task is visualized at the bottom of the diagram. Illustration modeled after  Our work builds upon recent research on manyto-one, one-to-many, and many-to-many translation models. We are interested in evaluating manyto-many models under realistic conditions, including: 1. A highly imbalanced amount of training data available for different language pairs. 2. A very diverse set of source and target languages.
3. Training and evaluation data from many domains.
We focus on multi-layer transformer models (Vaswani et al., 2017), which achieve state-ofthe-art performance on large scale MT and NLP tasks (Devlin et al., 2018;Bojar et al., 2018). We study four ways of building multi-lingual translation models. Importantly, all of the models we study can do zero-shot translation: translating between language pairs for which no parallel data was seen at training time. The models use training data from 11 distinct languages 1 , with supervised data available from the WMT19 news-translation task for 22 of the 110 unique translation directions 2 . This leaves 88 translation directions for which no parallel data is available. We try to evaluate zero-shot translation performance on all of these additional directions.
Target Language Specification Although the embedding and encoder parameters of a multilingual system may be shared across all languages without any special modification to the model, decoding from a multi-lingual model requires a means of specifying the desired output language. Previous work has accomplished this in different ways: • pre-pending a special target-language token to the input  • using an additional embedding vector for the target language (Lample and Conneau, 2019) • using unique decoders for each target language (Luong et al., 2016;Firat et al., 2016) • partially sharing some of the decoder parameters while keeping others unique to each target language Blackwood et al., 2018) However, to the best of our knowledge, no sideby-side comparison of these approaches has been conducted. We therefore train models which are identical except for the way that decoding into different target languages is handled, and conduct a large-scale evaluation. We use only the language pairs and official parallel data released by the WMT task organisers, meaning that all of our systems correspond to the constrained setting of the WMT shared task, and our experimental settings should thus be straightforward to replicate.

Multi-Task Translation Models
This section discusses the key components of the transformer-based NMT model, focusing on the various ways to enable translation into many target languages. We use the terms source/target task and language interchangeably, to emphasize our view that multi-lingual NMT is one instantiation of the more general case of multi-task sequence to sequence learning.

Shared Encoders and Embeddings
In this work, we are only interested in ways of providing target task information to the model -information about the source task is never given explicitly, and the encoder is always completely shared across all tasks. The segmentation model and embedding parameters are also shared between all source and target tasks (see below for more details).   showed that, as long as a mechanism exists for specifying the target task, it is possible to share the decoder module's parameters across all tasks. In the case where all parameters are shared, the decoder model must therefore learn to operate in a number of distinct modes which are triggered by some variation in the input. A simple way to achive this variation is by pre-pending a special "task-token" to each input. We refer to this method as PREPEND.

Task Embeddings (EMB)
An alternative to the use of a special task token is to treat the target task as an additional input feature, and to train a unique embedding for each target task (Lample and Conneau, 2019), which is combined with the source input. This technique has the advantage of explicitly decoupling target task information from source task input, introducing a relatively small number of additional parameters. This approach can be seen as adding an additional token-level feature which is the same for all tokens in a sequence (Sennrich and Haddow, 2016). We refer to this setting as EMB.

Task-specific Decoders (DEC)
In general, any subset of decoder parameters may be replicated for each target language, resulting in parameter sets which are specific to each target task. At one extreme, the entire decoder module may be replicated for each target language, a setting which we label DEC (Dong et al., 2015).

Task-specific Attention (ATTN)
An approach somewhere in-between EMB and DEC is to partially share some of the decoder parameters, while keeping others unique to each task. Recent work proposed creating unique attention modules for every target task, while sharing the other decoder parameters Blackwood et al., 2018). The implementation of their approaches differ significantly -we propose to create completely unique attention parameters for each task. This means that for each of our 11 languages, we have unique context-and self-attention parameters in each layer of the transformer decoder. We refer to this setting as ATTN.

Experiments
All experiments are conducted using the transformer-base configuration of Vaswani et al. (2017) with the relevant modifications for each system discussed in the previous section. We use a shared sentencepiece 3 segmentation model with 32000 pieces. We use all available parallel data from the WMT19 news-translation task for training, with the exception of commoncrawl, which we found to be very noisy after manually checking a sample of the data, and paracrawl, which we use only for  . We train each model on two P100 GPUs with an individual batch size of up to 2048 tokens. Gradients are accumulated over 8 mini-batches and parameters are updated synchronously, meaning that our effective batch size is 2 * 2048 * 4 = 16384 tokens per iteration. Because the task pair for  Table 1: Training dataset statistics for our multilingual NMT experiments. # seen is the total number of segments seen during training. # available is the number of unique segments available in the parallel training datasets. # epochs is the number of passes made over the available training data -when this is < 1, the available training data was only partially seen. % budget is the percentage of the training budget allocated to this pair of tasks.
each mini-batch is sampled according to our policy weights and (fixed) random seed, and each iteration consists of 8 unique mini-batches, a single parameter update can potentially contain information from up to 8 unique task pairs. We train each model for 100,000 iterations without early stopping, which takes about 40 hours per model. When evaluating we always use the final model checkpoint (i.e. the model parameters saved after 100,000 iterations). We use our in-house research NMT system, which is heavily based upon OpenNMT-py (Klein et al., 2017). The sampling policy weights were specified manually by looking at the amount of available data for each pair, and estimating the difficulty of each translation direction. The result of the sampling policy is that lower resource language pairs are upsampled significantly. Table 1 summarizes the statistics for each language pair. Note that the data in each row represents a pair of tasks, i.e. the total number of segments seen for EN-CS is split evenly between EN→CS, and CS→EN. Because we train for only 100,000 iterations, we do not see all of the available training data for some high-resource language pairs.
With the exception of the system which prepends a target task token to each input, the input to each model is identical. Each experimental setting is mutually exclusive, i.e. in the EMB setting we do not prepend task tokens, and in the ATTN setting we do not use task embeddings. Figure 2 plots the validation performance during training on one of our validation datasets. The language embeddings from the EMB system are visualized in figure 3.  We evaluate the performance of our models in four ways. First, we check performance on the supervised pairs using dev and test sets from the WMT shared task. We then try to evaluate zeroshot translation performance in several ways. We use the TED talks multi-parallel dataset (Ye et al., 2018) to create gold sets for all zero-shot pairs that occur in the TED talks corpus, and evaluate on those pairs. We also try two ways of evaluating zero-shot translation without gold data. In the first, we do round-trip translation SRC → PIVOT → SRC , and measure performance on the (SRC , SRC) pair -this method is labeled   ZERO-SHOT PIVOT. In the second, we use parallel evaluation datasets from the WMT shared tasks (consisting of (SRC, REF) pairs), and translate SRC → PIVOT → TRG , then measure performance on the resulting (TRG , REF) pairs (see below for more details), where the pivot and target language pair is a zero-shot translation task -this method is labeled ZERO-SHOT PARALLEL PIVOT. Table 2 lists the WMT evaluation dataset that we use for each language pair. In the ZERO-SHOT PIVOT setting, the reference side of the dataset is used as input. Table 3 shows global results for all parallel tasks and all zero-shot tasks, by system. Global scores are obtained by concatenating the segmented outputs for each translation direction, and computing the BLEU score against the corresponding concatenated, segmented reference translations. The results in table 3 are thus tokenized BLEU scores.

Parallel Tasks
In the following results, we report BLEU scores on de-tokenized output, and compute scores using  sacrebleu 5 . Therefore, we expect BLEU scores to be equivalent to those used in the WMT automatic evaluation. We note that across all but the lowest-resource tasks, the model with a unique decoder for each language outperforms all others. However, for EN→GU and EN→KK, the lowest-resource translation directions, the unique decoder model fails completely, probably because the unique parameters for KK and GU were not updated by a sufficient number of mini-batches (approximately 15,600 for EN→GU and 14,800 for EN→KK).

Zero-shot Translation Tasks
In order to test our models in the zero-shot setting, we adapt an evaluation technique that has recently been used for unsupervised MT -we translate from the source language into a pivot language, then back into the source language, and evaluate the score of the resulting source-language hypotheses against the original source (Lample 5 , 2018a). This technique allows us to evaluate for all possible translation directions in our multi-directional model. Aware of the risk that the model simply copies through the original source segment instead of translating, we assert that at least 95% of pivot translations' language code is correctly detected by langid 6 , and pairs which do not meet this criteria for any system are removed from the evaluation for all systems (not just for the system that failed). For all models except EMB only RU→KK→RU FI→LT→FI, and ZH→GU→ZH failed this test, but for the EMB model 31 of the 110 translation directions failed (see tables 6 and 7 7 . This result indicates that models which use language embeddings may have a more "fuzzy" representation of the output task, and are much more prone to copying than other approaches to multi-lingual MT. However, even for the languages which passed the language identification filter, we suspect that some copying is occurring for the EMB system, because of the mismatch in results between the ZERO-SHOT PIVOT task and the SUPERVISED, ZERO-SHOT TED, and ZERO-SHOT PARALLEL PIVOT tasks (see table 3). Table 7 (in appendix) contains the results for all possible translation directions and all models in the ZERO-SHOT PIVOT evaluation setting.

Zero-Shot Evaluation on TED Talks Corpus
We conduct an additional evaluation on some of the language pairs from the TED Talks multiparallel corpus (Ye et al., 2018), which has recently been used for the training and evaluation of multi-lingual models. We filter the dev and test sets of this corpus to find segments which have translations for all of EN, FR, RU, TR, DE, CS, LT, FI, and are at least 20 characters long, resulting in 606 segments. Because this corpus is preprocessed, we first de-tokenize and de-escape punctuation using sacremoses 8 . We then evaluate zero-shot translation for all possible pairs which do not occur in our parallel training data, aggregate results are shown in the third row of table 3.

Discussion
Our results show that a models with either (1) a completely unique decoders for each target language or (2) unique decoder attention parameters for each target language clearly outperform models with fully shared decoder parameters in our setting. It is plausible that the language-independence of encoder output could be correlated with the amount of sharing in the decoder module. Because most non-English target tasks only have parallel training data in English, a unique decoder for those tasks only needs to learn to decode from English, not from every possible source task. However, our results show that the ATTN model, which partially shares parameters across target languages only slightly outperforms the DEC model globally, because of the improved performance of the ATTN model on the lowest-resource tasks (Table 4, Table  7 (in appendix)).
shown that multi-way NMT systems can be created with minimal modification to the approach used for single-language-pair systems.  showed that simply prepending a target-task token to source inputs is enough to enable zero-shot translation between language pairs for which no parallel training data is available.
Our work is most similar to , where many different strategies for sharing decoder parameters are investigated for oneto-many translation models. However, their evaluation setting is constrained to one-to-many models which translate from English into two target languages, whereas our setting is more ambitious, performing multi-way translation between 11 languages. Blackwood et al. (2018) showed that using separate attention parameters for each task can improve the performance of multi-task MT models -this work was the inspiration for the ATTN setting in our experiments.
Concurrently with this work, (Aharoni et al., 2019) evaluated a multiway MT system on a large number of language pairs using the TED talks corpus. However, they focus upon EN-* and *-EN, and do not test different model variants.

Conclusions and Future Work
We have presented results which are consistent with recent smaller-scale evaluations of multilingual MT systems, showing that assigning unique attention parameters to each target language in a multi-lingual NMT system is optimal when evaluating such a system globally. However, when evaluated on the individual task level, models which have unique decoder parameters for every target task tend to outperform other configurations, except when the amount of available training data is extremely small. We have also introduced two methods of evaluating zero-shot translation performance when parallel data is not available, and we conducted a large-scale evaluation of translation performance across all possible translation directions in the constrained setting of the WMT19 news-translation task.
In future work, we hope to continue studying how multi-lingual translation systems scale to realistic volumes of training data and large numbers of source and target tasks.  Table 7: Pivot-based translation results in all directions, for all models. Rows indicate source language, columns indicate pivot language. For example, cell (1, 2) contains the results for CS→DE→CS. Runs which did not pass the language identification filter are struck-through. The MT-matrix (http://matrix.statmt.org/matrix) was the inspiration for this rendering.