Emergent Communication Pretraining for Few-Shot Machine Translation

While state-of-the-art models that rely upon massively multilingual pretrained encoders achieve sample efficiency in downstream applications, they still require abundant amounts of unlabelled text. Nevertheless, most of the world’s languages lack such resources. Hence, we investigate a more radical form of unsupervised knowledge transfer in the absence of linguistic data. In particular, for the first time we pretrain neural networks via emergent communication from referential games. Our key assumption is that grounding communication on images—as a crude approximation of real-world environments—inductively biases the model towards learning natural languages. On the one hand, we show that this substantially benefits machine translation in few-shot settings. On the other hand, this also provides an extrinsic evaluation protocol to probe the properties of emergent languages ex vitro. Intuitively, the closer they are to natural languages, the higher the gains from pretraining on them should be. For instance, in this work we measure the influence of communication success and maximum sequence length on downstream performances. Finally, we introduce a customised adapter layer and annealing strategies for the regulariser of maximum-a-posteriori inference during fine-tuning. These turn out to be crucial to facilitate knowledge transfer and prevent catastrophic forgetting. Compared to a recurrent baseline, our method yields gains of 59.0% 147.6% in BLEU score with only 500 NMT training instances and 65.1% 196.7% with 1,000 NMT training instances across four language pairs. These proof-of-concept results reveal the potential of emergent communication pretraining for both natural language processing tasks in resource-poor settings and extrinsic evaluation of artificial languages.


Introduction
Zero-shot and few-shot learning are notoriously challenging for neural networks (Bottou and Bousquet, 2008;Vinyals et al., 2016;Ravi and Larochelle, 2017).However, they are a prerequisite for natural language processing in most languages, which suffer from the paucity of annotated data (Ponti et al., 2019a).State-of-the-art models rely on knowledge transfer, whereby an encoder is pretrained via language modeling on texts from multiple languages, and subsequently 'fine-tuned' on labelled examples of resource-rich languages (Wu and Dredze, 2019;Conneau et al., 2020) or few examples in a target resourcepoor language (Lauscher et al., 2020).However, even raw texts required for pretraining are scant (Kornai, 2013): for instance, Wikipedia dumps cover 278 languages out of the 7,097 spoken world-wide (Eberhard et al., 2020).
For this reason, we push the idea of cross-lingual knowledge transfer even further, exploring and profiling a setting where not even raw natural language data for a target language are available for unsupervised pretraining.In their stead, we exploit artificial languages emerging from a referential game on raw images (Kazemzadeh et al., 2014;Lazaridou et al., 2017).In particular, we encourage agents to cooperate in identifying images among distractors by communicating over vocabularies whose meanings are unknown.The key intuition is that, whereas lexicalisation is mostly arbitrary (Saussure, 1916), communication grounded in a real-world environment does constrain what languages are likely or possible (Haspelmath, 1999;Croft, 2000).Hence, we hypothesise that communication over raw images offers a favourable inductive bias for natural language tasks.Figure 1: An overview of the model architecture.Dashed lines denote parameter transfer from the EC pretraining task to the MT fine-tuning task.We stress that during EC pretraining, we do not leverage any image-caption pairs; instead, only unlabelled images are used.During MT fine-tuning, standard seq2seq NMT models are trained on SRC and TRG sentence pairs without any visual information available.
In particular, we experiment with initialising an encoder-decoder model for few-shot neural machine translation with parameters pretrained on emergent communication.In the past, emergent communication has mostly attracted theoretical interest as a tool to shed light on cooperative behaviours, the compositional properties of emergent communication protocols (Lazaridou et al., 2017;Havrylov and Titov, 2017;Cao et al., 2018;Li and Bowling, 2019;Kajić et al., 2020), and natural language evolution (Kottur et al., 2017;Graesser et al., 2019).To our knowledge, this is the first preliminary study on deploying artificial languages from emergent communication in natural language applications.
Conversely, our method also constitutes an extrinsic evaluation protocol to probe the properties of different emergent languages.The underlying assumption is that they should facilitate downstream tasks only to the extent that they share common characteristics with natural languages.In particular, we run in-depth analyses on the impact that the rate of communication success and maximum sequence length have on NMT performance.
For the sake of fully leveraging the pretrained parameters and ameliorating overfitting, we also explore several new strategies to perform knowledge transfer.In particular, we customise the adapter layer (Houlsby et al., 2019) and propose annealing strategies for the regularisation term of MAP inference during fine-tuning.We run experiments in NMT between English and four languages (German, Czech, Romanian, and French) in both directions.By virtue of emergent communication pretraining and the proposed transfer strategies, we report gains in BLEU scores when simulating few-shot MT setups for the four target languages: 59.0%∼147.6% over a standard encoder-decoder baseline when 500 training instances are available, and 65.1%∼196.7%when 1, 000 training instances are available.Our code is available online at https://github.com/cambridgeltl/ECNMT.

Related Work
Our work lies at the intersection of several prominent research areas such as pretraining for transfer learning, emergent communication, few-shot machine translation, and inductive biases for language.To all of these we cannot do full justice given space constraints.
Pretraining for Transfer Learning.Unsupervised pretraining on large collections of unlabelled text yields general-purpose contextualized word representations (Peters et al., 2018;Howard and Ruder, 2018) that are beneficial across a range of downstream NLP tasks.The current dominant paradigm is training a Transformer-based deep model (Vaswani et al., 2017) relying on masked language modeling or a similar objective, as proposed in the omnipresent BERT model (Devlin et al., 2019) and its extensions (Liu et al., 2019;Conneau and Lample, 2019;Song et al., 2019;Joshi et al., 2020), and then fine-tuning the model further on a downstream task (Wang et al., 2019).
Often this approach exploits large textual data and deep models spanning even billions of parameters (Conneau et al., 2020;Raffel et al., 2019;Brown et al., 2020).In this work, we refrain from chasing task leaderboards (Linzen, 2020) and posit a fundamental question about language learning instead.
Emergent Communication.The functional aspect of language (Clark, 1996) can be captured by artificial multi-agent games (Kirby, 2002;Mordatch and Abbeel, 2018), in which agents have to communicate about some shared input space (e.g., images).A common emergent communication protocol has been adopted in a large body of recent research: a speaker encodes a piece of information into a sequence of discrete symbols (emergent language) and a listener then aims to decipher the sequence and recover the original piece of information (Lazaridou et al., 2017;Havrylov and Titov, 2017;Lazaridou et al., 2018;Bouchacourt and Baroni, 2018;Chaabouni et al., 2019;Li and Bowling, 2019;Chaabouni et al., 2020;Luna et al., 2020;Kharitonov and Baroni, 2020, inter alia).
The present work is partly inspired by the work of Lee et al. (2018), who train agents to communicate about images with their natural language captions and use their parameters as encoder-decoders for machine translation.However, this framework relies on the availability of natural language captions (whereas we use only artificial languages emerging from raw images).Moreover, it does not cast EC as pretraining followed by NMT few-shot fine-tuning; rather, it learns a model in a single stage.These differences make our approach not only applicable to truly resource-lean languages but also substantially superior in performance on a same dataset such as English-German Multi30k (see § 5).
Another strand of recent research (Lowe et al., 2019;Lowe et al., 2020;Lazaridou et al., 2020) aims at enhancing emergent communication success by encouraging agents to imitate natural language data supplied at the beginning of training.Our work goes the opposite direction and investigates whether an emergent communication protocol pretrained without any human language data can benefit downstream NLP applications such as machine translation.
On the contrary, we ground our neural model on visual knowledge acquired from agent interactions without any observation of human language, and then fine-tune our model on translation tasks even with as few as 500 to 1, 000 training instances.We rely on few-shot MT as a standard, well-known, and sound testbed to empirically validate the crucial question of this work, that is, whether emergent communication pretraining without any natural language data can inform models of language.
Inductive Biases for Language.Finally, a series of recent works has investigated how to construct neural models that are inductively biased towards learning new natural languages.This endeavour is motivated both by the need of sample efficiency and concerns of cognitive realism, as children can acquire language from limited stimuli (Chomsky, 1978).In particular, neural weights reflecting linguistic universals in phonotactics can be learned via approximate Bayesian inference (Ponti et al., 2019b) or meta-learning (McCoy et al., 2020).Papadimitriou and Jurafsky (2020) found that recurrent models pretrained on non-linguistic data with latent structure (such as music or code) facilitate natural language tasks.
To our knowledge, we are the first to propose grounded communication as a non-linguistic source for pretraining, based on the hypothesis that modal and functional knowledge is a crucial inductive bias for fast and effective language acquisition.The proposed method comprises the standard two stages of transfer learning.First, as detailed in § 3.1, we pretrain two speaker-listener agents via emergent communication on image referential games.We then recombine1 the pretrained EC agents to construct NMT encoder-decoder networks (see Figure 1), and fine-tune the networks on a small number of parallel sentence pairs, as we describe in § 3.2.At the fine-tuning stage, we also add an Adapter layer between the translation encoder and decoder, and further leverage two variants of regularisation with annealing, which are outlined in § 3.3.An illustrative overview of the proposed method is provided in Figure 1.

Emergent Communication Pretraining
EC pretraining consists in the following referential game: an image is seen only by a speaker, while a listener must guess the correct image among a set of distractors based on a message generated by the speaker.Cooperation and communication therefore arise due to information asymmetry between the two players.This setup follows previous work (Havrylov and Titov, 2017) with one core difference: like Lee et al. (2018), we train two agents, each consisting of a speaker and a listener, one for the source language, and another for the target language.Contrary to Lee et al. (2018), who rely on image-caption pairs for the supervised training of the speaker agents, we employ only unlabelled images to train communication protocols in an unsupervised way.The artificial language developed by agents is not explicitly constrained to resemble any natural language.We denote the two agents as Agent s = {Speaker s , Listener s } and Agent t = {Speaker t , Listener t }.In our implementation, following recent work (Graesser et al., 2019;Resnick et al., 2020;Chaabouni et al., 2020;Kharitonov and Baroni, 2020;Lowe et al., 2020), both speakers and listeners are instantiated as single-layer Gated Recurrent Units (GRUs) (Chung et al., 2014). 2he pretraining process of Agent s follows these steps: Image Set Preparation.Let us denote the set of N images as D I .At each training step, an input image are randomly chosen from the entire set D I .Images are represented as 2,048-dimensional feature vectors extracted from a ResNet-50 CNN (He et al., 2016).
Message Generation.Speaker s takes the input image I i and outputs a message m describing the image, a sequence of discrete symbols of variable length.The generation comes to a halt when the special end-of-sentence symbol is emitted or the maximum message length L max is reached.Since m comprises discrete symbols, in order to make the model end-to-end differentiable, we adopt the Gumbel-Softmax distribution (Jang et al., 2017;Maddison et al., 2017) to draw samples from categorical distributions of emergent tokens while making the gradient flow. 3The generation of the discrete symbol m t at each time step t can be described by the following: Here, h s t represents the hidden state at time step t, <bos> stands for the special beginning-of-sentence symbol, while MLP for multilayer perceptron.The parameters for MLP 1 are shared by both Speaker and Listener and map image features into input vectors for the GRU layer.A second MLP 2 is used by Speaker to project each GRU hidden state-one for each time step-into vectors with dimensionality equal to the predefined vocabulary size of the emergent language.
Image Inference.Given the input image, the generated message describing the image, and K confounding images, Listener s must now guess the correct input image among the distractors.To do so, a second GRU layer decodes the message generated by the Speaker as follows: (2) h l t now denotes the Listener's hidden state at time step t.The hidden state at the last time step, h l |m| , is used to reason over the correct image I i and K distractors C i , and guess which image is the one described by the Speaker.Given the message m and any image I we define a compatibility score based on the inverse squared error (Lee et al., 2018): We then minimise the cross-entropy loss, treating the set of compatibility scores as logits, to optimise the agent parameters for the image referential game: In a nutshell, Speaker s takes an input from the image domain, then encodes it into a message in the emergent language domain.The message conveys information that has to be transferred back to the image domain by Listener s in order to solve the cooperative game.The same process is repeated alternating between Agent s and Agent t .4

NMT Fine-Tuning and Adapters
After EC pretraining of Agent s and Agent t , we recombine their speaker and listener modules into a standard sequence-to-sequence encoder-decoder neural machine translation (NMT) architecture, as shown in Figure 1.Let us denote a training set of n parallel sentences in the source and the target language as D: {(x (1) , y (1) ), . . ., (x (n) , y (n) )}.The model then predicts the output sequence of the i-th parallel sentence: In Eq. ( 5), y i <t represents the first t − 1 tokens in the target language sentence y i , and the input sentence x i is encoded as a fixed-length hidden vector by the encoder, following the standard sequence-to-sequence procedure (Sutskever et al., 2014).The sequence loss is defined as follows: The source-to-target translation model consists of Listener s (input: emergent language domain, output: image domain) and Speaker t (input: image domain, output: emergent language domain) and we denote their RNN parameters as w : these are transferred to MT fine-tuning.After fine-tuning on a small set of source-to-target sentence pairs, the model can perform the translation task.In an analogous manner, the target-to-source model is composed of Listener t and Speaker s .
To compensate for the lack of an intermediate image domain in MT, at the fine-tuning stage we add an Adapter module in between encoders and decoders.Adapters are small neural modules that contain additional trainable parameters which facilitate quicker and more effective domain adaptation in computer vision (Rebuffi et al., 2017;Rebuffi et al., 2018) and, more recently, NLP tasks (Houlsby et al., 2019;Stickland and Murray, 2019;Pfeiffer et al., 2020b;Bapna and Firat, 2019;Pfeiffer et al., 2020a).A notable difference compared to prior work is that during the fine-tuning stage we train jointly both the Adapter and the model parameters (which are transferred from EC).Our Adapter modules follow a simple architecture from prior work (Houlsby et al., 2019), and comprise linear layers with residual connections and dropout, as illustrated in Figure 1.

Regularisation with Annealing
During fine-tuning, we also add to the objective an annealed regulariser for the encoder-decoder parameters (which, on the other hand, does not apply to the adapter module).These parameters are initialised using the parameters w transferred from the EC agents.We can then define a regularisation term that prevents the parameters w from drifting away from their initialisation w during fine-tuning (Duong et al., 2015): where α is a positive real-valued tunable hyper-parameter denoting the strength of the regularisation penalty.Note that this amounts to placing a prior N (w , Iα −1 ) on the encoder-decoder parameters.However, the contribution of the log-prior in Eq. ( 7) to the posterior probability of the parameters should stay fixed, whereas the contribution of the negative log-likelihood in Eq. ( 6) should grow linearly with the number of examples.In other words, the likelihood should be able to overwhelm the prior in the limit of infinite data.For this reason, the importance of the regulariser should be gradually decayed over fine-tuning steps.Therefore, we propose two variants of regularisation with annealing, labelled REG-A (exponential decay) and REG-B (inverse multiplicative decay).At the fine-tuning step k: λ is a real-valued hyper-parameter from the interval [0, 1) that controls the decay steepness.The final objective function is then as follows: where L sequence is provided by Eq. ( 6), and R is one of REG-A or REG-B.

Experimental Setup
EC Pretraining is based on the MS COCO data set (Lin et al., 2014).We randomly select 50,000 images for training and 5,000 for validation. 5For each image, a 2,048-dimensional feature vector is extracted from ResNet-50 (He et al., 2016).The input vocabulary size for EC is equal to the BPE vocabulary size during MT fine-tuning.However, since human language data are excluded, note that there is no alignment between EC and MT BPE vocabularies. 6The maximum message length, L max , is set to an integer around the average length, in terms of BPE tokens, of the MT training sets: 15-18 in our experiments.We do not impose additional constraints on the generated messages' length. 7We later profile the impact of L max on MT performance in §5.The layer size is 256 for the input embeddings and 512 for the hidden layers.We use Adam (Kingma and Ba, 2015) with a learning rate of lr = 0.001.The dropout rate is set to 0.1 and the Gumbel-Softmax temperature is set to 1.The number of distracting images is K = 255 during training, and K = 127 in evaluation.Experiments on the validation set achieve the prediction accuracy of > 99% in all EC experimental runs, i.e., the listener is able to guess the single correct image from a set of 128 images almost always.We analyse the impact of EC prediction accuracy on few-shot MT performance later in §5.
Machine Translation experiments are conducted on two standard datasets: Multi30k and Europarl.The Multi30k data set (Elliott et al., 2016;Barrault et al., 2018), originally devised for multi-modal MT, contains multilingual captions for ≈ 30k images.We discard images and run text-only fine-tuning and evaluation on English-German (EN-DE) and English-Czech (EN-CS) in both directions.We rely on the default training set of 29,000 pairs of parallel sentences, which we also subsample to simulate true few-shot scenarios: we randomly select 500, 1,000, and 10,000 sentence pairs for the lower-resource setups.In all experimental runs, we use the original validation set spanning 1,014 sentence pairs and the default test set spanning 1,000 pairs.We also run experiments on Europarl data (Koehn, 2005) from OPUS (Tiedemann, 2009) for two language pairs: English-Romanian (EN-RO) and English-French (EN-FR), again in both directions.We retain only sentences with a length between 5 and 15 words to construct data sets whose average sentence length is similar to that of Multi30k.We then randomly sample 10,000 parallel sentences as our (largest) training set, while two other disjoint random samples of 1,500 sentence pairs are used for validation and test, respectively.As with Multi30k, we again sample 500 and 1,000 training instances from the full set of 10k examples to simulate few-shot settings.
For each language pair, we lowercase and tokenise the data using byte-pair encoding (BPE) (Sennrich et al., 2016).Our BPE vocabularies are derived from all 29,000 training pairs (for the Multi30k language pairs) and 10,000 training pairs (for the Europarl language pairs).We again use Adam in the same configuration as EC pretraining, except for setting the dropout rate to 0.2.The hyper-parameters of the annealed regulariser are set to α = 5 and λ = 0.998 based on the scores on the EN-DE validation set (in the 1k training setup) and fixed to those values in all other experiments and for all other language pairs.For a fair comparison, the other hyper-parameters for fine-tuning are set identically to the NMT baseline introduced in the next paragraph.
NMT Baseline and Evaluation Details.The baseline NMT model is the standard seq2seq model whose architecture is exactly the same as our proposed model, but now with randomly initialised parameters (rather than transferred from EC).We extensively search the hyper-parameter space of the baseline model (Sennrich and Zhang, 2019) and adopt Adam optimiser with learning rate of 0.001, β 1 = 0.9, β 2 = 0.999, = 1e-08, a dropout rate of 0.2, a batch size of 128, a hidden-state size of 512, an embedding size of 256, and a max sequence length of 80.For all models, we rely on beam search with beam size 12 for decoding.The evaluation metric is BLEU-4 (Post, 2018).

Results and Analysis
In what follows, we report the NMT results of our proposed model on all language pairs.We then perform an ablation study highlighting the individual contributions-of the customised adapter layer, the strategies for annealing the regulariser, and emergent communication pretraining-to the final results.Finally, we assess the impact of the rate of communication success and maximum sequence length on downstream NMT performances.
Main Results.The BLEU scores of the model leveraging both EC pretraining and adapters are shown in Table 1 for the Multi30k dataset, and in Table 2 for Europarl.The results reveal sweeping gains on all language pairs and in both translation directions.These are especially accentuated in Czech and Romanian (e.g.+196.7% in EN-CS and +115.1% in RO-EN for 1k samples), which are arguably more distant from English than German and French.This suggests that our method might be particularly suited for languages that are the most challenging in real-world scenarios.Moreover, we note that the gains do not fade away as more training examples become available.For instance, while the relative improvements on the baseline decrease from +132.7% in the 500-shot setting to +29.0% in the 29k-shot setting (DE-EN), the absolute improvements remain consistent (+5.75 BLEU and +6.41 BLEU, respectively).Most importantly, the results clearly suggest the usefulness of EC pretraining on a downstream natural language task.Ablation Study.In order to disentangle the contribution of each separate component of the full model, in Table 3 we report the results on two language pairs (EN-DE and RO-EN) for different combinations of the recurrent baseline, EC pretraining, adapters, and regulariser annealing strategies.We find that all the components improve the translation quality regardless of the amount of training data.Taking the case of EN-DE 0.5k as an example, the baseline achieves a BLEU of 4.28.On top of this, EC pretraining boosts this result to 6.48, adding the adapter layer to 5.21.When EC and adapters are combined, they yield a BLEU of 7.52.
Interestingly, the only finding in counter-tendency to this pattern is that the intersection of emergent communication pretraining and regulariser annealing decreases the performance compared with the      Lastly, by comparing the two strategies to anneal the regulariser during maximum-a-posteriori inference, we find no evidence favouring one or the other.Table 1, Table 2, and Table 3 show that while REG-A (exponential decay) achieves equal or better performance in 0.5k and 1k settings, REG-B (inverse multiplicative decay) shows its strength in 10k and 29k settings.A comparison on EN-DE between these two and a regulariser without annealing is shown in Appendix, where our regularisers gain an edge in all few-shot settings. 8nfluence of EC Properties on MT Fine-Tuning.Finally, we investigate how the properties of the artificial languages developed through EC affect downstream NMT performance.Most significantly, this can also be interpreted as a tool to evaluate whether emergent languages display affinities with natural languages.If this is the case, in fact, they should provide the correct inductive bias for NLP tasks and improve the sample efficiency of neural models.
First, we focus on the rate of communication success.During the EC pretraining stage, we save and evaluate models every 50 training steps to pick up models with the desired level of accuracy.We run experiments on EN-DE and RO-EN (1k samples) and select five Agent s and Agent t whose prediction accuracy is near 50%, 80%, 95%, 98.5% and 99.5%, respectively.During fine-tuning we try all their 25 possible combinations.As shown in Figure 2, the prediction accuracy for Agent s and Agent t does positively correlate with MT performance.However, this trend is not strict and absolute, and sometimes sub-optimal EC models may fare better in the downstream task.
Second, we focus on the influence of maximum sequence length.In our main results, we have set L max around the average sentence length (in BPE) of NMT training sets.However, we now show that this is not strictly necessary.In EN-DE and DE-EN, the average length for both languages and all splits (training, valid and test sets) are almost the same, 15.We vary L max in steps of 5 from 5 to 60.In all these settings, we control for accuracy, only selecting models with a rate of communication success of 99.45 ± 0.16%.The results are illustrated in Figure 3.They show that, with the exception of L max = 5, for all the higher values MT performance is not particularly affected by L max .
Further Discussion.One interesting question concerns what kind of knowledge exactly has been learned and transferred to the fine-tuning task.Of course, the pretrained EC model does not contain any information about either SRC or TRG languages.In fact, if the adapter is trained at the MT fine-tuning stage in isolation (freezing encoder and decoder to the initialisation values), MT performance turns out to be 0 in terms of BLEU score.What is more, it remains an open question whether the real-world grounding represented by image features plays a role in MT fine-tuning.If this were the case, one would expect higher gains in Multi30k than Europarl, as it consists of image captions.However, this does not occur in practice.As possible alternative hypotheses, EC pretraining might learn functional aspects of language (Wagner et al., 2003;Wittgenstein, 2009;Lazaridou et al., 2017;Lazaridou et al., 2020), i.e., the capability of agents to communicate and interact, or some language-universal structural properties, similar to Papadimitriou and Jurafsky (2020).We hope that future work will shed light on this matter.
We also note that without the adapter and the regulariser, the gains on MT are relatively limited.Hence, we must additionally stress the importance and synergy of both these modules to bridge between pretraining and fine-tuning tasks.On the one hand, initialisation and regularisation avoid catastrophic forgetting of old knowledge and drifts to parameter regions unfit for communication on referential games.On the other hand, the adapter module allows a drift from the image domain and thus results in fast adaptation to the new knowledge.

Conclusion and Future Work
We have demonstrated that an extreme pretraining paradigm without any human language data, but rather based on emergent communication (EC) in referential games, provides an inductive bias for learning natural languages.In theory, it makes this paradigm applicable to any of the world's languages, most of which suffer from the paucity of annotated data.In particular, we focused on neural machine translation (NMT) with limited parallel data as a downstream task.Our results across several language pairs and in different few-shot setups indicate that parameter initialisations informed by EC pretraining, in combination with adapter modules and annealed regularisation, yield higher accuracy and sample efficiency than baselines without any EC pretraining.Vice versa, we argued that NMT performance can also be interpreted as an extrinsic evaluation protocol for emergent communication models: it can assess to which extent emergent languages reflect properties found in natural languages.In particular, we discovered that communication success rate is well correlated with BLEU scores, whereas maximum sequence length is not impactful.In the future, we plan to experiment with other state-of-the-art NMT architectures, apply our method to extremely low-resource languages, and extend the scope of our work to other tasks beyond NMT.

A.1 Regulariser without Annealing
In Table 4, we compare the BLEU scores of our proposed annealed regularisers (REG-A and REG-B) and a regulariser without annealing (NA) on EN-DE.↓ indicates when the newly added module reduces the BLEU score by at least 0.4 BLEU points, and ↑ represents the highest gain compared with the baseline.

Figure 2 :
Figure 2: Impact of EC prediction accuracy on NMT BLEU scores for EN-DE (left) and RO-EN (right).All BLEU scores are obtained in the '1k Samples' setup with the full model variant EC Transferred + Adapter + REG-A.

Figure 3 :
Figure 3: Impact of maximum EC message length (L max ) on NMT performance.All BLEU scores are obtained in the '1k Samples' setup with the full model variant EC Transferred + Adapter + REG-A.

Table 1 :
BLEU scores of the full model from §3 in the few-shot translation task on Multi30k with varying number of parallel sentences (N Samples).↑ represents the highest score, associated with its relative gain over the baseline.

Table 2 :
BLEU scores of the full model from §3 in the few-shot translation task on Europarl with varying number of parallel sentences (N Samples).↑ represents the highest score, associated with its relative gain over the baseline.

Table 3 :
Ablation experiments.↓ indicates the case when an added component reduces BLEU by at least 0.4 BLEU points; ↑ represents the highest score, associated with its relative gain over the main baseline.

Table 4 :
Regularisers with and without annealing.