Target Conditioning for One-to-Many Generation

Neural Machine Translation (NMT) models often lack diversity in their generated translations, even when paired with search algorithm, like beam search. A challenge is that the diversity in translations are caused by the variability in the target language, and cannot be inferred from the source sentence alone. In this paper, we propose to explicitly model this one-to-many mapping by conditioning the decoder of a NMT model on a latent variable that represents the domain of target sentences. The domain is a discrete variable generated by a target encoder that is jointly trained with the NMT model.The predicted domain of target sentences are given as input to the decoder during training. At inference, we can generate diverse translations by decoding with different domains. Unlike our strongest baseline (Shen et al., 2019), our method can scale to any number of domains without affecting the performance or the training time. We assess the quality and diversity of translations generated by our model with several metrics, on three different datasets.


Introduction
Neural Machine Translation (NMT) models are trained to translate a sentence from a source language into a target language. There are many translations of the same sentence that are both grammatically correct and faithful to the source, but these translations may differ greatly in their vocabulary, style or grammar. Inferring the best translation among them requires to explore a vast output space to cover this variability. This is typically handle as a post-processing step using a search algorithm, like beam search. This procedure is known to produce translations that lack in diversity, often differing only by a punctuation or a word (Kumar and Byrne, 2004;Li et al., 2016). While the search algorithm can certainly be improved, part of the problem resides also in the training of the NMT models; they are trained on 1-to-1 translation datasets without any objective to encourage diverse translations.
There are many ways to model the diversity of translations from data that contain only one translation, such as mixture of experts (Shen et al., 2019) or variational autoencoders (Zhang et al., 2016). A particularity of machine translation is that it is a one-to-many mapping problem. This means that the variability should be encoded by the target sentence and the question is how to combine a NMT system with a target sentence encoder with no posterior collapse.
In this work, we propose to combine the encoder of the NMT with a discrete target encoder. Similar to other discrete autoencoders (Kaiser et al., 2018;van den Oord et al., 2017), each target sentence is assigned to a discrete variable, or domain, and each domain is associated with an embedding. The embeddings from both encoders are then fed to the decoder of the NMT to form a translation. The discrete latent representation follows a categorical distribution that is constrained to be uniform over the dataset to avoid a mode collapse. Since each domain has its own embedding, changing the domain embedding changes the translation. At test time, we can thus condition the generation on each domain embedding to produce multiple translations with high diversity.
Our approach is general and can be applied on top of any model with little computational overhead. An advantage of our approach is that the number of domains can be arbitrarily large without affecting the performance or the running time. Our approach can replace or work with beam search during inference. We assess the quality and diversity of translations generated by our model with several metrics, on three different datasets. At training time, a target sentence is encoded with the target transformer encoder to get a latent representation z y . The latent representation is linearly mapped to a vector of size K on which apply a softmax to obtain domain probabilities. Each domain is associated with an embedding. The decoder is fed with both the source encoding, and the sum of the domain embeddings reweighted by their probabilities. During inference, we can generate K different hypotheses by switching the domain embedding that is fed to the decoder. To prevent a train-test discrepancy, during training we apply an argmax operator on domain probabilities, with probability p hard .

Related Work
Several studies have proposed to sample diverse sequences by changing the value of a latent variable. For example, one possibility is to add noise to the latent space of a Variational Auto-Encoder (Kingma and Welling, 2013) to diversify samples in machine translation (Zhang et al., 2016), language modeling (Bowman et al., 2015) or question generation (Jain et al., 2017). In particular, Zhang et al. (2016) also condition the decoder of a NMT Model on a target encoder. As opposed to our work, the output of their encoder is continuous and sampling diverse generation requires to inject random noise, while we obtain diversity by switching between discrete domains. Similar noise injection mechanisms have been investigated to improve the diversity of responses in dialogue (Serban et al., 2017;Cao and Clark, 2017;Wen et al., 2017), and image captioning (Wang et al., 2017;Dai et al., 2017). Closer to our work, (Shen et al., 2019;Shu et al., 2019) and Xu et al. (2018) use domain embeddings to condition their generations. Unlike us, they do not condition the domain on the target, but select the domain which minimizes the reconstruction loss, which becomes expensive as the number of domains increases. Another relevant work is the fast decoder of Kaiser et al. (2018) where they also combine a discrete encoder applied on the target sentence with the NMT encoder. Their goal is to accelerate the decoding process of a machine translation system, while we are interested in efficiently sampling diverse translations. Another line of work focuses on improving the generation by changing the decoding scheme during inference (Li et al., 2016;Gu et al., 2017) or by matching the training of the model to the decoding scheme (Wiseman and Rush, 2016;Collobert et al., 2019). This is done by either training through a beam search decoder (Wiseman and Rush, 2016;Collobert et al., 2019) or by reframing generation as a reinforcement learning problem Ranzato et al., 2015). These works focus on the decoding scheme to improve generation, but do not address the problem of diversifying the outputs generated from the same input.

Model
In this section, we describe our target encoder and how to train it along with a translation model. The target encoder learns to map target sentences to discrete domains, and we show how to use these domains to efficiently sample diverse translations.

Target encoding
A Neural Machine Translation (NMT) model is composed of a source encoder E src , and a decoder D. Given a dataset D of pairs (x, y) of source sentences and their target translations, a standard encoder-decoder model is trained to minimize: where p D y|E src (x) represents the probability given by the decoder D to a target sentence y to be the translation of a source sentence x. In our case, we consider that we also have a target encoder E tgt , and we feed the decoder not only with an encoding of the source sentence, but also with an encoding of the target sentence. As a result, the model is trained to minimize: Without architectural constraint, the decoder D could trivially learn the identity mapping between the encoding of the target sentence E tgt (y) and the sentence to generate y. Instead, we propose to use a key-value structure for this embedding where the target encoder provides a probability for a key, or domain, and we feed the associated value to the decoder of the machine translation system. In practice, we constraint the output of the target encoder to represent the domain probability distribution of the target sentence. The output of the target encoder is thus a K-dimensional vector of probabilities p = E tgt (y). Since the output of the target encoder is not directly fed to the decoder, we bound the amount of information provided by the target encoder, preventing the model from learning a trivial mapping. At test time, we cannot estimate E tgt (y) since the target sentence y is not available. Instead, we feed the decoder D with any one-hot vector of R K to generate K different translations.
An illustration of our model is provided in Figure 1.

Implementation
Our NMT model is the transformer network of Vaswani et al. (2017) with a dimension d, with a transformer encoder E src and a transformer decoder D. The target encoder E tgt that we introduce in this paper is composed of a transformer encoder with the same architecture as the source encoder E src and other components detailed bellow. We refer the reader to Vaswani et al. (2017) for the details of the architecture and describe below the specificites of our target encoder E tgt .
The output E tgt (y) of the target encoder is a probability vector of size K. To obtain these probabilities, we encode the target through a transformer encoder. We take the first hidden state h ∈ R d of the last layer of the target encoder, corresponding to the start token. We linearly map h to a score vector of dimension K. Finally, we apply a softmax operator to obtain a vector of domain probabilities: In that setting, the decoder is trained with arbitrary probability vectors, which becomes problematic at test time when p is set to a one-hot embedding on which the decoder may never have been trained. To prevent this train-test discrepancy, we apply a temperature on the domain scores s that decreases linearly from 1 to 0 over training. When the temperature reaches 0, we have p = I(argmax(s)) 1 (i.e. the domain with the highest score has probability 1, the others have probability 0) and the target encoder remains frozen during the remaining training time.
Moreover, at each training step, we randomly replace the softmax by an argmax operator with a probability p hard . In practice, we set p hard = 0.25, which means that 75% of the time the target encoder is trained along with the source encoder and decoder, and 25% of the time the target encoder is only used to predict the domain with the highest probability. Overall, we have: where X is a random variable from a uniform distribution, i.e., X ∼ U(0, 1).
Optimization. When T > 0, the model is fully differentiable and the target encoder can be trained in an end-to-end fashion with the rest of the model. We found that it is also possible to use discrete operators like the Gumbel-Softmax (Jang et al., 2016). This way, E tgt (y) is always a one-hot vector and there is no train-test discrepancy. However, learning the target encoder through a discrete encoding makes optimization more difficult, and we obtained better results with a regular softmax.
Domain input. To feed the target encoder output E tgt (y) as input to the decoder D, the decoder learns a matrix of embeddings E = e 0 , . . . , e K−1 ∈ R d×K where each e i represents a different domain. Traditionally, the first input of a decoder is an embedding that corresponds to a start symbol S . Instead, we feed as first embedding a vector e, where: The domain embeddings E are learned during training. This process is illustrated in Figure 2.

Training objective
We denote by θ the parameters of E src , E tgt , and D. Given a mini-batch of source and target sentences {(x i , y i )} 1≤i≤N , the model is trained to minimize: In practice, we want the decoder to properly leverage E tgt (y), i.e., the domain information coming from the target encoder. Without additional constraints, nothing prevents the model from collapsing to a mode where the target encoder constantly predicts the same domain, regardless of its input. The model is then perfectly predicting its domain, which means that it receives no gradient to escape this trivial solution.
To address this issue, we add a regularization term to the training objective, to encourage the model to make a uniform usage of available domains. In particular, we define the entropy distribution of selected domains in the mini-batch: is the probability distribution of domains for the target sentence y i . Finally, the model is trained to minimize L(θ) − λL XE (θ), where λ is a hyper-parameter.

Inference
At inference, we generate one hypothesis per domain, i.e. K hypotheses. To generate the k th hypothesis, we perform decoding by feeding e k as embedding of the start symbol. We generate translations with greedy decoding, except in Figure 5, where we combine our model with beam search decoding which leads to a different quality vs. diversity trade-off.

Experiments
In this section, we describe an evaluation protocol similar to Shen et al. (2019), and compare our approach to several baselines on 3 MT datasets. Then, we show the importance of different components in our model in an ablation study.

Softdiscretization
MatMut I am in Decoder Figure 2: Detailed illustration of our model. Z y is the first hidden state of the output of the target transformer encoder. To obtain E tgt (y), we linearly map Z y to a K dimensional vector and perform a "soft-disctretization" by applying either a softmax or an argmax operator.
We then compute the target domain vector e as the sum of the domain embeddings E reweighted by their probabilities contained in E tgt (y). The vector e is fed to the decoder as the embedding of the first token, along with the source encoding Z x = E src (x).

Evaluation Metrics
To measure both the quality and diversity of our generations, we use an evaluation protocol similar to Shen et al. (2019 (j,k)∈K 2 ,j =k pairwise computes the BLEU score between hypotheses of a same source sentence. A low pairwise ensures diversity in translations, while a pairwise of 100 means that for a given source sentence, the decoder will always generate the same translation. Overall, we want the model to have a low pairwise while preserving a high mBLEU score.

Dataset
We train and test our model on three different datasets, following Shen et al. (2019). Each dataset comes with a test set with multiple human reference translations.
WMT'17 English-German (En-De). We follow the same pre-processing protocol as Shen et al. (2019), where we filter all training sentences with more than 80 source or target words, which results in 4.5M sentence pairs. We apply the Moses tokenizer (Koehn et al., 2007) and learn a joint BPE vocabulary with 32k codes (Koehn et al., 2007). We take newstest2013 as a validation set, and test on a subset of 500 sentences of newstest2014 with 10 reference translations.
WMT'14 English-French (En-Fr). We follow the setup of Gehring et al. (2017), which results in 36M training sentence pairs. We use a joint vocabulary of 40k BPE codes. We use newstest2012 and newstest2013 as a validation set, and test on a subset of 500 sentences from newstest2014 with 10 reference translations.
WMT'17 Chinese-English (Zh-En). We follow the pre-processing setup of Hassan et al. (2018). The training set is composed of 20M sentence pairs, with 48k and 32k source and target BPE vocabularies respectively. We develop on devtest2017 and evaluate on a subset of 2000 sentences of new-stest2017 that comes with 3 reference translations.

Experimental details
In all our experiments, we consider transformers with 6 layers, 8 attention heads, and we set the model dimension to d = 512. We optimize our model with the Adam optimizer (Kingma and Ba, 2014) with β 1 = 0.9, β 2 = 0.98 and a learning rate of 3 × 10 −4 . We use the same learning rate schedule as Vaswani et al. (2017). We use a dropout (Srivastava et al., 2014) of 0.1 in the source encoder and the decoder. Following Shen et al. (2019), we do not use any dropout in the target encoder. With stochasticity in the target encoder, a same target sentence tends to be mapped to different domains at different iterations, which prevents the decoder from learning the specificity of each domain, and results in identical generations with no diversity.
We use 128 GPUs for the En-Fr experiments, and 16 GPUs for the En-De and Zh-En experiments. For the En-Fr experiments, we train with minibatches of around 450k tokens, and 55k tokens for En-De and Zh-En. We use float16 operations to speed up training and to reduce the memory usage of our models. We implement our model within the fairseq framework of .

Baselines
Sampling and Beam. We report results with a sampling and a beam baseline, as well as the diverse beam method (Vijayakumar et al., 2018). We consider a standard NMT system (i.e. an encoderdecoder model, without target encoder or latent variable). At test time, for sampling we sample K translations to generate K hypotheses. For the beam search, we use a beam size of K and return all hypotheses in the beam.

Mixture of Experts.
We also compare against the state-of-the-art Mixture of Experts (MoE) model of Shen et al. (2019), with online responsibility update, uniform prior, shared parameters and hard assignment (hMup in their paper), which is their overall best setup. MoE model is composed of a source encoder E src and a decoder D. Like our model, the decoder learns a matrix of embeddings E = e 0 , . . . , e K−1 ∈ R d×K where each e i represents a different domain which is fed as first input of the decoder. Unlike us, they do not use a separate target encoder to select the domain, but consider an EM algorithm where the selected domain is the one that minimizes the reconstruction loss of the target sentence. In particular, for a mini-batch of N source and target sentences {(x i , y i )} 1≤i≤N , the E-step computes: Then, the M-step minimizes the negative logprobability of target sentences, given their source encodings, and the selected domains: We run all of these baselines with the same transformer architecture as the one used in our model. For fair comparison, we use the same optimizer, learning rate and batch size in all experiments.

Main results
Table 1 present mBLEU and pairwise scores for different models, on the three considered datasets. We observe that a high mBLEU score is often combined with a high pairwise. For instance, the beam search and sampling baselines fail at generating both diverse and high quality translations. Beam search and diverse beam search hypotheses are accurate, but lack diversity, resulting in a very similar set of hypotheses. On WMT En-De, with K = 10, beam search gives a mBLEU score of 66.3 but a pairwise score of 74. On the other hand, the sampling baseline generates very diverse but inaccurate hypotheses, with a pairwise score of 11.8, but a mBLEU of 28.2.
The Mixture of Experts and Target Encoder models have a better trade-off between diversity and quality, as shown in Figure 4. Overall, our method provides more diversity than the MoE method, i.e. it obtains a lower pairwise score, but to the detriment of a lower mBLEU score. In Table 1, we observe that for En-De and En-Fr, our model obtains a lower mBLEU score than beam search decoding and the Mixture of Experts, but provides more diversity, with a pairwise score of 57.3 instead of 64.4 in En-Fr. While both methods perform similarly, our approach is simpler to implement, and can easily scale to an arbitrary number of domains, as shown in the following section.

Training speed
The training speed of our method is independent of the number of domains. In contrast, the train- ing speed of the MoE model of Shen et al. (2019) decreases drastically when the number of domains increases. Indeed, the MoE model requires to perform K forward passes to determine the best domain. In Figure 3, we compare the training speed of both models for K = 3, 5, 10, 20, 50 and 200. Unlike the MoE model, using a target encoder allows generalization to an arbitrary number of domains.

Ablation study
Beam search. In Figure 5, we study the impact of decoding with beam search instead of greedy decoding. Using beam search improves the quality of translations, but deteriorates the diversity.
Combining a target encoder model with a beam search pushes towards the same trade-off of qualitydiversity as the greedy MoE model. In each case, we report results for K = 5, 10 and 20 domains. MoE and Target Encoder provide the best trade-off between quality and diversity. Compared to MoE, Target Encoder provides a lower mBLEU score, but also a lower pairwise (i.e., more diversity).
we sometimes encounter the "collapse" scenario where at training time all target sentences are mapped to the same domain. As a result, only the embedding associated to that domain is trained, and at test time, every sentence generated from another (and untrained) domain embedding will be invalid. This means that only one of the K generated hypotheses will be valid, leading to a very poor mBLEU. Conversely, when λ is too high, the regularization term becomes predominant and the target encoder primarily focuses on maximizing the domain usage entropy, rather than on minimizing the decoder reconstruction loss. As a result, the target encoder uniformly maps target sentences to all available domains, but the domains do not contain any information about target sentences. This way, the decoder learns to ignore the domain, and will always output the same translation, independently of the input domain, which results in a pairwise score close to 100 (i.e. there is no diversity). In practice, we found that setting λ = 0.1 or λ = 1 leads to similar results, and is enough to prevent the collapse scenario.
Source versus target encoding. In this experiment, we change the input of our target encoder to probe where the source of diversity in our model comes from. In particular, it is possible that the diversity captured by our model is indirectly coming from the source sentences through the target sentences. We test this hypothesis by replacing the input of the target encoder by the source sentence. This model is identical to ours beside the change in the input of the target encoder. In that setting, on WMT'17 En-De, when using 10 domains, we obtain a mBLEU score of 66.5, and a pairwise BLEU of 97.2, which means that the model was not able to learn anything specific about each domain, Source 参与投票的成员中,58%反对该合同交易。 自11月份开始，俄罗斯民意也有所扭转。 Human references It was rejected by 58% of its members who voted in the ballot.
Russian public opinion has also turned since November. Of the members who voted, 58% opposed the contract transaction.
Russian public opinion has started to change since November. Of the members who participated in the vote, 58% opposed the contract. The polls in Russian show a twist turn since the beginning of November.

Beam 3, Top 3
Of those voting, 58 per cent opposed the contract deal. Since November, Russian public opinion has also turned around. Fifty-eight per cent of the members voting opposed the contract deal.
Since November, Russian public opinion has also changed. Fifty-eight per cent of the members voting opposed the contract.
Russian public opinion has also changed since November.
Mixture of Experts Of the members who voted, 58% opposed the deal. Since November, the mood in Russia has also reversed. (Shen et al., 2019) Fifty-eight per cent of the members who voted opposed the contract deal. Since November, opinion in Russia has also reversed.
Fifty-eight per cent of the voting members opposed the contract deal.
Opinion in Russia has also shifted since November.
Our Model Of the members voting, 58 per cent opposed the contract deal. Since November, Russian public opinion has also reversed. Fifty-eight per cent of the members who voted opposed the contract deal. The mood in Russia has also reversed since November. Fifty-eight per cent of those voting had opposed this contract deal.
There has also been a reversal in Russian public opinion since November. This operation will require moving the northern elevated bridge to a further 700 mm south. The operation was completed by moving the northern elevated bridge to the south by 700 mm.
Source 尽管该桥的规模和重量都很巨大，但是完成高架桥和桥面之间的合龙却需要精细作业，需要非常精准的装配。 Reference Although the bridge is very large and heavy, joining the viaduct and roadway is a meticulous operation, requiring precise assembly. Our model Despite the scale and weight of the bridge, the completion of the Alloy bridge and the deck requires fine work and very precise assembly.
Although it is huge in size and weight, it requires fine work and very precise assembly to complete the Alloy between the viaduct and the deck. The bridge, though of great scale and weight, requires fine work and a very precise assembly to complete the Alloy between the viaduct and the deck.
Source 据报道，雷杰普·塔伊普·埃尔多安则五次提到"好朋友"普京。 Reference According to the report, Recep Tayyip Erdogan mentioned his "good friend" Putin five times. Our model It was reported that Recep Tayyip Erdogan made five references to "good friend" Putin.
Recep Tayyip Erdogan reportedly mentioned "good friend" Putin five times. The five-time mentions of "good friend" Putin were reported by Recep Tayyip Erdogan.
Source 叙利亚是关键因素之一。莫斯科近日在叙利亚扮演和事佬的角色，而俄罗斯与土耳其却支持相反派别。 Reference Syria is a key factor. Moscow has recently played the role of a mediator in Syria. However, Russia and Turkey supported opposing factions. Our model One of the key factors is Syria, where Moscow has played a conciliatory role, while Russia and Turkey have supported the opposite.
Syria is one key factor, with Moscow playing a conciliatory role in Syria, while Russia and Turkey have supported the opposite. The role of Moscow as a peacemaker in Syria recently was one of the key factors, while Russia and Turkey supported the opposition.
Source 双方就如何在叙利亚问题上求同存异未达成明确共识。 Reference There is no clear consensus on where they can seek common ground on Syria. Our model There is no clear consensus on how to seek common ground on Syria.
No clear consensus has been reached on how to find common ground on Syria. The two sides did not reach a clear consensus on how to seek common ground on Syria.
Source 莫斯科当时的反应很激烈 Reference Moscow's reaction was intense. Our model Moscow was reacting violently.
Moscow's reaction was very strong. The reaction in Moscow was intense.
Source 令人欣慰的是，一切都进展得很顺利。 Reference It is gratifying that everything is going well. Our model It is gratifying to note that everything is going well.
Thankfully, everything has gone well.
To the relief of all, everything was going well. and the decoder simply ignores the domain information. The fact that learning the domain from the input sentence does not work well is expected, as this information is already encoded in the source encoding z x . This validates that learning the diversity form the target domains is important. It also suggests that the diversity that our model learns is inherent to the target domain, and does not come from the source domain indirectly. Finally, both models have the same number of parameters, suggesting that the gain in performance is not only caused by the additional parameters.  (Ott et al., 2018), our model is able to generate diverse translations with very different prefixes, even for long sentences.

Conclusion
In this paper, we presented an efficient way to sample diverse translations by adding a discrete target encoder to a NMT model. The discrete representation allows to change the domain of the translation and can be trained without supervision. The advantages of using a discrete encoder is that it is both general and scales with the number of domains with no additional computational time. In the future, we plan to test our discrete target encoder to diversify generations in other domains, such as language modeling, image captioning or image inpainting.