Sequence to Sequence Mixture Model for Diverse Machine Translation

Sequence to sequence (SEQ2SEQ) models lack diversity in their generated translations. This can be attributed to their limitations in capturing lexical and syntactic variations in parallel corpora, due to different styles, genres, topics, or ambiguity of human translation process. In this paper, we develop a novel sequence to sequence mixture (S2SMIX) model that improves both translation diversity and quality by adopting a committee of specialized translation models rather than a single translation model. Each mixture component selects its own training dataset via optimization of the marginal log-likelihood, which leads to a soft clustering of the parallel corpus. Experiments on four language pairs demonstrate the superiority of our mixture model compared to SEQ2SEQ model with the standard and diversity encouraged beam search. Our mixture model incurs negligible additional parameters and no extra computation in the decoding time.


Introduction
Neural sequence to sequence (SEQ2SEQ) models have been remarkably effective machine translation (MT) (Sutskever et al., 2014;Bahdanau et al., 2015). They have revolutionized MT by providing a unified end-to-end framework, as opposed to the traditional approaches requiring several submodels and long pipelines. The neural approach is superior or on-par with statistical MT in terms of translation quality on various MT tasks and domains e.g. (Wu et al., 2016;Hassan et al., 2018).
A well recognized issue with SEQ2SEQ models is the lack of diversity in the generated translations. This issue is mostly attributed to the decoding algorithm (Li et al., 2016), and recently to the model (Zhang et al., 2016;Schulz et al., 2018a). The former direction has attempted to design diversity encouraging decoding algorithm, particularly beam search, as it generates translations sharing the majority of their tokens except a few trailing ones. The latter direction has investigated modeling enhancements, particularly the introduction of continuous latent variables, in order to capture lexical and syntactic variations in training corpora, resulted from the inherent ambiguity of the human translation process. 1 However, improving the translation diversity and quality with SEQ2SEQ models is still an open problem, as the results of the aforementioned previous work are not fully satisfactory.
In this paper, we develop a novel sequence to sequence mixture (S2SMIX) model that improves both translation quality and diversity by adopting a committee of specialized translation models rather than a single translation model. Each mixture component selects its own training dataset via optimization of the marginal log-likelihood, which leads to a soft clustering of the parallel corpus. As such, our mixture model introduces a conditioning global discrete latent variable for each sentence, which leads to grouping together and capturing variations in the training corpus. We design the architecture of S2SMIX such that the mixture components share almost all of their parameters and computation.
We provide experiments on four translation tasks, translating from English to German/French/Vietnamese/Spanish. The experiments show that our S2SMIX model consistently outperforms strong baselines, including SEQ2SEQ model with the standard and diversity encouraged beam search, in terms of both translation diversity and quality. The benefits of our mixture model comes with negligible additional parameters and no extra computation at inference time, compared to the vanilla SEQ2SEQ model.

Attentional Sequence to Sequence
An attentional sequence to sequence (SEQ2SEQ) model (Sutskever et al., 2014;Bahdanau et al., 2015) aims to directly model the conditional distribution of an output sequence y ≡ (y 1 , . . . , y T ) given an input x, denoted P (y | x). This family of autoregressive probabilistic models decomposes the output distribution in terms of a product of distributions over individual tokens, often ordered from left to right as, where y <t ≡ (y 1 , . . . , y t−1 ) denotes a prefix of the sequence y, and θ denotes the tunable parameters of the model. Given a training dataset of input-output pairs, denoted by D ≡ {(x, y * ) d } D d=1 , the conditional log-likelihood objective, predominantly used to train SEQ2SEQ models, is expressed as, A standard implementation of the SEQ2SEQ model is composed of an encoder followed by a decoder. The encoder transforms a sequence of source tokens denoted (x 1 , . . . , x N ), into a sequence of hidden states denoted (h 1 , . . . , h N ) via a recurrent neural network (RNN). Attention provides an effective mechanism to represent a soft alignment between the tokens of the input and output sequences (Bahdanau et al., 2015), and more recently to model the dependency among the output variables (Vaswani et al., 2017).
In our model, we adopt a bidirectional RNN with LSTM units (Hochreiter and Schmidhuber, 1997). Each hidden state h n is the concatenation of the states produced by the forward and backward RNNs, h n = [h →n , h n← ]. Then, we use a two-layer RNN decoder to iteratively emit individual distributions over target tokens (y 1 , ..., y T ). At time step t, we compute the hidden representations of an output prefix y ≤t denoted s 1 t and s 2 t based on an embedding of y t denoted M[y t ], previous representations s 1 t−1 , s 2 t−1 , and a context vector c t as, where M is the embedding table, and W and W are learnable parameters. The context vector c t is computed based on the input and attention, where W h , W s , b a , and v are learnable parameters, and a t is the attention distribution over the input tokens at time step t. The decoder utilizes the attention information to decide which input tokens should influence the next output token y t+1 .

Sequence to Sequence Mixture Model
We develop a novel sequence to sequence mixture (S2SMIX) model that improves both translation quality and diversity by adopting a committee of specialized translation models rather than a single translation model. Each mixture component selects its own training dataset via optimization of the marginal log-likelihood, which leads to a soft clustering of the parallel corpus. We design the architecture of S2SMIX such that the mixture components share almost all of their parameters except a few conditioning parameters. This enables a direct comparison against a SEQ2SEQ baseline with the same number of parameters. Improving translation diversity within SEQ2SEQ models has received considerable recent attention (e.g., Vijayakumar et al. (2016); Li et al. (2016)). Given a source sentence, human translators are able to produce a set of diverse and reasonable translations. However, although beam search for SEQ2SEQ models is able to generate various candidates, the final candidates often share majority of their tokens, except a few trailing ones. The lack of diversity within beam search raises an issue for possible re-ranking systems and for scenarios where one is willing to show multiple translation candidates to the user. Prior work attempts to improve translation diversity by incorporating a diversity penalty during beam search (Vijayakumar et al., 2016;Li et al., 2016). By contrast, our S2SMIX model naturally incorporates diversity both during training and inference.
The key difference between the SEQ2SEQ and S2SMIX models lies in the formulation of the conditional probability of an output sequence y given an input x. The S2SMIX model represents P θ (y | x) by marginalizing out a discrete latent variable z ∈ {1, . . . , K}, which indicates the selection of the mixture component, i.e., where K is the number of mixture components. For simplicity and to promote diversity, we assume that the mixing coefficients follow a uniform distribution such that for all z ∈ {1, . . . , K}, For the family of S2SMIX models with uniform mixing coefficients (10), the conditional loglikelihood objective (2) can be re-expressed as: where log(1/K) terms were excluded because they offset the objective by a constant value. Such a constant has no impact on learning the parameters θ. One can easily implement the objective in (11) using automatic differentiation software such as tensorflow (Abadi et al., 2016), by adopting a LogSumExp operator to aggregate the loss of the individual mixture components. When the number of components K is large, computing the terms P θ (y * t | y * <t , x, z) for all values of z ∈ {1, . . . , K} can require a lot of GPU memory. To mitigate this issue, we will propose a memory efficient formulation in Section 3.3 inspired by the EM algorithm.

S2SMIX Architecture
We design the architecture of the S2SMIX model such that individual mixture components can share as many parameters and as much computation as possible. Accordingly, all of the mixture components share the same encoder, which requires processing the input sentence only once. We consider different ways of injecting the conditioning signal into the decoder. As depicted in Figure 1, we consider different ways of injecting the conditioning on z into our two-layer decoder. These different variants require additional lookup tables denoted M 1 , M 2 , or M b . When we incorporate the conditioning on z into the LSTM layers, each lookup table (e.g., M 1 and M 2 ) has K rows and D LSTM columns, where D LSTM denotes the number of dimensions of the LSTM states (512 in our case). We combine the state of the LSTM with the conditioning signal via simple addition. Then the LSTM update equations take the form, for i ∈ {1, 2}. We refer to the addition of the conditioning signal to the bottom and top LSTM layers of the decoder as bt and tp respectively. Note that in the bt configuration, the attention mask depends on the indicator variable z, whereas in the tp configuration that attention mask is shared across different mixture components. We also consider incorporating the conditioning signal into the softmax layer to bias the selection of individual words in each mixture component. Accordingly, the embedding table M b has K rows and D vocab entries, and the logits from (5) are added to the corresponding row of M b as, We refer to this configuration as sf and to the configuration that includes all of the conditioning signals as all.

Separate Beam Search per Component
At the inference stage, we conduct a separate beam search per mixture component. Performing beam search independently for each component encourages diversity among the translation candidates as different mixture components often prefer certain phrases and linguistic structures over each other. Letŷ z denote the result of the beam search for a mixture component z. The final output of our model, denotedŷ is computed by selecting the translation candidate with the highest probability under the corresponding mixture component, i.e., In order to accurately estimate the conditional probability of each translation candidate based on (9), one needs to evaluate each candidate using all of the mixture components. However, this process considerably increases the inference time and latency. Instead, we approximate the probability of each candidate by only considering the mixture component based on which the candidate translation has been decoded, as outlined in (14). This approximation also encourages the diversity as we emphasized in this work.
Note that we have K mixture components and a beam search of b per component. Overall, this requires processing K × b candidates. Accordingly, we compare our model with a SEQ2SEQ model using the same beam size of K × b.

Memory Efficient Formulation
In this paper, we adopt a relatively small number of mixture components (up to 16), but to encompass various clusters of linguistic content and style, one may benefit from a large number of components. Based on our experiments, the memory footprint of a S2SMIX with K components increases by about K folds, partly because the softmax layers take a big fraction of the memory. To reduce the memory requirement for training our model, inspired by prior work on EM algorithm (Neal and Hinton, 1998), we re-express the gradient of the conditional log-likelihood objective in (11) exactly as, where with uniform mixing coefficients, the posterior distribution P (z | x, y * ) takes the form, where z (y | x) = log P θ (y | x, z). Based on this formulation, one can compute the posterior distribution in a few forward passes, which require much less memory. Then, one can draw one or a few Monte Carlo (MC) samples from the posterior to obtain an unbiased estimate of the gradient in (15). As shown in algorithm 1, the training procedure is divided into two parts. For each minibatch we compute the componentspecific log-loss for different mixture components in the first stage. Then, we exponentiate and normalize the losses as in (16) to obtain the posterior distribution. Finally, we draw one sample from the posterior distribution per input-output example, and optimize the parameters according to the loss of such a component. These two stages are alternated until the model converges. We note that this algorithm follows an unbiased stochastic gradient of the marginal log likelihood.

Experiments
Dataset. To assess the effectiveness of the S2SMIX model, we conduct a set of translation experiments on TEDtalks on four language pairs: English→French (en-fr), English→German (en-de), English→Vietnamese (en-vi), and English→Spanish (en-es).
We use IWSLT14 dataset 2 for en-es, IWSLT15  Implementation details. All of the models use a one-layer bidirectional LSTM encoder and a twolayer LSTM decoder. Each LSTM layer in the encoder and decoder has a 512 dimensional hidden state. Each input word embeddings is 512 dimensional as well. We adopt the Adam optimizer (Kingma and Ba, 2014). We adopt dropout with a 0.2 dropout rate. The minibatch size is 64 sentence pairs. We train each model 15 epochs, and select the best model in terms of the perplexity on the dev set.
Diversity metrics. Having more diversity in the candidate translations is one of the major advantages of the S2SMIX model. To quantify diversity within a set {ŷ m } M m=1 of translation candidates, we propose to evaluate average pairwise BLEU between pairs of sentences according to As an alternative metric of diversity within a set {ŷ m } M m=1 of translations, we propose a metric based on the fraction of the n-grams that are 3 https://sites.google.com/site/iwsltevaluation2016 4 https://github.com/moses-smt/mosesdecoder 5 https://nlp.stanford.edu/projects/nmt where ngram(y) returns the set of unique n-grams in a sequence y. We report average div_bleu and average div_ngram across the test set for the translation candidates found by beam search. We measure and report bigram diversity in the paper and report unigram diversity in the supplementary material.

S2SMIX configuration.
We start by investigating which of the ways of injecting the conditioning signal into the S2SMIX model is most effective. As seen in Section 3, the mixture components can be built by adding component-specific vectors to the logits (sf), the top LSTM layer (tp) or the bottom LSTM layer (bt) in the decoder, or all of them (all). Figure 2 shows the BLEU score of these variants on the translation tasks across four different language pairs. We observe that adding a component-specific vector to the recurrent cells in the bottom layer of the decoder is the most effective, and results in BLEU scores superior or onpar with the other variants across the four language pairs. Therefore, we use this model variant in all experiments for the rest of the paper.
Furthermore, Table 2 shows the number of parameters in each of the variants as well as the base SEQ2SEQ model. We confirm that the mixture model variants introduce negligible number   S2SMIX vs. SEQ2SEQ. We compare our mixture model against a vanilla SEQ2SEQ model both in terms of translation quality and diversity. To be fair, we compare models with the same number of beams during inference, e.g., we compare vanilla SEQ2SEQ using a beam size of 8 with S2SMIX-4 with 4 component and a beam size of 2 per component.
As an effective regularization strategy, we adopt label smoothing to strengthen generalisation performance (Szegedy et al., 2016;Pereyra et al., 2017;Edunov et al., 2018). Unlike conventional cross-entropy loss, where the probability mass for the ground truth word y is set to 1 and q(y ) = 0 for y = y, we smooth this distribution as: where is a smoothing parameter, and V is the vocabulary size. In our experiments, is set to 0.1.  Table 4: S2SMIX with 4 components vs SEQ2SEQ endowed with the beam-diverse decoder (Li et al., 2016) with the beam size of 4.
with the corresponding row in the bottom part for a fair comparison in terms of the effective beam size. Firstly, we observe that increasing the beam size deteriorates the BLEU score for the SEQ2SEQ model. Similar observations have been made in the previous work (Tu et al., 2017). This behavior is in contrast to our S2SMIX model where increasing the beam size improves the BLEU score, except en-es, which demonstrates a decreasing trend when beam size increases from 2 to 4. Secondly, our S2SMIX models outperform their SEQ2SEQ counterparts in all settings with the same number of bins. Figure 3 shows the diversity comparison between the S2SMIX model and the vanilla SEQ2SEQ model where the number of decoding beams is 8. The diversity metrics are bigram and BLEU diversity as defined earlier in the section. Our S2SMIX models significantly dominate the SEQ2SEQ model across language pairs in terms of the diversity metrics, while keeping the translation quality high (c.f. the BLEU scores in Table 3).
We further compare against the SEQ2SEQ model endowed with the beam-diverse decoder (Li et al., 2016). This decoder penalizes sibling hypotheses generated from the same parent in the search tree, according to their ranks in each decoding step. Hence, it tends to rank high those hypotheses from different parents, hence encouraging diversity in the beam. Table 4 presents the BLEU scores as well as the diversity measures. As seen, the mixture model significantly outperforms the SEQ2SEQ endowed with the beam-diverse decoder, in terms of the diversity in the generated translations. Furthermore, the mixture model achieves up to 1.7 BLEU score improvements across three language pairs.  Table 5: BLEU scores using greedy decoding and training time based on the original log-likelihood objective and online EM coupled with gradient estimation based on a single MC sample. The training time is reported by taking the average running time of one minibatch update across a full epoch.
Large mixture models. Memory limitations of the GPU may make it difficult to increase the number of mixture components beyond a certain amount. One approach is to decrease the number of sentence pairs in a minibatch, however, this results in a substantial increase in the training time.
Another approach is to resort to MC gradient estimation as discussed in Section 3.3. The top-part of Table 5 compares the models trained by online EM vs the original log-likelihood objective, in terms of the BLEU score and the training time. As seen, the BLEU score of the EMtrained models are on-par with those trained on the log-likelihood objective. However, online EM leads to up to 35% increase in the training time for S2SMIX-4 across four different language pairs, as we first need to do a forward pass on the minibatch in order to form the lower bound on the loglikelihood training objective.
The bottom-part of Table 5 shows the effect of online EM coupled with sampling only one mixture component to form a stochastic approximation to the log-likelihood lower bound. For each minibatch, we first run a forward pass to compute the probability of each mixture component given each sentence pair in the minibatch. We then sample a mixture component for each sentence-pair to form the approximation of the log-likelihood lower bound for the minibatch, which is then optimized using back-propagation. As we increase the number of mixture components from 4 to 8, we see about 0.7 BLEU score increase for en-de; while there is no significant change in the BLEU score for en-fr, en-vi and en-es.
Increasing the number of mixture components further to 16 does not produce gains on these datasets. Time-wise, training with online EM coupled with 1-candidate sampling should be significantly faster that the vanilla online EM and the original likelihood objective in principle, as we need to perform the backpropagation only for the selected mixture component (as opposed to all mixture components). Nonetheless, the additional computation due to increasing the number of mixtures from 4 to 8 is about 26%, which increases to about 55% when going from 8 to 16 mixture components.

Qualitative Analysis
Finally, we would like to demonstrate that our S2SMIX does indeed encourage diversity and im-
prove the translation quality. As shown in Table 6, compared with SEQ2SEQ, which mistranslates the second clause, our S2SMIX is not only capable of generating a group of correct translation, but also emitting synonyms for different mixture components. We provide more examples in the supplementary material.

Related Work
Obviously, different domains aim at different readers, thus they exhibit distinctive genres compared to other domains. A well-tuned MT system cannot directly apply to new domains; otherwise, translation quality will degrade. Based on this factor, out-domain adaptation has been widely studied for MT, ranging from data selection (Li et al., 2010;Wang et al., 2017), tuning (Luong and Manning, 2015;Farajian et al., 2017) to domain tags (Chu et al., 2017). Similarly, in-domain adaptation is also a compelling direction. Normally, to train an universal MT system, the training data consist of gigantic corpora covering numerous and various domains.This training data is naturally so diverse that Mima et al. (1997) incorporated extralinguistic information to enhance translation quality. Michel and Neubig (2018) argue even without explicit signals (gender, politeness etc.), they can handle domain-specific information via annotation of speakers, and easily gain quality improvement from a larger number of domains. Our approach is considerably different from the previous work. We remove any extra annotation, and treat domain-related information as latent variables, which are learned from corpus.
Prior to our work, diverse generation has been studied in image captioning, as some of the training set are comprised of images paired with multiple reference captions. Some work puts their efforts on decoding stages, and form a group of beam search to encourage diversity (Vijayakumar et al., 2016), while others pay more attention to adversarial training (Shetty et al., 2017;. Within translation, our method is similar to Schulz et al. (2018b), where they propose a MT system armed with variational inference to account for translation variations. Like us, their diversified generation is driven by latent variables. Albeit the simplicity of our model, it is effective and able to accommodate variation or diversity. Meanwhile, we propose several diversity metrics to perform quantitative analysis.
Finally, Yang et al. (2018) proposes a mixture of softmaxes to enhance the expressiveness of language model, which demonstrate the effectiveness of our S2SMIX model under the matrix factorization framework.

Conclusions and Future Work
In this paper, we propose a sequence to sequence mixture (S2SMIX) model to improve translation diversity within neural machine translation via incorporating a set of discrete latent variables. We propose a model architecture that requires negligible additional parameters and no extra computation at inference time. In order to address prohibitive memory requirement associated with large mixture models, we augment the training procedure by computing the posterior distribution fol-lowed by Monte Carlo sampling to estimate the gradients. We observe significant gains both in terms of BLEU scores and translation diversity with a mixture of 4 components. In the future, we intend to replace the uniform mixing coefficients with learnable parameters, since different components should not necessarily make an equal contribution to a given sentence pair. Moreover, we will consider applying our S2SMIX model to other NLP problems in which diversity plays an important role.