Generating Diverse Translation from Model Distribution with Dropout

Despite the improvement of translation quality, neural machine translation (NMT) often suffers from the lack of diversity in its generation. In this paper, we propose to generate diverse translations by deriving a large number of possible models with Bayesian modelling and sampling models from them for inference. The possible models are obtained by applying concrete dropout to the NMT model and each of them has specific confidence for its prediction, which corresponds to a posterior model distribution under specific training data in the principle of Bayesian modeling. With variational inference, the posterior model distribution can be approximated with a variational distribution, from which the final models for inference are sampled. We conducted experiments on Chinese-English and English-German translation tasks and the results shows that our method makes a better trade-off between diversity and accuracy.


Introduction
In the past several years, neural machine translation (NMT) (Kalchbrenner and Blunsom, 2013;Gehring et al., 2017;Vaswani et al., 2017;Zhang et al., 2019) based on the end-to-end model has achieved impressive progress in improving the accuracy of translation. Despite its remarkable success, NMT still faces problems in diversity. In natural language, due to lexical, syntactic and synonymous factors, there are usually multiple proper translations for a sentence. However, existing NMT models mostly implement one-to-one mapping between natural languages, that is, one source language sentence corresponds to one target language sentence. Although beam search, a widely used decoding algorithm, can generate a group of translations, its search space is too narrow to extract diverse translations. * Corresponding author: Yang Feng There are some researches working at enhancing translation diversity in recent years. Li et al. (2016) and Vijayakumar et al. (2016) proposed to add regularization terms to the beam search algorithm so that it can possess greater diversity. He et al. (2018) and Shen et al. (2019) introduced latent variables into the NMT model, thus the model can generate diverse outputs using different latent variables. Moreover, Sun et al. (2019) proposed to combine the structural characteristics of Transformer and use the different weights between each head in the multi-head attention mechanism to obtain diverse results. In spite of improvement in balancing accuracy and diversity, these methods do not represent the diversity in the NMT model directly.
In this paper, we take a different approach to generate diverse translation by explicitly maintaining different models based on the principle of Bayesian Neural Networks (BNNs). These models are derived by applying concrete dropout (Gal et al., 2017) to the original NMT model and each of them is given a probability to show its confidence in generation. According to Bayesian theorem, the probabilities over all the possible models under a specific training dataset forms a posterior model distribution which should be involved at inference. To make the posterior model distribution obtainable at inference, we further employ variational inference (Hinton and Van Camp, 1993;Neal, 1995;Graves, 2011;Blundell et al., 2015) to infer a variational distribution to approximate it, then at inference we can sample a specific model based on the variational distribution for generation.
We conducted experiments on the NIST Chinese-English and the WMT'14 English-German translation tasks and compared our method with different strong baseline approaches. The experiment results show that our method can get a good trade-off in translation diversity and accuracy with little train-ing cost.
Our contributions in this paper are as follows: • We introduce Bayesian neural networks with variational inference to NMT tasks to explicitly maintain different models for diverse generation.
• We apply concrete dropout to the NMT model to derive the possible models which only demands a small cost in computation.

Background
Assume a source sentence with n words x = x 1 , x 2 , ..., x n , and its corresponding target sentence with m words y = y 1 , y 2 , ..., y m , NMT models the probability of generating y with x as the input. Based on the encoder-decoder framework, NMT model Θ encodes source sentence into hidden states by its encoder, and uses its decoder to find the probability of t-th word y t , which depends on the hidden states and the first t − 1 words of target sentence y. The translation probability from sentence x and y can be expressed as: Given a training dataset with source-target sentence pairs D = {(x 1 , y * 1 ), ..., (x D , y * D )}, the loss function we want to minimize in the training is the sum of negative log-likelihood of Equation 1: In practice, by properly designing neural network structures and training strategies, we can get the specific model parameters that minimize Equation 2 and obtain translation results through the model with beam search.
One of the most popular model in NMT is Transformer, which was proposed by Vaswani et al. (2017). Without recurrent and convolutional networks, Transformer constructs its encoder and decoder by stacking self-attention and fullyconnected network layers. Self attention is operated with three inputs: query(Q), key(K) and value(V) as: where the dimension of key is d k .
Note that Transformer implements the multihead attention mechanism, projecting inputs into h group inputs to generate h different outputs in Equation 3, and these outputs are concatenated and projected into final outputs: are the projection matrices. The output of Equation 5 is then fed into the fully-connected layer named feed-forward network. The feed-forward network uses two linear networks and a ReLU activation function: We only give a brief description of Transformer above. Please refer to Vaswani et al. (2017) for more details.

Bayesian Neural Networks with Variational Inference
For most of machine learning tasks based on neural networks, a model with specific parameters is trained to explain the observed training data. However, there are usually a large number of possible models that can fit the training data well, which leads to model uncertainty. Model uncertainty may result from noisy data, uncertainty in model parameters or structure uncertainty, and it is represented as the confidence which model to choose to predict with. In order to express model uncertainty, we consider all possible models with parameters ω and define a prior distribution P (ω) over the model (i.e., the space of the parameters) to denote model uncertainty. Then given a training data set (X, Y ), the predicted distribution for Y can be denoted as P (Y |X, ω). Following Gal et al. (2017), we employ Bayesian neural networks (BNNs) to represent the P (ω|X, Y ), which is the posterior distribution of the models under (X, Y ). BNNs offer a probabilistic interpretation of deep learning models by inferring distributions over the models' parameters which are trained using Bayesian inference. The posterior distribution can be got by invoking Bayes' theorem as: Then according to BNNs, given a new test data x , the predictive distribution of the output y is: The expectations in Equation 7 and 8 are integrated over model distribution ω, and the huge space of ω makes it intractable to obtain the results. Therefore, inspired by Hinton and Van Camp (1993), Graves (2011) proposes a variational approximation method, using a variational distribution Q(ω|θ) with the parameters θ to approximate the posterior distribution P (ω|X, Y ). To this end, the training objective is to minimize the Kullback-Leibler (KL) divergence between the model distribution and the posterior distribution KL(Q(ω|θ)||P (ω|X, Y )). With variational inference, the objective is equivalent to maximizing its evidence lower bound (ELBO), so we get As we can see in Equation 9, the first term on the right side is the expectation of the predicted probability over model distribution on the training set, which can be unbiased estimated with the Monte-Carlo method. And the second term is the KL divergence between the approximate model distribution and the prior distribution. From the perspective of Hinton and Van Camp (1993) and Graves (2011), with the above objective, we can express model uncertainty under the training data and meanwhile regularize model parameters and avoid over-fitting. Therefore, at inference, we can use the distribution Q(ω|θ) instead of P (ω|X, Y ) to evaluate model confidence (i.e., model uncertainty).

Model distribution with Dropout
To derive the BNN, we need to first decide how to explore for the possible models and then decide the prior distribution and the variational distribution for the models. As in Gal et al., 2017, we can define a simple model with parameters ω W (W ∈ R m×n ) and then drop out some column of ω W to get the possible models. We use matrix Gaussian distribution as the prior model distribution and Bernoulli distribution as the posterior model distribution.
Using W .j to denote the j-th column of W , we draw a matrix Gaussian distribution as the probability distribution of dropping out the j-th column as where l is the hyper-parameter. The above matrix Gaussian distribution is used as the prior distribution of the models got by dropping out the j-th column. Then we introduce p (p ∈ R 1×n ) as the probability vector of dropping out the columns of ω W , which means dropping out the j-th column with the probability of p j , and keeping the j-th column unchanged with the probability of 1 − p j . Therefore the posterior model distribution of dropping out the j-th column is defined as where W ∈ θ and p ∈ θ are trainable parameters. With Equation 10 as prior, the KL divergence for the j-th column of the matrix can be represented as: where and Since the probability distribution among different neural networks and different columns of neural network are independent. For a complex multilayer neural network θ, the KL divergence between model distribution Q(ω|θ) and prior distribution P (ω) is Previous sections show how to use concrete dropout to realize variational approximation of the posterior model distribution. In this section we will introduce the implementation in representing model distribution for Transformer with aforementioned methods.

Dropout in Transformer
Stated in detail in Vaswani et al. (2017), in Transformer, dropout is commonly used to the output of modules, including the output of embedding, attention layer and feed-forward layer. Also, from Equation 13, we find it's important to find the network W corresponding to the dropout module. The correspondences in Transformer are as follows: Embedding module Embedding module works for mapping the words into embedding vectors. The embedding module contains a matrix W E ∈ R l d ×d , where l d is the length of dictionary and d is the dimension of embedding vector. For the i-th word in the dictionary, its embedding vector is the i-th column of W E . Since dropout the jth dimension of word embedding is equivalent to dropping out the j-th row of W E , we utilize W T E and its corresponding dropout in Equation 13.
Attention module For attention modules, as we can see in 5, their outputs are generated by concatenating the output of different heads and projecting by matrix W O . Since dropout is used in the output of attention module, we take W O and its corresponding dropout in calculating Equation 13.
Feed-forward module As shown in Equation 6, the output is generated through W 2 with bias b 2 . As we can see, for network y = xW + b, we can find that as we can see, dropout to the output of the feed-forward module can be regraded as dropping out W 2 and b 2 . So, during training, we use Concat(W T , b T ) T to calculate Equation 13.

Training and Inference
Although dropout is frequently utilized in Transformer, there are some networks in Transformer , and their output is not masked by dropout. So, in our implementation, we obtain the model distribution by fine-tuning the pre-trained model, freezing their parameters and only updating dropout probabilities. Moreover, we choose different trained modules to train their output dropout probability, and in calculating Equation 15, we only take those trained dropout probabilities into consideration. By allowing dropout probabilities to change, our method can better represent model distributions under the training dataset than the fixed dropout probabilities. The mini-batch training algorithm is expressed in Equation 1. It is worth to mention that since we train the model distribution with batches of data, we scale the KL divergence with the proposition of the batch in the entire training dataset. θ ← θ + η ∂ ∂θ L 10: end for 11: end while During updating the dropout probability, due to the discrete characteristics of the Bernoulli distribution, we cannot directly calculate the gradient of the first term in Equation 9 to the dropout probability. So, we adopt concrete dropout, which is used in Gal et al. (2017). As a continuous relaxation of dropout, for its input y, the output can be expressed as y = y z, and vector z satisfies: where u ∼ U(0, 1), p is dropout probability.
In the inference stage, we just randomly mask model parameters with trained dropout probabil-ities, with different random seeds, NMT models with different parameters are sampled. Since diverse translations are demanded, we performed several forward passes through different sampled NMT models, and different translations are generated with different model outputs and beam search.

Experiment Setup
Dataset In our experiment, we select datasets in the following translation tasks: • NIST Chinese-to-English (NIST Zh-En). Its dataset is based on LDC news corpus and contains about 1.34 million sentence pairs. It also includes 6 relatively small datasets, MT02, MT03, MT04, MT05, MT06, and MT08. In our experiments, we use MT02 as the development set, and the rest work as the test sets.
Without special explanation, we use average result of test sets as final results. • WMT'14 English-to-German (WMT'14 En-De). Its dataset comes from the WMT'14 news translation task, which contains about 4.5 million sentence pairs. In our experiment, we use newstest2013 as the development set and newstest2014 as the test set. For above two datasets, We adopt Moses tokenizer (Koehn et al., 2007) in English and German corpus. We also use the byte pair encoding (BPE) algorithm (Sennrich et al., 2015), and limit the size of the vocabulary K = 32000. And we train a joint dictionary for WMT'14 En-De. For NIST, we use THULAC toolkit  to segment Chinese sentence into words. In addition, we remove the examples in datasets from the above two tasks where length of the source language sentence or target language sentence exceed 100 words.
Model Architecture In our experiments, we all adopt the Transformer Base model in Vaswani et al. (2017). Transformer base model has 6 layers in encoder and decoder, and it has hidden units with 512 dimension, except for the feed-forward network, where the inner-layer output dimension is 2048. The number of heads in Transformer base model is 8 and the default dropout probability is 0.1. And our model is implemented in python3 with the Fairseq-py  toolkit.
Experimental Setting During training, in order to improve the accuracy, we use the label smoothing (Szegedy et al., 2016) with = 0.1. In terms of optimizer, we adopt the Adam optimizer (Kingma and Ba, 2014), the main parameters of the opti-mizer is β 1 = 0.9, β 2 = 0.98, and = 10 −9 . As for the learning rate, we adopt the dynamic learning rate method in Vaswani et al. (2017) with warmup steps = 4000. Also, we use mini-batch training with max token = 4096.
Metrics In terms of evaluation metrics, referring to Shen et al. (2019), we adopt the BLEU and Pairwise-BLEU to evaluate translation quality and diversity. Both two metrics are calculated with case-insensitive BLEU algorithm in Papineni et al. (2002). In our experiments, the BLEU is to measure the average similarity between the output translations and the standard translation. The higher the BLEU value, the better the accuracy of translation. And the Pairwise-BLEU reflects the average similarity between the output translations of different groups. The lower the Pairwise-BLEU value, the lower the similarities, and the more diverse the translations. In our experiment, we use the NLTK toolkit to calculate the two metrics.
6 Experiment Results

Analysis of Training Modules and Hyper-parameter
In this experiment, we train models with different training modules and hyper-parameter l with NIST dataset to evaluate their effects, and some results are shown in Table 1.
As we can see in Table 1, for those training modules, when l is small, with the same hyperparameter l, choosing smaller part of training modules will lead to lower BLEU and Pairwise-BLEU, showing that accuracy of the generate translations increases while diversity decreases. We also find that in the same training modules, with the increase of l, the Pairwise-BLEU decreases steadliy, and then increases when the BLEU is close to zero; and the BLEU has similar trends with Pairwise-BLEU, however, when l is relatively low, the BLEU tends to stablize.
For the above-mentioned experimental results, we can interpret as follows: in training modules, since Equation 15 is the sum of the training modules' KL divergence, with the training modules increase, the KL divergence accordingly increases, pushing the dropout probability higher and making translations diverse. In terms of hyper-parameter l, as we can see in Equation 10, when l increases, the prior distribution is squeezed to zero matrix; thus during training, the dropout probabilities will get higher to make the model distribution close to l 2 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 and Pairwise-BLEU change with training modules. Also, the BLEU decreases with the increase of l, and the Pairwise-BLEU decreases steadily and then increases when the BLEU value is close to its minimum. prior distribution. However, when the l is too high to make most of dropout probabilities close to 1, uncertainty of model parameters decreases, making Pairwise-BLEU increases.

Results in Diverse Translation
From the previous section, we can see that by selecting different training modules and hyperparameter, translations with different accuracy and diversity can be obtained. Then we conduct experiments to generate 5 groups of different translations on NIST Zh-En dataset and WMT'14 En-De dataset, and compare the diversity and accuracy of the translations generated by our method and the following baseline approaches: • Beam Search: we choose the optimal 5 results directly generated by beam search in this paper. • Diverse Beam Search (DBS) (Vijayakumar et al., 2016): it works by grouping the outputs and adding regularization terms in beam search to encourage diversity. In our experiment, the number of output translations of groups are all 5. • HardMoE (Shen et al., 2019): it trains model with different hidden states and obtains different translations by controlling hidden state. In our experiment, we set the number of hidden states is 5. • Head Sampling (Sun et al., 2019): it generate different translations by sampling different encoder-decoder attention heads according to their attention weight, and copying the samples to other heads in some conditions. Here, we set the parameter K = 3. The results are shown in Figure 1. In Figure  1 we plot the BLEU versus Pairwise-BLEU, the scattered points show the results of baseline approaches, and the points on the curves are results in the same training modules with different hyper- Figure 1: Experiment result in NIST Zh-En (upper one) and WMT'14 En-De (lower one). The X axis and Y axis represents BLEU and Pairwise-BLEU value. The first three groups in legend (connected with curves) are results with our methods under specific training modules with different prior parameters l, and the latter four groups (scattered points) are results from baseline methods. Table 2: Translation examples in NIST Zh-En Task. Results of our work is generated by training dropout in decoder with l 2 = 1000. The result shows that by adjusting model parameters, our method can generate translations with higher diversity while maintaining accuracy. parameter l. From Figure 1, firstly, we can verity that choosing different training modules can lead to different balance of translation diversity and accuracy, for NIST Zh-En and WMT'14 En-De, training dropout probabilities in the full model can get better translations.
Also, we suggest that in NIST Zh-En task, by adjusting training modules and hyper-parameter l, our results which has higher BLEU and lower Pairwise-BLEU values than baselines without HardMoE, even for HardMoE itself, our method is comparable with proper l while training the whole model. In WMT'14 En-De, we also find that our method exceeds the baseline approach except HardMoE. For the gap in performace with HardMoE, we interpreted that since our models are randomly sampled from the model distribution, it could be hard for our models to represent such distinguishable characteristics like HardMoE, which trains multiple different latent variables.
Also, to intuitively display the improvement of diversity in our translations, we choose a case from NIST Zh-En task, the results are shown in Table 2. The case shows that compared with beam search, which only varies in few words, our method can obtain more diverse translations while ensuring the accuracy of translation, and diversities are not only shown in words, but also reflected in lexical characteristic.

Analyzing Module Importance with Dropout Probability
Some researches (Voita et al., 2019;Michel et al., 2019; found that a well-trained Transformer model is over-parameterized. Useful information gathers in some parameters and some modules and layers can be pruned to improve the efficiency during test time. Since dropout can play the role of regularization and there are differences in the trained dropout probabilities of different neuron, we conjecture that the trained dropout probability and the importance of each module are correlated. To investigate this, we choose the model in which dropout probabilities of the full model is trained with l 2 = 400 in NIST Zh-En task, and separately calculate the average dropout probabilityp dropout of different attention modules. Also, we manually pruned the corresponding modules of the model, obtained translations and calculated its BLEU. the more the BLEU drops, the more important the module is to translation. To quantify their relevance, we calculate the Pearson correlation coefficient (PCC) ρ in different kinds of training modules, and highlights the highest and lowest results.
Results of our experiment are shown in Table  3. Firstly, we can see that the average dropout probabilities and BLEU are not fully positively correlated, which might be explained by the contingency of sampling from model distribution during training. But from the maximum and minimum of  Table 3: Average dropout probabilities of each module and BLEU of translations generated by model where the module is pruned. From the maximum and minimum ofp dropout and BLEU, and correlation coefficient ρ in different modules, it is obvious that dropout probabilities of module and its importance is correlated.
p dropout and BLEU, we can find that the dropout probabilitiesp dropout and the BLEU of translations show some similar information in module importance. Also, we quantify the correlation between thep dropout and BLEU, finding that it is highly correlated in self-attention module in encoder and in E-D attention in decoder, since its correlation coefficient ρ > 0.8, and thep dropout and BLEU is also correlated in self-attention in decoder.

Related Work
Researches in Bayesian Neural Network have a long history, Hinton and Van Camp (1993) firstly proposes a variational inference approximation methods to BNN to minimize the minimum description length (MDL), then Neal (1995) approximate BNN by Hamiltonian Monte Carlo methods. In recent years, Graves (2011) introduces the concept of variational inference, by approximating posterior distribution with model distribution, the model minimizes its MDL and reduces the model weight; and Blundell et al. (2015) proposes an algorithm similar to Graves (2011), however, it uses mixture of Gaussian densities as prior and achieved comparable performance with dropout in regularization. Introduced by Hinton et al. (2012), dropout, which is easy to implement, works as a stochastic regularization to avoid over-fitting. And there are several theoretical explainations such as getting sufficient model combinations (Hinton et al., 2012;Srivastava et al., 2014) to train and augumenting training data (Bouthillier et al., 2015). Gal and Ghahramani (2016) proposes that dropout can be understood as a bayesian inferences algorithm, and Gal et al. (2017) uses concrete dropout in updating dropout probabilities. Also, the author implements the dropout methods to represent uncertainty in different kinds of deep learning tasks in Gal (2016).
In neural machine translation task, lack of diversity is a widely acknowledged problem, some researches like Ott et al. (2018) investigate the cause of uncertainty in NMT, and some provide metrics to evaluate the translation uncertainty like Galley et al. (2015); Dreyer and Marcu (2012). There are also other researches that put forward methods to obtain diverse translation.  Shao et al. (2018) propose a new probabilistic ngrambased loss to conduct sequence-level training for generating diverse translation. Feng et al. (2020) propose to employ future information to evaluate fluency and faithfulness to encourage diverse translation.
There are also a few papers in interpreting Transformer model, Voita et al. (2019) suggests that some heads play a consistent role in machine translation, and their roles can be interpreted linguistically; also, they implement L 0 penalty to prune heads. Michel et al. (2019) shows that huge amounts of heads in Transformer can be pruned, and the importance of head is cross-domain. Also,  shows that the layers in Transformer are also able to be pruned: similar to our work, during training, they drop the whole layer with dropout and trained their probability; however, variational inference strategy is not used in their paper, and they take different kinds of inference strategies to balance performance and efficiency rather than sampling.

Conclusion
In this paper, we propose to utilize variational inference in diverse machine translation tasks. We represent the Transformer model distribution with dropout, and train the model distributions to minimize its distance to the posterior distribution under specific training dataset. Then we generate diverse translations with the models sampled from the trained model distribution. We further analyze the correlations between module importance and trained dropout probabilities. Experiment results in Chinese-English and English-German translation tasks suggest that by properly adjusting trained modules and prior parameters, we can generate translations which balance accuracy and diversity well.
In future work, firstly, since our model is randomly sampled from model distribution to generate diverse translation, it is meaningful to explore better algorithms and training strategies to represent model distribution and search for the most distinguishable results in model distribution. Also, we'll try to extend our methods in a wider range of NLP tasks.