Uncertainty-Aware Semantic Augmentation for Neural Machine Translation

As a sequence-to-sequence generation task, neural machine translation (NMT) naturally contains intrinsic uncertainty, where a single sentence in one language has multiple valid counterparts in the other. However, the dominant methods for NMT only observe one of them from the parallel corpora for the model training but have to deal with adequate variations under the same meaning at inference. This leads to a discrepancy of the data distribution between the training and the inference phases. To address this problem, we propose uncertainty-aware semantic augmentation, which explicitly captures the universal semantic information among multiple semantically-equivalent source sentences and enhances the hidden representations with this information for better translations. Extensive experiments on various translation tasks reveal that our approach significantly outperforms the strong baselines and the existing methods.


Introduction
In recent years neural machine translation (NMT) has demonstrated state-of-the-art performance on many language pairs with advanced architectures and large scale data (Bahdanau et al., 2015;Vaswani et al., 2017). At training time the parallel data only contains one source sentence as the input and the rest reasonable ones are ignored, while at inference the resulting model has to deal with adequate variations under the same meaning. This discrepancy of the data distribution poses a formidable learning challenge of the inherent uncertainty in machine translation. Since typically there are several semantically-equivalent source sentences that can be translated to the same target sentence, but the model only observes one at training time. Thus it is natural to enable an NMT model trained with the token-level cross-entropy (CE) to capture such a rich distribution, which exactly motivates our work.
Intuitively, the NMT model should be trained under the guidance of the same latent semantics that it will access at inference time. In their seminal work, the variational models (Blunsom et al., 2008;Zhang et al., 2016;Shah and Barber, 2018) introduce a continuous latent variable to serve as a global semantic signal to guide the generation of target translations. Wei et al. (2019) consider an universal topic representation for each sentence pair as global semantics for enhancing representations learnt by NMT models.  minimize the difference between the representation of source and target sentences. Although their methods yield notable results, they are still limited to one-to-one parallel sentence pairs.
To address these problems, we present a novel uncertainty-aware semantic augmentation method, which takes account of the intrinsic uncertainty sourced from the one-to-many nature of machine translation . Specifically, we first synthesize multiple reasonable source sentences to play the role of inherent uncertainty 1 for each target sentence. To achieve this, we introduce a controllable sampling strategy to cover adequate variations for inputs, by quantifying the sharpness of the word distribution in each decoding step and taking the proper word (the one with the maximum probability if sharp enough or determined by multinomial sampling) as the output. Then a semantic constrained network (SCN) is developed to summarize multiple source sentences that share the same meaning into a closed semantic region, augmented by which the model generates translations finally. By integrating such soft correspondences into the translation process, the model can intuitively work well when fed with an unfamiliar literal expression that can be supported by its underlying semantics. In addition, given the effectiveness of leveraging monolingual data in improving translation quality (Sennrich et al., 2016a), we further propose to combine the strength of both semantic augmentation and massive monolingual data distributed in the target language.
We conduct extensive experiments in both a supervised setup with bilingual data only, and a semi-supervised setup where both bilingual and target monolingual data are available. We evaluate the proposed approach on the widely used WMT14 English→French, WMT16 English→German, NIST Chinese→English and WMT18 Chinese→English benchmarks. Experimental results show that the proposed approach consistently improves translation performance on multiple language pairs. As another bonus, by adding monolingual data in German, our approach yields an additional gain of +1.5∼+3.3 BLEU points on WMT16 English→German task. Extensive analyses reveal that: • Our approach demonstrates strong capability on learning semantic representations.
• The proposed controllable sampling strategy introduces reasonable uncertainties into the training data and generates sentences are of both high diverse and high quality.
• Our approach motivates the models to be consistent when processing equivalent source inputs with various lteral expressions.

Preliminary
Neural Machine Translation (Bahdanau et al., 2015) directly models the translation probability of a target sentence y = y 1 , ..., y Ty given its corresponding source sentence x = x 1 , ..., x Tx : where θ is a set of model parameters and y <i is a partial translation. The word-level translation probability is formulated as: P (y i |y <i , x; θ) ∝ exp{g(y i−1 , s i , c i ; θ)}, in which g(·) denotes a non-linear function to predict the i-th target word y i from the decoder state s i and the context vector c i summarized from a sequence of representations of the encoder with an attention module. For training, given a parallel corpus {(x n , y n )} N n=1 , the objective is to maximize logP (y n |x n ; θ) over the entire training set.
Related Work on Data augmentation. DA has been used to improve the diversity of training signals for NMT models, like randomly shuffle (swap) or drop some words in a sentence (Iyyer et al., 2015;Artetxe et al., 2018;Lample et al., 2018), randomly replace one word in the original sentences with another word (Fadaee et al., 2017;Xie et al., 2017;Kobayashi, 2018;Cheng et al., 2018;, syntax-aware methods (Duan et al., 2020), as well as using target monolingual data (Sennrich et al., 2016a;Cheng et al., 2016;Zhang et al., 2018;Wu et al., 2018;Hoang et al., 2018;Niu et al., 2018;Edunov et al., 2018;Imamura et al., 2018;. More recently, Fadaee and Monz (2018) introduce several variations of sampling strategies targeting difficult-to-predict words.  have studied that what benefits from data augmentation across different methods and tasks.  propose to improve the robustness of NMT models towards perturbations and minor errors by introducing adversarial inputs into training process. In contrast, we aim at bridging the discrepancy of the data distribution between the training and the inference phases, through augmenting each training instance with multiple semantically-equivalent source inputs.
Related Work on Uncertainty in NMT. Recently, there are increasing number of studies investigating the effects of quantifying uncertainties in different applications Kendall and Gal, 2017;Xiao and Wang, 2018;Zhang et al., 2019b,a;Shen et al., 2019). However, most work in NMT has focused on improving accuracy without much consideration for the intrinsic uncertainty of the translation task itself. In their seminal work, the latent variable models (Blunsom et al., 2008;Zhang et al., 2016) introduce a (set of) continuous latent variable(s) to model underlying semantics of source sentences and to guide the generation of target translations. Zaremoodi and Haffari (2018) propose a forest-to-sequence NMT model to make use of exponentially many parse trees of the source  Figure 1: Uncertainty-Aware Semantic Augmentation for NMT. X (y) indicates a set of semantically-equivalent source sentences for y. The blue-solid and red-dashed lines represent the forward-pass information flow for x and x, respectively. Note that our method involves a shared encoder as well as a shared decoder for both x andx. sentence.  have focused on analyzing the uncertainty in NMT that demonstrate how uncertainty is captured by the model distribution and how it affects search strategies.  propose to quantify the confidence of NMT model predictions based on model uncertainty. Our work significantly differs from theirs. We model the inherent uncertainty by representing multiple source sentences into a closed semantic region, and use this semantic information to enhance NMT models where diverse literal expressions intuitively be supported by their underlying semantics.

Uncertainty-Aware Semantic Augmentation for NMT
Here, we present the uncertainty-aware semantic augmentation (as shown in Figure 1), which takes account of the intrinsic uncertainty of machine translation and enhances the latent representation semantically. For each sentence-pair (x, y), supposing X (y) is a set of correct source sentences for y, in which each sentencex is assumed to have the same meaning as x. Given a training corpus D, we introduce the objective function as: where • sem (x, x) to encourage the SCN to extract the core semantics (z and z) forx and x respectively, while constraining them into a closed semantic region. It is formulated as the negative Kullback-Leibler (KL) divergence between the semantic distributions P φ (z|x) and P φ (z|x), where φ denotes the combined parameters of the encoder and the SCN.
• mle (x, y; z) and mle (x, y;z) to guide the decoder to generate the output y with the assist of input-invariant semantics given diverse inputs x andx.
• λ 1 and λ 2 control the balance between the original source sentence x and its reasonable counterparts X (y). In experiments, we set λ 1 + λ 2 = 1.0, which means a target sentence occurs once in total. γ controls the impact of the semantic agreement training to be described in Section 3.3.
Intuitively, our new objective is exactly a regularized version of the widely used maximum likelihood estimation (MLE) in conventional NMT. The models are trained to optimize both the translation loss and the semantic agreement between x andx. In the following sections, we will first describe how to summarize multiple source sentences into a closed semantic region by developing a semantic constrained network (SCN) in Section 3.1. And then introduce the proposed controllable sampling strategy in Section 3.2 to construct adequate and reasonable variations for source inputs.

Semantic Constrained Network
Network Architecture. One core component of our approach is the proposed SCN, which aims to learn the global semantics and make them no difference between multiple source sentences (x andx). We adopt the CNN to address the variable-length problem of a sequence of hidden representations H x (which is the output of the top encoder layer given x) of the encoder stack. Formally, given an encoded representation H x = H x 1 , H x 2 , ..., H x Tx , the SCN first represents it as: where ⊕ is the concatenation operator to build the matrix ξ 1:Tx . Then a convolution operation involves a kernel W c is applied to a window of l words to produce a new feature: where ⊗ operator is the summation of elementwise production, b is a bias term. Finally we apply a max-over-time pooling operation over the feature map c = max{c 1 , c 2 , .., c Tx−l+1 } to capture the most important feature, that is, one with the highest value. We can use various numbers of kernels with different window sizes to repeat the above process, and extract different features to form the semantic representation, denoted as H c for x (and Hc forx in a symmetric way).
Semantic Agreement Training. Given the semantic distributions P φ (z|x) of x and P φ (z|x) of x, we formulate sem (x, x) as the negative KL divergence between them: We assume P φ (z|x) and P φ (z|x) have the following forms: The mean µ (μ) and s.d. σ (σ) are the outputs of neural networks based on the observation H c (or Hc), as where W µ , b µ , W σ and b σ are trainable parameters. To obtain a representation for latent semantic Threshold Method = 0 Multinomial sampling = +∞ Greedy search ∈ (0, +∞) Controllable sampling distributions, we employ reparameterization technique as in (Kingma et al., 2014;Zhang et al., 2016). Formally, where ∼ N (0, I) plays a role of introducing noises, and denotes an element-wise product. There can be other proper strategies to unify semantics of diverse inputs, we just present one example. Actually, the Gaussian form adopted here has several advantages, such as analytical evaluation of the KL divergence and ease of reparametrization for efficient gradient computation.
Augment Semantically. Given the encoder output H x of x, we augment it semantically with the captured semantics z by combining them with a gate g = sigmoid(z · W gz + H xt · W gx ), Identically, Hō can be formulated givenz and Hx. Finally, the augmented source representation H o (or Hō) is fed to the decoder to generate the final translation y conditioned on x (orx). In this strategy, our model can intuitively work well when meeting infrequent literal expressions as that can be pivoted by their corresponding semantic regions.

Controllable Sampling
For each target sentence y, we need a set of reasonable source sentences X (y) to play the role of the inherent uncertainty. Unfortunately, it is extremely cost to annotate multiple source sentences manually for tens of million target sentences. To this end, we automatically construct X (y) using a well-trained target-to-source model ← − θ by sampling from the predicted word distributions: However, it is problematic to force the generation of a certain number of source sentences indiscriminately for each target sentence using beam search or multinomial sampling. The reason is that both of them synthesize sentences are either of less diverse or of less quality. Therefore, we propose a controllable sampling strategy to generate reasonable source sentences: at each decoding step, if the word distribution is sharp then we take the word with the maximum probability, otherwise the sampling method formulated in Eq. (11) is applied. Formally, where ε is exactly the information entropy respect to P (·|x <t , y; ← − θ ): where P (x j |x <t , y; ← − θ ) denotes the conditional probability of the j-th word in the vocabulary appearing after the sequence x 1 , x 2 , ..., x t−1 . Actually, the widely used multinomial sampling and greedy search strategies can be served as special cases of the controllable sampling. is a hyperparameter that indicates the sharpness threshold of the predicted word distributions and relates our method with the special cases as shown in Table 1. In practice, we repeat the above process N times to generate multiple source sentences to form X (y).

Training
Our framework initializes the model based on the parameters trained by the standard maximum likelihood estimation (MLE) (Eq. (1)). As shown in Eq.
(2), the training objective of our approach is differentiable, which can be optimized using standard mini-batch stochastic gradient ascent techniques. To avoid the KL collapse (Bowman et al., 2016;Zhao et al., 2017), we use a simple scheduling strategy that sets γ = 0 at the beginning of training and gradually increases γ until γ = 1 is reached.

Experiments
We examine our method upon advanced TRANS-FORMER (Vaswani et al., 2017) and conduct experiments on four widely used translation tasks, including WMT14 English→French (En→Fr), WMT16 English→German (En→De), NIST Chinese→English (Zh→En) and WMT18 Chinese→English.

Experimental Setting
Dataset For En→De, we used the WMT16 2 corpus containing 4.5M sentence pairs with 118M English words and 111M German words. The validation set is the concatenation of newstest2012 and newstest2013, and the results are reported on new-stest2014 (test14), newstest2015 (test15) as well as newstest2016 (test16). For En→Fr, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences. The validation set is the concatenation of newstest2012 and newstest2013, and the results are reported on new-stest2014 (test14 We used the Stanford segmenter (Tseng et al., 2005) for Chinese word segmentation and applied the script tokenizer.pl of Moses (Koehn et al., 2007) for English, French and German tokenization. For En→De and En→Fr, all data had been jointly byte pair encoded (BPE) (Sennrich et al., 2016b) with 32k merge operations, which results in a shared source-target vocabulary. For NIST Zh→En, we created shared BPE codes with 60K operations that induce two vocabularies with 47K Chinese sub-words and 30K English sub-words. For WMT18 Zh→En, we used byte-pair-encoding to preprocess the source and target sentences, forming source-and target-side dictionaries with 32K types, respectively.
Model We adopt the transformer base setting for Zh→En translations, while both base and big settings are adopted in En→De and En→Fr translations. For SCN, the filter windows are set to 2, 3, 4, 5 with 128 feature maps each. We set = 2.5, N = 3 for balancing the translation performance and the computation complexity. During training, we set λ 1 = λ 2 = 0.5, roughly 4,096 source and target to-Method Param.
Training  kens are paired in one mini-batch. We employ the Adam optimizer with β 1 = 0.9, β 2 = 0.998, and = 10 −9 . Additionally, the same warmup and decay strategy for learning rate as Vaswani et al. (2017) is also used, with 8,000 warmup steps. For evaluation, we use beam search with a beam size of 4/5 and length penalty of 0.6/1.0 for En→{De,Fr}/Zh→En tasks respectively. We measure case-sensitive/insensitive tokenized BLEU 4 by multi-bleu.pl/mteval-v11b.pl for En→De/NIST Zh→En, while case-sensitive detokenized BLEU is reported by the official evaluation script mteval-v13a.pl for WMT18 Zh→En. Unless noted otherwise we run each experiment on up to four Tesla M40 GPUs and accumulate the gradients for 4 updates. For En→De/NIST Zh→En, each model was repeatedly run 4 times and we reported the average BLEU, while each model was trained only once on the larger WMT18 Zh→En dataset. For a strictly consistent comparison, we involve two strong baselines: • TRANSFORMER, which is trained on the real parallel data only.
• TRANSFORMER syn , which is trained on the same data as ours that consists of the real parallel data and the back-translated corpora. The latter contains N semantically-equivalent source sentences for each target sentence. These synthetic corpora are generated by a well-trained reverse NMT model using the proposed controllable sampling (see 3.2).  +2.36 BLEU points on average. In addition, our best model also achieves superior results across test sets to existing systems. For a more challenging task, we also report the results on WMT18 Zh→En task in Table 2. Compared with strong baseline systems, we observe that our method consistently improves translation performance on both newstest2017 and newstest2018. These results indicate that the effectiveness of our approach cannot be affected by the size of datasets. Table 3 shows the results on WMT16 En→De and WMT14 En→Fr translations. For En→De, when investigating semantic augmentation into NMT models, significant improvements over two baselines (up to +0.91 and +0.62 BLEU points on average respectively) can be observed. We also take existing NMT systems as comparison which use almost the same English-German corpus. Our best system outperforms the standard Transformer (Vaswani et al., 2017) with +1.27 BLEU on newstest2014. It worth mentioning that our method outperforms the advanced robust NMT systems (Cheng et al., 2018, which aim to construct anti-noise NMT models, with at least +0.23 BLEU and up to +0.48 BLEU improvements. On En→Fr, our method outperforms both the previous  models and the in-house baselines. To further verify our approach, we study it with respect to big models and compare it with two related methods . We can observe that the proposed approach achieves the best results among all methods for the same number of hidden units.

Analysis
Effect of N . To determine the number of synthetic source sentences N in our system beforehand, we conduct experiments on Zh→En and En→De translation tasks to test how it affects the translation performance. We vary the value of N from 1 to 9 with 2 as step size and the results are reported on validation sets (Table 4). We can find that the translation performance achieves substantial improvement with N increasing from 1 to 3. However, with N set larger than 3, we get little improvement. To make a trade-off between the translation performance and the computation complexity, we set N as 3 in our experiments.
Effect of . The introduction of the hyperparameter aims at acquiring the proper quantity of synthetic data. To investigate the effect of it, we quantify: (1) the diversity using the edit distance among the synthetic source sentences and (2) the quality using BLEU scores of synthetic source sentences, with respect to various values of . For each target sentence in validation sets, we generate N = 3 synthetic source sentences using controllable sampling. Table 5 shows the results. The BLEU scores were computed regarding the multiple synthetic sentences as a document. As in (Imamura et al., 2018), the edit distances are computed for two cases: (1) SYN vs. REAL, the average distance between a synthetic source sentence (SYN) and the real source sentence (REAL).
(2) SYN vs. SYN, the average distance among synthetic source sentences of a target sentence (C 2 3 = 3 combinations per target sentence). We can find that     0.42 BLEU points when removing mle (x, y) while that increases to 0.73 BLEU points when sem is excluded. In addition, only adding sem is able to achieve an improvement of +1.09 BLEU points.

Effect of Controllable Sampling
The widely used multinomial sampling and beam (greedy) search can be viewed as two special cases of the newly introduced controllable sampling. As in Table 7, our controllable sampling method achieves the best result among them on the validation set. We think that reasonable uncertainties can be mined via our controllable sampling strategy.
Visualization of Latent Space. We would like to verify whether our approach can capture semantics. Fortunately, there are such cases in the training set: a target sentence appears several times with different source sentences. We take some of them as examples, in which there are at least 17 unique source sentences for each target sentence. We visualize the semantic representations captured by the SCN of these examples in Figure 2. We observe that the representations are clearly clustered into 6 groups as expected, although demonstrating some noises, which reveal the strong capability of our approach to capture semantic representations.
Case Study.  to TRANSFORMER syn , our approach motivates the models to be consistent when processing equivalent source inputs with various lteral expressions.

Semi-supervised Setting
Given the effectiveness of leveraging monolingual data in improving translation quality (Sennrich et al., 2016a), we further propose to improve our proposed model using target monolingual data on WMT16 En→De translation. Specifically, we augment the original parallel data of WMT16 corpus containing 4.5M sentence pairs by 24M 5 unique sentences randomly extracted from German monolingual newscrawl data. All of them are no longer than 100 words after tokenizing and BPE processing. We synthesize multiple source sentences for each monolingual sentence via controllable sampling (Section 3.2), and the one with the highest probability is served as the real source sentence (i.e., x). We upsample the parallel data with a rate of 5 so that we observe every bitext sentence 5 times more often than each monolingual sentence. The resulted data is finally used to re-train our models and perform 300K updates on 8 P100 GPUs. Due to resource constraints, we adopt the smaller transformer base setting here.

Conclusion and Future Work
We present an uncertainty-aware semantic augmentation method to bridge the discrepancy of the data distribution between the training and the inference phases for dominant NMT models. In particular, we first synthesize a proper number of source sentences to play the role of intrinsic uncertainties via the controllable sampling for each target sentence. Then, we develop a semantic constrained network to summarize multiple source inputs into a closed semantic region which is then utilized to augment latent representations. Experiments on WMT14 English→French, WMT16 English→German, NIST Chinese→English and WMT18 Chinese→English translation tasks show that the proposed method can achieve consistent improvements across different language pairs. While we showed that uncertainty-aware semantic augmentation with Gaussian priors is effective, more work is required to investigate if such an approach will also be successful for more sophisticated priors. In addition, learning universal representations among semantically-equivalent source and target sentences (Wei et al., 2020) can complete the proposed method.