Improving Back-Translation with Uncertainty-based Confidence Estimation

While back-translation is simple and effective in exploiting abundant monolingual corpora to improve low-resource neural machine translation (NMT), the synthetic bilingual corpora generated by NMT models trained on limited authentic bilingual data are inevitably noisy. In this work, we propose to quantify the confidence of NMT model predictions based on model uncertainty. With word- and sentence-level confidence measures based on uncertainty, it is possible for back-translation to better cope with noise in synthetic bilingual corpora. Experiments on Chinese-English and English-German translation tasks show that uncertainty-based confidence estimation significantly improves the performance of back-translation.


Introduction
The past several years have witnessed the rapid development of end-to-end neural machine translation (NMT) (Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017), which leverages neural networks to map between natural languages.Capable of learning representations from data, NMT has significantly outperformed conventional statistical machine translation (SMT) (Koehn et al., 2003) and been widely deployed in large-scale MT systems in the industry (Wu et al., 2016;Hassan et al., 2018).
Despite the remarkable success, NMT suffers from the data scarcity problem.For most language pairs, large-scale, high-quality, and widecoverage bilingual corpora do not exist.Even for the top handful of resource-rich languages, the major sources of available parallel corpora are of- * Yang Liu is the corresponding author: liuyang2011@ tsinghua.edu.cn. 1 The source code is available at https://github.com/THUNLP-MT/UCE4BT Bush hold a talks and Sharon … … bushi yu shalong jvxing le huitan y < l a t e x i t s h a 1 _ b a s e 6 4 = " o y l 3 C D d c 4 p K X M b w M X u l N j W Z + g Q Y = " > A A A B 8 X i c b V B N S 8 N A F H y p X 7 V + V T 1 6 W S y C p 5 K o o M e i F 4 8 V b C 2 2 o W y 2 m 3 b p Z h N 2 X 4 Q S + i + 8 e F D E q / / G m / / G T Z u D t g 4 s D D P v s f M m S K Q w 6 L r f T m l l d W 1 9 o 7 x Z 2 d r e 2 d 2 r 7 h + 0 T Z x q x l s s l r H e X c + 5 q M l p 9 g 5 h D 9 w P n 8 A / 2 i R H w = = < / l a t e x i t > x < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 t b 5 7 3 A A n u x E h C s A M H c B V c K q + P s = " > A A A B + X i c b V D L S s N A F J 3 U V 6 2 v q E s 3 g 0 V w V R I V d F l 0 4 7 K C f U A T y m Q 6 a Y d O J m H m p l h C / s S N C 0 X c + i f u / B s n b R b a e m D g c M 6 9 3 D M n S A T X 4 D j f V m V t f W N z q 7 p d 2 9 n d 2 z + w D 4 8 6 O k 4 V Z W 0 a i 1 j 1 A q K Z 4 J K 1 g Y N g v U Q x E g W C d Y P J X e F 3 p 0 x p H s t H m C X M j 8 h I 8 p B T A k Y a 2 L Y 3 J p B 5 E Y F x E G Z P e T 6 w 6 0 r l o u J c N 5 + G q 3 r w t 6 6 i i E 3 S K z p G L r l E T 3 a M W a i O K p u g Z v a I 3 K 7 N e r H f r Y z F a s c q d Y / Q H 1 u c P V q + U H A = = < / l a t e x i t > ✓y!x < l a t e x i t s h a 1 _ b a s e 6 4 = " 8 n r Q v ten restricted to government documents or news articles.
Therefore, improving NMT under small-data training conditions has attracted extensive attention in recent years (Sennrich et al., 2016a;Cheng et al., 2016;Zoph et al., 2016;Chen et al., 2017;Fadaee et al., 2017;Ren et al., 2018;Lample et al., 2018).Among them, back-translation (Sennrich et al., 2016a) is an important direction.Its basic idea is to use an NMT model trained on limited authentic bilingual corpora to generate synthetic bilingual corpora using abundant monolingual data.The authentic and synthetic bilingual corpora are then combined to re-train NMT models.Due to its simplicity and effectiveness, backtranslation has been widely used in low-resource language translation.However, as the synthetic corpora generated by the NMT model are inevitably noisy, translation errors can be propagated to subsequent steps and prone to hinder the performance (Fadaee and Monz, 2018;Poncelas et al., 2018).
In this work, we propose a method to quantify the confidence of NMT model predictions to enable back-translation to better cope with translation errors.The central idea is to use model uncertainty (Buntine and Weigend, 1991;Gal and Ghahramani, 2016;Dong et al., 2018;Xiao and Wang, 2019) to measure whether the model parameters can best describe the data distribution.Based on the expectation and variance of wordand sentence-level translation probabilities calculated by Monte Carlo Dropout (Gal and Ghahramani, 2016), we introduce various confidence measures.
Different from most previous quality estimation studies that require feature extraction (Blatz et al., 2004;Specia et al., 2009;Salehi et al., 2014) or post-edited data (Kim et al., 2017;Wang et al., 2018;Ive et al., 2018) to train external confidence estimators, all our approach needs is the NMT model itself.Hence, it is easy to apply our approach to arbitrary NMT models trained for arbitrary language pairs.Experiments on Chinese-English and English-German translation tasks show that our approach significantly improves the performance of back-translation.

Background
Let x = x 1 . . .x I be a source-language sentence and y = y 1 . . .y J be a target-language sentence.We use P (y|x, θ x→y ) to denote a source-to-target NMT model (Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017) parameterized by θ x→y .Similarly, the target-to-source NMT model is denoted by P (x|y, θ y→x ).
Let D b = { x (m) , y (m) } M m=1 be an authentic bilingual corpus that contains M sentence pairs and D m = {y (n) } N n=1 be a monolingual corpus that contains N target sentences.The first step of back-translation (Sennrich et al., 2016a) is to train a target-to-source model on the authentic bilingual corpus D b using maximum likelihood estimation: where the log-likelihood is defined as The second step is to use the trained model θy→x to translate the monolingual corpus D m : where The word-level decision rule is given by The resulting translations {x (n) } N n=1 can be combined with D m to generate a synthetic bilingual corpus Db = { x(n) , y (n) } N n=1 .The third step is to train a source-to-target model P (y|x, θ x→y ) on the combination of authentic and synthetic bilingual corpora: (5) This three-step process can iterate until convergence (Hoang et al., 2018;Cotterell and Kreutzer, 2018).
A problem with back-translation is that model predictions are inevitably erroneous.Translation errors can be propagated to subsequent steps and impair the performance of back-translation, especially when Db is much larger than D b (Pinnis et al., 2017;Fadaee and Monz, 2018;Poncelas et al., 2018).Therefore, it is crucial to develop principled solutions to enable back-translation to better deal with the error propagation problem.

Approach
This work aims to find solutions to the two following problems: 1. How to quantify the confidence of model predictions at both word and sentence levels?
2. How to leverage confidence to improve backtranslation?
Section 3.1 introduces how to calculate model uncertainty, which lays a foundation for designing uncertainty-based word-and sentence-level confidence measures in Section 3.2.Section 3.3 describes confidence-aware training for NMT models on noisy bilingual corpora.
Bush hold a talks and Sharon x < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 t b 5 7 bushi yu shalong jvxing le huitan Figure 2: Illustration of uncertainty calculation.Given a target sentence y and the model prediction x, our approach treats word-and sentence-level translation probabilities as random variables and uses Monte Carlo Dropout to draw samples.These samples are used to calculate the expectations and variances of translation probabilities.

Calculating Uncertainty
Uncertainty quantification, which quantifies how confident a certain mapping is with respect to different inputs, has made significant progress due to the recent advances in Bayesian deep learning (Kendall et al., 2015;Gal and Ghahramani, 2016;Kendall and Gal, 2017;Xiao and Wang, 2019;Oh et al., 2019;Geifman et al., 2019;Lee et al., 2019).
In this work, we aim to calculate model uncertainty (Kendall and Gal, 2017;Dong et al., 2018;Xiao and Wang, 2019), which measures whether a model can best describe the data distribution, for NMT using approximate inference methods widely used in Bayesian neural networks.
Given the authentic bilingual corpus D b , Bayesian neural networks aim at finding the posterior distribution over model parameters P (θ y→x |D b ).With a target sentence y in the monolingual corpus D m and its translation x, the translation probability is given by In particular, we are interested in calculating the variance of the distribution P (x|y, θ y→x ) that reflects our ignorance over model parameters, which is referred to as model uncertainty.As exact inference is intractable, a number of variational inference methods (Graves, 2011;Blundell et al., 2015;Gal and Ghahramani, 2016) have been pro-posed to find an approximation to P (θ y→x |D b ).
In this work, we leverage the widely used Monte Carlo Dropout (Gal and Ghahramani, 2016) to obtain samples of word-and sentence-level translation probabilities.
Figure 2 illustrates the key idea of our approach.Given an authentic target sentence y, an NMT model made its prediction x via a standard decoding process (see Eq. (3) and Eq. ( 4)).To quantify how confident the model was when making the prediction, our approach treats word-and sentence-level translation probabilities as random variables. 2Drawing samples can be done by randomly deactivating part of neurons of the NMT model and re-calculating translation probabilities while keeping y and x fixed.This stochastic feedforward is repeated K times and generates K samples for both word-and sentence-level translation probabilities, respectively.We use θ(k) y→x to denote the model parameters derived from θy→x by deactivation in the k-th pass.
Intuitively, if the variance of translation probability is low, it is highly likely that the model was confident in making the prediction.Given K samples {P (x|y, θ(k) y→x )} K k=1 , the expectation 2 Unlike prior studies that calculate model uncertainty during inference (Xiao and Wang, 2019), our approach computes uncertainty after the NMT model has made the prediction for two reasons.First, our goal is to quantify the confidence of model prediction rather than using uncertainty to improve model prediction.Second, using Monte Carlo Dropout during decoding is very slow because of the autoregressive property of standard NMT models.
of sentence-level translation probability can be approximated by The variance of sentence-level translation probability can be approximated by Var P (x|y, θy→x ) which is also referred to as model uncertainty.
The expectation and variance of word-level translation probabilities can also be calculated similarly using K samples.

Confidence Measures
We use C(y, x<i , xi , θy→x ) to denote the wordlevel confidence for the model to generate xi and C(y, x, θy→x ) to the denote the sentence-level confidence for the model to generate x.
Intuitively, when making predictions, the more confident an NMT model is, the higher expectation and lower variance of translation probability are.For comparison reasons, we used the following four types of confidence measures at the sentence level in our experiments: where α and β are hyper-parameters to control the gap between confidence values of sentences of different quality.Larger values of α and β lead to bigger gaps. 3 In Eq. ( 12), our approach tries to combine the merits of expectation and variance by using variance divided by expectation because smaller variance and bigger expectation are expected to result in higher confidence.There may exist more sophisticated ways to estimate prediction confidence using model uncertainty (Dong et al., 2018).As we find that the measures mentioned above are easy-to-implement and prove to be effective in our experiments, we leave the investigation of more complex confidence measures for future work.
The word-level confidence measures can be defined similarly.

Confidence-aware Training for NMT
We propose confidence-aware training for NMT to enable NMT to make better use of noisy data.Word-and sentence-level confidence measures are complementary: while word-level confidence can provide more fine-grained information than the sentence-level counterpart, it is unable to cope with word omission errors that can only be captured at the sentence level.As a result, our approach incorporates both word-and sentence-level confidence measures into the training process. 4

Using Sentence-level Confidence
It is easy to integrate sentence-level confidence into back-translation by modifying the likelihood function in Eq. ( 5): 3 Note that all confidence measures are between 0 and 1.Clearly, both the expectation and variance of a probability are between 0 and 1.It can be proved that the variance of a probability is no greater than the corresponding expectation.As a result, CCEV(•) is also between 0 and 1. 4 Instead of applying confidence estimation to the second pass of decoding (Luong et al., 2017), we directly integrate confidence scores into the training process.These two kinds of methods are complementary.

original attention weights word-level confidence modified attention weights
Bush hold a talks and Sharon Bush hold a talks and Sharon Bush hold a talks and Sharon Serving as a weight assigned to each synthetic sentence pair, sentence-level confidence is expected to help to minimize the negative effect of estimating parameters on sentences with lower confidence.Note that the confidence of an authentic sentence pair in D b is 1.

Using Word-level Confidence
As the source side instead of the target side of the synthetic bilingual corpus is noisy, wordlevel confidence cannot be integrated into backtranslation in a similar way to sentence-level confidence.This is because the word-level confidence associated with each source word does not get involved in backpropagation during training.
Alternatively, we build a real-valued word-level confidence vector: Due to the wide use of attention (Bahdanau et al., 2015;Vaswani et al., 2017) in NMT, we use the confidence vector c ∈ R 1×I to modify attention weights and enable the model to focus more on words with high confidence.Figure 3 shows an example.Figure 3(a) gives a source sentence in the synthetic bilingual corpus, in which erroneous words "hold", "talks", and "and" receive high attention weights, deteriorating the parameter estimation on this sentence pair.By multiplying with word-level confidence (Figure 3(b)), the weights are modified to pay less attention to erroneous words (Figure 3(c)).
More formally, the modified attention function is given by where Q ∈ R I×D , K ∈ R I×D , and V ∈ R I×D are query, key, and value matrices and D is the hidden size. is a broadcast product.
Since the integration of sentence-and wordlevel confidence measures are independent of each other, it is easy to use both of them in backtranslation.

Setup
We evaluated our approach on Chinese-English and English-German translation tasks.The evaluation metric is BLEU (Papineni et al., 2001) as calculated by the multi-bleu.perlscript.We use the paired bootstrap resampling (Koehn, 2004) for significance testing.
For the Chinese-English task, the training set contains 1.25M sentence pairs from LDC5 with 27.8M Chinese words and 34.5M English words.To build the monolingual corpus for backtranslation, we extracted the English side of the training set of the WMT 2017 Chinese-English news translation task.After removing sentences longer than 256 words, we randomly selected 10M English sentences as the monolingual corpus.NIST06 is used as the development set and NIST02, 03, 04, 05, and 08 datasets as test sets.
For the English-German task, we used the dataset of the WMT 2014 English-German translation task.The training set consists of 4.47M sentence pairs with 116M English words and 110M German words.We randomly selected 4.5M German sentences from the 2012 News Crawl corpus of WMT 2014 to construct the monolingual corpus for back-translation.We use newstest 2013 as which is an open source software officially recommended by the QE shared task of WMT.Following the guide of OpenKiwi, we used a German-English parallel corpus containing 2.09M sentence pairs to train the predictor and a post-edited corpus containing 25k sentence triples to train the estimator.All the data used to train QE models are provided by WMT.As there are no post-edited corpora for the Chinese-English task, NEURALQE can only be used in the English-German task in our experiments.For NEURALQE, both word-and sentence-level quality scores were considered.We implemented our method on the top of THUMT (Zhang et al., 2017).The NMT model we use is Transformer (Vaswani et al., 2017).We used the base model for the Chinese-English task and the big model for the English-German task.We used the Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98 and = 10 −9 to optimize model parameters.We used the same warmup strategy for learning rate as Vaswani et al. (2017) with warmup steps = 4, 000.During training, the hyper-parameter of label smoothing was set as ls = 0.1 (Szegedy et al., 2016;Pereyra et al., 2017).During training and the Monte Carlo Dropout process, the hyper-parameter of dropout was set to 0.1 and 0.3 for Transformer base and big models, respectively.K was set to 20.Through experiments, we find our method works best when the α and β are set to 2. All experiments were conducted on 8 NVIDIA GTX 1080Ti GPUs.

Comparison of Confidence Measures
Table 1 shows the comparison of confidence measures on the Chinese-English development set.We find that using either the translation probabilities outputted by the model (i.e., "PTP") or the expectation of translation probabilities (i.e., "EXP") deteriorates the translation quality, which suggests that translation probabilities themselves can not help NMT models better cope with synthetic data.Table 3: BLEU scores on the NIST Chinese-English translation task.The ratio of authentic data to synthetic data is 1:1.NONE: only the authentic bilingual corpus is used.SEARCH: the translations of the monolingual corpus are generated by beam search (Sennrich et al., 2016a).SAMPLE: the translations of the monolingual corpus are generated by sampling (Edunov et al., 2018)."CE": confidence estimation method."U": the proposed uncertaintybased confidence estimation."All": the combination of all test sets."+": significantly better than SEARCH without CE (p < 0.05)."++": significantly better than SEARCH without CE (p < 0.01)." ‡ ‡": significantly better than SAMPLE without CE (p < 0.01).

Data
In contrast, using the variance or model uncertainty (i.e., "VAR") increases translation quality.Combining variance and expectation (i.e., "CEV") leads to a further improvement.In the following experiments, we use CEV as the default setting.

Comparison between Word-and Sentence-level Confidence Measures
Table 2 shows the comparison between wordand sentence-level CEV (i.e., combination of expectation and variance) confidence measures on the Chinese-English development set.It is clear that using either sentence-level or word-level confidence measures improves the translation performance.Thanks to more fine-grained quantification of uncertainty, using word-level confidence achieves a higher BLEU score than using sentence-level confidence.Combining the sentence-and word-level of confidences leads to a further improvement, suggesting that they are complementary to each other.In the following experiments, we use the combination of word-and sentence-level confidences as the default setting.

Main Results
The Chinese-English Task Table 3 shows the results of the Chinese-English task.Back-translation, either generating translations using beam search (i.e., SEARCH) or using sampling (i.e., SAMPLE), does lead to significant improvements over using only the authentic bilingual corpus (i.e., NONE).We find that SAM-PLE is more effective than SEARCH, which confirms the finding of Edunov et al. (2018).Using uncertainty-based confidence (i.e., "U") signifi-cantly improves over both SEARCH and SAMPLE on the combination of all test sets (p < 0.01).As there is no Chinese-English labeled data to train neural quality estimation models, we did not report the result of NEURALQE in this experiment.
The English-German Task Table 4 shows the results of the English-German task.We find that using quality estimation, either NEURALQE (i.e., "N") or UNCERTAINTY (i.e., "U"), improves over SEARCH and SAMPLE.UNCERTAINTY even achieves better performance than NEURALQE, although NEURALQE uses additional labeled training data.As NEURALQE heavily relies on post-edited corpora and labeled data to train QE models, it can only be used in a handful of language pairs.In contrast, it is easier to apply our approach to arbitrary language pairs since it does not need any labeled data to estimate confidence.

Effect of Training Corpus Size
Figure 4 shows the effect of training corpus size.The X-axis is the size of the total training data (i.e., D b ∪ Db in Eq. ( 5)).The BLEU scores were calculated on the Chinese-English development set.We find that the translation performance of SEARCH rises with the increase of monolingual corpus size in the beginning.However, further enlarging the monolingual corpus hurts the translation performance.In contrast, our approach can still obtain further improvements when adding more synthetic bilingual sentence pairs.Similar findings are also observed for SAMPLE.

Effect of Data Selection
Instead of randomly selecting monolingual sentences to generate synthetic data, we also used the method proposed by (Fadaee and Monz, 2018) to select monolingual data by targeting difficult words.In this series of experiments, we used the same amount of monolingual data that was derived from a larger monolingual corpus using different data selection methods.
Results on NIST06 show that targeting difficult words improves over randomly selecting monolingual data (46.23 → 46.60 BLEU), confirming the finding of Fadaee and Monz (2018).Using uncertainty-based confidence can further im-prove the translation performance (46.60 → 47.18 BLEU), indicating that our approach can be combined with advanced data selection methods.

Case Study
Figure 5 shows an example of model prediction and its corresponding word-and sentence-level confidence measures for the English-German task.We observe that the PTP and EXP measures are unable to give low confidence to erroneous words.In contrast, variance-based measures such as VAR and CEV can better quantify how confident the model is to make its prediction.

Related work
Our work is closely related to three lines of research: (1) back-translation, (2) confidence estimation, and (3) uncertainty quantification.

Back-translation
Back-translation is a simple and effective approach to leveraging monolingual data for NMT (Sennrich et al., 2016a).There has been a growing body of literature that analyzes and extends back-translation recently.Currey et al. (2017) show that low-resource NMT can benefit from the synthetic data generated by simply copying target monolingual data to the source side.Imamura et al. (2018) and Edunov et al. (2018) demonstrate that it is more effective to generate source sentences via sampling rather than beam search.Cotterell and Kreutzer (2018) and Hoang et al. reference A person who is dying will accept being helped to drink brandy or Pepsi , whatever is their tipple .
prediction The dying person is given oral care with brandy or Pepsi as desired .(2018) find that iterative back-translation can further improve the performance of NMT.Fadaee and Monz (2018) show that words with high predicted loss during training benefit most.Our work differs from existing methods in that we propose to use confidence estimation to enable back-translation to better cope with noisy synthetic data, which can be easily combined with previous works.Our experiments show that both neural and uncertainty-based confidence estimation methods benefit back-translation.

Confidence Estimation
Estimating the confidence or quality of the output of MT systems (Ueffing and Ney, 2007;Specia et al., 2009;Bach et al., 2011;Salehi et al., 2014;Rikters and Fishel, 2017;Kepler et al., 2019) is important for enabling downstream applications such as post-editing and interactive MT to better cope with translation mistakes.While existing methods rely on external models to estimate confidence, our approach leverages model uncertainty to derive confidence measures.The major benefit is that our approach does not need labeled data.

Uncertainty Quantification
Reliable uncertainty quantification is key to building a robust artificial intelligent system.It has been successfully applied to many fields, including computer vision (Kendall et al., 2015;Kendall and Gal, 2017), time series prediction (Zhu and Laptev, 2017), and natural language processing (Dong et al., 2018;Xiao and Wang, 2019).Our work differs from previous work in that we are in-terested in calculating uncertainty after the model has made the prediction rather during inference.Ott et al. (2018) also analyze the inherent uncertainty of machine translation.The difference is that they focus on the existence of multiple correct translations for a single sentence while we aim to quantify the uncertainty of NMT models.

Conclusions
We have presented a method for qualifying model uncertainty for neural machine translation and use uncertainty-based confidence measures to improve back-translation.The key idea is to use Monte Carlo Dropout to sample translation probabilities to calculate model uncertainty, without the need for manually labeled data.As our approach is transparent to model architectures, we plan to further verify the effectiveness of our approach on other downstream applications of NMT such as post-editing and interactive MT in the future.

Figure 1 :
Figure 1: Confidence estimation for back-translation.Back-translation generates a source (e.g., English) sentence for a ground-truth target (e.g., Chinese) sentence.Such synthetic sentence pairs are used to train NMT models.As the model prediction (i.e., x) is often noisy, our work aims to quantify the prediction confidence using model uncertainty to alleviate error propagation.

Figure 3 :
Figure 3: Using word-level confidence in confidence-aware training.The basic idea is to use confidence to modify attention weights to pay less attention to erroneous words highlighted in underline.(a) The original attention weights of the NMT model; (b) the word-level confidence of the noisy source sentence; (c) the attention weights modified by the word-level confidence, which focus more on words with high confidence.is a broadcast product.See Eq. (15) for details.

Figure 4 :
Figure 4: Effect of training corpus size.

Table 2 :
Comparison between word-and sentencelevel CEV confidence measures.