Reducing Word Omission Errors in Neural Machine Translation: A Contrastive Learning Approach

While neural machine translation (NMT) has achieved remarkable success, NMT systems are prone to make word omission errors. In this work, we propose a contrastive learning approach to reducing word omission errors in NMT. The basic idea is to enable the NMT model to assign a higher probability to a ground-truth translation and a lower probability to an erroneous translation, which is automatically constructed from the ground-truth translation by omitting words. We design different types of negative examples depending on the number of omitted words, word frequency, and part of speech. Experiments on Chinese-to-English, German-to-English, and Russian-to-English translation tasks show that our approach is effective in reducing word omission errors and achieves better translation performance than three baseline methods.


Introduction
While neural machine translation (NMT) has achieved remarkable success (Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017), there still remains a severe challenge: NMT systems are prone to omit essential words on the source side, which severely deteriorate the adequacy of machine translation. Due to the lack of interpretability of neural networks, it is hard to explain how these omission errors occur and design methods to eliminate them.
Existing methods for reducing word omission errors in NMT have focused on modeling coverage Mi et al., 2016;Wu et al., 2016;Tu et al., 2017). The central idea is to model the fertility (i.e., the number of corresponding target words) of a source word based on attention weights to avoid word omission. Although these methods prove to be effective in modeling coverage for NMT, they heavily rely on the attention weights provided by the * Corresponding author: Yang Liu RNNsearch model (Bahdanau et al., 2015). Since the attention weights between input and output are not readily available in the state-of-the-art Transformer model (Vaswani et al., 2017), it is hard for existing methods to be directly applicable. As a result, it is important to develop model-agnostic methods for addressing the word omission problem in NMT.
In this paper, we propose a simple and effective contrastive learning approach to reducing word omission errors in NMT. The basic idea is to maximize the margin between the probability of a ground-truth translation and that of an erroneous translation for a given source sentence. The erroneous translations are automatically constructed via omitting words among the ground-truth translations. We design several types of erroneous translations in respect of omission counts, word frequency, and part of speech. Our approach has the following advantages: • Model agnostic. Our approach is applicable to all existing NMT models. Only the training objective and training data need to be changed.
• Language independent. Our approach is independent of languages and can be applied to arbitrary languages.
• Fast to train. Contrastive learning starts with a pre-trained NMT model and usually converges in only hundreds of steps.
We evaluate our approach on German-to-English, Chinese-to-English, and Russian-to-English translation tasks. Experiments show that contrastive learning can not only effectively reduce word omission errors but also achieve better translation performance than existing methods in both automatic and human evaluations.

A Contrastive Learning Approach
Let x be a source sentence and y be a target sentence. We use P (y|x; θ) to denote an NMT model parameterized by θ. Given trained parametersθ, the translation of a source sentence is given bŷ During decoding process, the NMT model chooses the candidate sentence with the highest probability as the output translation. When a word omission error occurs, erroneous translations, which are mistakenly assigned with higher probabilities, are more likely to be chosen than ground-truth translations. Therefore, to reduce word omission errors, the probability that the NMT model assigns to an erroneous translation must be lower than that of a ground-truth translation.
Our proposed contrastive learning method is shown in Algorithm 1 , which consists of three steps. In the first step, the model is trained using maximum likelihood estimation (MLE) on a parallel corpus (lines 1-2). In the second step, negative examples are automatically constructed by omitting words in ground-truth translations (line 3). In the third step, the model is finetuned using contrastive learning with the estimates of MLE as a starting point.
More formally, given a parallel training set D = { x (s) , y (s) } S s=1 , the first step is to find a set of model parameters that maximizes the loglikelihood of the training set: where the log-likelihood is defined as The second step is to construct negative examples based on the ground-truth parallel corpus. Given a ground-truth sentence pair x, y from the parallel training set D, an erroneous sentence pair x,ỹ can be automatically constructed by omitting words from the translation y in the groundtruth sentence pair. In this work, we distinguish between three methods for omitting words: • Random omission. One or more source words are omitted according to a random uniform distribution.
• Omission by word frequency. One or more source words are omitted according to word frequencies.
• Omission by part of speech. One or more source words are omitted according to parts of speech.
Contrastive learning starts with the model parameters trained by MLE. Our contrastive learning approach is equipped with a max-margin loss. The max-margin loss ensures that the margins of the log-likelihood between the ground-truth pairs and the contrastive examples are higher than the setting η:θ where the max-margin loss is defined as

Experiments
We evaluated the proposed method on Chineseto-English, German-to-English, and Russian-to-English translation tasks.

Setup
For the Chinese-to-English translation task, we use the WMT 2017 dataset as the training set, which is composed of the News Commentary v12, UN Parallel Corpus v1.0, and CWMT corpora. The training set contains 25M sentence pairs. The newsdev2017 and newstest2017 datasets are used as the development set and test set, respectively. For the German-to-English translation task, we use the WMT 2017 dataset as the training set, which consists of 6M preprocessed sentence pairs. The newstest2014 and newstest2017 datasets are used as the development set and test set, respectively. For the Russian-to-English translation task, we use the WMT 2017 preprocessed dataset as the training set, which consists of 25M preprocessed sentence pairs. The newstest2015 and new-stest2016 datasets are used as the development set and test set, respectively.
Following Sennrich et al. (2016b), we split words into sub-word units. The numbers of merge operations in byte pair encoding (BPE) for both language pairs are set to 32K. After performing BPE, the training set of the Chinese-to-English task contains 550M Chinese sub-word units and 615M English sub-word units, the training set of the German-to-English task consists of 157M German sub-word units and 153M English subword units, and the training set of the Russian-to-English task consists of 653M Russian sub-word units and 629M English sub-word units.
We used three baselines in our experiments: • MLE: Maximum likelihood estimation. The setting of hyper-parameters is the same with (Vaswani et al., 2017); • MLE + CP: Imposing the coverage penalty (Wu et al., 2016) constraint on the decoding process of MLE. We treat the softmax weight matrix in the uppermost "encoder-decoder attention" layer of Transformer as the attention weight matrix to calculate coverage penalty; • WordDropout: Implementing the word dropout technique proposed by Sennrich et al. (2016a) during MLE training.
For our contrastive learning method, we compare different settings of erroneous training setD: • CL one/two/three :D is constructed via omitting one/two/three words randomly from the ground-truth translations in D; • CL low/high :D is constructed via omitting the word with the lowest/highest frequency from each ground-truth translation in D; Figure 1: Visualization of margin differences between CL one and MLE on 500 sampled sentence pairs. We use red to highlight sentence pairs on which CL one achieves a larger margin than MLE. Blue points denote MLE achieves a higher margin.
• CL V/IN :D is constructed via omitting one verb or preposition randomly from the ground-truth translation in D. The part-ofspeech information is given by the Stanford Parser (Manning et al., 2014).

Comparison of Margins
To find out whether CL increases the margin compared with MLE, we calculate the following margin difference for a ground-truth sentence pair x, y and an erroneous sentence pair x,ỹ : ∆M = log P (y|x;θ CL )−log P (ỹ|x;θ CL ) − log P (y|x;θ MLE )+log P (ỹ|x;θ MLE ) (6) Figure 1 shows the margin difference between CL one and MLE on 500 sampled sentence pairs from the training set for the Chinese-to-English task. "Sentence length" denotes the sum of the lengths of the source and target sentences (i.e., |x| + |y|). Red points denote sentence pairs on which CL one has a larger margin than MLE (i.e., ∆M > 0), while the blue ones denote the ∆M < 0 case. We find that CL one has a larger margin than MLE on 95% of the 500 sampled sentence pairs, with an average margin difference of 1.4. Table 1 shows the results of automatic evaluation on Chinese-to-English, German-to-English, and Russian-to-English translation tasks. The evaluation metric is case-insensitive BLEU score (Papineni et al., 2002). Contrastive learning starts with the model parameters trained by MLE and converges in only 150 steps. For fair comparison, all  For fair comparison, all the models of MLE, MLE + CP, and MLE + data are trained for another 150 steps as well, but yielding no further improvement. "+": significantly better than MLE (p < 0.05). "++": significantly better than MLE (p < 0.01). " * ": significantly better than MLE + CP (p < 0.05). " * * ": significantly better than MLE + CP (p < 0.01)." †": significantly better than WordDropout (p < 0.05). " † †": significantly better than WordDropout (p < 0.01).

Method
Flu.  Table 2: Human evaluation results on the Chinese-to-English task. "Flu." denotes fluency and "Ade." denotes adequacy. Two human evaluators who can read both Chinese and English were asked to assess the fluency and adequacy of the translations. The scores of fluency and adequacy range from 1 to 5.
the models of MLE, MLE+CP, and MLE+data are trained for another 150 steps as well, but yielding no further improvement.
We observe that with negative examples synthesized properly, our contrastive learning method significantly outperforms MLE, MLE + CP, and WordDropout on all three language pairs. An interesting finding is that omitting highfrequency source words (i.e., CL high ) achieves significantly better results than omitting lowfrequency source words (i.e., CL low ) for all three language pairs, which suggests that standard NMT models tend to omit high-frequency source words rather than low-frequency words.

Method
Zh  CL denotes the contrastive learning method with the highest BLEU score, which is CL one for the Chineseto-English and German-to-English tasks and CL three for the Russian-to-English task.
The experiment on omission by part of speech further confirms this finding as omitting highfrequency prepositions (i.e., CL IN ) leads to better results than omitting low-frequency verbs (i.e., CL V ). Table 2 shows the results of human evaluation on the Chinese-to-English task. We asked two human evaluators who can read both Chinese and English to evaluate the fluency and adequacy of the translations generated by MLE, MLE + CP, MLE + data, and CL one . The scores of fluency and adequacy range from 1 to 5. The translations were shuffled randomly, and the name of each method was anonymous to human evaluators.

Human Evaluation Results
We find that CL one significantly improves the adequacy over all baselines. This is because omitting important information in source sentences de-creases the adequacy of translation. CL one is capable of alleviating this problem by assigning lower probabilities to translations with word omission errors.
To further quantify to what extent our approach reduces word omission errors, we asked human evaluators to manually count word omission errors on the test sets of all the translation tasks. Table 3 shows the error counts. We find that CL one achieves significant error reduction as compared with MLE, MLE + CP, and WordDropout for all the three language pairs.

Related Work
Our work is related to two lines of research: modeling coverage for NMT and contrastive learning in NLP.

Modeling Coverage for NMT
The notion of coverage dates back to conventional phrase-based statistical machine translation (Koehn et al., 2003). A coverage vector, which is used to indicate whether a source phrase is translated or not during the decoding process, ensures that each source phrase is translated exactly once. As there are no latent variables defined on language structures in neural networks, it is hard to directly introduce coverage into NMT. As a result, there are two strategies. The first strategy is to modify the model architectures to incorporate coverage Mi et al., 2016), which requires considerable expertise. The second strategy is to impose constraints on the decoding process (Wu et al., 2016).
Our work differs from prior studies in that contrastive learning is model agnostic. All previous coverage-based methods heavily rely on attention weights between source and target words to derive coverage for source words. Such attention weights are not readily available for all NMT models. In contrast, our method can be used to fine-tune arbitrary NMT models to reduce word omission errors in only hundreds of steps.

Contrastive Learning in NLP
Contrastive learning has been widely used in natural language processing. For instance, word embeddings are usually learned by the noise contrastive estimation method (Gutmann and Hyvärinen, 2012): a negative example is synthesized by randomly selecting a word from the vo-cabulary to replace a word in a ground-truth example (Vaswani et al., 2013;Mnih and Kavukcuoglu, 2013;Bose et al., 2018).
The closest work to ours is (Wiseman and Rush, 2016), which leverages contrastive learning during beam search with the golden reference sentences as positive examples and the current output sentences as contrastive examples. While they focus on improving the capability of Seq2Seq model to capture global dependencies, we focus on reducing word omission errors of Transformer model effectively.

Conclusion
We have presented contrastive learning for reducing word omission errors in neural machine translation. Contrastive examples are automatically constructed by omitting words from the groundtruth translations. Our approach is model-agnostic and can be applied to arbitrary NMT models. Experiments show that our approach significantly reduces omission errors and improves translation performance on three language pairs.