Addressing Troublesome Words in Neural Machine Translation

One of the weaknesses of Neural Machine Translation (NMT) is in handling lowfrequency and ambiguous words, which we refer as troublesome words. To address this problem, we propose a novel memoryenhanced NMT method. First, we investigate different strategies to define and detect the troublesome words. Then, a contextual memory is constructed to memorize which target words should be produced in what situations. Finally, we design a hybrid model to dynamically access the contextual memory so as to correctly translate the troublesome words. The extensive experiments on Chinese-to-English and English-to-German translation tasks demonstrate that our method significantly outperforms the strong baseline models in translation quality, especially in handling troublesome words.

However, the current NMT is a global model that maximizes the performance on the overall data and has problems in handling low-frequency words and ambiguous words 1 , we refer these words as troublesome words and define them in Section 3.1.
Some previous work attempt to tackle the translation problem of low-frequency words. Sennrich et al. (2016) propose to decompose the words into subwords which are used as translation units so 1 In this work, we consider a source word is ambiguous if it has multiple translations with high entropy of probability distribution.
Source: 阿尔卡特 宣称 去年 第四 季 销售 成长 近 百分之三十 Pinyin: aerkate cheng qunian disi ji xiaoshou chengzhang jin baifenzhisanshi Reference: alcatel says sales in fourth quarter last year grew nearly 30 % NMT: he said sales grew nearly 30 percent in fourth quarter of last year NMT+LexiconTable: alcatel said sales growth nearly 30 percent in fourth quarter of last year Figure 1: The NMT model produces a wrong translation for the low-frequency word "aerkat". While introducing an external lexicon table without contextual information, the model incorrectly translates the ambiguous word "chengzhang" into "growth". that the low-frequency words can be represented by frequent subword sequences. Arthur et al. (2016) and Feng et al. (2017) try to incorporate a translation lexicon into NMT in order to obtain the correct translation of low-frequency words. However, the former method still faces the lowfrequency problem of subwords. And the latter one has a drawback that they use lexicons without considering specific contexts. Fig. 1 shows an example, in which "aerkate" is an infrequent word and the baseline NMT incorrectly translates it into a pronoun "he". Incorporation of bilingual lexicon rectifies the mistake but wrongly converts "chengzhang" into an incorrect target word "growth" since an entry "(chengzhang, growth)" in the bilingual lexicon is somewhat wrongly used without taking the contexts into account. Furthermore, these two kinds of methods mainly focus on low-frequency words that are just a part of the troublesome words.
In this paper, we categorize the words (including infrequent words and ambiguous words) which are difficult to translate as troublesome words and propose a novel memory-augmented framework to address them. Our method first investigates different strategies to define the troublesome words. Then, these words and their contexts in the training data are memorized with a contextual memory which is finally accessed dynamically during decoding to solve the translation problem of the troublesome words.
Specifically, we first decode all the source sentences of the bilingual training data with baseline NMT and define the troublesome source words according to the distance between the predicted words and the gold words. The troublesome words associated with their hidden contextual representations are stored in a memory which memorizes the correct translation and the corresponding contextual information. During decoding, we activate the contextual memory when we encounter the troublesome words and employ the contextual similarity between the test sentence and the memory to determine appropriate target words. We test our methods on Chinese-to-English and Englishto-German translation tasks. The experimental results demonstrate that the translation performance can be significantly improved and a large portion of troublesome words can be correctly translated. The contributions are listed as follows: 1) We are the first to define and handle the troublesome words in neural machine translation.
2) We propose to memorize not only the bilingual lexicons but also their contexts with a contextual memory.
3) We design a dynamic approach to correctly translate the troublesome words by combining the contextual memory and the NMT model.

Neural Machine Translation
NMT contains an encoder and a decoder. The encoder transforms a source sentence X = {x 1 , x 2 , ..., x T x } into a set of context vectors C = (h m 1 , h m 2 , ..., h m T x ) by using m stacked Long Short Term Memory (LSTM) layers (Hochreiter and Schmidhuber, 1997) . h m j is the hidden state of the top layer in encoder. The bottom layer of encoder is a bi-direction LSTM layer to collect the context from the left side and right side.
The decoder generates one target word at a time by computing p N i (y i |y <i , C) as follows: where z i is the attention output: c i can be calculated as follows: where a i,j is the attention weight: where z m i is the hidden state of the top layer in decoder. More detailed introduction can be found in (Luong et al., 2015).
Notation. In this paper, we denote the whole source vocabulary by , where s m is the source word and t n is the target word. We denote a source sentence by X and a target sentence by Y . Each source word in X is denoted by x j . Each target word in Y is denoted by y i . Accordingly, a target word can be denoted not only by t n , but also by y i . This does not contradict. t n means this target word is the n th word in vocabulary V T , and y i means this target word is the i th word in sentence Y . Similarly, we denote a source word by s m and x j .

Method Description
Our method contains three parts: 1) definition and detection of the troublesome words (Section 3.1); 2) contextual memory construction (Section 3.2); and 3) hybrid approach combining contextual memory and baseline NMT model (Section 3.3).

Troublesome Word Definition
Generally speaking, troublesome words are those that are difficult to translate for the baseline NMT system BN M T . Fig. 2 shows the main process to detect the troublesome words. Given each training sentence pair (X, Y ), BN M T decodes the source sentence X and outputs the predicted probability of each gold target word p N i (y i ). We call y i an exception if p N i (y i ) satisfies the predefined exception criteria introduced below. The source word x j is an exception (a candidate troublesome word) if (x j , y i ) is an entry in the word alignment A 2 . Suppose x j appears N times in the training data and there are M exceptions among all its aligned gold target words. Then, the exception rate r(x j ) will be M/N . Definition: x j is a troublesome word if r(x j ) > in which is a predefined threshold.
Exception Criteria. As discussed before, we need an exception criterion to measure whether a gold target word is an exception or not. In this paper, we investigate three exception criteria. Here, we introduce each of them through a toy example shown in Fig. 3, in which the source sentence is X = {x 1 , x 2 , x 3 } and the gold target sentence is Y = {y 1 , y 2 , y 3 }. The left shows the probability distribution of all target vocabulary p N i (V T ) at each decoding step i, where the probability of the gold target word is highlighted in yellow. The right shows the word alignments between X and Y .
1) Absolute Criterion. A gold target word y i is an exception if its predicted probability p N i (y i ) is lower than a predefined threshold, namely p N i (y i ) < p 0 . In Fig. 3, p N i (y i ) at each decoding step is respectively 0.8, 0.31 and 0.2. If we set p 0 = 0.5, p N 2 (y 2 ) and p N 3 (y 3 ) are lower than threshold p 0 . x 1 and x 3 are both exceptions according to the alignments.
2) Gap Criterion. For this criterion, we utilize the predicted probability gap between the gold target word and the top one. Specifically, the gap can be calculated by: where max(p N i (V T )) is the top one in the probability distribution at the i th decoding step. y i is an exception if g(y i ) > g 0 . In Fig. 3, the largest predicted probabilities at each decoding step max(p N i (V T )) are respectively 0.8, 0.35 and 0.75. Thus, the gap is 0.0, 0.04 and 0.55. If g 0 = 0.1, x 3 is an exception since g(y 3 ) > g 0 and x 3 aligns to y 3 . x 1 x 2 x 3 x 1 x 2 x 3 y 1 y 2 y 3 Figure 3: A toy example to show the process: if p N i (y i ) (left) satisfies the predefined exception criteria and x j aligns to y i , then x j is an exception.
3) Ranking Criterion. This criterion is based on the ranking of p N i (y i ) in p N i (V T ) (denoted by rank(y i )). If rank(y i ) > rank 0 , then y i is an exception. In Fig. 3, the ranking of each gold target word is 1, 3 and 2. If we set rank 0 = 2, then rank(y 2 ) = 3 > rank 0 and x 1 is an exception due to the alignment between x 1 and y 2 .
Using the above exception criteria and the definition of troublesome words, we can detect all the source-side troublesome words in the bilingual training data.

Contextual Memory Construction
For a troublesome word, we now introduce how to build a contextual memory M to store its translation knowledge. Specifically, the contextual memory contains five elements: each of them is described as follows: • s m is a troublesome source word.
• t n is a gold target word for s m .
• c(s m , t n ) is the context of lexicon pair (s m , t n ). Here, we use the hidden states of encoder h j to represent the context, since it contains the information from left ( − → h j ) and right ( ← − h j ). Note that when we traverse the training data and memorize the contexts of all troublesome words, there must be many cases in which the same pair (s m , t n ) appears in different contexts. In order to reduce the memory size and fuse different contexts of a same lexicon pair, we merge these memories by averaging the contexts. Assume there are K different contexts for (s m , t n ), and they are denoted by h k (s m , t n ). The average context of (s m , t n ) can be calculated by: Note that the context here is defined on the source side.
• p L (s m , t n ) is the lexicon translation probability. It is the average of source-to-target and target-to-source probabilities calculated through maximum likelihood estimation on word alignments.
• r(s m ) is the exception rate of s m introduced in Section 3.1 and it can indicate the translation difficulty of a source word. We will use r(s m ) to determine the dynamic weights of contextual memories in Section 4.
Noise Reduction. As we know, the training data and word alignments are not perfect and may introduce noise to the contextual memory. To reduce the noise, we employ two strategies. 1) To improve the quality of the alignments A, we derive the alignment results from source-totarget and target-to-source, respectively. We only save the alignments which exist in both directions.
2) We eliminate the lexicon pairs whose translation probabilities are too small. For a lexicon pair (s m , t n ), if its lexicon translation probability is smaller than 0.01, we treat this lexicon pair as a noisy sample and eliminate it from our memory.

Integrating Contextual Memory into NMT
In this section, we integrate the contextual memory into NMT to handle troublesome words. The overall framework is depicted in Fig. 4 and the integration process can be divided into four steps: Step 1. Given a test sentence X, the first step is to find the troublesome words in X and collect corresponding local memories from the global contextual memory M. For each source word x j , we retrieve from M if it is a troublesome word and obtain the local memory as follows: Step 2. The next step is to measure the contextual similarity between the context in the test sentence X and the context in M. For the troublesome word x j ∈ X, we still use the encoder hidden state h j to represent the context in X. The corresponding context in M is c(x j , t n ) in Eq. (8).
Here, we use a feed-forward network to measure this similarity 3 : where v d , W h and W c are learnable parameters. The sigmoid function guarantees the similarity score is in the range (0, 1). This similarity d j (t n ) will determine whether or not to adopt the target translation word t n in M.
Step 3. The next task is calculating the probability p M i (t n ) of t n at each decoding step i. p M i (t n ) is the probability predicted by the contextual memory M and is calculated by: where a i,j is the attention weight, d j (t n ) is the context similarity in Eq. (9), and p L (x j , t n ) is the lexicon translation probability.
Step 4. The final task is to combine the memory predicted probability (p M i in Eq. (10)) and the NMT predicted one (p N i in Eq. (1)). Here, we propose a dynamic strategy to balance these two probabilities: where p F i (t n ) is the final probability of the target word t n , λ i is the dynamic weight to adjust the contribution from the memory and NMT. Here we explain the reason why we apply the dynamic manner. Recall that for each source troublesome word s m , we calculate its exception rate (similar to error rate). If a troublesome word has a lower exception rate, indicating that this source word is easier to be translated for the neural model. In this case, p N i is more reliable. Thus we design the dynamic weight λ i according to the exception rate r(x j ): where β γ is a learnable parameter. From Eq. (12), the dynamic weight λ i is determined by both of the attention weight a i,j , and the exception rate r(x j ).
Training the parameters. As discussed above, our method contains some parameters (v d , W h , W c and β γ ) to be learned. We denote the parameters introduced by our method by θ M and the parameters in NMT by θ N . To make it efficient, given the aligned training data D = X (d) , Y (d) |D| d=1 , we keep θ N unchanged and optimize θ M by maximizing the following objective function.
where p F i can be calculated by Eq. (11).

Experimental Settings
We test the proposed methods on Chinese-to-English (CH-EN) and English-to-German ( We use the Zoph RNN toolkit 4 to implement all described methods. In all experiments, the encoder and decoder include two stacked LSTM layers. The word embedding dimension and the size of hidden layers are both set to 1,000. The minibatch size is set to 128. We discard the training sentence pairs whose length exceeds 100. We run a total of 20 iterations for all translation tasks. We test all methods based on two granularities: words and sub-words. For word granularity, we limit the vocabulary to 30K (CH-EN) and 50K (EN-DE) for both the source and target languages. For subword granularity, we use the BPE method (Sennrich et al., 2016) to merge 30K (CH-EN) and 32K (EN-DE) steps. The beam size is set to 12. We use case-insensitive 4-gram BLEU (Papineni et al., 2002) for translation quality evaluation.
We compare our method with other relevant methods as follows: 1) Baseline: It is the baseline NMT system with global attention (Luong et al., 2015;Zoph and Knight, 2016;Jean et al., 2015).
2) Arthur: It is the state-of-the-art method which incorporates discrete translation lexicons into NMT (Arthur et al., 2016). We implement Arthur et al. (2016)'s method in two different ways. In the first way, we fix the Baseline unchanged, and utilize Arthur et al. (2016)'s method in the test phase. We denote this system by Arthur(test). In second way, we allow Baseline to be retrained by Arthur et al. (2016)'s method, and denote the system by Arthur(train+test). We replicate the Arthurs work using the bias method with the hyper parameter being set to 0.001 as reported in their paper.
3) X+MEM: It is our proposed memory augment method for any neural model X, in which we define the troublesome word by using the gap criterion with threshold g 0 = 0.1. We set threshold = 0.05, which is fine-tuned in validation set. It means if the exception rate of a source word exceeds 0.05, we treat this word as a troublesome word. Table 1 reports the main translation results of CH-EN translation. We first compare Baseline+MEM with Baseline. As shown in row 1 and row 5 in Table 1, Baseline+MEM can improve over Baseline on all test datasets, and the average improvement is 1.37 BLEU points. The results show that our method could significantly outperform the baseline model. shows the BLEU points improvement of system "X+MEM" than system X. "*" indicates that system "X+MEM" is statistically significant better (p < 0.05) than system X and " †" indicates p < 0.01.

Results on Sub-words
We also test the proposed method when the translation unit is sub-word. The baseline and our method using sub-word as translation unit are respectively denoted by Baseline(subword) and Baseline(sub-word)+MEM. The results are shown in row 4 and row 7. From the results, Baseline(sub-word)+MEM outperforms Baseline(sub-word) by 1.01 BLEU points, indicating that adopting sub-words as translation units still faces the problem of troublesome tokens, and our method could alleviate this problem.

Our Method vs. Method Using Translation Lexicon
We also compare our method with Arthur et al. (2016)'s method which incorporates a translation lexicon into NMT. Here, the comparison is conducted in two ways based on whether the baseline neural model is fixed or retrained. Fixed Baseline. Comparing Arthur(test) (row 2 in Table 1) and Baseline+MEM (row 5 in Table 1), we can see that our proposed method can surpass Arthur(test) with 1.05 BLEU points. As there are three differences between our methods and Arthur(test), we take the following experiments to evaluate the effect of each difference.
The first difference is that our memory only  stores the lexicon pairs for troublesome words, while Arthur(test) utilizes all the available lexicon pairs. We implement another system which is similar to Arthur(test), except that we only utilize the troublesome lexicon pairs. We denote the system by Tword. The results are reported in Table 2.
From the results, we can find that Tword obtains better translation results than Arthur(test) while using much fewer lexicon pairs (125K vs. 938K).
The second difference is that we take the context into consideration. When we add the context on the basis of Tword (denoted by +Context), it further improves the baseline system by 1.03 BLEU points, indicating the importance of the context. Fig.5 shows the mentioned example in Section 1, in which Arthur(test) translates chengzhang into a wrong target word growth, while Baseline+MEM could overcome this mistake with the help of the context modeling.
We also implement another system, in which we build the contextual memory for all source words. Figure 6: The comparison of different criteria. The gap criterion outperforms others with the increase of the memory size. We denote the system by All+Context and the results are reported in Table 2. As shown in Table  2, All+Context surpasses Arthur(test) with 0.75 BLEU points while at the cost of 6.4G memory footprint and 1.829s time consuming. However, if we only build the contextual memory for the troublesome words, comparing to All+Context, there is only a slighter BLEU points decline (40.19 vs. 40.23) while sharply reduces memory size to 893M and decoding time to 0.511s, showing that our strategy of only building the contextual memory for troublesome words is effective.

Memory Size
The final difference is that we employ the dynamic strategy to balance between NMT and the contextual memory. When we employ this dynamic strategy (denoted by +Dynamic), the improvement can further reach 1.37 BLEU points.
Retrained Baseline. In the second comparison, we allow the baseline model to be retrained by Arthur's method (Arthur(train+test)). We then implement our method using Arthur(train+test) as baseline (denoted by Arthur(train+test)+MEM). Comparing the results of these two methods in Table 1 (line 3 and 6), our method is still effective on the retrained model. The average gains are 0.92 BLEU points.

Effects of Different Exception Criteria
In our method, we investigate three exception criteria to define the troublesome words. The following experiment is conducted to compare their performances. For fairness, the comparison of the three criteria is conducted under the same number of contextual memory, which can be achieved by adjusting the respective thresholds (p 0 , g 0 and rank 0 ). The results are reported in Fig. 6, in which the x axis represents the size of contextual memory, the y axis denotes BLEU score, and the numbers in the bracket from left to right are the   respective thresholds of gap, absolute and ranking. As shown in Fig. 6, all the three criteria can improve the translation quality. When the memory size is relatively small, absolute criterion performs best. With the size increases, the gap criterion achieves a higher performance than others. Note that our current criteria only consider one single factor. The combination of different criteria may be more beneficial, and we leave this as our future work.

Results on Low-Frequency Words and Ambiguous Words
We further analyze our method on specific troublesome words, such as low-frequency words and ambiguous words. Here, we use the following definition in our analysis.
Low-frequency words: The words whose frequency is lower than 100.
Ambiguous words: Assume a word s m contains K candidate translations with a probability p L k . If the entropy of probability distribution − K k=1 p L k logp L k > E 0 ( E 0 = 1.5 in this paper), we treat this word as an ambiguous word.
Therefore, the sentences containing troublesome words can be divided into four different parts: 1) sentences which contain both lowfrequency and ambiguous words (Low+Amb,   Table 3. From this table, we observe that our proposed method improves the translation quality on all kinds of sentences. Low+Amb performs best (Low the second), indicating that our method is most effective in dealing with low-frequency words. The improvement on Amb is 0.81 BLEU points, showing that our method can also well handle the ambiguous words. We also conduct a manual analysis to figure out how many troublesome words could be rectified by our method. We randomly select 200 testing sentences, and count the following three numbers: 1) the number of troublesome words in the sentence (Tword), 2) the number of mistakes produced by Baseline (Error), 3) the number (ratio) of rectification using our method (Rectify). 4) The number of deterioration caused by our method (Deterio). The statistics are reported in Table 4. From the results, we can get similar conclusions that our method is most effective on low-frequency and ambiguous words with the rectification rate 50.8% and 41.7% respectively.
We can notice that the proposed method also produces 11 deterioration cases (Deterio) when rectifying the troublesome words. As a comparison, we also count the total rectification and deterioration numbers of Arthur(test). The results are reported in Table 5. These results show that our method could rectify more words (51 vs. 70) with less deterioration (17 vs. 11) than Arthur(test).

Results on EN-DE Translation
We also test our method on EN-DE translation and the results are reported in Table 6. We can see that our method is still effective on EN-DE translation. Specifically, when the translation unit is word, the proposed method improves the baseline by 1.13 BLEU points. The improvement is 0.76 BLEU points when the translation unit is sub-word.  Table 6: The results on EN-DE translation. "*" indicates that it is statistically significantly better (p < 0.05) than system X and " †" indicates p < 0.01.

Related Work
The related work can be divided into three categories and we describe each of them as follows: Neural Turing Machine for NMT. Our idea is first inspired by the Neural Turing Machine (NTM) (Graves et al., 2014(Graves et al., , 2016 and memory network (Weston et al., 2014). (Wang et al., 2017a) used special NTM memory to extend the decoder in the attention-based NMT. In their method, the memory is used to provide temporary information from source to assist the decoding process. In contrast, our work uses memory to store contextual knowledge in the training data.
Smaller translation granularity. Our work is also inspired by the other studies to deal with the low-frequency and ambiguous words (Vickrey et al., 2005;Zhai et al., 2013;Rios et al., 2017;Carpuat and Wu, 2007;. Among them, the most relevant is the work that decomposes the low-frequency words into smaller granularities, e.g, hybrid word-character model (Luong and Manning, 2016), sub-word model (Sennrich et al., 2016) or word piece model . These methods mainly focus on lowfrequency words that are just a subset of the troublesome words. Furthermore, our experimental results show that even using a smaller translation unit, the NMT model still faces the problem of troublesome tokens and our method could alleviate this problem.
Combining SMT and NMT. Our ideas are also inspired by the work which combines SMT and NMT. Earlier studies were mostly based on the SMT framework, and have been deeply discussed by the review paper in Zhang and Zong (2015). Later, the researchers transfer to NMT framework, e.g. (Wang et al., 2017b;Zhou et al., 2017;Tu et al., 2016;Mi et al., 2016;He et al., 2016;Dahlmann et al., 2017;Wang et al., 2017c,d;Gu et al., 2018;Zhao et al., 2018). The most relevant studies are Arthur et al. (2016) and Feng et al. (2017). They incorporate the lexicon pairs into NMT to improve the translation quality. There are three differences between our method and theirs. First, we only utilize the lexicon pairs for the troublesome words, rather than using all lexicon pairs. Second, we take contextual information into consideration for memory construction. Third, we design a dynamic strategy to balance the memory and NMT. The experiments show the superiority of our proposed methods.

Conclusions
To address troublesome words in NMT, we have proposed a novel memory-enhanced framework. We first define and detect the troublesome words, then construct a contextual memory to store the translation knowledge and finally access the contextual memory dynamically to correctly translate the troublesome words. The extensive experiments on Chinese-to-English and English-to-German translation tasks demonstrate that our method significantly outperforms the strong baseline models in translation quality, especially in handling the troublesome words.