Addressing the Rare Word Problem in Neural Machine Translation

Neural Machine Translation (NMT) has recently attracted a lot of attention due to the very high performance achieved by deep neural networks in other domains. An inherent weakness in existing NMT systems is their inability to correctly translate rare words: end-to-end NMTs tend to have relatively small vocabularies with a single"unknown-word"symbol representing every possible out-of-vocabulary (OOV) word. In this paper, we propose and implement a simple technique to address this problem. We train an NMT system on data that is augmented by the output of a word alignment algorithm, allowing the NMT system to output, for each OOV word in the target sentence, its corresponding word in the source sentence. This information is later utilized in a post-processing step that translates every OOV word using a dictionary. Our experiments on the WMT'14 English to French translation task show that this simple method provides a substantial improvement over an equivalent NMT system that does not use this technique. The performance of our system achieves a BLEU score of 37.5, which improves upon the previous best end-to-end NMT by 2.7 points. Our NMT system is the first to surpass the existing state-of-the-art performance on a WMT'14 contest task.


Introduction
Deep Neural Networks (DNNs) have achieved excellent results on speech recognition [11], visual object recognition [15], and other challenging tasks, so there has been much interest in applying them to natural language processing (NLP) problems as well.Among the important NLP tasks, machine translation (MT) is one where DNNs are likely to achieve strong results because of the availability of large parallel corpora.
Machine translation is challenging because the space of possible translations for a given source sentence is vast.For more than a decade, the standard MT approach [14] has been subject to intensive research and resulted in better systems over time [13,4,3,8].However, this comes at the cost of having a complex pipeline with many subcomponents that need to be tuned jointly, making it difficult to improve upon existing systems.In recent years, MT researchers have been seeking to incorporate neural network models into the standard pipeline as an additional subcomponent [20,21,23,6].At heart, these systems use phrase tables and thus rely primarily on small contexts during the translation process.
Lately, there have been a number of attempts to develop a purely neural machine translation system (NMT) [12,5,2,22].NTM systems should eventually outperform the best standard systems because neural networks scale well with larger models and generalize to word sequences that do not appear in the training set.In addition, NMT systems are easy to train with backpropagation and their decoder is easy to implement, unlike the highly intricate decoders used by the phrase-based systems [13].NMT systems also use minimal domain knowledge, which makes them applicable to any other problem that can be formulated as mapping a sequence to another sequence [22].
A major limitation of existing NMTs is their use of a fixed modest-sized vocabulary.NMT systems are completely incapable of translating rare words, as they use a single <unk> symbol to represent all out-of-vocabulary (OOV) words, as illustrated in Figure 1.Empirically, both Sutskever et al. [22] and Bahdanau et al. [2] have observed that sentences with many rare words tend to be translated much more poorly than sentences containing mainly frequent words.Standard phrase-based systems, on the other hand, suffer less from the rare word problem because they can afford a much larger vocabulary, and because of their use of explicit alignments and phrase counts allows them to memorize the translations of even extremely rare words.
Motivated by the strengths of the standard phrase-based system, we propose and implement a simple approach to address the rare word problem of NMTs.Our approach augments the training data with alignment information that allows the NMT system to emit, for each OOV word, a "pointer" to its corresponding word in the source sentence.This information is later utilized in a post-processing step that translates the OOV words using a dictionary or with the identity translation (if no translation is found).
Our experiments confirm that this approach is effective.On the English to French WMT'14 translation task, this approach provides an improvement of more than 2 BLEU points over an equivalent NMT system that does not use this technique.Moreover, our system achieves 36.9BLUE points, matching the performance of the state-of-the-art system [7] while using three times less data.Our system improves upon the previous best NMT system by 2.1 BLEU points.
Example of the rare word problem -An English source sentence (en), a human translation to French (fr), and a translation produced by one of our neural network systems (nn) before handling OOV words.We highlight words that are unknown to our model.The token <unk> indicates an OOV word.We also show a few important alignments between the pair of sentences.

Neural Machine Translation
A neural machine translation system is any neural network that maps a source sentence, s 1 , . . ., s n , to a target sentence, t 1 , . . ., t m , where all sentences are assumed to terminate with a special "end-ofsentence" token <eos>.More concretely, an NMT system uses a neural network to parameterize the conditional distributions p(t j |t <j , s ≤n ) for 1 ≤ j ≤ m.By doing so, it becomes possible to compute and therefore maximize the log probability of the target sentence given the source sentence There are many ways to parameterize these conditional distributions.For example, Kalchbrenner et al. [12] used a combination of a convolutional neural network and a recurrent neural network, Sutskever et al. [22] used a large and deep Long Short-Term Memory (LSTM) model, Cho et al. [5] used an architecture similar to the LSTM, and Bahdanau et al. [2] used a more elaborate neural network architecture that uses an attentional mechanism over the input sequence, similarly to Graves [9] and Graves et al. [10].
In this work, we use the exact model of Sutskever et al. [22], which has a large deep LSTM to encodue the input sequence and a separate deep LSTM to produce a translation from the input sequence.The encoder reads the source sentence, one word at a time, and produces a large hidden state that represents the entire source sentence.The decoder is initialized from that final hidden state and generates a target translation, one word at a time, until the end-of-sentence symbol <eos> is emitted.
Despite the relatively large amount of work done on pure neural machine translation systems, there has been no work addressing the OOV problem in NMT systems.

Rare Word Models
To address the rare word problem discussed in Section 1, we train our neural machine translation system to track the source of the unknown words in the target sentences.If we knew the source word that is responsible for each unknown target word, we could introduce a post-processing step that would replace each <unk> in the system's output with a translation of its source word, using either a dictionary or the identity translation.For example, in Figure 1, if the model knows that the second unknown token in the NMT (line nn) originates from the source word ecotax, it can perform a word dictionary lookup to replace that unknown token by écotaxe.Similarly, an identity translation of the source word Pont-de-Buis can be applied to the third unknown token.
We present three annotation strategies that can easily be applied to any NMT system.We treat the NMT system [12,22,5] as a black box and train it on a dataset annotated with alignment information specified by one of the models below.Such alignment data can be obtained from a parallel corpus using an unsupervised aligner.From the alignment links, we construct a word dictionary that will be used for the word translations in the post-processing step.If a word does not appear in our dictionary, then we apply the identity translation.
The first part of the sentence pair in Figure 1 (lines en and fr) is used to illustrate our models.

Copyable Model
In this approach, we introduce multiple tokens to represent the unknown words in the source and in the target language, instead of just one token <unk> token.We annotate the OOV words in the source sentence with unk 1 , unk 2 , unk 3 , . .., in that order, where repeating unknown words are given identical tokens.The annotation of the unknown words in the target language is slightly more elaborate: (a) each unknown target word that is aligned to an unknown source word is assigned the same unknown token (hence, the "copy" model) and (b) an unknown target word that has no alignment or that is aligned with a known word uses the special null token unk n .See Figure 2 for an example.This annotation enables us to translate every non-null token.

Positional All Model (PosAll)
The copyable model is limited by its inability to translate unknown target words that are aligned to known words in the source sentence, such as the pair of words portiqueportico in our running example.This happens because source vocabularies tend to be much larger than target vocabularies due to the cost of the softmax (although much faster alternatives to the softmax exist which could potentially alleviate this problem).This limitation motivated us to develop a model that predicts the complete alignments between the source and the target sentence, which is straightforward since the complete alignments are available during training time.
Specifically, we return to using only a single universal <unk> token.However, on the target side, we insert a positional token pos d after every word.Here, d indicates a relative position (d = −7, . . ., −1, 0, 1, . . ., 7) to denote that a target word at position j is aligned to a source word at position i = j − d.Aligned words that are too far apart are not annotated.In addition, we have a null token pos n to mark unaligned words.Our annotation is illustrated in Figure 4.

Positional Unknown Model (PosUnk)
A major weakness of the PosAll model is that it doubles the length of the target sentence, which makes learning more difficult and nearly 2 times slower per parameter update.However, our postprocessing step is concerned only with the alignments of the unknown words, so it is more sensible to annotate only the alignments of the unknown words.This motivates our positional unknown model which uses the unkpos d tokens (for d in −7, . . ., 7 or n) to simultaneously denote (a) the fact that a word is unknown and (b) its relative position d with respect to its aligned source word, similarly to the positional all model (where d is set to the null symbol n whenever the word does not have an alignment).We use the universal <unk> for all other unknown tokens in the source language.
It is possible that despite its slower speed, the PosAll model will learn better alignments as it is trained on many more examples of words and their alignments.We answer this question in the experimental section.

Experiments
We evaluate the effectiveness of our OOV models on the WMT'14 English-to-French translation task. 1 Translation quality is measured with the BLEU metric [17] on the newstest2014 (which has 3003 sentences).

Training Data
To be comparable with the results reported by previous work on neural machine translation systems [22,5,2], we train our models on the same training data of 12M parallel sentences (348M French and 304M English words).The 12M subset was selected from the full WMT'14 parallel corpora using the method proposed in [1]. 2ue to the computationally intensive nature of the naive softmax in the target language, we limit the French vocabulary to the 40K most frequent French words (note that [22] used a vocabulary of 80k French words).On the source side, however, we can afford a much larger vocabulary, so we use the 200K most frequent English words.The model treats all other words as unknowns.When the French (target) vocabulary has 40K words, there are on average 1.33 unknown words per sentence on the target side of the test set.
We annotate our training data using the three schemes described in the previous section.The alignment is computed with the Berkeley aligner [16] using its default settings.We discard sentence pairs in which either the source or the target sentence exceed 100 tokens.

Training Details
Our training procedure and hyperparameter choices are similar to those used by Sutskever et al. [22].
In more details, we train multi-layer deep LSTMs, each of which has 1000 cells, with 1000 dimensional embeddings.Like Sutskever et al. [22], we reverse the words in the source sentences which has been shown to improve LSTM memory utilization and results in better translations of long sentences.Our hyperparameters can be summarized as follows: (a) the parameters are initialized uniformly in [-0.08, 0.08], (b) SGD has a fixed learning rate of 0.7, (c) we train for 8 epochs (after 5 epochs, we begin to halve the learning rate every 0.5 epoch), (d) the size of the mini-batch is 128, and (e) we rescale the normalized gradient to ensure that its norm does not exceed 5 [18].
We also follow the GPU parallelization scheme proposed in [22], allowing us to reach a training speed of 9.0K words per second ([22] achieved 6.3K words per second with a larger vocabulary of 80K; our target vocabulary has 40K words).Training takes about 7-10 days on an 8-GPU machine.

A note on BLEU scores
The website http://matrix.statmt.org/matrixstates that the state-of-the-art (SOTA) system [7] achieves a BLEU score of 35.8 on the English to French language pair on WMT'14.This numerical score is based on detokenized translations.However, all other systems that we compared against have been evaluated on the tokenized translations using the multi-bleu.plscript, which is consistent with previous work [5,2,19,22].Thus, to make it possible to compare our system against the system of Durrani et al. [7], we evaluated its tokenized predictions (which can be downloaded from statmt.org[7]) on the test set (newstest2014) and arrived at the BLEU score of 37.0 points [22].

Main Results
We compare our system to other systems trained on the same training data of 12M sentence pairs, which include several recent end-to-end neural systems, as well as phrase-based baselines with neural components.We also compare to the performance of the state-of-the-art MT system [7] from the WMT'14 competition, which is trained on 36M sentence pairs.
The results shown in Table 1 demonstrate that our unknown word translation technique (in particular, the PosUnk model) significantly improves the translation quality for both the individual (non-ensemble) LSTM models (+2.3 BLEU) and the ensemble model (+2.8 BLEU).For the nonensemble models, we report the performance of models with 4 and 6 layers (see Section 5.2 for more analysis).For the ensemble setting, we use a combination of 5 depth-4 models and 3 depth-6 models.Our best result (36.9 BLEU) outperforms all other NMT systems by a large margin, and in particular, it outperforms the current best NMT system [22] by 2.1 BLEU points (we even outperform Sutskever et al. [22] when they rerank the n-best list of a phrase-based baseline [22]).We compare the other rare word translation schemes in the next section.
It is notable that the more accurate NMT systems obtain greater improvements from our postprocessing step.It is the case because the usefulness of the PosUnk model depends directly on the NMT's ability to correctly locate, for a given OOV target word, the word in the source sentence that is responsible for it.An ensemble of large models identifies these source words with greater accuracy, so it is not surprising that the PosUnk model provides the greatest improvement in performance for the best models.1: Translation results on newstest2014 -BLEU scores of the following systems: (a) the state-of-the-art system (trained on the full WMT'14 corpus of 36M sentence pairs) and (b) other neural-based systems (trained on the same subset of WMT'14 data with 12M sentence pairs).We highlight the performance of our best system in bolded text and state the improvements obtained by our technique of handling rare words (namely, with the PosUnk model).Notice that the more accurate systems achieve a greater improvement from the post-processing step.This is the case because the larger, more accurate models are also more accurate in their output of the alignment information of the unknown word, which makes the post-processing more useful.

Analysis
We analyze and quantify the improvement obtained by our rare word translation approach and provide a detailed comparison of the different rare word techniques proposed in Section 3. We examine the effect of depth on the LSTM architectures and demonstrate a strong correlation between perplexities and BLEU scores.We also highlight a few translation examples where our models succeed in correctly translating OOV words as well several failures.

Rare Word Analysis
To analyze the effect of rare words on translation quality, we follow Sutskever et al. [22] and sort the sentences in newstest2014 by the average frequency rank of their words.We split the test sentences into groups where the sentences within each group have a comparable number of rare words and evaluate each group independently.We evaluate our systems before and after translating the OOV words and compare with the standard MT systems -we use the state-of-the-art (SOTA) system from WMT'14 [7], and neural MT systems -we use the ensemble system described in [22] (See Section 4).
Rare word translation is challenging for neural machine translation systems as shown in Figure 5.
The translation quality of our model before applying the unknown word translations is shown by the green star line, and the current best NMT system [22] is the purple diamond line.While [22] produces excellent translations of sentences with frequent words (the left part of the graph), they are worse than SOTA system (red triangle line) on sentences with many rare words (the right side of the graph).When applying our unknown word translation technique (blue square line), we significantly improve the translation quality of our NMT: in for the last group of 500 sentences which have the greatest proportion of OOV words in the test set, we increase the BLEU score of our system by 6.5 BLEU points.Overall, our rare word translation model interpolates between the SOTA system and the system of Sutskever et al. [22], which allows us to outperform SOTA on sentences that consist predominantly of frequent words and approach its performance on sentences with many OOV words.On the x-axis, we order newstest2014 sentences by their average frequency rank and divide the sentences into groups, where each group consists of sentences with a comparable prevalence of rare words.We compute the BLEU score of each group independently.

Other Effects
In this section, all models are trained on the unreversed sentences, and we use the following hyperparameters: we initialize the parameters uniformly in [-0.1, 0.1], the learning rate is 1, the maximal gradient norm is 1, with a source vocabulary of 90k words, and a target vocabulary of 40k (see Section 4.2 for more details).While these LSTMs do not achieve the best possible performance, it is still useful to analyze them.
Rare Word Models -We examine the effect of the different rare word models presented in Section 3, namely: (a) Copyable -which aligns the unknown words on both the input and the target side by learning to copy indices, (b) the Positional All (PosAll) -which predicts the aligned source positions for every target word, and (c) the Positional Unknown (PosUnk) -which predicts the aligned source positions for only the unknown target words.It is also interesting to measure the improvement obtained when no alignment information is used during training.As such, we include a baseline model with no alignment knowledge (NoAlign) in which we simply assume that the i th unknown word on the target sentence is aligned to the i th unknown word in the source sentence.For each model, we show results before (left) and after (right) the rare word translation.We also highlight the perplexity of each model (for PosAll, we report the perplexities of predicting the words and the positions separately).
From the results in Figure 6, a simple monotone alignment assumption for the NoAlign model yields a modest gain of 0.8 BLEU points.If we train the model to predict the alignment, then the Copyable model offers slightly better gain of 1.0 BLEU.Note, however, that English and French have similar word order structure, so it would be interesting to experiment with other language pairs, such as English and Chinese, in which the word order is not as monotonic.These harder language pairs potentially imply a smaller gain for the NoAlign model and a larger gain for the Copyable model.We leave it for future work.
The positional models (PosAll and PosUnk) improve translation performance by more than 2 BLEU points.This proves that the limitation of the copyable model, which forces it to align each unknown output word with an unknown input word, is considerable.In contrast, the positional models can align the unknown target words with any source word, and as a result, post-processing has a much stronger effect.The PosUnk model achieves better translation results than the PosAll model which suggests that it is easier to train the LSTM on shorter sequences.
Deep LSTM architecture -We compare a number of PosUnk models trained with different number of layers (3, 4, and 6).We observe that the gain obtained by the PosUnk model increases in tandem with the overall accuracy of the model, which is consistent with the idea that larger models can point to the appropriate source word more accurately.Additionally, we observe that on average, each extra LSTM layer provides roughly 1.0 BLEU point improvement as demonstrated in Figure 7.
Perplexity and BLEU -Lastly, we find it interesting to observe a strong correlation between perplexity (the objective we are optimizing for) and the translation quality as measured by the BLEU score.Figure 8 shows the performance of a 4-layer LSTM, in which we compute both perplexity and BLEU scores at different points during training.We find that on average, a reduction of 0.5 perplexity gives us roughly 1.0 BLEU point improvement.6, we show the performance before and after we translate the OOV words, as well as the perplexities.Notice that the PosUnk model is more useful on more accurate models.

Sample Translations
We present three sample translations of our best system (with 36.9BLEU) in Table 2.In our first example, the model translates all the unknown words correctly: 2600, orthopédiques, and cataracte.It is interesting to observe that the model can accurately predict an alignment of distances of 5 and 6 words.The second example highlights the fact that our model can translate long sentences reasonably well and that it was able to correctly translate the unknown word for JPMorgan at the very far end of the source sentence.Lastly, our examples also reveal several penalties incurred by our model: (a) incorrect entries in the word dictionary, as with négociateur vs. trader in the second example, and (b) incorrect alignment prediction, such as the unkpos 3 word is incorrectly aligned with the source word was and not with abandoning, which resulted in an incorrect translation in the third sentence.

Conclusion
We have shown that a simple alignment-based technique can mitigate and even overcome one of the main weaknesses of current NMT systems, which is their inability to translate words that are not in their vocabulary.A key advantage of our technique is the fact that it is applicable to any NMT system and not only to the deep LSTM model of Sutskever et al. [22].A technique like ours is likely necessary if an NMT system is to achieve state-of-the-art performance on machine translation.
We have demonstrated empirically that on the WMT'14 English-French translation task, our technique yields a consistent and substantial improvement of 2-3 BLEU points over various NTM sys- But concerns have grown after Mr Mazanga was quoted as saying Renamo was abandoning the 1992 peace accord .trans Mais les inquiétudes se sont accrues après que M. unkpos 3 a déclaré que la unkpos 3 unkpos 3 l' accord de paix de 1992 . +unk Mais les inquiétudes se sont accrues après que M. Mazanga a déclaré que la Renamo était l' accord de paix de 1992 . tgt Mais l' inquiétude a grandi après que M. Mazanga a déclaré que la Renamo abandonnait l' accord de paix de 1992 .
Table 2: Sample translations -the table shows the source (src) and the translations of our best model before (trans) and after (+unk) unknown word translations.We also show the human translations (tgt) and italicize words that are involved in the unknown word translation process.
tems of different architectures.Our system outperforms the current best end-to-end neural machine translation system by the large margin of 2.1 BLEU points.Most importantly, our models match the performance of the state-of-the-art system, which uses three times more data than our model.

Figure 3 :
Figure 3: Positional All Model -the annotation of the PosAll model, where each word is followed by the relative positional tokens pos d or the null token pos n .

Figure 4 :
Figure 4: Positional Unknown Model -the annotations under the PosUnk model, where we annotate only the aligned unknown words with the unkpos d tokens.

Figure 5 :
Figure 5: Rare word translation -the figure shows the translation performances of several systems.On the x-axis, we order newstest2014 sentences by their average frequency rank and divide the sentences into groups, where each group consists of sentences with a comparable prevalence of rare words.We compute the BLEU score of each group independently.

Figure 6 :
Figure6: Rare word model comparison -the plot shows the translation performance of four 6layer LSTMs: a model that uses no alignment information (NoAlign) and the other three rare word models (Copyable, PosAll, PosUnk).For each model, we show results before (left) and after (right) the rare word translation.We also highlight the perplexity of each model (for PosAll, we report the perplexities of predicting the words and the positions separately).

Figure 7 :
Figure 7: Effect of depth -BLEU scores achieved by LSTM models of various depths (3, 4, and 6 layers).These experiments use the PosUnk model.Similarly to Figure6, we show the performance before and after we translate the OOV words, as well as the perplexities.Notice that the PosUnk model is more useful on more accurate models.

Figure 8 :
Figure 8: Perplexity and BLEU correlation -BLEU scores and perplexities obtained by evaluating an LSTM model with 4 layers at various stages of training.