Modeling Coverage for Neural Machine Translation

Attention mechanism has enhanced state-of-the-art Neural Machine Translation (NMT) by jointly learning to align and translate. It tends to ignore past alignment information, however, which often leads to over-translation and under-translation. To address this problem, we propose coverage-based NMT in this paper. We maintain a coverage vector to keep track of the attention history. The coverage vector is fed to the attention model to help adjust future attention, which lets NMT system to consider more about untranslated source words. Experiments show that the proposed approach significantly improves both translation quality and alignment quality over standard attention-based NMT.


Introduction
The past several years have witnessed the rapid progress of end-to-end Neural Machine Translation (NMT) (Sutskever et al., 2014;Bahdanau et al., 2015). Unlike conventional Statistical Machine Translation (SMT) (Koehn et al., 2003;Chiang, 2007), NMT uses a single and large neural network to model the entire translation process. It enjoys the following advantages. First, the use of distributed representations of words can alleviate the curse of dimensionality (Bengio et al., 2003). Second, there is no need to explicitly design features to capture translation regularities, which is quite difficult in SMT. Instead, NMT is capable of learning representations directly from the training data. Third, Long Short-Term Memory (Hochreiter and Schmidhuber, 1997) enables NMT to capture long-distance reordering, which is a significant challenge in SMT.
NMT has a serious problem, however, namely lack of coverage. In phrase-based SMT (Koehn et al., 2003), a decoder maintains a coverage vector to indicate whether a source word is translated or not. This is important for ensuring that each source word is translated in decoding. The decoding process is completed when all source words are "covered" or translated. In NMT, there is no such coverage vector and the decoding process ends only when the end-of-sentence mark is produced. We believe that lacking coverage might result in the following problems in conventional NMT: 1. Over-translation: some words are unnecessarily translated for multiple times; 2. Under-translation: some words are mistakenly untranslated.
Specifically, in the state-of-the-art attention-based NMT model (Bahdanau et al., 2015), generating a target word heavily depends on the relevant parts of the source sentence, and a source word is involved in generation of all target words. As a result, over-translation and under-translation inevitably happen because of ignoring the "coverage" of source words (i.e., number of times a source word is translated to a target word). Figure 1(a) shows an example: the Chinese word "guānbì" is over translated to "close(d)" twice, while "bèipò" (means "be forced to") is mistakenly untranslated.
In this work, we propose a coverage mechanism to NMT (NMT-COVERAGE) to alleviate the overtranslation and under-translation problems. Basically, we append a coverage vector to the inter-(a) Over-translation and under-translation generated by NMT.
(b) Coverage model alleviates the problems of over-translation and under-translation.
Figure 1: Example translations of (a) NMT without coverage, and (b) NMT with coverage. In conventional NMT without coverage, the Chinese word "guānbì" is over translated to "close(d)" twice, while "bèipò" (means "be forced to") is mistakenly untranslated. Coverage model alleviates these problems by tracking the "coverage" of source words. mediate representations of an NMT model, which are sequentially updated after each attentive read during the decoding process, to keep track of the attention history. The coverage vector, when entering into attention model, can help adjust the future attention and significantly improve the overall alignment between the source and target sentences. This design contains many particular cases for coverage modeling with contrasting characteristics, which all share a clear linguistic intuition and yet can be trained in a data driven fashion. Notably, we achieve significant improvement even by simply using the sum of previous alignment probabilities as coverage for each word, as a successful example of incorporating linguistic knowledge into neural network based NLP models.
Experiments show that NMT-COVERAGE significantly outperforms conventional attentionbased NMT on both translation and alignment tasks. Figure 1(b) shows an example, in which NMT-COVERAGE alleviates the over-translation and under-translation problems that NMT without coverage suffers from.

Background
Our work is built on attention-based NMT (Bahdanau et al., 2015), which simultaneously con-Figure 2: Architecture of attention-based NMT. Whenever possible, we omit the source index j to make the illustration less cluttered. ducts dynamic alignment and generation of the target sentence, as illustrated in Figure 2. It produces the translation by generating one target word y i at each time step. Given an input sentence x = {x 1 , . . . , x J } and previously generated words {y 1 , . . . , y i−1 }, the probability of generating next word y i is where g is a non-linear function, and t i is a decoding state for time step i, computed by Here the activation function f (·) is a Gated Recurrent Unit (GRU) (Cho et al., 2014b), and s i is a distinct source representation for time i, calculated as a weighted sum of the source annotations: is the annotation of x j from a bi-directional Recurrent Neural Network (RNN) (Schuster and Paliwal, 1997), and its weight α i,j is computed by and is an attention model that scores how well y i and h j match. With the attention model, it avoids the need to represent the entire source sentence with a single vector. Instead, the decoder selects parts of the source sentence to pay attention to, thus exploits an expected annotation s i over possible alignments α i,j for each time step i. However, the attention model fails to take advantage of past alignment information, which is found useful to avoid over-translation and undertranslation problems in conventional SMT (Koehn et al., 2003). For example, if a source word is translated in the past, it is less likely to be translated again and should be assigned a lower alignment probability.

Coverage Model for NMT
In SMT, a coverage set is maintained to keep track of which source words have been translated ("covered") in the past. Let us take x = {x 1 , x 2 , x 3 , x 4 } as an example of input sentence. The initial coverage set is C = {0, 0, 0, 0} which denotes that no source word is yet translated. When a translation rule bp = (x 2 x 3 , y m y m+1 ) is applied, we produce one hypothesis labelled with coverage C = {0, 1, 1, 0}. It means that the second and third source words are translated. The goal is to generate translation with full coverage C = {1, 1, 1, 1}. A source word is translated when it is covered by one translation rule, and it is not allowed to be translated again in the future (i.e., hard coverage). In this way, each source word is guaranteed to be translated and only be translated once. As shown, coverage is essential for SMT since it avoids gaps and overlaps in translation of source words.
Modeling coverage is also important for attention-based NMT models, since they generally lack a mechanism to indicate whether a certain source word has been translated, and therefore are prone to the "coverage" mistakes: some parts of source sentence have been translated more than once or not translated. For NMT models, directly modeling coverage is less straightforward, but the problem can be significantly alleviated by keeping track of the attention signal during the decoding process. The most natural way for doing that would be to append a coverage vector to the annotation of each source word (i.e., h j ), which is initialized as a zero vector but updated after every attentive read of the corresponding annotation. The coverage vector is fed to the attention model to help adjust future attention, which lets NMT system to consider more about untranslated source words, as illustrated in Figure 3.

Coverage Model
Since the coverage vector summarizes the attention record for h j (and therefore for a small neighbor centering at the j th source word), it will discourage further attention to it if it has been heavily attended, and implicitly push the attention to the less attended segments of the source sentence since the attention weights are normalized to one. This can potentially solve both coverage mistakes mentioned above.
Formally, the coverage model is given by where • g update (·) is the function that updates C i,j after the new attention α i,j at time step i in the decoding process; • C i,j is a d-dimensional coverage vector summarizing the history of attention till time step i on h j ; • Φ(h j ) is a word-specific feature with its own parameters; • Ψ are auxiliary inputs exploited in different sorts of coverage models.
Equation 6 gives a rather general model, which could take different function forms for g update (·) and Φ(·), and different auxiliary inputs Ψ (e.g., previous decoding state t i−1 ). In the rest of this section, we will give a number of representative implementations of the coverage model, which either leverage more linguistic information (Section 3.1.1) or resort to the flexibility of neural network approximation (Section 3.1.2).

Linguistic Coverage Model
We first consider at linguistically inspired model which has a small number of parameters, as well as clear interpretation. While the linguisticallyinspired coverage in NMT is similar to that in SMT, there is one key difference: it indicates what percentage of source words have been translated (i.e., soft coverage). In NMT, each target word y i is generated from all source words with probability α i,j for source word x j . In other words, the source word x j is involved in generating all target words and the probability of generating target word y i at time step i is α i,j . Note that unlike in SMT in which each source word is fully translated at one decoding step, the source word x j is partially translated at each decoding step in NMT. Therefore, the coverage at time step i denotes the translated ratio of that each source word is translated.
We use a scalar (d = 1) to represent linguistic coverage for each source word and employ an accumulate operation for g update . The initial value of linguistic coverage is zero, which denotes that the corresponding source word is not translated yet. We iteratively construct linguistic coverages through accumulation of alignment probabilities generated by the attention model, each of which is normalized by a distinct contextdependent weight. The coverage of source word x j at time step i is computed by where Φ j is a pre-defined weight which indicates the number of target words x j is expected to generate. The simplest way is to follow Xu et al. (2015) in image-to-caption translation to fix Φ = 1 for all source words, which means that we directly use the sum of previous alignment probabilities without normalization as coverage for each word, as done in (Cohn et al., 2016). However, in machine translation, different types of source words may contribute differently to the generation of target sentence. Let us take the sentence pairs in Figure 1 as an example. The noun in the source sentence "jīchǎng" is translated into one target word "airports", while the adjective "bèipò" is translated into three words "were forced to". Therefore, we need to assign a distinct Φ j for each source word. Ideally, we expect Φ j = I i=1 α i,j with I being the total number of time steps in decoding. However, such desired value is not available before decoding, thus is not suitable in this scenario.
Fertility To predict Φ j , we introduce the concept of fertility, which is firstly proposed in wordlevel SMT (Brown et al., 1993). Fertility of source word x j tells how many target words x j produces. In SMT, the fertility is a random variable Φ j , whose distribution p(Φ j = φ) is determined by the parameters of word alignment models (e.g., IBM models). In this work, we simplify and adapt fertility from the original model and compute the fertility Φ j by 2 where N ∈ R is a predefined constant to denote the maximum number of target words one source word can produce, σ(·) is a logistic sigmoid function, and U f ∈ R 1×2n is the weight matrix. Here we use h j to denote (x j |x) since h j contains information about the whole input sentence with a strong focus on the parts surrounding x j (Bahdanau et al., 2015). Since Φ j does not depend on i, we can pre-compute it before decoding to minimize the computational cost.

Neural Network Based Coverage Model
We next consider Neural Network (NN) based coverage model. When C i,j is a vector (d > 1) and g update (·) is a neural network, we actually have an RNN model for coverage, as illustrated in Figure 4. In this work, we take the following form: where f (·) is a nonlinear activation function and t i−1 is the auxiliary input that encodes past translation information. Note that we leave out the word-specific feature function Φ(·) and only take the input annotation h j as the input to the coverage RNN. It is important to emphasize that the NN-based coverage model is able to be fed with arbitrary inputs, such as the previous attentional context s i−1 . Here we only employ C i−1,j for past alignment information, t i−1 for past translation information, and h j for word-specific bias. 3 Gating The neural function f (·) can be either a simple activation function tanh or a gating function that proves useful to capture long-distance 3 In our preliminary experiments, considering more inputs (e.g., current and previous attentional contexts, unnormalized attention weights ei,j) does not always lead to better translation quality. Possible reasons include: 1) the inputs contains duplicate information, and 2) more inputs introduce more back-propagation paths and therefore make it difficult to train. In our experience, one principle is to only feed the coverage model inputs that contain distinct information, which are complementary to each other. dependencies. In this work, we adopt GRU for the gating activation since it is simple yet powerful (Chung et al., 2014). Please refer to (Cho et al., 2014b) for more details about GRU.
Discussion Intuitively, the two types of models summarize coverage information in "different languages". Linguistic models summarize coverage information in human language, which has a clear interpretation to humans. Neural models encode coverage information in "neural language", which can be "understood" by neural networks and let them to decide how to make use of the encoded coverage information.

Integrating Coverage into NMT
Although attention based model has the capability of jointly making alignment and translation, it does not take into consideration translation history. Specifically, a source word that has significantly contributed to the generation of target words in the past, should be assigned lower alignment probabilities, which may not be the case in attention based NMT. To address this problem, we propose to calculate the alignment probabilities by incorporating past alignment information embedded in the coverage model.
Intuitively, at each time step i in the decoding phase, coverage from time step (i − 1) serves as an additional input to the attention model, which provides complementary information of that how likely the source words are translated in the past. We expect the coverage information would guide the attention model to focus more on untranslated source words (i.e., assign higher alignment probabilities). In practice, we find that the coverage model does fulfill the expectation (see Section 5). The translated ratios of source words from linguistic coverages negatively correlate to the corresponding alignment probabilities.
More formally, we rewrite the attention model in Equation 5 as where C i−1,j is the coverage of source word x j before time i. V a ∈ R n×d is the weight matrix for coverage with n and d being the numbers of hidden units and coverage units, respectively.
We take end-to-end learning for the NMT-COVERAGE model, which learns not only the parameters for the "original" NMT (i.e., θ for encoding RNN, decoding RNN, and attention model) but also the parameters for coverage modeling (i.e., η for annotation and guidance of attention) . More specifically, we choose to maximize the likelihood of reference sentences as most other NMT models (see, however ): (θ * , η * ) = arg max θ,η N n=1 log P (y n |x n ; θ, η) (9) No auxiliary objective For the coverage model with a clearer linguistic interpretation (Section 3.1.1), it is possible to inject an auxiliary objective function on some intermediate representation.
More specifically, we may have the following objective: η penalizes the discrepancy between the sum of alignment probabilities and the expected fertility for linguistic coverage. This is similar to the more explicit training for fertility as in Xu et al. (2015), which encourages the model to pay equal attention to every part of the image (i.e., Φ j = 1). However, our empirical study shows that the combined objective consistently worsens the translation quality while slightly improves the alignment quality.
Our training strategy poses less constraints on the dependency between Φ j and the attention than a more explicit strategy taken in (Xu et al., 2015). We let the objective associated with the translation quality (i.e., the likelihood) to drive the training, as in Equation 9. This strategy is arguably advantageous, since the attention weight on a hidden state h j cannot be interpreted as the proportion of the corresponding word being translated in the target sentence. For one thing, the hidden state h j , after the transformation from encoding RNN, bears the contextual information from other parts of the source sentence, and thus loses the rigid correspondence with the corresponding word. Therefore, penalizing the discrepancy between the sum of alignment probabilities and the expected fertility does not hold in this scenario.

Setup
We carry out experiments on a Chinese-English translation task. Our training data for the translation task consists of 1.25M sentence pairs extracted from LDC corpora 4 , with 27.9M Chinese words and 34.5M English words respectively. We choose NIST 2002 dataset as our development set, and the NIST 2005NIST , 2006 and 2008 datasets as our test sets. We carry out experiments of the alignment task on the evaluation dataset from (Liu and Sun, 2015), which contains 900 manually aligned Chinese-English sentence pairs. We use the caseinsensitive 4-gram NIST BLEU score (Papineni et al., 2002) for the translation task, and the alignment error rate (AER) (Och and Ney, 2003) for the alignment task. To better estimate the quality of the soft alignment probabilities generated by NMT, we propose a variant of AER, naming SAER: where A is a candidate alignment, and S and P are the sets of sure and possible links in the reference alignment respectively (S ⊆ P ). M denotes alignment matrix, and for both M S and M P we assign the elements that correspond to the existing links in S and P with probabilities 1 while assign the other elements with probabilities 0. In this way, we are able to better evaluate the quality of the soft alignments produced by attention-based NMT. We use sign-test (Collins et al., 2005) for statistical significance test. For efficient training of the neural networks, we limit the source and target vocabularies to the most frequent 30K words in Chinese and English, covering approximately 97.7% and 99.3% of the two corpora respectively. All the out-of-vocabulary words are mapped to a special token UNK. We set N = 2 for the fertility model in the linguistic coverages. We train each model with the sentences of length up to 80 words in the training data. The word embedding dimension is 620 and the size of a hidden layer is 1000. All the other settings are the same as in (Bahdanau et al., 2015).  Table 1: Evaluation of translation quality. d denotes the dimension of NN-based coverages, and † and ‡ indicate statistically significant difference (p < 0.01) from GroundHog and Moses, respectively. "+" is on top of the baseline system GroundHog.
We compare our method with two state-of-theart models of SMT and NMT 5 : • Moses (Koehn et al., 2007): an open source phrase-based translation system with default configuration and a 4-gram language model trained on the target portion of training data.
• GroundHog (Bahdanau et al., 2015): an attention-based NMT system. Table 1 shows the translation performances measured in BLEU score. Clearly the proposed NMT-COVERAGE significantly improves the translation quality in all cases, although there are still considerable differences among different variants.

Translation Quality
Parameters Coverage model introduces few parameters. The baseline model (i.e., GroundHog) has 84.3M parameters. The linguistic coverage using fertility introduces 3K parameters (2K for fertility model), and the NN-based coverage with gating introduces 10K×d parameters (6K×d for gating), where d is the dimension of the coverage vector. In this work, the most complex coverage model only introduces 0.1M additional parameters, which is quite small compared to the number of parameters in the existing model (i.e., 84.3M).
Speed Introducing the coverage model slows down the training speed, but not significantly. When running on a single GPU device Tesla K80, the speed of the baseline model is 960 target words per second. System 4 ("+Linguistic coverage with fertility") has a speed of 870 words per second, while System 7 ("+NN-based coverage (d=10)") achieves a speed of 800 words per second. NN-based Coverages (Rows 5-7): (1) Gating (Rows 5 and 6): Both variants of NN-based coverages outperform GroundHog with averaged gains of 0.8 and 1.3 BLEU points, respectively. Introducing gating activation function improves the performance of coverage models, which is consistent with the results in other tasks (Chung et al., 2014).
(2) Coverage dimensions (Rows 6 and 7): Increasing the dimension of coverage models further improves the translation performance by 0.6 point in BLEU score, at the cost of introducing more parameters (e.g., from 10K to 100K). 6 Table 2 lists the alignment performances. We find that coverage information improves attention model as expected by maintaining an annotation summarizing attention history on each source word. More specifically, linguistic coverage with fertility significantly reduces alignment errors under both metrics, in which fertility plays an important role. NN-based coverages, however, does not significantly reduce alignment errors until increasing the coverage dimension from 1 to 10. It indicates that NN-based models need slightly more (a) Groundhog (b) + NN cov. w/ gating (d = 10) Figure 5: Example alignments. Using coverage mechanism, translated source words are less likely to contribute to generation of the target words next (e.g., top-right corner for the first four Chinese words.  Table 2: Evaluation of alignment quality. The lower the score, the better the alignment quality.

Alignment Quality
dimensions to encode the coverage information. Figure 5 shows an example. The coverage mechanism does meet the expectation: the alignments are more concentrated and most importantly, translated source words are less likely to get involved in generation of the target words next. For example, the first four Chinese words are assigned lower alignment probabilities (i.e., darker color) after the corresponding translation "romania reinforces old buildings" is produced.

Effects on Long Sentences
Following Bahdanau et al. (2015), we group sentences of similar lengths together and compute BLEU score and averaged length of translation for each group, as shown in Figure 6. Cho et al. (2014a) show that the performance of Groundhog drops rapidly when the length of input sentence increases. Our results confirm these findings. One main reason is that Groundhog produces much shorter translations on longer sentences (e.g., > 40, see right panel in Figure 6), and thus faces a serious under-translation problem. NMT-COVERAGE alleviates this problem by incorporating coverage information into the attention model, which in general pushes the attention to untranslated parts of the source sentence and implicitly discourages early stop of decoding. It is worthy to emphasize that both NN-based coverages (with gating, d = 10) and linguistic coverages (with fertility) achieve similar performances on long sentences, reconfirming our claim that the two variants improve the attention model in their own ways.

Groundhog translates this sentence into:
jordan achieved an average score of eight weeks ahead with a surgical operation three weeks ago .
in which the sub-sentence ", qiúduì zài cǐ qījiān 4 shèng 8 fù" is under-translated. With the (NNbased) coverage mechanism, NMT-COVERAGE translates it into: jordan 's average score points to UNK this year . he received surgery before three weeks , with a team in the period of 4 to 8 . The quantitative and qualitative results show that the coverage models indeed help to alleviate under-translation, especially for long sentences consisting of several sub-sentences.

Related Work
Our work is inspired by recent works on improving attention-based NMT with techniques that have been successfully applied to SMT. Following the success of Minimum Risk Training (MRT) in SMT ,  proposed MRT for end-to-end NMT to optimize model parameters directly with respect to evaluation metrics. Based on the observation that attentionbased NMT only captures partial aspects of attentional regularities,  proposed agreement-based learning (Liang et al., 2006) to encourage bidirectional attention models to agree on parameterized alignment matrices. Along the same direction, inspired by the coverage mechanism in SMT, we propose a coverage-based approach to NMT to alleviate the over-translation and under-translation problems.
Independent from our work, Cohn et al. (2016) and Feng et al. (2016) made use of the concept of "fertility" for the attention model, which is similar in spirit to our method for building the linguistically inspired coverage with fertility. Cohn et al. (2016) introduced a feature-based fertility that includes the total alignment scores for the sur-rounding source words. In contrast, we make prediction of fertility before decoding, which works as a normalizer to better estimate the coverage ratio of each source word. Feng et al. (2016) used the previous attentional context to represent implicit fertility and passed it to the attention model, which is in essence similar to the input-feed method proposed in (Luong et al., 2015). Comparatively, we predict explicit fertility for each source word based on its encoding annotation, and incorporate it into the linguistic-inspired coverage for attention model.

Conclusion
We have presented an approach for enhancing NMT, which maintains and utilizes a coverage vector to indicate whether each source word is translated or not. By encouraging NMT to pay less attention to translated words and more attention to untranslated words, our approach alleviates the serious over-translation and under-translation problems that traditional attention-based NMT suffers from. We propose two variants of coverage models: linguistic coverage that leverages more linguistic information and NN-based coverage that resorts to the flexibility of neural network approximation . Experimental results show that both variants achieve significant improvements in terms of translation quality and alignment quality over NMT without coverage.