On the Word Alignment from Neural Machine Translation

Prior researches suggest that neural machine translation (NMT) captures word alignment through its attention mechanism, however, this paper finds attention may almost fail to capture word alignment for some NMT models. This paper thereby proposes two methods to induce word alignment which are general and agnostic to specific NMT models. Experiments show that both methods induce much better word alignment than attention. This paper further visualizes the translation through the word alignment induced by NMT. In particular, it analyzes the effect of alignment errors on translation errors at word level and its quantitative analysis over many testing examples consistently demonstrate that alignment errors are likely to lead to translation errors measured by different metrics.


Introduction
Machine translation aims at modeling the semantic equivalence between a pair of source and target sentences (Koehn, 2009), and word alignment tries to model the semantic equivalence between a pair of source and target words (Och and Ney, 2003). As a sentence consists of words, word alignment is conceptually related to machine translation and such a relation can be traced back to the birth of statistical machine translation (SMT) (Brown et al., 1993), where word alignment is the basis of SMT models and its accuracy is generally helpful to improve translation quality (Koehn et al., 2003;Liu et al., 2005).
In neural machine translation (NMT), it is also important to study word alignment, because word alignment provides natural ways to understanding black-box NMT models and analyzing their translation errors (Ding et al., 2017). Prior researches * Work done while X. Li interning at Tencent AI Lab. L. Liu is the corresponding author. observed that word alignment is captured by NMT through attention for recurrent neural network based NMT with a single attention layer (Bahdanau et al., 2014;Mi et al., 2016;Li et al., 2018). Unfortunately, we surprisingly find that attention may almost fail to capture word alignment for NMT models with multiple attentional layers such as TRANSFORMER (Vaswani et al., 2017), as demonstrated in our experiments.
In this paper, we propose two methods to induce word alignment from general NMT models and answer a fundamental question that how much word alignment NMT models can learn ( § 3). The first method explicitly builds a word alignment model between a pair of source and target word representations encoded by NMT models, and then it learns additional parameters for this word alignment model with the supervision from an external aligner similar to Mi et al. (2016) and . The second method is more intuitive and flexible: it is parameter-free and thus does not need retraining and external aligner. Its key idea is to measure the prediction difference of a target word if a source word is removed, inspired by Arras et al. (2016) and Zintgraf et al. (2017). Experiments on an advanced NMT model show that both methods achieve much better word alignment than the method by attention ( § 4.1). In addition, our experiments demonstrate that NMT captures good word alignment for those words mostly contributed from source (CFS), while their word alignment is much worse for those words mostly contributed from target (CFT). This finding offers a reason why advanced NMT models delivering excellent translation capture worse word alignment than statistical aligners in SMT, which was observed in prior researches yet without deep explanation (Tu et al., 2016;. Furthermore, we understand and interpret NMT from the viewpoint of word alignment induced from NMT ( § 4.2). Unlike existing researches on interpreting NMT by accessing few examples as case study (Ding et al., 2017;Alvarez-Melis and Jaakkola, 2017), we aim to provide quantitatively analysis for interpreting NMT by accessing many testing examples, which makes our findings more general. To this end, we firstly compare the effects of both approaches to interpreting NMT and find the prediction difference is better for understanding NMT. Consequently, we propose to quantitatively analyze the translation errors by using alignment from prediction difference. Since it is far from trivial to measure the translation errors at the word level, we design experiments by using two metrics to detect translation errors. Our empirical results consistently show that wrong alignment is more likely to induce the translation errors meanwhile right alignment favors to encourage the translation quality. Our analysis further suggest that word alignment errors for CFS words are responsible for translation errors in some extent.
This paper makes the two-fold contributions: • It systematically studies word alignment from NMT and proposes two approaches to induce word alignment which are agnostic to specific NMT models.
• It understands NMT from the viewpoint of word alignment and investigates the effect of alignment errors on translation errors via quantitative analysis over many testing examples.

Neural Machine Translation
Given a source sentence x = x 1 , · · · , x |x| and a target sentence y = y 1 , · · · , y |y| , NMT aims at maximizing the following conditional probabilities: 1 where y <i = y 1 , . . . , y i−1 denotes a prefix of y with length i − 1, and s L i is the final decoding state of y i . Generally, the conditional distribution P y i | s L i is somehow modeled within an encoder-decoder framework. In encoding stage, the source sentence x is encoded as a sequence of hidden vectors h by an encoder according to specific NMT models, such as a multi-layer encoder consisting of recurrent neural network (RNN), convolutional neural network (CNN), or self-attention layer. In decoding stage, each decoding state in l th layer s l i is computed as follows: where l ∈ {1, . . . , L}, y i is the word embedding of word y i , f is a general function dependent on a specific NMT model, c l i is a context vector in l th layer, computed from h and s l <i according to different NMT models. As the dominant models, attentional NMT models define the context vector c l i as a weighted sum of h, where the weight α l i = g s l−1 i , s l <i , h is defined by a similarity function. Due to the space limitation, we refer readers to Bahdanau et al. (2014), Gehring et al. (2017) and Vaswani et al. (2017) for the details on the definitions of f and g.

Alignment by Attention
Since the attention weight α l i,j measures the similarity between s l−1 i and h j , it has been widely used to evaluate the word alignment between y i and x j (Bahdanau et al., 2014;Ghader and Monz, 2017). Once an attentional NMT model has been trained, one can easily extract word alignment A from the attention weight α according to the style of maximum a posterior strategy (MAP) as follows: where A i,j = 1 indicates y i aligns to x j . For NMT models with multiple attentional heads attentional layers as in Vaswani et al. (2017), we sum all attention weights with respect to all heads to a single α before MAP in equation 3.
NMT models, whereas it is useless for general NMT models. In this section, in order to induce word alignment from general NMT models, we propose two different methods, which are agnostic to specific NMT models.

Alignment by Explicit Alignment Model
Given a source sentence x, a target sentence y, following Liu et al. (2005) and Taskar et al. (2005), we explicitly define a word alignment model as follows: where δ (x j , y i ; W ) is a distance function parametrized by W . Ideally, δ is able to include arbitrary features such as IBM model 1 similar to Liu et al. (2005). However, as our goal is not to achieve the best word alignment but to focus on that captured by an NMT model, we only consider these features completely learned in NMT. Hence, we define the where x j and y i are word embeddings of x j and y i learned in NMT, h j is the hidden unit of x j in the encoding network and s L i is the hidden unit of y j in the decoding network, denotes the concatenation of a pair of column vectors of dimension d, and W is a matrix of dimension 2d × 2d.
The explicit word alignment model is trained by maximizing the objective function with respect to the parameter matrix W : where A ref ij is the reference alignment between x j and y i for a sentence pair x and y. As the number of elements in W is up to one million (i.e., (2 × 512) 2 ), it is not feasible to train it using a small dataset with gold alignment. Therefore, following Mi et al. (2016) and , we run statistical word aligner such as FAST ALIGN (Dyer et al., 2013) on a large corpus and then employ resulting word alignment as the silver alignment A ref for training. Note that our goal is to quantify word alignment learned by an NMT model, and thus we only treat W as the parameter to be learned, which differs from the joint training all parameters including those from NMT models as in Mi et al. (2016) and .
After training, one obtains the optimized W and then easily infers word alignment for a test sentence pair x, y via the MAP strategy as defined in equation 3 by setting Note that if word embeddings and hidden units learned by NMT models capture enough information for word alignment, the above method can obtain excellent word alignment. However, because the dataset for supervision in training definitely include some data intrinsic word alignment information, it is unclear how much word alignment is only from NMT models. Therefore, we propose the other method which is parameter-free and only dependent on NMT models themselves.

Alignment by Prediction Difference
The intuition to this method is that if y i aligns to x j , the relevance between y i and x j should be much higher than that between y i and any other x k with k = j. Therefore, the key to our method is that how to measure the relevance between y i and x j .
Sampling method Zintgraf et al. (2017) propose a principled method to measure the relevance between a pair of tokens in input and output. It is estimated by measuring how the prediction of y i in the output changes if the input token x j is unknown. Formally, the relevance between y i and x j for a given sentence pair x, y is defined as follows: where x (j,x) denotes the sequence by replacing x j with x in x, particularly x (j,∅) denotes the sequence by removing x j from x, P (y i | y <i , x) is defined in equation 1 and P x | y <i , x (j,∅) is approximated by the empirical distribution P (x), which can be considered as the 1-gram language model for the source side of the training corpus. Unlike a computer vision task in Zintgraf et al. (2017), the size of source vocabulary in NMT is up to 30000 and thus summation over this large vocabulary is challenging in computational efficiency. As a result, we only sample multiple words to approximate the expectation in equation 8 by Monte Carlo (MC) approach.
Deterministic method Inspired by the idea of dropout (Srivastava et al., 2014), we measure the relevance by disabling the connection between x j and the encoder network in a deterministic way. Formally, R (y i , x j ) is directly defined via dropout effect on x j as follows: where x (j,0) denotes the sequence by replacing x j with a word whose embedding is a zero vector. In this way, the computation in equation 9 is much faster than the Monte Carlo sampling approach involving multiple samples. It is worth mentioning that equation 9 resembles the Monte Carlo sampling approach with a single sample in calculation, but it is significantly better than MC with a single sample in alignment quality and is very close to MC approach with enough samples, as to be shown in our experiments.
Note that the relevance R(y i , x j ) ∈ [−1, 1], where R(y i , x j ) = 1 means i th target word is totally determined by the j th source word; R(y i , x j ) = −1 means i th target word and j th source word are mutual exclusive; R(y i , x j ) = 0 means j th source word do not affect generating i th target word. To obtain word alignment for a given sentence pair x, y , after collecting R(y i , x j ) one can easily infer word alignment via the MAP strategy as defined in equation 3 by setting α i,j = R(y i , x j ).
Remark The above R(y i , x j ) in equation 7 quantifies the relevance between a target word y i and a source word x j . Similarly, one can quantify the relevance between y i and its history word y k as follows: where R o indicates the relevance between two target words y i and y k with k < i, and P (y i | y <i (k,0) , x) is obtained by disabling the connection between y k and the decoder network, similarly to P y i | y <i , x (j,0) . Unlike R(y i , x j ) capturing word alignment information, R o (y i , y k ) is able to capture word allocation in a target sentence and it will be used to answer a fundamental question why NMT models yields better translation yet worse word alignment compared with SMT in section of experiments.

Experiments
In this section, we conduct extensive experiments on ZH⇒EN and DE⇒EN translation tasks to evaluate different methods for word alignment induced from the NMT model and compare them with a statistical alignment model FAST ALIGN (Dyer et al., 2013). Then, we use the induced word alignment to understand translation errors both qualitatively and quantitatively.
The proposed methods are implemented on top of TRANS-FORMER (Vaswani et al., 2017) which is a state-ofthe-art NMT system. We report AER on NIST05 test set and RWTH data, whose reference alignment was manually annotated by experts Ghader and Monz, 2017). More details on data and training these systems are described in Appendix A.

Inducing Word alignment from NMT
Attention Since the bilingual corpus intrinsically includes word alignment in some extent, word alignment by attention should be better than the data intrinsic alignment if attention indeed captures alignment. To obtain the data intrinsic word alignment, we calculate pointwise mutual information (PMI) from the bilingual corpus and then infer word alignment for each bilingual sentence by using the MAP strategy as in equation 3. 2 It is astonishing that word alignment by attention is inconsistent for different layers of TRANS-FORMER, although attention in a single layer TRANSFORMER obtains decent word alignment. Referring to Figure 1, for models more than two layers, alignment captured by attention on middle layer(s) is reasonable, but that on low or high layer is obviously worse than PMI. The possible reasons can be explained as follows. The possible functionality of lower layers might be constructing gradually better contextual representation of the word at each position as suggested in recent contextualized embedding works (Peters et al., 2018;Devlin et al., 2018;Radford et al., 2019   Explicit Alignment Model (EAM) As shown in Table 1, EAM outperforms alignment induced from attention by a large margin. However, since EAM employs silver alignment annotations from FAST ALIGN for training the additional parameters, its final AER includes contributions from both the aligned data and the model. To eliminate contribution from the data, we investigate the AERs over different pre-trained translation models with their EAMs trained on the same FAST ALIGN annotated data. We find that a stronger (higher BLEU) translation model generally obtains better alignment (lower AER). As shown in Table 2, TRANSFORMER-L6 generates much better alignment than TRANSFORMER-L1, highly correlated with their translation performances. This suggests that supervision is not enough to obtain good alignment and the hidden units learned by a translation model indeed implicitly capture alignment knowledge by learning translation. In addition, EAM can be thought as a kind of agnostic probe (Belinkov et al., 2017;Hewitt and Manning, 2019) to investigate how much alignment are implicitly learned in the hidden representations.
Prediction Difference (PD) As shown in Table 1, PD also delivers better word alignment than attention. PD can be implemented by sampling method or deterministic method. As shown in Table 3, the alignment performance of sampling method is improving as growing of the sample size, because the accuracy of Monte Carlo approach is dependent on the number of samples.
And no matter what sample size is, the variance of AER is always ignorable. The reason might be that the arg max operation in equation 3 eliminates the fluctuation of probability matrix. Although using large sample size can achieve nice alignment performance, it is costly in computation. Fortunately, the deterministic method, which employs a single zero embedding rather than embedding of random words, can also achieve nice alignment performance with the same computa-  tional. One possible reason is that using zero embedding in inference is exactly the same way as dropout in training, making the trained parameters perform well in inference. In the rest of this paper, we employ the deterministic version as the default for PD in this paper.
Alignment on CFT words It is well-known that NMT outperforms SMT a lot in translation, and thus it is natural to ask why NMT yields worse alignment than the aligner FAST ALIGN in SMT, as shown in Table 1. Because the probability of a target word typically employs the mixed contributions from both source and target sides, NMT may capture good alignment for the target words mostly contributed from source (CFS, such as content words) while bad alignment for the target words mostly contributed from target (CFT, such as function words). To this end, we divide the target words into two categories: for a given sentence pair x, y , CFS and CFT are formally defined as two sets containing the target word y i satisfies following conditions respectively, where ∈ [0, 1) is a probability margin between CFS and CFT words.
After dividing the target words into two categories of CFS and CFT words according to the criterion defined above, 3 we evaluate alignment performance for each category and the results are shown in Table 4. We find that NMT indeed captures better alignment for CFS words than the alignment for CFT words, and FAST ALIGN generates much better alignment than NMT for CFT words. Therefore, this fact indicates that CFT words are the reason why NMT generate worse alignment than FAST ALIGN.

Methods
Target Words Tasks   Confidence-binned AER Since confidence can reflect translation quality to some extent, we also use the confidence of each target word (the predictive probability) during forced decoding to group the targets into ten bins and report the AER of them in Figure 2. We can find the AER generally decreases as the probability increases. This also indicates that alignment analysis on real translation instead of ground truth may lead to more reliable conclusion since beam search always finds high confidence candidates.

Understanding NMT via PD Alignment
Which method is better for understanding? Previous experiments mainly consider the alignment for the reference, and show that EAM is better at aligning a reference word to source words than PD. However, in order to better understand sān xía gōng chéng dì xìa dìan zhàn jí jīang kāi gōng jìan shè Three gorges project 's underground powerhouse to Three gorges project 's underground powerhouse construction start construction begin construction R: T: (a) Forced Decoding Error bā xiè sī gǔ dāng xuǎn luó mǎ ní yà zǒng tǒng chóu zǔ zhèng fǔ miàn lín tiǎo zhàn Basescu elected romanian president , faces challenge of forming goverment Romanian president elected to form goverment T: R: (b) Real Decoding Error Figure 3: Two examples of showing the translation errors caused by word alignment errors both in forced decoding and real decoding on TRANSFORMER-L6. Red arrow means wrong alignment while Green arrow means the golden alignment. red word means translation error. 'R' denotes reference sentence and 'T' denotes translation sentence.
the translation process of a NMT model, it is helpful to analyze the alignment of real translations derived from the NMT model itself. This is also in accordance with the confidence-binned observation previously. The alignment of the real translation actually provides some insight on the causal relationship among source and target words. To obtain AER on real decoding, we manually annotate word alignment of the real translations for 200 source sentences randomly selected from the ZH⇒EN test set. As shown in Table 5, PD yields better alignment for the real translation than EAM, and we even surprisingly find that its alignment performance is better than FAST ALIGN. 4 This quantitative finding demonstrates PD is better for understanding the real translation in general rather than only for some special case.  It is worth noting that EAM does not only deliver worse word alignment for real translations than PD, but also be dangerous to understand NMT through its word alignment. The main reason is that EAM relies on an external aligned dataset with supervision from statistical word aligner FAST ALIGN, and thus the characteristic of its alignment result are similar to that of FAST ALIGN, leading to the understanding biased to FAST ALIGN. In contrast, PD only relies on prediction from a neural model to define the relevance, it has been successfully used to understand and interpret a neural model (Zintgraf et al., 2017). Therefore, in the rest of this subsection, we try to understand NMT by using PD both qualitatively and quantitatively.

Analyze translation errors in forced decoding
We consider the forced decoding translation error as follows. We fix the translation history as the prefix of the reference y <i at each timestep i and then check whether the 1-best wordŷ i = arg max y P (y|y <i , x) is exactly y i . Ifŷ i = y i we say the NMT model makes an error decision at this timestep. We give a case of this kind of error in Figure 3(a). After visualizing the alignment of y i by PD, we find that its alignment in red color is not correct compared to the ground-truth alignment in green color. As a result, the NMT model can not capture the sufficient context to accurately predict the reference word y i and thereby generates an incorrect word 'construction'.
Besides the case study, we try to quantitatively interpret that alignment errors may lead to translation errors. To this end, we divide all timesteps from the reference of the test dataset into two categories, i.e. one with right alignment and the other with wrong alignment. Then we calculate the forced decoding translation error rates for each category, i.e. the ratio between the number of timesteps making error decisions in one category and the total number of timesteps, as depicted in Table 6. From the table, it is clear that wrong alignment is more likely to cause a translation error while correct alignment is likely to make a correct translation decision. Particularly, compared with right alignment, when alignment is wrong, the forced decoding translation error rate of CFS words increases much more than CFT words (∆). This observation indicates word alignment errors of CFS words are mainly responsible for translation errors instead of CFT words.  Figure 4: An example of word alignment and translation produced by TRANSFORMER-L6. Red arrow means wrong alignment and blue arrow means the prediction is attributed to a target word. The word in light font do not align to any source word, while red word means wrong translation.   For quantitative analysis, the same as the forced decoding, we split all target words into two parts, i.e. right alignment and wrong alignment, and then we evaluate the real decoding translation error rate for each of them via i |{y i } \ {ŷ i }|/ i |{y i }|. Table 7, there is an obvious gap between the real decoding translation error of right alignment and wrong alignment, which shows alignment errors have adverse effect on translation quality. For CFS and CFT words, Table 7 demonstrates that alignment errors cause decreasing of translation quality for both sets. Same as forced decoding, the real decoding translation error are also mainly attributed to CFS words. This suggests improving the ability of learning word alignment for CFS words is potential to improve translation quality for neural machine translation.

As shown in
Interpret Translation via CFT Alignment As the translation error has been shown related to the alignment error, the translation success can also be understood by word alignment. Previous research (Ding et al., 2017;Alvarez-Melis and Jaakkola, 2017) have attempted to interpret the decision-making of translation by aligning target words to source words. However, there is nonignorable amount of translated target words are mostly contributed from target side instead of source side.
As shown in Figure 4(a), as a functional word, 'a' should not be aligned to any source word. However, in Figure 4(b) PD incorrectly aligned 'a' to 'háng hǎi jīa' by only considering the contributions from the source side, and this leads to a misunderstanding for why 'a' is translated. Fortunately, according to equation 11, PD is good at distinguishing where the contributions come from for both source and target sides. As shown in Figure 4(c), considering alignment of words in CFS, 'a' is superbly not aligned to any source word because it belongs to CFT and should be aligned to 'is', which explains why NMT correctly translates 'a'.
Although the ambiguous Chinese word 'hé' mostly means 'and', TRANSFORMER is able to translate it perfectly as a given name 'hé' as shown