Accurate Word Alignment Induction from Neural Machine Translation

Despite its original goal to jointly learn to align and translate, prior researches suggest that the state-of-the-art neural machine translation model Transformer captures poor word alignment through its attention mechanism. In this paper, we show that attention weights do capture accurate word alignment, which could only be revealed if we choose the correct decoding step and layer to induce word alignment. We propose to induce alignment with the to-be-aligned target token as the decoder input and present two simple but effective interpretation methods for word alignment induction, either through the attention weights or the leave-one-out measures. In contrast to previous studies, we find that attention weights capture better word alignment than the leave-one-out measures under our setting. Using the proposed method with attention weights, we greatly improve over fast-align on word alignment induction. Finally, we present a multi-task learning framework to train the Transformer model and show that by incorporating GIZA++ alignments into our multi-task training, we can induce significantly better alignments than GIZA++.


Introduction
The task of word alignment is to find lexicon translation equivalents from parallel corpus.It is one of the fundamental tasks in natural language processing and is widely studied by the community (Dyer et al., 2013;Brown et al., 1993;Vogel et al., 1996;Och and Ney, 2003;Liu and Sun, 2015).Word alignments are useful in many scenarios, such as error analysis (Ding et al., 2017;Li et al., 2019), the introduction of coverage and fertility models (Tu et al., 2016), inserting external constraints in interactive machine translation (Hasler et al., 2018) and providing guidance for human translators in computer-aided translation (Dagan et al., 1993).
Word alignment is part of the pipeline in statistical machine translation (SMT) (Brown et al., 1993;Koehn et al., 2003;Chiang, 2005), but is not necessarily needed for neural machine translation (NMT) (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015).The attention mechanism (Bahdanau et al., 2015) in NMT does not functionally play the role of word alignment between the source and the target, at least not in the same way as its analog in SMT.It is hard to interpret the attention activations and extract meaningful word alignments, especially from Transformer which uses multiple attention components for each of the stacked decoder layers.As a result, the most widely used word alignment tools are still external statistical models such as fast-align (Dyer et al., 2013) and GIZA++ (Brown et al., 1993;Vogel et al., 1996;Och and Ney, 2003).
Recently, there is a resurgence of interest in the community to study word alignment for the Transformer.One simple solution is to induce word alignments from the attention weights between the encoder and decoder.The attention weights are averaged across all heads from the penultimate decoder layer.Then the next target word is aligned with the source word that has the maximum attention weights.However, such schedule only captures noisy word alignment (Alkhouli et al., 2018;Li et al., 2019;Ding et al., 2019;Garg et al., 2019).One of the major problems is that it induces alignment before observing the to-be-aligned target token (Peter et al., 2017;Ding et al., 2019).Suppose for the same source sentence, there are two alternative translations that diverge at decoding step i, generating y i and y i which respectively correspond to different source words.Presumably, the source word that is aligned to y i and y i should change correspondingly.However, this is not possible under arXiv:2004.14837v1[cs.CL] 30 Apr 2020 the above method, because the alignment scores are computed before prediction of y i or y i .
To alleviate this problem, some researchers modify the transformer architecture by adding alignment modules that predicts the to-be-aligned target token (Zenkel et al., 2019) or the training loss by adding an additional alignment loss which is computed with full target sentence in a multi-task learning framework (Garg et al., 2019).Others argue that using only attention weights is insufficient for generating clean word alignment and propose to induce word alignment with feature importance measures, such as leave-one-out (LOO) measures (Li et al., 2019) and gradient-based measures (Ding et al., 2019;Garg et al., 2019).However, all previous work induce alignment at the decoding step when the to-be-aligned target token is the decoder output.
In this work, we propose to induce word alignment at the decoding step when the to-be-aligned target token is the decoder input instead of the output.In this way, we can incorporate the information of the to-be-aligned target token easily.Specifically, we present two novel methods for alignment induction, one with attention weights and one with LOO measures.These methods are pure interpretation methods and do not require any parameter update or architecture change.We demonstrate that if the correct decoding step and layer is chosen, attention weights in vanilla Transformer is much more effective than previous known (Alkhouli et al., 2018;Li et al., 2019;Ding et al., 2019;Garg et al., 2019) and sufficient for generating clean word alignment interpretation.It is able to reduce the Alignment Error Rate (AER) by 6.9-9.8 points over the naive attention weights baseline and 3.4-6.7 points over fast-align under three evaluation sets.Our method is also complementary to the multi-task learning framework proposed by Garg et al. (2019).We demonstrate a combination of our method and the multi-task learning can further improve alignment performance, reducing AER by 0.4-2.7 points over GIZA++.

Neural Machine Translation
Let x = {x 1 , ..., x |x| } and y = {y 1 , ..., y |y| } be source and target sentences.Neural machine translation models the target sentence given the source sentence as p(y|x; θ), where θ is a set of model parameters to be learned.The negative log-likelihood (NLL) training loss is defined as: where y 0 = bos and y |y|+1 = eos represent the beginning and end of target sentences respectively.
The NMT model can be implemented with different architectures.In this paper, we use Transformer (Vaswani et al., 2017).Transformer is an encoder-decoder model that only relies on attention.Each decoder layer attends to the encoder output with multi-head attention, which consists of N attention heads running in parallel.

Alignment by Attention
The encoder output from the last encoder layer is denoted as h = {h 1 , ..., h |x| }, and the hidden states at decoder layer l as z = {z l 1 , ..., z l |y|+1 }.For decoder layer l, we define the layer averaged target-to-source attention weights as W l : where Q l n , U l n is the query and key matrix for head n of the target-to-source attention at decoder layer l, and the element W l i,j in W l measures the relevance between decoder hidden state z l i and encoder output h j .For simplicity, below we use the term "attention weights" to denote the layer averaged target-to-source attention weights.
Given a trained Transformer model, word alignment can be extracted from the attention weights (Alkhouli et al., 2018;Li et al., 2019;Ding et al., 2019;Garg et al., 2019).More specifically, word alignment A is extracted from attention weights W l according to the style of maximum a posterior strategy (MAP) as follows: where A ij = 1 indicates y i is aligned to x j .Garg et al. (2019) show that attention weights from the penultimate layer, i.e., l = L − 1, can induce the best alignments.
Although simple to implement, this method fails to obtain satisfactory word alignment (Alkhouli  et al., 2018;Li et al., 2019;Ding et al., 2019;Garg et al., 2019).First of all, W l i,j does not naturally measure the relevance between y i and x j .It measures the relevance between decoder hidden state z l i and encoder output h j .However, at decoding step i, the decoder input is y i−1 and the output is y i .Decoder representation z l i may better represent y i−1 instead of y i , especially for the bottom layers.Second, since the attention weight W l i,j is computed before observing y i , it becomes difficult for it to learn the target token y i 's alignment to the source tokens, as discussed in Section 1.
As a result, it is necessary to develop novel methods for alignment induction.This method should be able to (1) take into account the relationship of z l i , y i and y i−1 , and (2) adapt the alignment induction with the to-be-aligned target token.

Alignment from Vanilla Transformer
We propose a novel framework to induce alignment from vanilla Transformer model.We first discuss how to represent the to-be-aligned target token at each decoder layer, then present the alignment induction method including: 1) alignment induction with attention weights or LOO measure and 2) layer selection criterion that determines the layer to induce alignment.We denote the method that induces alignment with attention weights as align-att and LOO measure as align-loo.

Target Token Representation
As shown in Fig. 1, we denote the input to the first decoder layer at decoding step i as z 0 i , the decoder hidden state as z l i (1 ≤ l ≤ L) and the probability distribution p(y|y 0:i−1 , x; θ) at the output layer as Regardless of the value of l, previous work (Li et al., 2019;Ding et al., 2019;Garg et al., 2019) compute the alignment scores for target token y i using the attention weights or feature importance measures associated with decoder representation z l i .This is not convincing, especially for small values of l.
At decoding step i, the input to the decoder is y i−1 , while at the output layer, it predicts the probability of y i .Therefore, we assume that z 0 i is more relevant to y i−1 , while z L+1 i is more relevant to y i .With l increasing from 0 to L + 1, the decoder gathers more information from the context, and may gradually change its representation from that more relevant to y i−1 to that more relevant to y i .Therefore, for small values of l, z l i may better represent the input token y i−1 than the output token y i .
In order to compute the alignment score of the target token after knowing its identity, we propose to induce alignment from decoder representations with the to-be-aligned target token as the input.Specifically, we represent y i as z l i+1 and induce word alignment from some layer l b .Since knowing the target token y i is important for accurate alignment induction (Section 1), we believe our method can induce better word alignment.

Alignment Induction
Given a source sentence x = {x 1 , ..., x |x| } and a target sentence y = {y 1 , ..., y |y| }, we first define the matrix of alignment scores as S l ∈ R |y|×|x| , in which an element S l i,j is the alignment score of source word x j to target word y i , computed according to the relevance of x j and z l i+1 , the representation of y i at the l-th layer: With this alignment scoring matrix S l , we can generate word alignment A following (Alkhouli et al., 2018;Li et al., 2019;Ding et al., 2019;Garg et al., 2019): where A ij = 1 indicates y i is aligned to x j .The relevance score in Eq. 4 can be defined either with attention weights or LOO measures.
Attention Weights In Transformer, the attention weights W l i+1,j measures the relevance between h j and z l i+1 .Since h j represents the source word x j , we can use the attention weights to define the relevance score: where 1 ≤ l ≤ L because attention weights are computed at these layers.

LOO Measure
The intuition to this method is that when the decoder representation z l i+1 is aligned to the source word x j , the influence of masking x j should be much higher for z l i+1 than other representations at the same layer.This approach is called Leave-One-Out (LOO).More specifically, we define the alignment score of source word x j to decoder representation z l i+1 as: where ) and x (j,0) denotes the source sentence by replacing x j with a word whose embedding is a zero vector.We use European distance to define the distance for layer 1 ≤ l ≤ L and KL divergence for layer l = L + 1: In summary, we induce alignment from layer l with where r(z l i+1 , x j ) is defined in Eq. 6 for the method align-att and Eq. 7 and 8 for the method align-loo.

Layer Selection Criterion
In this part, a surrogate layer selection criterion is proposed to select the best layer to induce alignment without manually labelled word alignments.Experiments show that this criterion correlates well with AER metric.
Given parallel sentence pairs x, y , we train a source-to-target model θ x→y and a target-to-source model θ y→x .We assume that the word alignment induced from these two models should agree with each other (Cheng et al., 2016).Therefore, we evaluate the quality of the alignment by computing the AER score on the validation set with the sourceto-target alignment as the hypothesis and the targetto-source alignment as the reference.For each model, we can obtain K word alignments from K different layers (K = L for align-att and K = L + 1 for align-loo).In total, we obtain K × K AER scores.We select the one with the lowest AER score, and its corresponding layers of the sourceto-target and target-to-source models are the layers we will use to induce word alignment at test time: 4 Alignment with Multi-task Learning Inspired by Garg et al. (2019), we present a multi-task learning framework to train a novel Transformer model that can incorporate external alignments from GIZA++.With the guidance of GIZA++ alignments, this method can further improve word alignment induction.Note that this method is not an interpretation method for vanilla Transformer as it modifies the training process.Specifically, we add an alignment loss to the regular loss function, supervising one attention head at the decoding step when the to-be-alignment target token is the decoder input and layer l b , the best layer selected with Eq. 10, to learn GIZA++ alignments.The alignment loss is defined as: where W l b n 0 is the attention matrix computed by an arbitrary alignment head n 0 from the layer l b , and G p denotes the normalized reference alignment extracted with GIZA++1 .Combing with the regular NLL loss, we train the Transformer with: where λ is a hyper-parameter to balance these two losses.At test time, we extract word alignment from the same head n 0 and the same layer l b with: Similar with Garg et al. (2019), the alignment loss can be computed at the same forward pass as computing the regular NLL loss or with another forward pass that removes the future mask at the decoder side (full-context).Full-context improves alignment extraction under their setting, because it considers the future context especially the to-bealigned target token when computing alignment scores.However, it also brings two problems.First, two forward training introduces additional computation cost at training stage.Second, with future context, the alignment is run as a separate postprocessing step after the full sentence translation is completed.Therefore, methods with full-context are not suitable for translation error analysis and cases where alignment is computed during the decoding process, e.g.constrained decoding with attention (Hasler et al., 2018).Since our method already computes the alignment scores after observing the to-be-aligned target token, we believe future context is not that necessary under our setting.We compare our method with (align-att-mtl) or without full-context (align-att-mtl-fullc) in the experiments.

Settings
Dataset For fair comparison, we follow Zenkel et al. (2019) in data setup.We evaluate all approaches on German-English, Romanian-English and French-English language pairs by measuring AER against the human alignment reference.Since no validation set is provided in their setup, we follow Ding et al. (2019) to set the last 1,000 sentences of the training data before preprocessing as validation set, which is used for training and best alignment layer selection.All models are trained in both translation directions and symmetrized with grow-diag-final (Koehn et al., 2005).See appendix for details.
NMT Systems We follow Ding et al. (2019) for training the NMT model.Specifically, we use fairseq-py2 to train the Transformer model, following the setting in transformer iwslt de en.
Baselines We have proposed methods to induce word alignment from the Transformer, two (alignatt and align-loo) from vanilla Transformer trained with regular NLL loss, and two (align-att-mtl and align-att-mtl-fullc) from Transformer trained with multi-task learning framework.We compare our methods with 12 baselines: • fast-align offline: the fast-align method (Dyer et al., 2013) under offline setting3 .
• fast-align online: the fast-align method (Dyer et al., 2013) under online setting.Since all the neural methods including ours are run in an online setting at test time, we believe this is a slightly better baseline to compared with.
• Attention: the Attention method reported in Ding et al. (2019), extracting alignment from the last decoder layer.
• naive-loo: we follow the same process as alignloo, except that we use z l i to represent y i .We find that l b = L + 1 for all translation tasks.
• naive-att: we follow the same process as alignatt, except that we use z l i to represent y i .We find that l b = L − 1 for all translation models.
• AddSGD: the method that explicitly adds an alignment module to Transformer reported in Zenkel et al. (2019).
For each sentence pair, naive-loo, align-loo and PD forward once with |x| + 1 masked sentence pairs as the input, while SmoothGrad and SD-SmoothGrad forward and backward once with m (m = 30 in Ding et al. ( 2019)) noisy sentence pairs as the input.All the other methods forward once with one sentence pair as the input.Therefore, the computation cost for methods with attention weights is much lower compared to methods with feature importance measures.

Results
Comparison with Baselines Table 1 compares our methods with all the baselines.It shows that our method align-att significantly outperforms all neural baselines that extract alignment from vanilla Transformer (Modify: -).For example, it outperforms SD-SmoothGrad(2016), the state-of-the-art method to extract alignment from vanilla Transformer, by 6.6-8.6 AER scores across different language pairs, not to mention that the induction computation cost of align-att is much less than SD-SmoothGrad.Our method align-att also outperforms align-loo in terms of both alignment quality and induction computation cost.The success of align-att shows that vanilla Transformer has captured alignment information in an implicit way, which could be revealed if the correct decoding step and layer are chosen to induce alignment.we also compare our methods with statistical alignment methods.align-att significantly outperforms fast-align, both under online and offline settings.It achieves slightly worse performance compared with GIZA++.Since align-att is better than alignloo across all datasets, we focus on experiments with align-att in the remaining sections.
We also compare our multi-task learning methods with Garg et al. (2019).Our methods with or without full-context obtain similar results, which is different from Garg et al. (2019).Their method with full context (naive-attn-mtl-fullc) obtains significant better results than that without full context (naive-attn-mtl).In naive-attn-mtl, alignment scores for target token y i is computed before observing y i .The rest target tokens, especially y i , are very important for better alignment.However, in our method align-attn-mtl, the alignment scores are already computed after observing y i , deemphasizing the rest target tokens.Our method without full-context (align-attn-mtl) obtains comparable result with the baseline method with fullcontext (naive-attn-mtl-fullc) with less training cost and less reliance on full-context.Finally, alignattn-mtl also significantly outperforms all statistical methods, improving over GIZA++ with 0.4-2.7 AER scores.In summary, our method align-attnmtl achieves the best alignment performance, with less computation cost and broader application scenarios.2 demonstrate that with align-att, the layers selected by the layer selection criterion are indeed the best layer to induce alignment.We also verify the layer selection criterion for the method align-loo.See the appendix for the details.
Relevance Measure Verification To investigate the relevance between decoder hidden states and the corresponding input/output token, we design an experiment to prob whether the decoder hidden states contain the identity information of the input and output token, following Brunner et al. (2019).Formally, for decoder hidden state z l i , the input token is identifiable if there exists a function g such that y i−1 = g(z l i ).We cannot prove the existence of g analytically.Instead, for each layer l we learn a projection function ĝl to project from the hidden state space to the input token embedding space ŷl i = ĝl (z l i ) and then search for the nearest neighbour y k within the same sentence.We say that the decoder hidden state can identify the input token if k = i − 1.Similarly, we follow the same process to project ẑl i into the output token embedding space and say that the decoder hidden state can identify the output token if k = i.We report the identifiability rate defined as the percentage of correctly identified tokens.
Fig. 2 presents the result on the validation set of de→en translation.We try a naive baseline ĝnaive l (z l i ) = z l i and two projection functions: a linear perceptron ĝlin l and a non-linear multi-layer perceptron ĝmlp l .The result shows that with trainable projection function ĝlin l and ĝmlp l , all layers can identify the input tokens, although more hidden states cannot be mapped back to their input tokens anymore in higher layers.Besides, at the bottom layers, the input tokens remain identifiable and the output tokens are hard to identify, regardless of the projection function we use.This verifies that for bottom layers, the hidden states are more relevant to the input token than the output token.Finally, it is much easier to identify the input token.For example, when projecting with mlp, all layers can identify more than 98% of the input tokens.However, even for the best layer, we can only identify 83.5% of the output tokens.This observation verifies that computing alignment scores of target word y i according to hidden state z l i+1 is better than with z l i since z l i even may not be able to identify y i .AER v.s.BLEU During training, vanilla Transformer gradually learns to align and translate.To analyze how the alignment behavior changes at different layers and for checkpoints with different translation quality, we plot AER on the test set v.s.BLEU on the validation set for different layers and checkpoints on the de→en translation.We compare the baseline naive-att and our method align-att and induce alignments from all layers.naive-att aligns source tokens to the output token based on current decoder hidden state (align output token), while  align-att aligns source tokens to the input token (align input token).
The experiment results are show in Fig. 3.We observe that at the beginning of training, layer 3 and 4 learn to align the input token, while layer 5 and 6 learn to align the output token.However, with increasing of BLEU scores, layer 4 tend to change from aligning input token to aligning output token, and layer 1 and 2 begin to align input token.This indicates that the vanilla Transformer gradually learns to align the input token from middle layers to bottom layers.Besides, the ability of layer 6 to align output token decreases.We hypothesize that layer 5 already has the ability to attend to the source tokens that are aligned to the output token, therefore attention weights in layer 6 may capture other information needed for translation.Finally, for models with highest BLEU scores, layer 5 aligns the output token best and layer 3 aligns the input token best.

Alignment Example
In Fig. 4, we present an alignment example from de-en alignment test set.Manual inspection of this example as well as others we find that our method align-att and align-att-mtl tend to align more tokens than GIZA++, and better align the beginning tokens compared to naive-att.Besides, with multi-task learning to incorporate GIZA++ alignment, we find that align-att-mtl sometimes successfully induces aligned pairs that captured by GIZA++, but not align-att.

Related Work
Inducing word alignment from RNNSearch (Bahdanau et al., 2015) has been explored by a number of works.Bahdanau et al. (2015) is the first to show word alignment example by using attention in RNNSearch.Ghader and Monz (2017) further demonstrate that the RNN-based NMT system achieves comparable alignment performance to that of GIZA++.Alignment has also been used to improve NMT performance, especially in low resource settings, by supervising the attention mechanisms of RNNSearch (Ghader and Monz, 2017;Chen et al., 2016;Liu et al., 2016;Alkhouli and Ney, 2017).
There is also a number of other studies that induce word alignment from Transformer.Li et al. (2019); Ding et al. (2019) claim that attention may not capture word alignment in Transformer, and propose to induce word alignment with prediction difference (Li et al., 2019) or gradient-based measure (Ding et al., 2019).Zenkel et al. (2019) modify the Transformer architecture for better alignment induction by adding an extra alignment module that is restricted to attend solely on the encoder information to predict the next word.Garg et al. (2019) propose a multi-task learning framework to improve word alignment induction without decreasing translation quality, by supervising one attention head at penultimate layer with GIZA++ alignments.Although these methods are reported to improve over layer average baselines, they ignore that better alignment can be induced by computing alignment scores at the decoding step when the to-be-aligned target token is the input, thus fail to fully induce the word alignment implicitly learned in Transformer.

Conclusion
In this paper, we have presented a novel framework for alignment induction from a vanilla Transformer model.The basic idea is that it is better to induce alignment at the decoding step when the to-be-aligned target token is the input instead of the output.Our method with attention weights successfully induces satisfactory word alignments from standard Transformer model, demonstrating that Transformer indeed jointly learns to align and translate.We also combine the multi-task learning framework with our method and show that by incorporating GIZA++ alignments into the Transformer, we can achieve significantly better alignment results compared to GIZA++, even without the use of future context.We also check whether our proposed layer selection criterion can select the right layer to induce alignment with the method align-loo.We follow the same process as in Section 5.2.First, we determine the best layer l b x→y and l b y→x based on the layer selection criterion on the validation set, as shown in Table 4 (a).Then we evaluate the AER scores of alignment induced from different layers on the test set (shown in Table 4 (b)), and find that the layers with the lowest AER score are consistent with l b x→y and l b y→x .This experiment result further verifies the effectiveness of the unsupervised layer selection criterion.

A.3 Best Layer to Induce Alignment
We also list the best layer to induce word alignment with different methods, all under the con-  figure of transformer iwslt de en.It shows that compared with align-loo and align-att, naive-loo and naive-att tend to induce alignment from higher layers, which is consistent with our intuition in Section 3.1.Besides, we also observe that with attention weights as the relevance measure, the best layer to obtain alignment is relatively consistent for different translation models under the same configure.With the method align-att, we have l b = 3 for all translation models.With the method naive-att, we have l b = 5 for all translation models.
A.4 Alignment with Transformer-Big Note that Transformer-small is trained on the datatset discussed in Section 5.1, while Transformer-big is provided by fairseq-py and trained on WMT19 for de→ en translation, WMT16 for en→ de translation and WMT14 for en→ fr translation. 5n this experiment, we test whether good word alignment can be extracted from the Transformer model under other configurations.Instead of training the model from scratch by ourselves, we utilize the pretrained Transformer models provided by fairseq-py.All the models are based on the configuration of transformer wmt en de big, therefore we denote them as Transformer-big.We use the pretrained Transformer models on WMT15 en→fr, WMT16 en→de and WMT19 de→en6 translations.We denote the small Transformer we train as Transformer-small, and compare these two configurations in Table 6.It shows that although the BLEU score obtained by Transformer-small and Transformer-big can be different, the AER scores of the two models with align-att are similar.We also list the layer from which align-att extracts alignment.Transformer-small and Transformerbig both have 6 decoder layers.However, the best layer for Transformer-small is always layer 3, while for Transformer-big, the best layer is layer 2 for de→en and en →de translations.This indicates that with more data and big model, the Transformer tends to acquire alignment information at lower layer.

A.5 More Alignment Examples
See next page.

Figure 1 :
Figure 1: The forced decoding process illustration with Transformer at decoding time step i − 1 and i.
.7 20.9 55.7 80.5 81.5 en→de 27.4 31.3 25.7 68.5 83.4 85.1 Table 2: Layer selection criterion verification with align-att on de-en alignment.(a) For each cell, we induce hypothesis alignment from de→en translation and reference alignment from en→de translation.l b = 3 for both translation directions in this table.(b) Test AER when inducing word alignment from different layers.Layer 3 induces the best alignment for both translation directions, which verifies the value of l b selected in (a).Layer Selection Criterion To test whether the layer selection criterion can select the right layer to induce alignment, we first determine the best layer l b x→y and l b y→x based on the layer selection criterion.Then we evaluate the AER scores of alignment induced from different layers on the test set, and check whether the layers with the lowest AER score are consistent with l b x→y and l b y→x .Table

Figure 2 :
Figure 2: Identifiability rate of the input and output tokens for decoder hidden states at different layers.

Figure 3 :
Figure 3: AER on the test set v.s.BLEU on the validation set on the de→en translation, evaluated with different checkpoints.

Figure 4 :
Figure 4: One example from de-en alignment test set.Golden alignments is shown in (1), blue squares and light blue squares represent sure and possible alignment separately.
(a)  For each cell, we induce hypothesis alignment from de→en translation and reference alignment from en→de translation.l b = 3 for the de→en translation and l b = 2 for the en→de translation in this table.(b) Test AER when inducing word alignment from different layers.Layer 3 induces the best alignment for the de→en translation and layer 2 the best for the reverse direction, which verifies the value of l b selected in (a).

Figure 5 :
Figure 5: Three alignment examples on the de-en alignment test set.All these are symmetrized alignment results.

Table 1 :
Garg et al. (2019)t with different alignment methods.bidiraresymmetrizedalignmentresults.Since we use the same dataset and preprocessing asDing et al. (2019), for Attention, SmoothGrad, SD-SmoothGrad and AddSGD, the results are quoted fromDing et al. (2019).SinceGarg et al. (2019)report that BPE is beneficial for statistical alignment models, which is consistent with our own experiments, we report the BPE-based fast-align and GIZA++ results.The column Modify represents whether the method modifies the training loss (loss) or the Transformer architecture (arch).The column Fullc denotes whether full-context is used to induce alignment at test time.Best bidir neural results that induce alignment without modification of loss or architecture are marked with underlines, and comparable best bidir results among all methods are marked with boldface.

Table 3 :
Number of sentences in each dataset.

Table 4 :
Layer selection criterion verification with align-aloo on de-en alignment.

Table 5 :
The best layer to induce alignment with different methods, based on the layer selection criterion.

Table 6 :
Comparison of Transformer-small and Transformer-big.