Attention Is Not Only a Weight: Analyzing Transformers with Vector Norms

.


Introduction
Transformers (Vaswani et al., 2017;Devlin et al., 2019;Yang et al., 2019;Lan et al., 2020) have improved the state-of-the-art in a wide range of natural language processing tasks. The success of the models has not yet been sufficiently explained; hence, substantial research has focused on assessing the linguistic capabilities of these models (Rogers et al., 2020;Clark et al., 2019).
One of the main features of Transformers is that they utilize an attention mechanism without the use of recurrent or convolutional layers. The attention mechanism computes an output vector by accumulating relevant information from a sequence of input vectors. Specifically, it assigns attention weights (i.e., relevance) to each input, and sums up input vectors based on their weights. The analysis of correlations between attention weights and various linguistic phenomena (i.e., weight-based analysis) is a prominent research area (Clark et al., 2019;Kovaleva et al., 2019;Reif et al., 2019;Lin et al., 2019;Mareček and Rosa, 2019;Htut et al., 2019;Raganato and Tiedemann, 2018;Tang et al., 2018).
This paper first shows that weight-based analysis is insufficient to analyze the attention mechanism. Weight-based analysis is a common approach to analyze the attention mechanism by simply tracking attention weights. The attention mechanism can be expressed as a weighted sum of linearly transformed vectors (Section 2.2); however, the effect of transformed vectors in weightbased analysis is ignored. We propose a normbased analysis that considers the previously ignored factors (Section 3). In this analysis, we measure the norms (lengths) of the vectors that were summed to compute the output vector of the attention mechanism.
Using the norm-based analysis of BERT (Section 4), we interpreted the internal workings of the model in more detail than when weight-based analysis was used. For example, the weight-based analysis (Clark et al., 2019;Kovaleva et al., 2019) reports that specific tokens, such as periods, commas, and special tokens (e.g., separator token; [SEP]), tend to have high attention weights. However, our norm-based analysis found that the information collected from vectors corresponding to special tokens was considerably lesser than that reported in the weight-based analysis, and the large attention weights of these vectors were canceled by other factors. Additionally, we found that BERT controlled the levels of contribution from frequent, less informative words by controlling the norms of their vectors.
In the analysis of a Transformer-based NMT system (Section 5), we reinvestigated how accurate word alignment can be extracted from the source-target attention. The weight-based results of Li et al. (2019), Ding et al. (2019), andZenkel et al. (2019) have empirically shown that word alignments induced by the source-target attention of the Transformer-based NMT systems are noisy. Our experiments show that more accurate alignments can be extracted by focusing on the vector norms.
The contributions of this study are as follows: • We propose a novel method of analyzing an attention mechanism based on vector norms (norm-based analysis). The method considers attention weights and previously ignored factors, i.e., the norm of the transformed vector. • Our norm-based analysis of BERT reveals that (i) the attention mechanisms pay considerably lesser attention to special tokens than to observations that are solely based on attention weights (weight-based analysis), and (ii) the attention mechanisms tend to discount frequent words. • Our norm-based analysis of a Transformer-based NMT system reveals that reasonable word alignment can be extracted from source-target attention, in contrast to the previous results of the weight-based analysis. The codes of our experiments are publicly available. 1 2 Background

Attention mechanism
Attention is a core component of Transformers, which consist of several layers, each containing multiple attentions ("heads"). We focused on analyzing the inner workings of these heads.
As illustrated in Figure 1, each attention head gathers relevant information from the input vectors. A vector is updated by vector transformations, attention weights, and a summation of vectors. Mathematically, attention computes each output vector y i ∈ R d from the corresponding pre-update vector y i ∈ R d and a sequence of input vectors X = {x 1 , . . . , x n } ⊆ R d : where α i,j is the attention weight assigned to the token x j for computing y i , and q(·), k(·), and v(·)  Figure 1: Overview of attention mechanism in Transformers. Sizes of the colored circles illustrate the value of the scalar or the norm of the corresponding vector.
are the query, key, and value transformations, respectively.
Attention gathers value vectors v(x j ) based on attention weights and then, applies matrix multiplication W O ∈ R d ×d ( Figure 1). 2 Boldface letters such as x denote row (not column) vectors, following the notations in Vaswani et al. (2017).
In self-attention, the input vectors X and the pre-update vector y i are previous layer's output representations. In source-target attention, X corresponds to the representations of the encoder, and vector y i (and updated vector y i ) corresponds to the vector of the i-th input token of the decoder.

Attention is a weighted sum of vectors
With a simple reformulation, one can observe that the attention mechanism computes the weighted sum of the transformed input vectors. Because of the linearity of the matrix product, we can rewrite Equation 1 as

Weighted vectors
Attention weights

Transformed vectors
Input vectors f Equation 3 shows that the attention mechanism first transforms each input vector x to generate f (x) ; computes attention weights α ; and then compute the sum αf (x) (see Figure 2).

Problems encountered in weight-based analysis
The attention mechanism has been designed to update representations by gathering relevant information from the input vectors. Prior studies have analyzed attention, focusing on attention weights, to ascertain which input vectors contribute (weight-based analysis) (Clark et al., 2019;Kovaleva et al., 2019;Reif et al., 2019;Lin et al., 2019;Mareček and Rosa, 2019;Htut et al., 2019;Raganato and Tiedemann, 2018;Tang et al., 2018). Analyses solely based on attention weight are based on the assumption that the larger the attention weight of an input vector, the higher its contribution to the output. However, this assumption disregards the magnitudes of the transformed vectors. The problem encountered when neglecting the effect of f (x j ) is illustrated in Figure 2. The transformed vector f (x 1 ) for input x 1 is assumed to be very small ( f (x 1 ) ≈ 0), while its attention weight α i,1 is considerably large. Note that the small α i,1 f (x 1 ) contributes a little to the output vector y i because y i is the sum of αf (x), where a larger vector contributes more to the output. Conversely, the large α i,3 f (x 3 ) dominates the output y i . Therefore, in this case, only considering the attention weight may lead to a wrong interpretation of the high contribution of input vector x 1 to output y i . Nevertheless, x 1 hardly has any effect on y i .
Analyses based on attention weights have not provided clear results in some cases. For example, Clark et al. (2019) reported that input vectors for separator tokens [SEP] tend to receive remarkably large attention weights in BERT, while changing the magnitudes of these weights does not affect the masked-token prediction of BERT. Such results can be attributed to the aforementioned issue of focusing only on attention weights.
3 Proposal: norm as a degree of attention As described in Section 2.3, analyzing the attention mechanism with only attention weights neglects the effect of the transformed vector f (x j ), which has a significant impact as we discussed later.
Herein, we propose the measurement of the norm of the weighted transformed vector αf (x) , given by Equation 3, to analyze the attention mechanism behavior. 3 Unlike in previous studies, we analyzed the behaviors of the norms, αf (x) and f (x) , and α to gain more in-depth insights into the functioning of attention. The proposed method of analyzing the attention mechanism is called norm-based analysis and the method that solely analyzes the attention weights is called weight-based analysis.
In Sections 4 and 5, we provide insights into the working of Transformers using norm-based analysis. Appendix A explains that our norm-based analysis can also be effectively applied to an entire multi-head attention mechanism.

Experiments: BERT
First, we show that the previously ignored transformed-vector norm affects the analysis of attention in BERT (Section 4.1). Applying our norm-based analysis, we re-examine the previous reports on BERT obtained by weight-based analysis (Section 4.2). Next, we demonstrate the previously overlooked properties of BERT (Section 4.3).

Does f (x) have an impact?
We analyzed the coefficient of variation (CV) 6 of previously ignored effect-f (x) -to first demonstrate the degree to which αf (x) differs from weight α. We computed the CV of f (x) of all the example data for each head. Table 1 shows that the average CV is 0.22. Typically, the value of the norm f (x) varies from 0.78 to 1.22 times the average value of the f (x) . Thus, there is a difference between the weight α and αf (x) due to the dispersion of f (x) , which motivated us to consider f (x) in the attention analysis. Appendix B presents the detailed results.

Re-examining previous observation
In this section, with the application of our normbased analysis, we reinvestigate the previous observation of Clark et al. (2019); they analyzed BERT using the weight-based analysis.
Settings: First, all the data were fed into BERT. Then, the weight α and αf (x) were collected from each head. Following Clark et al. (2019), we report the results of the following categories: (i) 4 We used PyTorch implementation of BERT-base (uncased) released at https://github.com/huggingface/ transformers. 5 https://github.com/clarkkev/attention-analysis 6 Coefficient of variation (CV) is a standardized (scaleinvariant) measure of dispersion, which is defined by the ratio of the standard deviation σ to the mean µ; CV := σ/µ.     , and punctuations-have remarkably large attention weights, which is consistent with the report of Clark et al. (2019). In contrast, our normbased analysis demonstrated that the contributions of vectors corresponding to these tokens were generally small (Figure 3b). The result demonstrates that the size of the transformed vector f (x) plays a considerable role in controlling the amount of information obtained from the specific tokens. Clark et al. (2019) hypothesized that if the necessary information is not present in the input vectors, BERT assigns large weights to [SEP], which appears in every input sequence, to avoid the incorporation of any additional information via at-tention. 7 Clark et al. (2019) called this operation no-operation (no-op). However, it is unclear whether assigning large attention weights to [SEP] realizes the operation of collecting little information from the input sequence.
Our norm-based analysis demonstrates that the amount of information from the vectors corresponding to [SEP] is small (Figure 3b). This result supports the interpretation that BERT conducts "no-op," in which attention to [SEP] is considered a signal that does not collect anything. Additionally, we hope that our norm-based analysis can provide a better interpretation of other existing findings.

Analysis-The relationship between α and
f (x) : It remains unclear how attention collects only a little information while assigning a high attention weight to a specific token, [SEP].
Here, we demonstrate an interesting trend of α and f (x) cancelling each other out on the tokens. 8 Table 2 shows the Spearman rank correlation coefficient between α and f (x) , corresponding to the vectors in each category. The weight α and the norm f (x) have a negative correlation in terms of [CLS], [SEP], periods, and commas. This cancellation manages to collect a little information even with large weights. Figure 4 illustrates the contrast between α and f (x) corresponding to [SEP] in each head. For most of the heads, α and f (x) clearly negate the magnitudes of each other. A similar trend was observed in [CLS], periods, and commas. Conversely, no significant trend was observed in the other tokens (see Appendix D.3). Figure 5 shows 1% randomly selected pairs of α and f (x) in each word category. Even when the same weight α is assigned, f (x) can vary, suggesting that α and f (x) play a different roles in attention.

Relation between frequency and f (x)
In the previous section, we demonstrated that f (x) corresponding to the specific tokens (e.g., [SEP]) is small. Based on the high frequencies 9 of (a) α.
(  these word types 10 , we hypothesized that BERT controlled contributions of highly frequent, less informative words by adjusting the norm of f (x).
Settings: First, all the data were fed into the model. Then, for each input token t, we collected the weight α and f (x) . We averaged α and f (x) for all the heads for each t to analyze the trend of the entire model. Let r(·) be a function that returns the frequency rank of a given word. 11 We analyzed the relationship of r(t) with α and f (x) .

Results:
The Spearman rank correlation coefficient between the frequency rank r(t) and f (x) was 0.75, indicating a strong positive correlation. In contrast, the Spearman rank correlation coefficient did not show any correlation (ρ = 0.06) between r(t) and α. 12 The visualizations of their relationships are shown in Appendix D.4. These results demonstrate that the self-attentions in BERT reduce the information from highly frequent words by adjusting f (x) and not α. This frequency-based effect is consistent with the intuition that highly frequent words, such as stop words, are unlikely to play an important role in solving the pre-training tasks (masked-token prediction and next-sentence prediction).

Experiments: Transformer for NMT
Additionally, we analyzed the source-target attention in a Transformer-based NMT system. One major research topic in the NMT field is whether NMT systems internally capture word alignment between source and target texts, and if so, how word alignment can be extracted from black-box NMT systems. Li et al. (2019), Ding et al. (2019), and Zenkel et al. (2019) empirically showed, using the weight-based method, that word alignment induced by the attention of the Transformer is noisy. In this section, we show the analysis of source-target attention using vector norms αf (x) and demonstrate that clean alignments can be extracted from the source-target attention. Word alignment can be used to provide rich information for the users of NMT systems (Ding et al., 2019).  2019), we trained a Transformer-based NMT system for German-to-English translation on the Europarl v7 corpus 13 . Next, we extracted word alignments from α and αf (x) under the force decoding setup. Finally, we evaluated the derived alignment using the alignment error rate (AER) (Och and Ney, 2000). A low AER score indicates that the extracted word alignments are close to the reference. We used the gold alignment dataset provided by Vilar et al. (2006) 14 . Experiments were performed on five random seeds, and the average AER scores were reported. The experimental settings are detailed in Appendix E.

Alignment extraction from attention
Weights or norms: A typical alignment extraction method uses attention weights (Li et al., 2019;Ding et al., 2019;Zenkel et al., 2019). Specifically, given a source-target sentence pair, Figure 6: An example of behavior of the source-target attentions in an NMT system (German-to-English). Attentions in the earlier layers focus the source word "ein" aligned with the input word "a," while those in the latter layers focus the source word "Schüler" aligned with the output word "student." {s 1 , . . . , s J } and {t 1 , . . . , t I }, word alignment is estimated by calculating a source word s j that has the highest weight when generating a target word t i . We call this method the weight-based alignment extraction. In contrast, we propose a normbased alignment extraction method that extracts word alignments based on αf (x) instead of α. Formally, in these methods, the source word s j with the highest attention weight or norm during the generating of target word t i is extracted as the word that is aligned with t i : In Section 5.2, following Li et al. (2019), we analyze the word alignments that we obtained from each layer by integrating H heads within the same layer: where f h (x j ) and α h i,j are the transformed vector and the attention weight at the h-th head, respectively.
Alignment with input or output word: In our preliminary experiments (Appendix E.3), we observed that the behavior of the source-target attention of the decoder differs between the earlier and later layers. As shown in Figure 6, at the time decoding the word t i+1 with the input t i , attention heads in the earlier layers assign large weights or norms to s j corresponding to the input t i "a," whereas those in the latter layers assign large values to s j corresponding to the output word t i+1 "student." Based on this observation, we explored two settings for investigating alignment extraction methods: alignment with output (AWO) and alignment with input (AWI). The AWO setting refers to the approach introduced in Equation 5. Specifically, alignments (s j , t i ) were extracted by considering a source word s j that gained the highest weight (norm) when outputting a particular target word t i .
In the AWI setting, alignments (s j , t i ) were extracted by considering a source word s j that gained the highest weight (norm) when inputting the word t i (i.e., predicting a word t i+1 ). Formally, alignment with the AWI setting is calculated as follows:

Comparative experiments
We compared the quality of the alignments that were obtained by the following six methods: We report the best and averaged AER scores across the layers. In addition, we report on the AER score at the head and the layer with the highest average αf (x) in the norm-based extraction. 15 The settings are detailed in Appendix E.2. The AER scores of each method are listed in Table 3. The results show that word alignments extracted using the proposed norm-based approach are more reasonable than those extracted using the weight-based approach. Additionally, better word alignments were extracted in the AWI setting than in the AWO setting. The alignment extracted using the layer with the highest average αf (x) in the AWI setting is better than the gradientbased method, and competitive with one of the existing word aligners-fast align. 16 These results 15 The average αf (x) of the layer was determined by the sum of the average αf (x) at each head in the layer. 16 Even at the head with the highest average αf (x) . Although the average score of five seeds in the AWI setting was 35.5, four seeds out of them achieved great score range  show that much clearer word alignments can be extracted from a Transformer-based NMT system than the results reported by existing research. The primary reason behind the differences between the results of the weight-and norm-based methods was analogous to the finding discussed in Section 4.2, while some specific tokens, such as /s , the special token for the end of the sentence, tended to obtain heavy attention weights; their transformed vectors were adjusted to be smaller, as shown in Figure 7.

Relationship between norms and alignment quality
We further analyze the relationship between αf (x) and AER scores in the head-level. Figures 8a and 8b show the AER scores of the alignments obtained by the norm based extraction at each head in the AWO and AWI settings. Figure 8c shows the average of αf (x) at each head. The small αf (x) implies that α and f (x) tend to cancel out in the head.
Comparing Figures 8a and 8c, the average αf (x) and AER scores in the AWI setting from 23.6-to 25.7. The score was 77.5 for a remaining seed.  Table 3, where the head or the layer with the highest average αf (x) provides clean alignments in the AWI setting. This result suggests that Transformer-based NMT systems may rely on specific heads that align source and target tokens. This result is also consistent with the exiting reports that pruning some attention heads in Transformers does not change its performance; on the contrary, it improves the performance (Michel et al., 2019;Kovaleva et al., 2019). In contrast, in the AWO setting (Figures 8b  and 8c), such a negative correlation is not observed; rather, a positive correlation is observed (Spearman's ρ is 0.56, and the Pearson's r is 0.55). Actually, in the AWO setting, the alignments extracted from the head/layer with the highest αf (x) is considerably worse than those from the other settings in Table 3. Investigating the reason for these contrasting results would be our future work. In Appendix F, we also present the results of a model with a different number of heads.
6 Related work 6.1 Probing of Transformers Transformers are used for many NLP tasks. Many studies have probed their inner workings to understand the mechanisms underlying their success (Rogers et al., 2020;Clark et al., 2019).
There are mainly two probing perspectives to investigate these models; they differ based on whether the target of the analysis is per-token level or it considers token-to-token interactions. The . The present study is closely related to the latter group; we have provided insights into the token-to-token attention in Transformer-based systems.

Analyzing the token-to-token interaction
Two types of methods are mainly considered to analyze the token-to-token interactions in Transformers. One is to track the attention weights, and the other is to check the gradient of the output with respect to the input of attention mechanisms.  Brunner et al. (2020) have introduced "effective attention," which has upgraded the weight-based analysis. Their proposal is similar to ours; they exclude attention weights that do not affect the output owing to the application of transformation f and input x in the analysis. However, our proposal differs from theirs in some aspects. Specifically, we aim to analyze the behavior of the whole attention mechanism more accurately, whereas they aim to make the attention weights more accurate. Furthermore, the effectiveness of their approach depends on the length of an input sequence; however, ours approach does not have such a limitation (see Appendix G). Additionally, we incorporate the scaling effects of f and x, whereas Brunner et al. (2020) have considered only the binary effect-either the weight is canceled or not.

Gradient-based analysis:
In the gradient analysis, the contribution of the input with respect to the output of the attention mechanism is calculated using the norm of a gradient matrix between the input and the output vector (Pascual et al., 2020). Intuitively, such gradient-based methods measure the change in the output vector with respect to the perturbations in the input vector. Estimating the contribution of a to b = ka by computing the gradient ∂b/∂a (= k) is analogous to estimating the contribution of x to y = αf (x) by observing only an attention weight α. 17 The two ap-17 For simplicity, we consider a linear example: b = ka. We are aware that there is a gap between the two examples in terms of linearity. Further exploration of the connection to the gradient-based method is needed. proaches have the same kind of problems; that is, both ignore the magnitude of the input, a or f (x).

Conclusions and future work
This paper showed that attention weights alone are only one of two factors that determine the output of attention. We proposed the incorporation of another factor, the transformed input vectors. Using our norm-based method, we provided a more detailed interpretation of the inner workings of Transformers, compared to the studies using the weight-based analysis. We hope that this paper will inspire researchers to have a broader view of the possible methodological choices for analyzing the behavior of Transformer-based models.
We believe that these findings can provide insights not only into the interpretation of the behaviors of Blackbox NLP systems but also into developing a more sophisticated Transformer-based system. One possible direction is to design an attention mechanism that can collect almost no information from an input sequence as the current systems achieve it by exploiting the [SEP] token.
In future work, we plan to apply our norm-based analysis to attention in other models, such as finetuned BERT, RoBERTa , and AL-BERT (Lan et al., 2020). Furthermore, we expect to extend the scope of analysis from the attention to an entire Transformer architecture to better understand the inner workings and linguistic capabilities of the current powerful systems in NLP.

A Multi-head attention and the norm-based analysis
Our norm-based analysis is applicable to the analysis of the multi-head attention mechanism implemented in Transformers. The i-th output of the multi-head attention mechanism y integrated i is calculated as follows: where α h i,j , W V,h , b V,h , and W O,h are the same as α i,j , W V , b V , and W O in Equations 3 and 4 for each head h, respectively. n is the number of tokens in the input vectors. Equation 7 can be rewritten as follows: As shown in Equation 10, the multi-head attention mechanism is also linearly decomposable, and one can analyze the flaw of the information from the j-th vector to the i-th vector by measuring h α h i,j f h (x j ) . In Section 5, we actually used h α h i,j f h (x j ) to extract the alignment from each layer's multi-head attention.
The output of the multi-head attention mechanism is calculated via the sum of the outputs of all the heads and a bias b O ∈ R d . Because adding a fixed vector is irrelevant to the token-to-token interaction that we aim to investigate, we omitted b O in our analysis.

B The source of the dispersion of f (x)
As described in Section 4.1, f (x) exhibits dispersion; however, it remains unclear whether this dispersion is attributed to x or f . Hence, we checked the dispersion of x and the scaling effects of the transformation f .
Dispersion of x : First, we checked the coefficient of variation (CV) of x . Table 4 shows that the average CV is 0.12, which is less than that of f (x) (0.22). The value of x typically varies between 0.88 and 1.12 times the average value of x . The layer normalization (Ba et al., 2016) that applied at the end of the previous layer should have a large impact on the variance of x .
Scaling effects of f : Second, we investigated the scaling effect of the transformation f on the norm of the input. Because the affine transformation f : R d → R d can be considered a linear transformation R d+1 → R d+1 (Appendix C), the singular values of f can be regarded as its scaling effect. Figure 9 shows the singular values of f in randomly selected heads in BERT. The singular values are displayed in descending order from left to right. In each head, there is a difference of at least 1.8 times between the maximum and minimum singular values. This difference is larger than that of x , where x typically varies between 0.88 and 1.12 times the average value. These results suggest that the dispersion of f (x) is primarily attributed to the scaling effect of f .

C Affine transformation as linear transformation
The affine transformation f : R d → R d in Equation 4 can be viewed as a linear transformation f : R d+1 → R d+1 . Given x := x 1 ∈ R d+1 , where 1 is concatenated to the end of each input vector x ∈ R d , the affine transformation f can be viewed as:

D Details on Sections 4.2 and 4.3
We describe the detailed experimental setup presented in Sections 4.2 and 4.3.

D.1 Notations
The dataset consists of several sequences; Data = (s 1 , · · · , s |Data| ). Each sequence consists of sev-  Table 4: Mean (μ), standard deviation (σ), coefficient of variance (CV), and maximum and minimum values of x ; the former three are averaged on all the layers. Figure 9: Singular values of f at randomly selected heads in each layer. We use layer -head number to denote a particular attention head. The singular values are eral tokens, s p = (t p 1 , · · · , t p |sp| ), where t p q is the q-th token in the p-th sequence. For simplicity, we define the following functions: where α ,h p,i,q is the attention weight assigned from the i-th pre-update vector to the q-th input vector in the p-th sequence. h and denote that the score is obtained from the h-th head of the -th layer.
x p,q denotes the input vector for token t p q in theth layer. f ,h (x p,q ) is the transformed vector for x p,q in the h-th head of the -th layer.
Next, the vocabulary V of BERT is divided into the following four categories: Let T (p, Z) be a function that returns all tokens t p q belonging to the category Z in the p-th sequence. To formally describe our experiments, several functions are defined as follows. Note that we analyzed a model with 12 heads in each layer. The LayerW(·) and LayerWN(·) functions are used to analyze the average behavior of the heads in a layer.

D.2 Experimental setup for Section 4.2
In Figure 3, the results of each layer are reported for each category. In Figures 3a and 3b, the values for each category Z were calculated using LayerW(Z, ) and LayerWN(Z, ), respectively. In Figure 4, α and f (x) in the h-th head of the -th layer were calculated using HeadW(Z, , h) and HeadN(Z, , h), respectively.
The scores reported in Table 2 are the Spearman rank correlation coefficient r between Weight (p, q, , h) and WNorm(p, q, , h). We calculated the r using all the pairs of Weight(p, q, , h) and WNorm(p, q, , h) for the possible values of p, q, , and h. In Figure 5, each plot corresponds to the pair of Weight (p, q, , h) and WNorm(p, q, , h), where the combination of (p, q, , h) was randomly determined.   tively. The values in these figures were calculated as described in Appendix D.2. Figures 10 and 11 show that the trends for categories B and C were analogous to those for the [SEP] token; the large α was canceled by the small f (x) . However, the trends for category D do not exhibit the trends of the negative correlation between α and f (x) . In each heatmap of f (x) , the color scale is determined by the maximum value of f (x) in each category.
We also reported the relationship between α and f (x) in Section 4.2 ( Figure 5). Figure 13 shows the results for each word category to provide a clearer display of the results.

D.4 Experimental setup and visualizations
for Section 4.3 In Section 4.3, we analyzed the relationship between the word frequency and f (x) . To formally describe our experiments, we further define the functions as follows: AvgW(p, q) = 1 12 · 12 Note that we analyzed a model comprising 12 layers; each layer has 12 attention heads. Let (a) α.
(b) f (x) . Figure 12: α and f (x) corresponding to other tokens, averaged on all the input text.  r(·) be a function that returns the frequency rank of a given word. We first calculated the Spearman rank correlation coefficient between r(t p q ) and AvgW(p, q). The score was 0.06, which suggests that there is no relationship between α and the frequency rank of the word. Then, we calculated the Spearman rank correlation coefficient between r(t p q ) and AvgN(p, q). The score was 0.75, which suggests a strong correlation between f (x) and the frequency rank of the word; Figure 14 shows these results.
In addition, the results for the word frequency, instead of the frequency rank, are shown in Figure 15. c(·) denotes a function that returns the frequency of a given word in the training dataset of BERT. We reproduced the dataset because it is not released.

E Details on Section 5 E.1 Hyperparameters and training settings
We used the Transformer (Vaswani et al., 2017) NMT model implemented in fairseq (Ott et al., 2019) for the experiments. Table 5 shows the hyperparameters of the model, which were the same Figure 14: Relationship between frequency rank r(t p q ) and AvgW(p, q), and that between r(t p q ) and AvgN(p, q). as those used by Ding et al. (2019). We used the model with the highest BLEU score in the development set for our experiments.
We conducted the data preprocessing 18 following the method by Zenkel et al. (2019) and Ding et al. (2019). All the words in the training data of the NMT systems were split into subword units using byte-pair encoding (BPE, Sennrich et al. (2016)) with 10k merge operations. Following Ding et al. (2019), the last 1000 instances of the training data were used as the development data.

E.2 Settings of the word alignment extraction
First, we applied BPE, which was used to split the training data of the NMT systems to create the evaluation data used for calculating the AER scores. Next, we extracted the scores of α and αf (x) for each subword in the evaluation data for the force decoding setup. The gold alignments are annotated at the word-level, not the subword-level. To calculate the word-level alignment scores, α and αf (x) for the subwords were merged along with the target token in the gold data by averaging, then merged along with the source tokens in the gold data by summation. These operations were the same as Li et al. (2019). 18 https://github.com/lilt/alignment-scripts (a) Relationship between c(t) and AvgW. (b) Relationship between c(t) and AvgN. Figure 15: Relationship between frequency count c(t p q ) and AvgW(p, q), and that between c(t p q ) and AvgN(p, q).
In existing studies, /s , the special token for the end of the sentence, was probably removed in calculating word alignments. We included /s as the alignment targets and we considered the alignments to /s as no alignment. In other words, if the model aligns a certain word with /s , we assume that the model decides that the word is not aligned to any word.

E.3 Layer-wise analysis
We preliminarily investigated how the sourcetarget attentions in a Transformer-based NMT system behave depending on the layer. Tang et al. (2018) have reported that they behave differently depending on the layer. The AER scores in the AWI and AWO settings were calculated for each layer (Figure 16). In the AWO setting, AER scores tend to be better in the latter layers than in the earlier layers (Figure 16a). In contrast, the AER scores tend to be better in the earlier layers than in the latter layers in the AWI setting (Figure 16b).
These results suggest that the earlier and latter layers focus on the source word that is aligned with the input and output target word, respectively (as shown in Figure 6). Furthermore, we believe that it is a convincing result to extract cleaner word alignments from the AWI setting than the AWO setting (Figure 16), because the AWI setting is   Figure 16: Layer-wise AER scores. Each value is the average of five random seeds. The closer the extracted word alignment is to the reference, the lower the AER score-the lighter the color. more advantageous. The main advantage is that while the decoder may fail to predict the correct output words, the input words are perfectly accurate owing to the teacher forcing.

F Word alignment experiments on different settings
To verify whether the results obtained in the Section 5 are reproducible in different settings, we conducted an additional experiment using the model with a different number of attention heads. Specifically, we used a model with eight attention heads in both the encoder and decoder. Table 6 shows the AER scores of the 8-head model. As with the results obtained by the 4-head model, word alignments extracted using the proposed norm-based approach were more reasonable than those extracted using the weight-based approach, and better word alignments are extracted in the AWI setting than in the AWO setting. Furthermore, the alignments extracted using the head or the layer with the highest average αf (x) in the AWI setting are competitive with one of the existing word aligners-fast align. With respect to the weight-based extraction, the scores obtained using (a) Reference. (b) Attention-weights.
(c) Vector-norms (ours). Figure 17: Examples of the reference alignment and the extracted patterns by each method in layer 1. Word pairs with a green frame shows the word with the highest weight or norm. The vertical axis represents the input source word in the decoder, and the pairs with a green frame are extracted as alignments in the AWI setting. Note that pairs that contain /s not extracted.   the 8-head model were worse than those obtained using the 4-head model. This may be owing to the increase in the number of heads that do not capture reasonable alignments. Figures 23a and 23b show the AER scores of the alignments obtained by the norm-based extraction at each head on one out of five seeds. Figure 23c shows the average of αf (x) at each head. As with the results obtained by the 4-head model, the heads with the low (i.e., better) AER score in the AWI setting tended to have the high αf (x) (the Spearman rank and Pearson correla-   tion coefficients between the AER scores and averaged αf (x) among the 6×8 heads are −0.26 and −0.50). In contrast, in the AWO setting, such a negative correlation is not observed; rather, a positive correlation is observed (the Spearman's ρ is 0.40 and the Pearson's r is 0.40). Additionally, following Appendix E.3, the AER scores for both the AWI and AWO settings for each layer were calculated ( Figure 24). As with the 4-head model (Appendix E.3), the latter layers correspond to the AWO setting and the earlier layers corresponds to the AWI setting in the 8-head (a) Attention-weights. (b) Vector-norms. Figure 22: Examples of the reference alignment and the extracted patterns by each method in layer 6.

AER ±SD
Transformer -Attention-based Approach -Alignment with output setting -Weight-based layer mean 70.4 0.6 best layer (layer 4 or 5) 49.3 1.2 Norm-based (ours) layer mean 63.2 0,7 best layer (layer 5) 43.4 0.8 head with the highest average αf (x) 87.2 0.6 layer with the highest average αf (x) 83.7 2.2 -Alignment with input setting -Weight-based layer mean 76.6 1.7 best layer (layer 2 or 3) 38.7 8.9 Norm-based (ours) layer mean 59.9 1.0 best layer (layer 2 or 3) 26.3 1.9 head with the highest average αf (x) 24.9 1.7 layer with the highest average αf (x) 26.5 1.9 Word Aligner fast align from Zenkel et al. (2019) 28.4 -GIZA++ from Zenkel et al. (2019) 21.0 - Table 6: Results on a model trained with the same settings as described in Appendix E.1 except that the number of attention heads in the encoder and decoder is 8. Each value is the average of five random seeds. model.
G Comparison with effective attention (Brunner et al., 2020) In this section, we discuss the difference between our approach and "effective attention" (Brunner et al., 2020), which is an enhanced version of the weight-based analysis. The effective attention exclude the components that do not affect the output owing to the application of transformation f and input x from the attention weight matrix A. The output-irrelevant components are derived from the null space of the matrix T , which is the stack of f (x). Figure 25a shows the Pearson correlation coefficient between the raw attention weight and the effective attention. Since the dimension of the null space of the matrix T depends on the length of (a) AER in the AWO setting.
(b) AER in the AWI setting.
(c) Averaged αf (x) . the input sequence, as shown in Figure 25a, the effective attention and raw attention weight are identical for short input sequences. Figure 25b shows the Pearson correlation coefficient between the raw attention weight and our norm-based method.
Since we incorporate the scaling effects of f and x, which contain canceling, our proposed method αf (x) differs from the raw attention weight, whether the input sequence is long or short.