Unsupervised Token-wise Alignment to Improve Interpretation of Encoder-Decoder Models

Developing a method for understanding the inner workings of black-box neural methods is an important research endeavor. Conventionally, many studies have used an attention matrix to interpret how Encoder-Decoder-based models translate a given source sentence to the corresponding target sentence. However, recent studies have empirically revealed that an attention matrix is not optimal for token-wise translation analyses. We propose a method that explicitly models the token-wise alignment between the source and target sequences to provide a better analysis. Experiments show that our method can acquire token-wise alignments that are superior to those of an attention mechanism.


Introduction
The Encoder-Decoder model with an attention mechanism (EncDec) (Sutskever et al., 2014;Cho et al., 2014;Bahdanau et al., 2015;Luong et al., 2015) has been an epoch-making development that has led to great progress in many natural language generation tasks, such as machine translation (Bahdanau et al., 2015), dialog generation (Shang et al., 2015), and headline generation (Rush et al., 2015). An enormous number of studies have attempted to enhance the ability of EncDec.
Furthermore, several intensive studies have also attempted to analyze and interpret the inside of black-box EncDec models, especially how they translate a given source sentence to the corresponding target sentence (Ding et al., 2017). One typical approach to this is to visualize an attention matrix, which is a collection of attention vectors (Bahdanau et al., 2015;Luong et al., 2015;Tu et al., 2016).
The assumption behind this interpretation is that the attention matrix has a "soft" token-wise alignment between the source and target sequences, and thus we can use EncDec models to skim which tokens in the source are converted to which tokens in the target.
However, recent studies have empirically revealed that an attention model can operate not only for token-wise alignment but also for other functionalities, such as reordering (Ghader and Monz, 2017;. In addition, Luong et al. (2015) reported that the quality of attention matrixbased alignment is quantitatively inferior to that of the Berkeley aligner (Liang et al., 2006). Koehn and Knowles (2017) also reported that attention matrix-based alignment is significantly different from that acquired from an off-the-shelf aligner for English-German language pairs. From these recent findings, the goal of this paper is to provide a method that can offer a better interpretation of how EncDec models translate a given source sentence to the corresponding target sentence.
In this paper, we focus exclusively on the headline generation task, which is categorized as a lossycompression generation (lossy-gen) task (Nallapati et al., 2016). Compared with a machine translation task, which is categorized as a loss-less generation (lossless-gen) task, the headline generation task additionally requires EncDec models to appropriately select salient ideas in given source sentences (Suzuki and Nagata, 2017). Therefore, the lossy-gen task seems to make modeling by EncDec much harder. In fact, our preliminary experiments revealed that the attention mechanism in EncDec models largely fails to capture token-wise alignments, e.g., less than 10 percent accuracy, even if we use one of the current state-of-the-art EncDec models (Table 3).
To obtain a better analysis of how EncDec models translate a given source sentence to the corre-sponding target sentence in the headline generation task, this paper introduces the Unsupervised tokenwise Alignment Module (UAM), a novel component that can be plugged into EncDec models. Unlike a conventional attention model, the proposed UAM explicitly captures token-wise alignments between the source and target sequences on the final hidden layer. One can plug the UAM into a EncDec model during a training phase and easily understand the EncDec model's behavior by analyzing the UAM's token-wise alignments. Moreover, the UAM does not require any gold alignment data.
To demonstrate the effectiveness of the UAM, we evaluate EncDec models with the UAM in the headline generation task (Rush et al., 2015), a widely used benchmark for EncDec models. Our experiments show that (i) EncDec models with a UAM achieve comparable (or even superior) performance to the current state-of-the-art headline generation model, and (ii) the produced token-wise alignment is practical regardless of the absence of gold alignment during its training phase.

Headline Generation Task
We address the headline generation task introduced in Rush et al. (2015). The source (input) is the first sentence of a news article, and the target (output) is the article's headline. We say I and J represent the numbers of tokens in the source and target, respectively. An important assumption in headline generation is that the target must be shorter than the source (I > J).
Here, we denote a source sequence as sequence X of one-hot vectors. Let x i ∈ {0, 1} Vs represent the one-hot vector of the i-th token in X, where V s represents the number of tokens in the source-side vocabulary V s . We use x 1:I to represent (x 1 , . . . , x I ); namely, X = x 1:I . Similarly, let y j ∈ {0, 1} Vt represent the one-hot vector of the j-th token in target sequence Y , where V t is the number of tokens in the target-side vocabulary V t . Here, we define Y as always containing two additional one-hot vectors of special tokens bos for y 0 and eos for y J+1 , namely, Y = y 0:J+1 .

Encoder-Decoder Model with Attention Mechanism (EncDec)
This section briefly describes EncDec as the base model of our method 2 .
2 Our EncDec configuration follows the model described in Luong et al. (2015).

Model Definition
EncDec models the following conditional probability: p(y j |y 0:j−1 , X). (1) Encoder EncDec encodes the source one-hot vector sequence x 1:I and generates the hidden state sequence h 1:I , where h i ∈ R H for all i and H is the size of the hidden state. We employ bidirectional RNN (BiRNN) as the encoder of the base model. BiRNN is composed of two separate RNNs for the forward ( − −− → RNN src ) and backward ( ← −− − RNN src ) directions. The forward RNN reads the source sequence X from left to right and constructs hidden states ( h 1:I ). Similarly, the backward RNN reads the input in the reverse order to obtain another sequence of hidden states ( h 1:I ). Finally, we take the summation of hidden states in each direction to construct the final representation of the source sequence (h 1:I ).
Concretely, for a given time step i, the representation h i is constructed as follows: where E s ∈ R D×Vs denotes the word embedding matrix of the source-side and D denotes the size of word embedding.
Decoder The decoder is the unidirectional RNN in the input-feeding approach (Luong et al., 2015). Concretely, decoder RNN takes the output of the previous time step y j−1 , decoder hidden state z j−1 and final hidden state z j−1 to derive the hidden state of current time step z j : where E t ∈ R D×Vt denotes the word embedding matrix of the decoder. Here, z 0 is defined as a zero vector.
Attention Mechanism The attention architecture of the base model is the same as that of the Global Attention model proposed by Luong et al. (2015). Attention is responsible for constructing the final hidden state z j from the decoder hidden state z j and encoder hidden states (h 1:I ). First, the model computes the attention vector α j ∈ R I from the decoder hidden state z j and encoder hidden states (h 1:I ). From among three attention scoring functions proposed in Luong et al. (2015), we employ a general function. This function calculates the attention score in bilinear form. Specifically, the attention score between the i-th source hidden state and the j-th decoder hidden state is computed by the following equation: where W α ∈ R H×H is a parameter matrix, and α j [i] denotes the i-th element of α j . α j is then used for collecting the source-side information that is relevant for predicting the target token. This is done by taking the weighted sum on the encoder hidden states: Next, the source-side information is mixed with the decoder hidden state to derive final hidden state z j . Concretely, the context vector c j is concatenated with z j to form vector u j ∈ R 2H . u j is then fed into a single fully-connected layer with tanh nonlinearity: where W s ∈ R H×2H is a parameter matrix. Finally, z j is fed into the softmax layer. The model generates a target-side token based on the probability distribution o j ∈ R Vt as where W o ∈ R Vt×H is a parameter matrix and b o ∈ R Vt is a bias term.

Training of EncDec
To train EncDec, let D be training data for headline generation, which consists of source-headline sentence pairs. Let θ represent all parameters in EncDec. Our goal is to find the optimal parameter setθ that minimizes the following objective function G 0 (θ) for the given training data D: Target Pred. Since o j for each j is a vector representation of the probabilities of p(ŷ|y 0:j−1 , X, θ) over the target vocabulariesŷ ∈ V t , we can calculate trg as

Inference of EncDec
In the inference step, we use the trained parameters to search for the best target sequence. We use beam search to find the target sequence that maximizes the product of the conditional probabilities as described in Equation 1. From among several stopping criteria for beam search (Huang et al., 2017), we adopt the widely used "shrinking beam" implemented in RNNsearch (Bahdanau et al., 2015) 3 .

Proposed Method: Unsupervised
Alignment Module (UAM) Figure 1 illustrates the proposed method, UAM. UAM is the module attached on top of the decoder output layer of an EncDec model, and it explicitly represents a token-wise alignment by predicting source tokens simultaneously with target tokens. Specifically, during the training of EncDec, the decoder estimates the probability distribution over the source-side vocabulary, q j ∈ R Vs , simultaneously with that of the target-side vocabulary, o j ∈ R Vt , for each time step j. In Figure 1, when the decoder predicts the target-side token "share", we want to predict its corresponding source-side token "share." If we can correctly predict the corresponding source-side token for each target-side token, we can obtain token-wise alignment. As shown below, we can train the model for this prediction without gold token-wise alignment signals.
This way of representing alignments can be extended to null alignments. We first expand a given target sequence with a sequence of null s so that its length is the same as that of the source sequence (in Figure 1, we extend "london share. . . eos " to "london share. . . eos null . . . null "). We then train the model so that it can predict a discarded sourceside token (e.g., "thursday") when it predicts null for the target-side. Through the task of predicting source-side tokens corresponding to null s, we expect the model to effectively learn to identify unimportant information in the source sequence.

Model Definition
EncDec with UAM jointly estimates the probability distributions over the source and target vocabularies. Specifically, UAM estimates a probability distribution over the source vocabulary q j ∈ R Vs at each time step j in the decoding process by: where W q ∈ R Vs×H is a parameter matrix, z j ∈ R H is a decoder's final hidden layer, and b q ∈ R Vs is a bias term. H is the size of the hidden state. EncDec also calculates a probability distribution over target vocabulary o j from the same vector z j . Next, we define Y as a concatenation of the onehot vectors of the target-side sequence Y and those of the special token null of length I − (J + 1). Here, y J+1 is a one-hot vector of eos , and y j for each j ∈ {J + 2, . . . , I} is a one-hot vector of null . Note that the length of Y is always no shorter than that of Y , that is, |Y | ≥ |Y | because headline generation always assumes I > J.
Based on this setting, we train the model in an unsupervised manner without gold alignment signals. To this end, we consider a sentence-wise loss function instead of token-wise loss. Specifically, we define the sentence-wise loss as the degree of difference between the sum of all one-hot vectors in the source sequence and the sum of UAM predictions. Namely, we take the difference ofx = I i=1 x i andq = I j=1 q j . Note thatx is a vector representation of the occurrences (or bag-of-words representation) of the source-side tokens. To minimize sentence-wise loss, the model must predict the bag-of-words representation. Through this optimization, the model is expected to eventually find the token-wise alignment.
EncDec with UAM models the following conditional probability: We define p(Y |X) as follows: which is identical to the conditional probability modeled by the base EncDec, except for considering null as part of the probability. Next, we define p(x|Y , X) as follows: where Z is a normalization term and C is a hyperparameter that controls the sensitivity of the distribution.

Training of UAM
We optimize the UAM combined with EncDec by minimizing the negative log-likelihood of Equation 14 as a loss function. Let γ represent the parameter set of SPM Then, we define the UAM loss src as follows; src (x, X, Y , γ, θ) = − log(p(x|Y , X, γ, θ)) From Equation 16, we can derive src as We can discard log(Z) from the RHS, since it is independent of γ and θ.
Here, we regard the sum of src and trg as an objective loss function of multi-task training. Formally, we train EncDec with the UAM by minimizing the following objective function G 1 : where D is training data for headline generation, which consists of source-headline sentence pairs.

Inference of UAM
In the inference time, the goal is only to search for the best target sequence. Thus, we do not need to compute the UAM during the inference. Similarly, it is unnecessary to produce null after generating eos . Thus, we can utilize the beam search procedure described in Section 3.3, and as a result the actual computation cost for the inference remains unchanged from the base EncDec.

Dataset
The origin of the headline generation dataset used in our experiments is identical to that used in Rush et al. (2015). The dataset consists of pairs of the first sentence of each article and its headline from the annotated English Gigaword corpus (Napoles et al., 2012). Rush et al. (2015) defined the training, validation and test splits, which contain approximately 3.8M, 200K and 400K source-headline pairs, respectively We used the entire training split for training as in the previous studies. We randomly sampled 8K instances as validation data and 10K instances as test data from the validation split. Moreover, we experimented on the test data provided by Zhou et al. (2017) and Toutanova et al. (2016) for comparison with the reported state-of-the-art performance (Zhou et al., 2017). We refer to those test data sets as Test (Ours), Test (Zhou), and MSR-ATC respectively. Among these test sets, MSR-ATC is the only dataset created by a human worker.

Implementation Details
In the experiment, we selected the hyper-parameter settings commonly used in previous studies e.g., (Rush et al., 2015;Nallapati et al., 2016;Suzuki and Nagata, 2017) We constructed the vocabulary set using byte pair encoding 4 (BPE) (Sennrich et al., 2016) to handle low-frequency words, since this is now a common practice in the field of neural machine translation. The BPE merge operations are jointly learned from the source and the target. We set the number of BPE merge operations at 5, 000.

Evaluation Metric
We evaluated the performance by ROUGE-1 (RG-1), ROUGE-2 (RG-2) and ROUGE-L (RG-L) 5 . We report the F1 value as given in a previous study 6 .
We computed the ROUGE scores using the official ROUGE script (version 1.5.5).

Results
To investigate the effectiveness of the UAM quantitatively, we chose a very strong baseline, SEASS (Zhou et al., 2017), which is the current state-of-the-art model. We reimplemented SEASS, hereafter EncDec+sGate, and compared EncDec+sGate with the model incorporating UAM into EncDec+sGate. Table 2 summarizes ROUGE-F1 results for all test data.
The table shows that EncDec+sGate+UAM achieved a huge gain particularly for MSR-ATC and a performance comparable to EncDec+sGate in Test (Ours) and Test (Zhou). Considering that the MSR-ATC dataset was created by a human worker, we believe that the improvement in MSR-ATC is the most remarkable result among the three test sets, since it indicates that our model most closely fits the human-generated summary.

Discussion
We investigated whether the UAM improves tokenwise alignment between the source and target se-  quences. We compared the alignments acquired by the UAM and the attention model, since attention has been implicitly considered as alignment (Bahdanau et al., 2015;Luong et al., 2015;Tu et al., 2016).

Visualizing UAM and Attention
We visualized UAM prediction and the attention matrix to see the acquired token-wise alignment between the source and target. Specifically, we fed the source-target pair (X, Y ) to EncDec+sGate and EncDec+sGate+UAM, and then collected UAM predictions (q 1:I ) of EncDec+sGate+UAM and the attention vectors (α 1:J ) of EncDec+sGate. We computed the attention vectors using Equation 7. For UAM prediction, we extracted the probability of each token x i ∈ X from q 1:I . Figure 2 and 3 show an example of the heat map. We used Test (Ours) as the input. The brackets in the y-axis represent the source-side tokens that are aligned with target-side tokens. We obtained the aligned tokens as follows: For attention (Figure 2a, 3a), we select the token with the largest attention value. For the UAM (Figure 2b, 3b), we select the token with the largest probability over the vocabulary V s . Figure 2a indicates that attention provides poor token-wise alignment. For example, the targetside token "kong" is incorrectly aligned with the source-side sentence period. In contrast, Figure 2b shows that the UAM captures reasonable alignments. Here, the source-side token "rose" is aligned with the target-side token "higher." UAM also correctly aligned unimportant tokens such as "##.##" with null .
In Figure 3a, the attention model repeatedly pays attention to the source-side token "positive." As a result, the attention model aligned the target-side token "egyptian" to "positive." On the other hand, in 3b, the UAM correctly aligns "egyptian." In addition, the UAM aligned source-side token "foreign" with target-side token "fm."  Figure 2: Visualization of models. The x-axis and y-axis correspond to the source and the target sequence respectively. Tokens in the brackets are source-side tokens aligned at that time step.

Alignment Accuracy
In order to quantitatively investigate the quality of the alignment, we evaluated the alignment accuracy of both EncDec+sGate and EncDec+sGate+UAM. We randomly sampled 40, 30 and 30 instances from Test (Ours), Test (Zhou) and MSR-ATC respectively. We acquired the alignment of the data by applying the UAM and the attention model, in the same manner as described in Section 6.1. A single annotator then evaluated the accuracy of the alignment by hand.   that of the attention model. This result is consistent with the empirical results reported by Koehn and Knowles (2017) and Luong et al. (2015). The attention alignment mistakes are mostly due to paying attention to either the sentence period or to the token decoded in the previous time step. It is noteworthy that the accuracy of the UAM alignment exceeds 50% even though we trained the model in an unsupervised manner. The fact that the UAM prediction acquires reasonable alignment suggests that the UAM has the potential to provide us a better understanding of the model's behavior.

Conclusion
In this paper, we introduced the Unsupervised token-wise Alignment Module (UAM), which learns to predict the token-wise alignment of tokens in the source and the target. Experiments on the headline generation task showed that the UAM can achieve comparable performance to the current state-of-the-art sGate model. In addition, the UAM obtained token-wise alignment that is superior to that of the attention model. This finding suggests we can use the UAM as an alternative to the attention matrix to attain a better understanding of the token-wise alignment of EncDec-based model.