Visualizing and Understanding Neural Machine Translation

While neural machine translation (NMT) has made remarkable progress in recent years, it is hard to interpret its internal workings due to the continuous representations and non-linearity of neural networks. In this work, we propose to use layer-wise relevance propagation (LRP) to compute the contribution of each contextual word to arbitrary hidden states in the attention-based encoder-decoder framework. We show that visualization with LRP helps to interpret the internal workings of NMT and analyze translation errors.


Introduction
End-to-end neural machine translation (NMT), which leverages neural networks to directly map between natural languages, has gained increasing popularity recently Bahdanau et al., 2015). NMT proves to outperform conventional statistical machine translation (SMT) significantly across a variety of language pairs (Junczys-Dowmunt et al., 2016) and becomes the new de facto method in practical MT systems .
However, there still remains a severe challenge: it is hard to interpret the internal workings of NMT. In SMT (Koehn et al., 2003;Chiang, 2005), the translation process can be denoted as a derivation that comprises a sequence of translation rules (e.g., phrase pairs and synchronous CFG rules). Defined on language structures with varying granularities, these translation rules are interpretable from a linguistic perspective. In contrast, NMT takes an end-to-end approach: all internal information is represented as real-valued vectors or * Corresponding author. matrices. It is challenging to associate hidden states in neural networks with interpretable language structures. As a result, the lack of interpretability makes it very difficult to understand translation process and debug NMT systems.
Therefore, it is important to develop new methods for visualizing and understanding NMT. Existing work on visualizing and interpreting neural models has been extensively investigated in computer vision (Krizhevsky et al., 2012;Mahendran and Vedaldi, 2015;Szegedy et al., 2014;Simonyan et al., 2014;Nguyen et al., 2015;Girshick et al., 2014;Bach et al., 2015). Although visualizing and interpreting neural models for natural language processing has started to attract attention recently (Karpathy et al., 2016;, to the best of our knowledge, there is no existing work on visualizing NMT models. Note that the attention mechanism (Bahdanau et al., 2015) is restricted to demonstrate the connection between words in source and target languages and unable to offer more insights in interpreting how target words are generated (see Section 4.5).
In this work, we propose to use layer-wise relevance propagation (LRP) (Bach et al., 2015) to visualize and interpret neural machine translation. Originally designed to compute the contributions of single pixels to predictions for image classifiers, LRP back-propagates relevance recursively from the output layer to the input layer. In contrast to visualization methods relying on derivatives, a major advantage of LRP is that it does not require neural activations to be differentiable or smooth (Bach et al., 2015). We adapt LRP to the attention-based encoder-decoder framework (Bahdanau et al., 2015) to calculate relevance that measures the association degree between two arbitrary neurons in neural networks. Case studies on Chinese-English translation show that visualization helps to interpret the internal workings of  Figure 1: The attention-based encoder-decoder architecture for neural machine translation (Bahdanau et al., 2015).
NMT and analyze translation errors.

Background
Given a source sentence x = x 1 , . . . , x i , . . . , x I with I source words and a target sentence y = y 1 , . . . , y j , . . . , y J with J target words, neural machine translation (NMT) decomposes the sentence-level translation probability as a product of word-level translation probabilities: where y <j = y 1 , . . . , y j−1 is a partial translation. In this work, we focus on the attention-based encoder-decoder framework (Bahdanau et al., 2015). As shown in Figure 1, given a source sentence x, the encoder first uses source word embeddings to map each source word x i to a real-valued vector x i . 1 Then, a forward recurrent neural network (RNN) with GRU units (Cho et al., 2014) runs to calculate source forward hidden states: where f (·) is a non-linear function. Similarly, the source backward hidden states can be obtained using a backward RNN: 1 Note that we use x to denote a source sentence and x to denote the vector representation of a single source word.
To capture global contexts, the forward and backward hidden states are concatenated as the hidden state for each source word: Bahdanau et al. (2015) propose an attention mechanism to dynamically determine the relevant source context c j for each target word: where α j,i is an attention weight that indicates how well the source word x i and the target word y j match. Note that an end-of-sentence token is appended to the source sentence.
In the decoder, a target hidden state for the j-th target word is calculated as where g(·) is a non-linear function, y j−1 denotes the vector representation of the (j − 1)-th target word. Finally, the word-level translation probability is given by P (y j |x, y <j ; θ) = ρ(y j−1 , s j , c j ), where ρ(·) is a non-linear function.
Although NMT proves to deliver state-of-theart translation performance with the capability to handle long-distance dependencies due to GRU and attention, it is hard to interpret the internal information such as − → h i , ← − h i , h i , c j , and s j in the encoder-decoder framework. Though projecting word embedding space into two dimensions (Faruqui and Dyer, 2014) and the attention matrix (Bahdanau et al., 2015) shed partial light on how NMT works, how to interpret the entire network still remains a challenge.
Therefore, it is important to develop new methods for understanding the translation process and analyzing translation errors for NMT.

Problem Statement
Recent efforts on interpreting and visualizing neural models has focused on calculating the contribution of a unit at the input layer to the final decision at the output layer (Simonyan et al., 2014;Mahendran and Vedaldi, 2015;Nguyen et al., 2015; in New </s> York 在 纽约 </s> in New zai niuyue Figure 2: Visualizing the relevance between the vector representation of a target word "New York" and those of all source words and preceding target words. Girshick et al., 2014;Bach et al., 2015;. For example, in image classification, it is important to understand the contribution of a single pixel to the prediction of classifier (Bach et al., 2015).
In this work, we are interested in calculating the contribution of source and target words to the following internal information in the attention-based encoder-decoder framework: 3. h i : the i-th source hidden state, 4. c j : the j-th source context vector, 5. s j : the j-th target hidden state, 6. y j : the j-th target word embedding.
For example, as shown in Figure 2, the generation of the third target word "York" depends on both the source context (i.e., the source sentence "zai niuyue </s>") and the target context (i.e., the partial translation "in New"). Intuitively, the source word "niuyue" and the target word "New" are more relevant to "York" and should receive higher relevance than other words. The problem is how to quantify and visualize the relevance between hidden states and contextual word vectors.
More formally, we introduce a number of definitions to facilitate the presentation.
Definition 1 The contextual word set of a hidden state v ∈ R M ×1 is denoted as C(v), which is a set of source and target contextual word vectors u ∈ R N ×1 that influences the generation of v. Figure 3: A simple feed-forward network for illustrating layer-wise relevance propagation (Bach et al., 2015).
As both hidden states and contextual words are represented as real-valued vectors, we need to factorize vector-level relevance at the neuron level. Definition 2 The neuron-level relevance between the m-th neuron in a hidden state v m ∈ R and the n-th neuron in a contextual word vector u n ∈ R is denoted as r un←vm ∈ R, which satisfies the following constraint: The vector-level relevance between a hidden state v and one contextual word vector u ∈ C(v) is denoted as R u←v ∈ R, which quantifies the contribution of u to the generation of v. It is calculated as Definition 4 The relevance vector of a hidden state v is a sequence of vector-level relevance of its contextual words: Therefore, our goal is to compute relevance vectors for hidden states in a neural network, as shown in Figure 2. The key problem is how to compute neuron-level relevance.

Layer-wise Relevance Propagation
We follow (Bach et al., 2015) to use layer-wise relevance propagation (LRP) to compute neuronlevel relevance. We use a simple feed-forward network shown in Figure 3 to illustrate the central idea of LRP.
Input: A neural network G for a sentence pair and a set of hidden states to be visualized V. Output: Vector-level relevance set R. LRP first propagates the relevance from the output layer to the intermediate layer: Note that we ignore the non-linear activation function because Bach et al. (2015) indicate that LRP is invariant against the choice of non-linear function. Then, the relevance is further propagated to the input layer: Note that r u 1 ←v 1 + r u 2 ←v 1 = v 1 .
More formally, we introduce the following definitions to ease exposition.
Definition 5 Given a neuron u, its incoming neuron set IN(u) comprises all its direct connected preceding neurons in the network.
For example, in Figure 3, the incoming neuron set of z 1 is IN(z 1 ) = {u 1 , u 2 }.
Definition 6 Given a neuron u, its outcoming neuron set OUT(u) comprises all its direct connected descendant neurons in the network.
For example, in Figure 3, the incoming neuron set of z 1 is OUT(z 1 ) = {v 1 , v 2 }.
Definition 7 Given a neuron v and its incoming neurons u ∈ IN(v), the weight ratio that measures the contribution of u to v is calculated as Although the NMT model usually involves multiple operators such as matrix multiplication, element-wise multiplication, and maximization, they only influence the way to calculate weight ratios in Eq. (15).
For matrix multiplication such as v = Wu, its basic form that is calculated at the neuron level is given by v = u∈IN(v) W u,v u . We follow Bach et al. (2015) to calculate the weight ratio using Eq. Figure 4: Visualizing source hidden states for a source content word "nian" (years).
For element-wise multiplication such as v = u 1 •u 2 , its basic form is given by v = u∈IN(v) u. We use the following method to calculate its weight ratio: For maximization such as v = max{u 1 , u 2 }, we calculate its weight ratio as follows: Therefore, the general local redistribution rule for LRP is given by Algorithm 1 gives the layer-wise relevance propagation algorithm for neural machine translation. The input is an attention-based encoderdecoder neural network for a sentence pair after decoding G and a set of hidden states to be visualized V. The output is a set of vector-level relevance between intended hidden states and their contextual words R. The algorithm first computes weight ratios for each neuron in a forward pass (lines 1-4). Then, for each hidden state to be visualized (line 6), the algorithm initializes the neuron-level relevance for itself (lines 7-9). After initialization, the neuron-level relevance is backpropagated through the network (lines 10-12). Finally, vector-level relevance is calculated based on neuron-level relevance (lines 13-16  Figure 5: Visualizing target hidden states for a target content word "visit".
where |G| is the number of neuron units in the neural network G, |V| is the number of hidden states to be visualized and O max is the maximum of outdegree for neurons in the network. Calculating relevance is more computationally expensive than computing attention as it involves all neurons in the network. Fortunately, it is possible to take advantage of parallel architectures of GPUs and relevance caching for speed-up.

Data Preparation
We evaluate our approach on Chinese-English translation. We use the open-source toolkit GROUNDHOG (Bahdanau et al., 2015), which implements the attention-based encoder-decoder framework. After model training and selection on the training and development sets, we use the resulting NMT model to translate the test set. Therefore, the visualization examples in the following subsections are taken from the test set. Figure 4 visualizes the source hidden states for a source content word "nian" (years). For each word in the source string "jin liang nian lai , meiguo" (in recent two years, USA), we attach a number to denote the position of the word in the sentence. For example, "nian" (years) is the third word. We are interested in visualizing the relevance between the third source forward hidden state − → h 3 and all its contextual words "jin" (recent) and "liang" (two). We observe that the direct preceding word "liang" (two) contributes more to forming the forward hidden state of "nian" (years). For the third source backward hidden state ← − h 3 , the relevance of contextual words generally decreases with the increase of the distance to "nian" (years). Clearly, the concatenation of forward and backward hidden states h 3 capture contexts in both directions.

Source Side
The situations for function words and punctuation marks are similar but the relevance is usually more concentrated on the word itself. We omit the visualization due to space limit. Figure 5 visualizes the target-side hidden states for the second target word "visit". For comparison, we also give the attention weights α 2 , which correctly identifies the second source word "canbai" ("visit") is most relevant to "visit".

Target Side
The relevance vector of the source context c 2 is generally consistent with the attention but reveals that the third word "shi" (is) also contributes to the generation of "visit".
For the target hidden state s 2 , the contextual word set includes the first target word "my". We find that most contextual words receive high values of relevance. This phenomenon has been frequently observed for most target words in other sentences. Note that relevance vector is not normalized. This is an essential difference between attention and relevance. While attention is defined to be normalized, the only constraint on relevance is that the sum of relevance of contextual words is identical to the value of intended hidden state neuron.
For the target word embedding y 2 , the relevance is generally consistent with the attention by identifying that the second source word contributes more to the generation of "visit". But R y 2 further indicates that the target word "my" is also very important for generating "visit". Figure 6 shows the hidden states of a target UNK word, which is very common to see in NMT because of limited vocabulary. It is interesting to investigate whether the attention mechanism could put a UNK in the right place in the translation. In this example, the 6-th source word "zhaiwuguo" is a UNK. We find that the model successfully predicts the correct position of UNK by exploiting surrounding source and target contexts. But the ordering of UNK usually becomes worse if multiple UNK words exist on the source side.

Translation Error Analysis
Given the visualization of hidden states, it is possible to offer useful information for analyzing translation errors commonly observed in NMT such as word omission, word repetition, unrelated words and negation reversion.

Word Omission
Given a source sentence "bajisitan zongtong muxialafu yingde can zhong liang yuan xinren toupiao" (pakistani president musharraf wins votes of confidence in senate and house), the NMT model pro- duces a wrong translation "pakistani president win over democratic vote of confidence in the senate". One translation error is that the 6-th source word "zhong" (house) is incorrectly omitted for translation.
As the end-of-sentence token "</s>" occurs early than expected, we choose to visualize its corresponding target hidden states. Although the attention correctly identifies the 6-th source word "zhong" (house) to be important for generating the next target word, the relevance of source context R c 12 attaches more importance to the end-ofsentence token.
Finally, the relevance of target word R y 12 reveals that the end-of-sentence token and the 11-th target word "senate" become dominant in the softmax layer for generating the target word.
This example demonstrates that only using attention matrices does not suffice to analyze the internal workings of NMT. The values of relevance of contextual words might vary significantly across different layers.

Word Repetition
Given a source sentence "meiguoren lishi shang you jiang chengxi de chuantong , you fancuo rencuo de chuantong" (in history , the people of america have the tradition of honesty and would not hesitate to admit their mistakes), the NMT model produces a wrong translation "in the history of the history of the history of the americans , there is a tradition of faith in the history of mistakes". The . guanxi kuadaxiyang is Figure 9: Analyzing translation error: unrelated words. The 9-th target word "forge" is totally unrelated to the source sentence.
translation error is that "history" repeats four times in the translation. Figure 8 visualizes the target hidden states of the 6-th target word "history". According to the relevance of the target word embedding R y 6 , the first source word "meiguoren" (american), the second source word "lishi" (history) and the 5-th target word "the" are most relevant to the generation of "history". Therefore, word repetition not only results from wrong attention but also is significantly influenced by target side context. This finding confirms the importance of controlling source and target contexts to improve fluency and adequacy (Tu et al., 2017).

Unrelated Words
Given a source sentence "ci ci huiyi de yi ge zhongyao yiti shi kuadaxiyang guanxi" (one the the top agendas of the meeting is to discuss the cross-atlantic relations), the model prediction is "a key topic of the meeting is to forge ahead". One translation error is that the 9-th English word "forge" is totally unrelated to the source sentence. Figure 9 visualizes the hidden states of the 9-th target word "forge". We find that while the attention identifies the 10-th source word "kuadaxiyang" (cross-atlantic) to be most relevant, the relevance vector of the target word R y 9 finds that multiple source and target words should contribute to the generation of the next target word.
We observe that unrelated words are more likely to occur if multiple contextual words have high values in the relevance vector of the target word being generated.

Negation Reversion
Given a source sentence "bu jiejue shengcun wenti , jiu tan bu shang fa zhan , geng tan bu shang ke chixu fazhan" (without solution to the issue of subsistence , there will be no development to speak of , let alone sustainable development), the model prediction is "if we do not solve the problem of living , we will talk about development and still less can we talk about sustainable development". The translation error is that the 8-th negation source word "bu" (not) is untranslated. The omission of negation is a severe translation error it reverses the meaning of the source sentence.
As shown in Figure 10, while both attention and relevance correctly identify the 8-th negation word "bu" (not) to be most relevant, the model still generates "about" instead of a negation target word. One possible reason is that target context words "will talk" take the lead in determining the next target word.

Extra Words
Given a source sentence "bajisitan zongtong muxialafu yingde can zhong liang yuan xinren toupiao"(pakistani president musharraf wins votes of confidence in senate and house), the model prediction is "pakistani president win over democratic vote of confidence in the senate" The translation error is that the 5-th target word "democratic" is extra generated.  Figure 11: Analyzing translation error: extra word. The 5-th target word "democratic" is an extra word. Figure 11 visualizes the hidden states of the 9-th target word "forge". We find that while the attention identifies the 9-th source word "xinren"(confidence) to be most relevant, the relevance vector of the target word R y 9 indicates that the end-of-sentence token and target words contribute more to the generation of "democratic".

Summary of Findings
We summarize the findings of visualizing and analyzing the decoding process of NMT as follows: 1. Although attention is very useful for understanding the connection between source and target words, only using attention is not sufficient for deep interpretation of target word generation ( Figure 9); 2. The relevance of contextual words might vary significantly across different layers of hidden states ( Figure 9); 3. Target-side context also plays a critical role in determining the next target word being generated. It is important to control both source and target contexts to produce correct translations ( Figure 10);

Related Work
Our work is closely related to previous visualization approaches that compute the contribution of a unit at the input layer to the final decision at the output layer (Simonyan et al., 2014;Mahendran and Vedaldi, 2015;Nguyen et al., 2015;Girshick et al., 2014;Bach et al., 2015;. Among them, our approach bears most resemblance to (Bach et al., 2015) since we adapt layer-wise relevance propagation to neural machine translation. The major difference is that word vectors rather than single pixels are the basic units in NMT. Therefore, we propose vectorlevel relevance based on neuron-level relevance for NMT. Calculating weight ratios has also been carefully designed for the operators in NMT. The proposed approach also differs from  in that we use relevance rather than partial derivative to quantify the contributions of contextual words. A major advantage of using relevance is that it does not require neural activations to be differentiable or smooth (Bach et al., 2015).
The relevance vector we used is significantly different from the attention matrix (Bahdanau et al., 2015). While attention only demonstrates the association degree between source and target words, relevance can be used to calculate the association degree between two arbitrary neurons in neural networks. In addition, relevance is effective in analyzing the effect of source and target contexts on generating target words.

Conclusion
In this work, we propose to use layer-wise relevance propagation to visualize and interpret neural machine translation. Our approach is capable of calculating the relevance between arbitrary hidden states and contextual words by back-propagating relevance along the network recursively. Analyses of the state-of-art attention-based encoder-decoder framework on Chinese-English translation show that our approach is able to offer more insights than the attention mechanism for interpreting neural machine translation.
In the future, we plan to apply our approach to more NMT approaches Shen et al., 2016;Tu et al., 2016; on more language pairs to further verify its effectiveness. It is also interesting to develop relevancebased neural translation models to explicitly control relevance to produce better translations.