Interactive Visualization and Manipulation of Attention-based Neural Machine Translation

While neural machine translation (NMT) provides high-quality translation, it is still hard to interpret and analyze its behavior. We present an interactive interface for visualizing and intervening behavior of NMT, specifically concentrating on the behavior of beam search mechanism and attention component. The tool (1) visualizes search tree and attention and (2) provides interface to adjust search tree and attention weight (manually or automatically) at real-time. We show the tool gives various methods to understand NMT.


Introduction
Recent advances in neural machine translation (NMT) (Sutskever et al., 2014) have changed the direction of machine translation community. Compared to traditional phrase-based statistical machine translation (SMT) (Koehn, 2010), NMT provides more accurate and fluent translation results. Companies also have started to adopt NMT for their machine translation service.
However, it is still challenging to analyze translation behavior of NMT. While SMT provides interpretable features (like phrase table), NMT directly learns complex features which are obscure to human. This is especially problematic in the case of wrong translation, since it is even hard to understand why the system generated such sentences.
To help the analysis, we propose a tool for visualizing and intervening NMT behavior, concentrated on beam search decoder and attention. The features can be grouped by two categories: • Visualizing decoder result, including how decoder assigns probability to each token • Intervening in decoder behavior, including manually expanding hypothesis discarded during search and adjusting attention weight. This helps understanding how the components affect translation quality.
We show the mechanism of visualization (Section 3.1 and 3.2) and manipulation (Section 3.3 and 3.4) and its usefulness with examples.

Related Work
There have been various methods proposed for visualizing and intervening neural models for NLP.  provides a concise literature review.
Visualization and manipulation of NMT could be grouped into three parts: RNN (of encoder and decoder), attention (of decoder), and beam search (of decoder).  RNN plays a central role in recognizing source sentences and generating target sentences. Although we here treat RNN as a black-box, there exists various methods to understand RNNs, e.g. by observing intermediate values (Strobelt et al., 2016;Karpathy et al., 2015; or by removing some parts of them (Goh, 2016;Li et al., 2016).
Attention (Bahdanau et al., 2014;Luong et al., 2015) is an important component for improving NMT quality. Since the component behaves like alignment in traditional SMT, it has been proposed to utilize attention during training (Cheng et al., 2015;Tu et al., 2016b) or during decoding (Wu et al., 2016). In this work, we propose a way to manipulate attention and to understand the behavior.
Beam search is known to improve quality of NMT translation output. However, it is also known that larger beam size does not always helps but rather hurts the quality (Tu et al., 2016a). Therefore it is important to understand how beam search affects quality. (Wu et al., 2016;Freitag and Al-Onaizan, 2017) proposed several penalty functions and pruning methods for beam search. We directly visualize beam search result as a tree and manually explore hypotheses discarded by decoder.

Interactive Beam Search
We propose an interactive tool for visualizing and manipulating NMT decoder behavior. The system consists of two parts: back-end NMT server and front-end web interface. NMT server is responsible for NMT computation. Web interface is responsible for requesting computation to NMT server and showing results at real time.
For back-end implementation, we use two NMT models. For English-Korean (en-ko), we use a model used in Naver Papago (Lee et al., 2016) service 1 ported to TensorFlow. For German-English (de-en), we adopted Nematus 2 and pretrained models provided by (Sennrich et al., 2016). For front-end we implemented JavaScript-based web  Table. Weight is represented as number and green color. page with d3.js 3 .

Search Tree Visualization
To understand how beam search decoder selects and discards intermediate hypothesis, we first plot all hypotheses as a tree (Figure 2, 3). For each input token (word or sub-word) and decoder (RNN) state vector, the decoder computes output probability of all possible output token, then beam search routine selects token based on its probability value (Figure 4). We plot each input and output token as tree node, and input-output relation as edge. If a node is mouse-hovered, it shows its next possible tokens with highest probability, including pruned ones (Figure 1). We also visualize output probability of node using edge thickness; thicker edge means higher probability.

Attention Visualization
We show the attention weight of (partially) generated sentence as table ( Figure 5) and as graph ( Figure 6). Table interface provides detailed information, and graph interface provides more concise view therefore better for long sentences.

Search Tree Manipulation
We implemented an interface to manually expand nodes which are discarded during beam search. In search tree visualization (Figure 7) or attention manipulation dialog (Figure 9), a user can click one of output candidate (green node) then the system computes its next outputs and extends the tree. This enables exploration of hypotheses not covered by decoder but worth to analyze.

Attention Manipulation
We are interested in understanding attention layer of (Bahdanau et al., 2014;Luong et al., 2015), especially the role and effect of attention weights. To achieve it, we modified NMT decoder to accept arbitrary attention weight instead of what the decoder computes (Figure 8).

Manual Adjustment of Attention Weight
For given memory cells (encoder outputs) (m 1 , · · · , m n ) and decoder internal state h, the attention layer first computes relevance score of memory cell s i = f (m i , h) and attention weight w i = softmax(s 1 , · · · , s n ) i . Then memory cells Here "attention weight" is replaced by "custom attention weight" which is given by user or computed to maximize probability of output token. Figure 9: Result of attention manipulation for two output tokens "어조" and "음색".
are summarized into one fixed vector (m) via weighted sum:m = i w i m i . The summarized vector is fed to next layer to compute output token probabilities: p(y j ) = g(m, h) j .
We modified the decoder to accept custom weight w = (w 1 , · · · w n ) instead of original ones w, when w is provided by user. We also implemented front-end interface to adjust custom weight (Figure 9). If user drags circle on the bar, the weights are adjusted and the system computes new output probabilities using the weight. It helps to understand what is encoded in memory cell and how decoder utilizes the attended memorym. For example, user may increase or decrease weight of specific memory cell and observe its effect. Figure 9 shows an illustrative example that how adjusting attention weight could change output probability distribution. When weight of "highly" and "tone" are high, NMT puts high probability to "어조" ("tone of voice"). When weight of "distinctive" is high, NMT recognizes "tone" in current context (musical instrument) and puts high probability to "음색" ("timbre").

Automatic Adjustment of Attention Weight
We also implemented a method to find attention weight maximizing output probability of a specific token. For attention weight w and token y, we see this problem as a constrained optimization: maximize log p(y|w, · · ·) s.t. w i ≥ 0, i w i = 1. Since the toolkits we use (TensorFlow 4 and Theano 5 ) provide unconstrained gradient descent optimizer, we cast the original problem to unconstrained optimization: instead of weight w, we optimize unnormalized score s before softmax, initialized as s i = log w i . The method can be used to optimize weight for specific time step (Figure 9) or for whole sentence (Figure 10). For English-Korean, this technique is particularly useful because the original attention weight is sometimes hard to interpret. Due to ordering differences between two languages, en-ko NMTs tend to generate diverse sentences and they have very different orderings among each other. In Figure 3, input sentence is "As a bass player, he is . . . ". NMT puts high probability to output sentences starting with either "베이스" ("bass") or "그" ("he"), since both are valid. Therefore, Figure 10: Two attention graphs of en-ko NMT. The first one shows attention weights from NMT. The second one shows attention weights adjusted to maximize target sentence. It reveals clearer and more interpretable relation than original attention.
corresponding source words have high attention weights (0.14 for "bass" and 0.07 for "he"). Since output token is chosen after attention, the attention weights do not necessarily look like alignment between source and output sentences, but rather look like a mixture of alignments of possible output sentences.
Once output token is chosen, we can find new attention weight which increases probability of output token, which would be more interpretable than the original weight. An example of such adjustment is shown at Figure 10.

Conclusion
We propose a web-based interface for visualizing, investigating and understanding neural machine translation (NMT). The tool provides several methods to understand beam search and attention mechanism in an interactive way, by visualizing search tree and attention, expanding search tree manually, and changing attention weight either manually or automatically. We show the visualization and manipulation helps understanding NMT behavior.