Dynamic Past and Future for Neural Machine Translation

Previous studies have shown that neural machine translation (NMT) models can benefit from explicitly modeling translated () and untranslated () source contents as recurrent states (CITATION). However, this less interpretable recurrent process hinders its power to model the dynamic updating of and contents during decoding. In this paper, we propose to model the dynamic principles by explicitly separating source words into groups of translated and untranslated contents through parts-to-wholes assignment. The assignment is learned through a novel variant of routing-by-agreement mechanism (CITATION), namely Guided Dynamic Routing, where the translating status at each decoding step guides the routing process to assign each source word to its associated group (i.e., translated or untranslated content) represented by a capsule, enabling translation to be made from holistic context. Experiments show that our approach achieves substantial improvements over both Rnmt and Transformer by producing more adequate translations. Extensive analysis demonstrates that our method is highly interpretable, which is able to recognize the translated and untranslated contents as expected.


Introduction
Neural machine translation (NMT) generally adopts an attentive encoder-decoder framework (Sutskever et al., 2014;Vaswani et al., 2017), where the encoder maps a source sentence into a sequence of contextual representations (source contents), and the decoder generates a target sentence word-by-word based on part of the source content assigned by an attention model (Bahdanau et al., 2015). Like human translators, NMT systems should have the ability to know the relevant source-side context for the current word (PRESENT), as well as recognize what parts in the source contents have been translated (PAST) and what parts have not (FUTURE), at each decoding step. Accordingly, the PAST, PRESENT and FU-TURE are three dynamically changing states during the whole translation process.
Previous studies have shown that NMT models are likely to face the illness of inadequate translation (Kong et al., 2019), which is usually embodied in over-and under-translation problems (Tu et al., 2016(Tu et al., , 2017. This issue may be attributed to the poor ability of NMT of recognizing the dynamic translated and untranslated contents. To remedy this, Zheng et al. (2018) first demonstrate that explicitly tracking PAST and FUTURE contents helps NMT models alleviate this issue and generate better translation. In their work, the running PAST and FUTURE contents are modeled as recurrent states. However, the recurrent process is still non-trivial to determine which parts of the source words are the PAST and which are the FU-TURE, and to what extent the recurrent states represent them respectively, this less interpretable nature is probably not the best way to model and exploit the dynamic PAST and FUTURE.
We argue that an explicit separation of the source words into two groups, representing PAST and FUTURE respectively (Figure 1), could be more beneficial not only for easy and direct recognition of the translated and untranslated source contents, but also for better interpretation of model's behavior of the recognition. We formulate the explicit separation as a procedure of partsto-wholes assignment: the representation of each source words (parts) should be assigned to its associated group of either PAST or FUTURE (wholes).
In this paper, we implement this idea using Cap-

布 什 (B u sh ) 为 振 兴 (r e v iv e ) 经 济 (e c o n o m y ) 计 划 (p la n ) 辩 护 (d e fe n d ) <B O S > <E O S >
Bush defended plan to revive its economy <EOS> his 其 (his) Translated (Past) Untranslated (Future) Current Translating (Present) Target: Source: Figure 1: An example of separation of PAST and FU-TURE in machine translation. When generating the current translation "his", the source tokens " BOS ", "布什(Bush)" and phrase "为...辩护(defend)" are the translated contents (PAST), while the remaining tokens are untranslated contents (FUTURE).
sule Network (Hinton et al., 2011) with routingby-agreement mechanism (Sabour et al., 2017), which has demonstrated its appealing strength of solving the problem of parts-to-wholes assignment (Hinton et al., 2018;Gong et al., 2018;, to model the separation of the PAST and FUTURE: 1. We first cast the PAST and FUTURE source contents as two groups of capsules. 2. We then design a novel variant of the routingby-agreement mechanism, called Guided Dynamic Routing (GDR), which is guided by the current translating status at each decoding step to assign each source word to its associated capsules by assignment probabilities for several routing iterations. 3. Finally, the PAST and FUTURE capsules accumulate their expected contents from representations, and are fed into the decoder to provide a time-dependent holistic view of context to decide the prediction. In addition, two auxiliary learning signals facilitate GDR's acquiring of our expected functionality, other than implicit learning within the training process of the NMT model.
We conducted extensive experiments and analysis to verify the effectiveness of our proposed model.
Experiments on Chinese-to-English, English-to-German, and English-to-Romanian show consistent and substantial improvements over the Transformer (Vaswani et al., 2017) or RNMT (Bahdanau et al., 2015). Visualized evidence proves that our approach does acquire the expected ability to separate the source words into PAST and FUTURE, which is highly interpretable. We also observe that our model does alleviate the inadequate translation problem: Human subjective evaluation reveals that our model produces more adequate and high-quality translations than Transformer. Length analysis regarding source sentences shows that our model generates not only longer but also better translations.

Neural Machine Translation
Neural models for sequence-to-sequence tasks such as machine translation often adopt an encoder-decoder framework. Given a source sentence x = x 1 , . . . , x I , a NMT model learns to predict a target sentence y = y 1 , . . . , y T by maximizing the conditional probabilities p(y|x) = T t=1 p(y t |y <t , x). Specifically, an encoder first maps the source sentence into a sequence of encoded representations: where f e is the encoder's transformation function.
Given the encoded representations of the source words, a decoder generates the sequence of target words y autoregressively: where E(y t ) is the embedding of y t . The current word is predicted based on the decoder state z t . f d is the transformation function of decoder, which determines z t based on the target translation trajectory y <t , and the lexical-level source content a t that is most relevant to PRESENT translation by an attention model (Bahdanau et al., 2015). Ideally, with all the source encoded representations in the encoder, NMT models should be able to update translated and untranslated source contents and keep them in mind. However, most of existing NMT models lack an explicit functionality to maintain the translated and untranslated contents, failing to distinguish the source words being of either PAST or FUTURE (Zheng et al., 2018), which is likely to suffer from severe inadequate translation problem (Tu et al., 2016;Kong et al., 2019).

Approach
Motivation Our intuition arises straightforwardly: if we could tell the translated and untranslated source contents apart by directly separating the source words into PAST and FUTURE categories at each decoding step, the PRESENT translation could benefit from the dynamically holistic context (i.e., PAST+ PRESENT+ FUTURE). To this purpose, we should design a mechanism by which each word could be recognized and assigned to a distinct category, i.e., PAST or FUTURE contents, subject to the translation status at present. This procedure can be seen as a parts-to-wholes assignment, in which the encoder hidden states of the source words (parts) are supposed to be assigned to either PAST or FUTURE (wholes). Capsule network (Hinton et al., 2011) has shown its capability of solving the problem of assigning parts to wholes (Sabour et al., 2017). A capsule is a vector of neurons which represents different properties of the same entity from the input (Sabour et al., 2017). The functionality relies on a fast iterative process called routing-byagreement, whose basic idea is to iteratively refine the proportion of how much a part should be assigned to a whole, based on the agreement between the part and the whole . Therefore, it is appealing to investigate if this mechanism could be employed for our intuition.

Guided Dynamic Routing (GDR)
Dynamic routing (Sabour et al., 2017) is an implementation of routing-by-agreement, where it runs intrinsically without any external guidance. However, what we expect is a mechanism driven by the decoding status at present. Here we propose a variant of dynamic routing mechanism called Guided Dynamic Routing (GDR), where the routing process is guided by the translating information at each decoding step ( Figure 2).
Formally, we cast the source encoded representations h of I source words to be input capsules, while we denote Ω as output capsules, which consist of J entries. Initially, we assume that J/2 of them (Ω P ) represent the PAST contents, and the rest J/2 capsules (Ω F ) represent the FUTURE: where each capsule is represented by a d cdimension vector. We assemble these PAST and FUTURE capsules together, which are expected to competing for source information, i.e., we now have Ω = Ω P ∪ Ω F . We will describe how to teach these capsules to retrieve their relevant parts from source contents in the Section 3.3. Note that we employ GDR at every decoding step t to obtain the time-dependent PAST and FUTURE and omit the subscript t for simplicity.
In the dynamic routing process, each vector output of capsule j is calculated with a non-linear squashing function (Sabour et al., 2017): where W j ∈ R d×dc is a trainable matrix for j-th output capsule 2 . c ij is the assignment probability (i.e. the agreement) that is determined by the iterative dynamic routing. The assignment probabilities c i· associated with each input capsule h i sum to 1: j c ij = 1, and are computed by: where routing logit b ij is initialized as all 0s, which measures the degree that h i should be sent to Ω j . The initial assignment probabilities are then iteratively updated by measuring the agreement between the vote vector v ij and capsule Ω j by an MLP, considering the current decoding state z t : where W b ∈ R d+dc * 2 and w ∈ R dc are learnable parameters. Instead of using simple scalar product, i.e., b ij = v ij Ω j (Sabour et al., 2017), which could not consider the current decoding state as a condition signal, we resort to the MLP to take z i into account inspired by MLP-based attention mechanism (Bahdanau et al., 2015;Luong et al., 2015). That is why we call it "guided" dynamic routing.

Algorithm 1 Guided Dynamic Routing (GDR)
Input: Encoder hidden state h, current decoding hidden state zt, and number of routing iterations r. Output: PAST, FUTURE, and redundant capsules. procedure: GDR(h, zt, r) 1: ∀i ∈ h, j ∈ Ω : bij ← 0, vij ← Wjhi Initializing routing logits, and vote vectors. 2: for r iterations do 3: ∀i ∈ h, j ∈ Ω: Compute assign. probs. cij by Eq. 6 4: ∀j ∈ Ω : Compute capsules Ωj by Eq. 4 5: ∀i ∈ h, j ∈ Ω : Update routing logits bij by Eq. 7 6: end for 7: [Ω P ; Ω F ; Ω R ] = Ω Return past, future, and redundant capsules 8: return Ω P , Ω F , Ω R Now with the awareness of the current decoding status, the hidden state (input capsule) of a source word prefers to send its representation to the output capsules, which have large routing agreements associated with the input capsule. After a few rounds of iterations, the output capsules are able to ignore all but the most relevant information from the source hidden states, representing a distinct aspect of either PAST or FUTURE.

Redundant Capsules
In some cases, some parts of the source sentence may belong to neither past contents nor future contents. For example, function words in English (e.g., "the") could not find its counterpart translation in Chinese. Therefore, we add additional Redundant Capsules Ω R (also known as "orphan capsules" in Sabour et al. (2017)), which are expected to receive higher routing assignment probabilities when a source word should not belong to either PAST or FUTURE.
We show the algorithm of our guided dynamic routing in Algorithm 1.

Integrating into NMT
The proposed GDR can be applied on the top of any sequence-to-sequence architecture, which does not require any specific modification. Let us take a Transformer-fashion architecture as example (Figure 3). Given a sentence x = x 1 , . . . , x I , the encoder leverages N stacked identical layers to map the sentence into contextual representations: where the superscript l indicates layer depth. Based on the encoded source representations h N , a decoder generates translation word by word. The decoder also has N stacked identical layers: where a (l) is the lexical-level source context assigned by an attention mechanism between current decoder layer and the last encoder layer. Given the hidden states of the last decoder layer z (N ) , we perform our proposed guided dynamic routing (GDR) mechanism to compute the PAST and FU-TURE contents from the source side and obtain the holistic context of each decoding step: where o = o 1 , · · · , o T is the sequence of the holistic context of each decoding step. Based on the holistic context, the output probabilities are computed as: The NMT model is now able to employ the dynamic holistic context for better generation.

Auxiliary Guided Losses
To ensure that the dynamic routing process runs as expected, we introduce the following auxiliary guided signals to assist the learning process.
Bag-of-Word Constraint Weng et al. (2017) propose a multitasking scheme to boost NMT by predicting the bag-of-words of target sentence using the Word Predictions approach. Inspired by this work, we introduce a BOW constraint to encourage the PAST and FUTURE capsules to be predictive of the preceding and subsequent bag-ofwords regarding each decoding step respectively: where p pre (y ≤t |Ω P t ) and p sub (y ≥t |Ω F t ) are the predicted probabilities of the preceding bag-ofwords and subsequent words at decoding step t, respectively. For instance, the probabilities of the preceding bag-of-words are computed by: The computation of p SUB (y ≥t |Ω F t ) is similar. By applying the BOW constraint, the PAST and FU-TURE capsules can learn to reflect the target-side past and future bag-of-words information.
Bilingual Content Agreement Intuitively, the translated source contents should be semantically equivalent to the translated target contents, and so do untranslated contents. Thus, a natural idea is to encourage the source PAST contents, modeled by the PAST capsule to be close to the target PAST representation at each decoding step, and the same for the FUTURE. Hence, we introduce a Bilingual Content Agreement (BCA) to require the bilingual semantic-equivalent contents to be predictive to each other by Minimum Square Estimation (MSE) loss: where the target-side past information is represented by the averaged results of the decoder hidden states of all preceding words, while the average of subsequent decoder hidden states represents the target-side future information.

Experiment
We mainly evaluated our approaches on the widely used NIST Chinese-to-English (Zh-En) translation task. We also conducted translation experiments on WMT14 English-to-German (En-De) and WMT16 English-to-Romanian (En-Ro): 1. NIST Zh-En. The training data consists of 1.09 million sentence pairs extracted from LDC 3 . We used NIST MT03 as the development set (Dev); MT04, MT05, MT06 as the test sets.
2. WMT14 En-De. The training data consists of 4.5 million sentence pairs from WMT14 news translation task. We used newstest2013 as the development set and newstest2014 as the test set.
3. WMT16 En-Ro. The training data consists of 0.6 million sentence pairs from WMT16 news translation task. We used newstest2015 as the development set and newstest2016 as the test set.
We used transformer base configuration (Vaswani et al., 2017) for all the models. We run the dynamic routing for r=3 iterations. The dimension d c of a single capsule is 256. Either PAST or FUTURE content was represented by J 2 = 2 capsules. Our proposed models were trained on the top of pre-trained baseline models 4 . λ 1 and λ 2 in training objective were set to 1. In Appendix, we provide details for the training settings.

NIST Zh-En Translation
We list the results of our experiments on NIST Zh-En task in Table 1 concerning two different architectures, i.e., Transformer and RNMT. As we can see, all of our models substantially outperform the baselines in terms of averaged BLEU score of all the test sets. Among them, our best model achieves 45.65 BLEU based on Transformer architecture. We also find that redundant capsules are helpful while discarding them leads to -0.35 BLEU degradation (45.65 vs 45.30). Efficiency To examine the efficiency of the proposed approach, we also list the relative speed of both training and testing. Our approach is 0.67× slower than the Transformer baseline in training phase, however, it does not hurt the speed of testing too much (0.94×). It is because the most extra computation in training phrase is related to the softmax operations of BOW losses, the degradation of the testing efficiency is moderate.
Comparison to Other Work On the experiments on RNMT architecture, we list two related works. Zheng et al. (2018) use extra PAST and FUTURE RNNs to capture translated and untranslated contents recurrently (PFRNN), while Kong et al. (2019) directly leverage translation adequacy as learning reward by their proposed Adequacyoriented Learning (AOL). Compared to them, our model also enjoys competitive improvements due to the explicit separation of source contents. In addition, PFRNN is non-trivial to adapt to Transformer, because it requires a recurrent process

WMT En-De and En-Ro Translation
We evaluated our approach on WMT14 En-De and WMT16 En-Ro tasks. As shown in Table  2, our reproduced Transformer baseline systems are close to the state-of-the-art results in previous work, which guarantee the comparability of our experiments. The results show a consistent trend of improvements as NIST Zh-En task on WMT14 En-De (+0.96 BLEU) and WMT16 En-Ro (+0.86 BLEU) benchmarks. We also list the results of other published research for comparison, where our model outperforms the previous results in both language pairs. Note that our approach also surpasses Kong et al. (2019) on WMT14 En-De task. These experiments demonstrate the effectiveness of our approach across different language pairs.

Analysis and Discussion
Our model learns PAST and FUTURE. We visualize the assignment probabilities in the last routing iteration (Figure 4). Interestingly, there is a clear trend that the assignment probabilities to the PAST capsules gradually raise up, while Figure 4: Visualization of the assignment probabilities of iterative routing. Each sub-heatmap is associated with a target word, where the left column is the probabilities of each source words routing to the PAST capsules, and the right one is to the FUTUREĖxamples in the red frame indicate the changes before and after the generation of the central word. We omit the assignment probabilities associated with the redundant capsules for simplicity. For instance, after the target word "defended" was generated, the assignment probabilities of its source translation "辩护" changed from FUTURE to PAST. Results of "Bush", "his", "revive" and "economy" are similar, except a adverse case ("plan").
those to the FUTURE capsules reduce to around zeros. This phenomenon is consistent with the intuition that the translated contents should aggregate and the untranslated should decline (Zheng et al., 2018). The assignment weights of a specific word change from FUTURE to PAST after being generated. These pieces of evidence give a strong verification that our GDR mechanism indeed has learned to distinguish the PAST contents and FU-TURE contents in the source-side. Moreover, we measure how well our capsules accumulate the expected contents by comparison between the BOW predictions and ground-truth target words. Accordingly, we define a top-5 overlap rate (r OL ) for predicting preceding and subsequent words are defined as follow, respectively: |Top 5(T−t) (p sub (Ω F t ))∩y>=t| |y>=t| . The PAST capsules achieves r P OL of 0.72, while r F OL of 0.70 for the FUTURE capsules. The results indicate that the capsules could predict the corresponding words to a certain extent, which implies the capsules contain the expected information of PAST or FUTURE contents.
Translations become better and more adequate. To validate the translation adequacy of our model, we use Coverage Difference Ratio (CDR) proposed by Kong et al. (2019) Table 3: Evaluation on translation quality and adequacy. For HUMAN evaluation, we asked three evaluators to score translations from 100 source sentences, which are randomly sampled from the testsets from anonymous systems, the QUALITY from 1 to 5 (higher is better), and the proportions of source words concerning OVER-and UNDER-translation, respectively. tween reference and translation. As shown in Table 3, our approach achieves a better CDR than the Transformer baseline, which means superiority in translation adequacy. Following Zheng et al. (2018), we also conduct subjective evaluations to validate the benefit of modeling PAST and FUTURE (the last three rows of Table 3). Surprisingly, we find that the modern NMT model, i.e., Transformer, rarely produces over-translation but still suffers from undertranslation. Our model obtains the highest human rating on translation quality while substantially alleviates the under-translation problem than Transformer.
Longer sentences benefit much more. We report the comparison with sentence lengths (Figure 5). In all the intervals of length, our model does generate better ( Figure 5b) and longer (Figure 5a) translations. Interestingly, our approach gets a larger improvement when the input sentences become longer, which are commonly thought hard to translate. We attribute this to the less number of under-translation cases in our model, meaning that our model learns better translation quality and adequacy, especially for long sentences.
Does guided dynamic routing really matter? Despite the promising numbers of the GDR and the auxiliary guided losses, a straightforward question rises: will other more simple models also work if they are just equipped with the guided losses to recognize PAST and FUTURE contents? In other word, does the proposed guided dynamic routing really matter?
To answer this question, we integrate the proposed auxiliary losses into two simple baselines to guide the recognition of past and future: A MLP classifier model (CLF) that determines if a source word is a past word, otherwise future 5 ; and an 5 CLF is a 3-way classifier that computes the probabilities p P (xi), p F (xi) and p R (xi) (they sum to 1) as past, future and redundant weights, which is similar to Equation 6. The PAST and FUTURE representations are computed by weighted summation, which is similar to Equation 4. attention-based model (ATTN) that uses two individual attention modules to retrieve past or future parts from the source words. As shown in Table 6, surprisingly, the simple baselines can obtain improvements, emphasizing the function of the proposed guided losses, while there remain a considerable gaps between our model and them. In fact, the CLF is essentially a one-iteration variant of GDR, and iterative refinement by multiple iterations is necessary and effective 6 . And the attention mechanism is used for feature pooling, not suitable for parts-to-wholes assignment 7 . These experiments reveal that our guided dynamic routing is a better choice to model and exploit the dynamic PAST and FUTURE.

Related Work
Inadequate translation problem is a widely known weakness of NMT models, especially when translating long sentences (Kong et al., 2019;Tu et al., 2016;. To alleviate this problem, one direction is to recognize the translated and untranslated contents, and pay more attention to untranslated parts. Tu et al. (2016), Mi et al. (2016) and Li et al. (2018) employ coverage vector or coverage ratio to indicate the lexical-level coverage of source words. Meng et al. (2018) influence the attentive vectors by translated/untranslated information. Our work mainly follows the path of Zheng et al. (2018), which introduce two extra recurrent layers in the decoder to maintain the representations of the past and future translation contents. However, it may be not easy to show the direct correspondence between the source contents and learned representations in the past/future RNN layers, nor compatible with the state-of-theart Transformer for the additional recurrences prevent Transformer decoder from being parallelized.
Another direction is to introduce global representations. Lin et al. (2018) model a global source representation by deconvolution networks. ; ; Geng et al. (2018) propose to provide a holistic view of target sentence by multi-pass decoding.  improve  to a synchronous bidirectional decoding fashion. Similarly, Weng et al. (2019) deploy bidirectional decoding in interactive translation setting. Different from these work aiming at providing static global information in the whole translation process, our approach models a dynamically global (holistic) context by using capsules network to separate source contents at every decoding steps.
Other efforts explore exploiting future hints. Serdyuk et al. (2018) design a Twin Regularization to encourage the hidden states in forward decoder RNN to estimate the representations of a backward RNN. Weng et al. (2017) require the decoder states to not only generate the current word but also predict the remain untranslated words. Actor-critic algorithms are employed to predict future properties Bahdanau et al., 2017;He et al., 2017) by estimating the future rewards for decision making. Kong et al. (2019) propose a policy gradient based adequacy-oriented approach to improve translation adequacy. These methods use future information only at the training stage, while our model could also exploit past and future information at inference, which provides accessible clues of translated and untranslated contents.
Capsule networks (Hinton et al., 2011) and its associated assignment policy of dynamic routing (Hinton et al., 2011) andEM-routing (Hinton et al., 2018) aims at addressing the limited expressive ability of the parts-to-wholes assignment in computer vision. In natural language processing community, however, the capsule network has not yet been widely investigated. Zhao et al. (2018) testify capsule network on text classification and Gong et al. (2018) propose to aggregate a sequence of vectors via dynamic routing for sequence encoding.  first propose to employ capsule network in NMT using routingby-agreement mechanism for layer representation aggregation. Wang (2019) develops a constant time NMT model using capsule networks. These studies mainly use capsule network for information aggregation, where the capsules could have a less interpretable meaning. In contrast, our model learns what we expect by the aid of auxiliary learning signals, which endows our model with better interpretability.

Conclusion
In this paper, we propose to recognize the translated PAST and untranslated FUTURE contents via parts-to-wholes assignment in neural machine translation. We propose the guided dynamic routing, a novel mechanism that explicitly separates source words into PAST and FUTURE guided by PRESENT target decoding status at each decoding step. We empirically demonstrate that such explicit separation of source contents benefit neural machine translation with considerable and consistent improvements on three language pairs. Extensive analysis shows that our approach learns to model the PAST and FUTURE as expected, and alleviates the inadequate translation problem. It is interesting to apply our approach to other sequence-to-sequence tasks, e.g., text summarization (as listed in Appendix).