A Stable and Effective Learning Strategy for Trainable Greedy Decoding

Beam search is a widely used approximate search strategy for neural network decoders, and it generally outperforms simple greedy decoding on tasks like machine translation. However, this improvement comes at substantial computational cost. In this paper, we propose a flexible new method that allows us to reap nearly the full benefits of beam search with nearly no additional computational cost. The method revolves around a small neural network actor that is trained to observe and manipulate the hidden state of a previously-trained decoder. To train this actor network, we introduce the use of a pseudo-parallel corpus built using the output of beam search on a base model, ranked by a target quality metric like BLEU. Our method is inspired by earlier work on this problem, but requires no reinforcement learning, and can be trained reliably on a range of models. Experiments on three parallel corpora and three architectures show that the method yields substantial improvements in translation quality and speed over each base system.


Introduction
Neural network sequence decoders yield stateof-the-art results for many text generation tasks, including machine translation (Bahdanau et al., 2015;Luong et al., 2015;Gehring et al., 2017;Vaswani et al., 2017;Dehghani et al., 2018), text summarization (Rush et al., 2015;Ranzato et al., 2015;See et al., 2017;Paulus et al., 2017) and image captioning Xu et al., 2015). These decoders generate tokens from left to right, at each step giving a distribution over possible next tokens, conditioned on both the input and all the tokens generated so far. However, since the space of all possible output sequences is infinite and grows exponentially with sequence length, heuristic search methods such as greedy decod-ing or beam search (Graves, 2012;Boulanger-Lewandowski et al., 2013) must be used at decoding time to select high-probability output sequences. Unlike greedy decoding, which selects the token of the highest probability at each step, beam search expands all possible next tokens at each step, and maintains the k most likely prefixes, where k is the beam size. Greedy decoding is very fast-requiring only a single run of the underlying decoder-while beam search requires an equivalent of k such runs, as well as substantial additional overhead for data management. However, beam search often leads to substantial improvement over greedy decoding. For example, Ranzato et al. (2015) report that beam search (with k = 10) gives a 2.2 BLEU improvement in translation and a 3.5 ROUGE-2 improvement in summarization over greedy decoding.
Various approaches have been explored recently to improve beam search by improving the method by which candidate sequences are scored (Li et al., 2016;Shu and Nakayama, 2017), the termination criterion (Huang et al., 2017), or the search function itself . In contrast, Gu et al. (2017) have tried to directly improve greedy decoding to decode for an arbitrary decoding objective. They add a small actor network to the decoder and train it with a version of policy gradient to optimize sequence objectives like BLEU. However, they report that they are seriously limited by the instability of this approach to training.
In this paper, we propose a procedure to modify a trained decoder to allow it to generate text greedily with the level of quality (according to metrics like BLEU) that would otherwise require the relatively expensive use of beam search. To do so, we follow Cho (2016) and Gu et al. (2017) in our use of an actor network which manipulates the decoder's hidden state, but introduce a stable and effective procedure to train this actor. In our training procedure, the actor is trained with ordinary backpropagation on a model-specific artificial parallel corpus. This corpus is generated by running the un-augmented model on the training set with large-beam beam search, and selecting outputs from the resulting k-best list which score highly on our target metric.
Our method can be trained quickly and reliably, is effective, and can be straightforwardly employed with a variety of decoders. We demonstrate this for neural machine translation on three state-of-the-art architectures: RNN-based (Luong et al., 2015), ConvS2S (Gehring et al., 2017) and Transformer (Vaswani et al., 2017), and three corpora: IWSLT16 German-English, 1 WMT15 Finnish-English 2 and WMT14 German-English. 3 2 Background

Neural Machine Translation
In sequence-to-sequence learning, we are given a set of source-target sentence pairs and tasked with learning to generate each target sentence (as a sequence of words or word-parts) from its source sentence. We first use an encoding model such as a recurrent neural network to transform a source sequence into an encoded representation, then generates the target sequence using a neural decoder.
Given a source sentence x = {x 1 , ..., x Ts }, a neural machine translation system models the distribution over possible output sentences y = {y 1 , ..., y T } as: where θ is the set of model parameters. Given a parallel corpus D x,y of source-target sentence pairs, the neural machine translation model can be trained by maximizing the loglikelihood: x,y ∈Dx,y log P (y|x; θ) . (2)

Decoding
Given estimated model parametersθ, the decision rule for finding the translation with the highest 1 https://wit3.fbk.eu/ 2 http://www.statmt.org/wmt15/translation-task.html 3 http://www.statmt.org/wmt14/translation-task probability for a source sentence x is given bŷ However, since such exact inference requires the intractable enumeration of large and potentially infinite set of candidate sequences, we resort to approximate decoding algorithms such as greedy decoding, beam search, noisy parallel decoding (NPAD; Cho, 2016), or trainable greedy decoding (Gu et al., 2017).
Greedy Decoding In this algorithm, we generate a single sequence from left to right, by choosing the token that is most likely at each step. The outputŷ = {ŷ 1 , ...,ŷ T } can be represented aŝ Despite its low computational complexity of O(|V | × T ), the translations selected by this method may be far from optimal under the overall distribution given by the model.
Beam Search Beam search decodes from left to right, and maintains k > 1 hypotheses at each step. At each step t, beam search considers all possible next tokens conditioned on the current hypotheses, and picks the k with the overall highest scores t t =1 P (y t |y <t , x;θ). When all the hypotheses are complete (they end in an end-of-thesentence symbol or reach a predetermined length limit), it returns the hypothesis with the highest likelihood. Tuning to find a roughly optimal beam size k can yield improvements in performance with sizes as high as 30 (Koehn and Knowles, 2017;Britz et al., 2017). However, the complexity of beam search grows linearly in beam size, with high constant terms, making it undesirable in some applications where latency is important, such as in on-device real-time translation.
NPAD Noisy parallel approximate decoding (NPAD; Cho, 2016) is a parallel decoding algorithm that can be used to improve greedy decoding or beam search. The main idea is that a better translation with a higher probability may be found by injecting unstructured random noise into the hidden state of the decoder network. Positive results with NPAD suggest that small manipulations to the decoder hidden state can correspond to substantial but still reasonable changes to the output sequence. Trainable Greedy Decoding Approximate decoding algorithms generally approximate the maximum-a-posteriori inference described in Equation 3. This is not necessarily the optimal basis on which to generate text, since (i) the conditional log-probability assigned by a trained NMT model does not necessarily correspond well to translation quality (Tu et al., 2017), and (ii) different application scenarios may demand different decoding objectives (Gu et al., 2017). To solve this, Gu et al. (2017) extend NPAD by replacing the unstructured noise with a small feedforward actor neural network. This network is trained using a variant of policy gradient reinforcement learning to optimize for a target quality metric like BLEU under greedy decoding, and is then used to guide greedy decoding at test time by modifying the decoder's hidden states. Despite showing gains over the equivalent actorless model, their attempt to directly optimize the quality metric makes training unstable, and makes the model nearly impossible to optimize fully. This paper offers a stable and effective alternative approach to training such an actor, and further develops the architecture of the actor network.

Methods
We propose a method for training a small actor neural network, following the trainable greedy decoding approach of Gu et al. (2017). This actor takes as input the current decoder state h t , an attentional context vector e t for the source sentence, and optionally the previous hidden state s t−1 of the actor, and produces a vector-valued action a t which is used to update the decoder hidden state. The actor function can take on a variety of forms, and we explore four: a feedforward network with one hidden layer (ff ), feedforward network with two hidden layers (ff2), a GRU recurrent network (rnn; Cho et al., 2014), and gated feedforward network (gate).
The feedforward ff actor function is computed as the ff2 actor is computed as the rnn actor is computed as and the gate actor is computed as Once the action a t has been computed, the hidden state h t is simply replaced with the updated stateh t :h Figure 1 shows a single step of the actor interacting with the underlying neural decoder of each of the three NMT architectures we use: the RNNbased model of Luong et al. (2015), ConvS2S (Gehring et al., 2017), and Transformer (Vaswani et al., 2017). We add the actor at the decoder layer immediately after the computation of the attentional context vector. For the RNN-based NMT, we add the actor network only to the last decoder layer, the only place attention is used. Here, it takes as input the hidden state of the last decoder layer h L t and the source context vector e t , and outputs the action a t , which is added back to the attention vectorh L t . For ConvS2S and Transformer, we add an actor network to each decoder layer. This actor is added to the sublayer which performs multi-head or multi-step attention over the output of the encoder stack. It takes as input the decoder state h l t and the source context vector e l t , and outputs an action a l t which is added back to geth l t . Training To overcome the severe instability reported by Gu et al. (2017), we introduce the use of a pseudo-parallel corpus generated from the underlying NMT model (Gao and He, 2013;Auli and Gao, 2014;Kim and Rush, 2016;Freitag et al., 2017;Zhang et al., 2017) for actor training. This corpus includes pairs that both (i) have a high model likelihood, so that we can coerce the model to generate them without much additional training or many new parameters and, (ii) represent high-quality translations, measured according to a target metric like BLEU. We do this by generating sentences from the original unaugmented model with large-beam beam search and selecting the best sentence from the resulting kbest list according to the decoding objective.
More specifically, let x, y be a sentence pair in the training data and Z = {z 1 , ..., z k } be the kbest list from beam search on the pretrained NMT model, where k is the beam size. We define the objective score of the translation z w.r.t. the goldstandard translation y according to a target metric such as BLEU (Papineni et al., 2002), NIST (Doddington, 2002), negative TER (Snover et al., 2006), or METEOR (Lavie and Denkowski, 2009) as O(z, y). Then we choose the sentencez that has the highest score to become our new target sentence:z = argmax Once we obtain the pseudo-corpus D we keep the underlying model fixed and train the actor by maximizing the loglikelihood of the actor parameters with these pairs: In this way, the actor network is trained to manipulate the neural decoder's hidden state at decoding time to induce it to produce better-scoring outputs under greedy or small-beam decoding.

Setting
We evaluate our approach on IWSLT16 German-English, WMT15 Finnish-English, and WMT14 De-En translation in both directions with three strong translation model architectures. For IWSLT16, we use tst2013 and tst2014 for validation and testing, respectively. For WMT15, we use newstest2013 and newstest2015 for validation and testing, respectively. For WMT14, we use newstest2013 and newstest2014 for validation and testing, respectively. All the data are tokenized and segmented into subword symbols using byte-pair encoding (BPE; Sennrich et al., 2016) to restrict the size of the vocabulary. Our primary evaluations use tokenized and cased BLEU. For METEOR and TER evaluations, we use multeval 4 with tokenized and case-insensitive scoring. All the underlying models are trained from scratch, except for ConvS2S WMT14 English-German translation, for which we use the trained model (as well as training data) provided by Gehring et al. (2017) Table 2: Generation quality (BLEU↑) using the proposed trainable greedy decoder without and with beam search (k = 4). Results without beam search (tg) are also appeared in Table 1.

ConvS2S
We implement our model based on fairseq-py. 7 We follow the settings in fconv iwslt de en and fconv wmt en de for IWSLT16 and WMT, respectively.
Transformer We implement our model based on the code from Gu et al. (2018). 8 We follow their hyperparameter settings for all experiments.
In the results below, we focus on the gate actor and pseudo-parallel corpora constructed by choosing the sentence with the best BLEU score from the k-best list produced by beam search with k = 35. Experiments motivating these choices are shown later in this section.

Results and Analysis
The results (Table 1) show that the use of the actor makes it practical to replace beam search with greedy decoding in most cases: We lose little or no performance, and doing so yields an increase in decoding efficiency, even accounting for the small overhead added by the actor. Among the three architectures, ConvS2S-the one with the most and largest layers-performs best. We conjecture that this gives the decoder more flexibility with which to guide decoding. In cases where model throughput is less important, our method can also be combined with beam search at test time to yield results somewhat better than either could achieve alone. Table 2 shows the result when combining our method with beam search. src Am Vormittag wollte auch die Arbeitsgruppe Migration und Integration ihre Beratungen fortsetzen . ref During the morning , the Migration and Integration working group also sought to continue its discussions .
greedy The morning also wanted to continue its discussions on migration and integration . beam4 In the morning , the working group on migration and integration also wanted to continue its discussions . beam35 In the morning , the migration and integration working group also wanted to continue its discussions .
tg The morning , the Migration and Integration Working Group wanted to continue its discussions . tg+beam4 In the morning , the Migration and Integration Working Group wanted to continue its discussions .   Examples Table 3 shows a few selected translations from the WMT14 German-English test set. In manual inspection of these examples and others, we find that the actor encourages models to recover missing tokens, optimize word order, and correct prepositions.
Likelihood We also compare word-level likelihood for different decoding results assigned by the base model and the actor-augmented model. For a sentence pair x, y , word-level likelihood is defined as Table 4 shows the word-level likelihood averaged over the test set for IWSLT16 and WMT14 German to English translation with Transformer.
Our trainable greedy decoder learns a much more peaked distribution and assigns a much higher probability mass to its greedy decoding result than the base model. When evaluated under the base model, the translations from trainable greedy decoding have smaller likelihood than the translations from greedy decoding using the base model for both datasets. This indicates that the trainable greedy decoder is able to find a sequence that is not highly scored by the underlying model, but that corresponds to a high value of the target metric.

Magnitude of Action Vector
We also record the L 2 norm of the action, decoder hidden state, and attentional source context vectors on the validation set. Figure 2 shows these values over the course of training on the IWSLT16 De-En validation set with Transformer. The norm of the action starts small, increases rapidly early in training, and converges to a value well below that of the decoder hidden state. This suggests that the action adjusts the decoders hidden state only slightly, rather than overwriting it.

Effects of Model Settings
Actor Architecture Figure 3 shows the trainable greedy decoding result on IWSLT16 De-En validation set with different actor architectures. We observe that our approach is stable across different actor architectures and is relatively insensitive to the hyperparameters of the actor. For the same type of actor, the performance increases gradually with the hidden layer size. The use of a recurrent connection within the actor does not meaningfully improve performance, possibly since all actors can use the recurrent connections of the underlying decoder. Since the gate actor contains no additional hyperparameters and was observed to learn quickly and reliably, we use it in all other experiments. Here, we also explore a simple alternative to the use of the actor: creating a pseudo-parallel corpus with each model, and then training each model, unmodified and entirety, directly on this new corpus. This experiment (cont. in Figure 3) yields results that are comparable to, but not better than, the results seen with the actors. However, this comes with substantially greater computational complexity at training time, and, if the same trained model is to be optimized for multiple target metrics, greater storage costs as well. Beam Size Figure 4a shows the effect of the beam size used to generate the pseudo-parallel corpus on the IWSLT16 De-En validation set with Transformer. Trainable greedy decoding improves over greedy decoding even when we set k = 1, namely, running greedy decoding on the unaugmented model to construct the new training corpus. With increased beam size k, the BLEU score consistently increases, but we observe diminishing returns beyond roughly k = 35, and we use that value elsewhere.

Training Corpus Construction
There are a variety of ways one might use the output of beam search to construct a pseudo-parallel corpus: We could use the single highest-scoring output (by BLEU, or our target metric) for each input (top1), use all 35 beam search outputs (full), use all those outputs that score higher than the threshold, namely the base model's greedy decoding output (thd), or combine the top1 results with the goldstandard translations (comb.). We show the effect of training corpus construction in Figure 4b. para denotes the baseline approach of training the actor with the original parallel corpus used to train the underlying NMT model. Among the four novel approaches, full obtains the worst performance, since the beam search outputs contain translations that are far from the gold-standard translation. We choose the best performing top1 strategy.

Related Work
Data Distillation Our work is directly inspired by work on knowledge distillation, which uses a similar pseudo-parallel corpus strategy, but aims at training a compact model to approximate the function learned by a larger model or an ensemble of models (Hinton et al., 2015). Kim and Rush (2016) introduce knowledge distillation in the context of NMT, and show that a smaller student network can be trained to achieve similar performance to a teacher model by learning from pseudo-corpus generated by the teacher model. Zhang et al. (2017) propose a new strategy to generate a pseudo-corpus, namely, fast sequenceinterpolation based on the greedy output of the teacher model and the parallel corpus. Freitag et al. (2017) extend knowledge distillation on an ensemble and oracle BLEU teacher model. However, all these approaches require the expensive procedure of retraining the full student network.

Pseudo-Parallel Corpora in Statistical MT
Pseudo-parallel corpora generated from beam search have been previously used in statistical machine translation (SMT) (Chiang, 2012;Gao and He, 2013;Auli and Gao, 2014;Dakwale and Monz, 2016). Gao and He (2013) integrate a recurrent neural network language model as an additional feature into a trained phrase-based SMT system and train it by maximizing the expected BLEU on k-best list from the underlying model. Our work revisits a similar idea in the context trainable greedy decoding for neural MT.
Decoding for Multiple Objectives Several works have proposed to incorporate different decoding objectives into training. Ranzato et al. (2015) and Bahdanau et al. (2016) use reinforcement learning to achieve this goal. Shen et al. (2016) and Norouzi et al. (2016) train the model by defining an objective-dependent loss function. Wiseman and Rush (2016) propose a learning algorithm tailored for beam search. Unlike these works that optimize the entire model,  introduce an additional network that predicts an arbitrary decoding objective given a source sentence and a prefix of translation. This prediction is used as an auxiliary score in beam search. All of these methods focus primarily on improving beam search results, rather than those with greedy decoding.

Conclusion
This paper introduces a novel method, based on an automatically-generated pseudo-parallel corpus, for training an actor-augmented decoder to optimize for greedy decoding. Experiments on three models and three datasets show that the training strategy makes it possible to substantially improve the performance of an arbitrary neural sequence decoder on any reasonable translation metric in either greedy or beam-search decoding, all with only a few trained parameters and minimal additional training time. As our model is agnostic to both the model architecture and the target metric, we see the exploration of more diverse and ambitious model-target metric pairs as a clear avenue for future work.