Towards Decoding as Continuous Optimisation in Neural Machine Translation

We propose a novel decoding approach for neural machine translation (NMT) based on continuous optimisation. We reformulate decoding, a discrete optimization problem, into a continuous problem, such that optimization can make use of efficient gradient-based techniques. Our powerful decoding framework allows for more accurate decoding for standard neural machine translation models, as well as enabling decoding in intractable models such as intersection of several different NMT models. Our empirical results show that our decoding framework is effective, and can leads to substantial improvements in translations, especially in situations where greedy search and beam search are not feasible. Finally, we show how the technique is highly competitive with, and complementary to, reranking.


Introduction
Sequence to sequence learning with neural networks (Graves, 2013;Sutskever et al., 2014;Lipton et al., 2015) is typically associated with two phases: training and decoding (a.k.a. inference). Model parameters are learned by optimising the training objective, in order that the model can produce good translations when decoding unseen sentences. The majority of research has focused on the training paradigm or network architecture, however effective means of decoding have been under-investigated. Conventional heuristicbased approaches for approximate inference include greedy, beam, and stochastic search. Greedy and beam search have been empirically proved to be adequate for many sequence to sequence tasks, and are the standard methods for NMT decoding.
However, these inference approaches have several drawbacks. Firstly, although NMT models use a left-to-right generation which would appear to facilitate efficient search, the models themselves use a recurrent architecture, and accordingly are non-Markov. This prevents exact dynamic programming solutions, and moreover, limits the potential to incorporate additional global features or constraints. Global factors can be highly useful in producing better and more diverse translations. Secondly, the sequential decoding of symbols in the target sequence, the inter-dependencies among the target symbols are not fully exploited. For example, when decoding the words of the target sentence in a left-to-right manner, the right context is not exploited leading potentially to inferior performance (see Watanabe and Sumita (2002a) who apply this idea in traditional statistical MT). A natural way to capture this is to intersect leftto-right and right-to-left models, however the resulting model has no natural generation order, and thus standard decoding methods are unsuitable.
We introduce a novel decoding framework ( § 3) that relaxes this discrete optimisation problem into a continuous optimisation problem. This is akin to linear programming relaxation approach for approximate inference in graphical models with discrete random variables, where the exact inference is NP-hard (Sontag, 2010;Belanger and McCallum, 2016). The resulting continuous optimisation problem is challenging due to the non-linearity and non-convexity of the relaxed decoding objective. We make use of stochastic gradient descent (SGD) and exponentiated gradient (EG) algorithms for decoding based on our relaxation approach. 1 Our decoding framework is powerful and flexible, as it enables us to decode with global constraints involving intersection of multiple NMT 1 Both methods are mainly used for training in prior work. models ( §4). We present experimental results on Chinese-English and German-English translation tasks, confirming the effectiveness of our relaxed optimisation method for decoding ( §5).

Neural Machine Translation
We briefly review the attentional neural translation model proposed by Bahdanau et al. (2015) as a sequence-to-sequence neural model onto which we apply our decoding framework.
In neural machine translation (NMT), the probability of the target sentence y given a source sentence x is written as: where f is a non-linear function of the previously generated sequence of words y <i , the source sentence x, and the model parameters Θ. In this paper, we realise f f f as follows: where MLP is a single hidden layer neural network with tanh activation function, and E E E y i−1 T is the embedding of the target word y i−1 in the embedding matrix E E E T ∈ R ne×|V T | of the target language vocabulary V T and n e is the embedding dimension. The state g i of the decoder RNN is a function of y i−1 , its previous state g i−1 , and the context c i = |x| j=1 α ij h j summarises parts of the source sentence which are attended to, where In above, − → h i and ← − h i are the states of the left-toright and right-to-left RNNs encoding the source sentence, and E E E x j S is the embedding of the source word x j in the embedding matrix E E E S ∈ R n e ×|V S | of the source language vocabulary V S and n e is the embedding dimension.
Given a bilingual corpus D, the model parameters are learned by maximizing the conditional log-likelihood, The model parameters Θ include the weight matrix W o ∈ R |V T |×n h and the bias b o ∈ R |V T | -with n H denoting the hidden dimension size -as well as the RNN encoder biRNN θ enc / decoder RNN φ dec parameters, word embedding matrices, and the parameters of the attention mechanism. The model is trained end-to-end by optimising the training objective using stochastic gradient descent (SGD) or its variants. In this paper, we focus on the decoding problem, which we turn to in the next section.

Decoding as Continuous Optimisation
In decoding, we are interested in finding the highest probability translation for a given source sentence: where Y x is the space of possible translations for the source sentence x. In general, searching Y x to find the highest probability translation is intractable due to the recurrent nature of eqn (1) which prevents dynamic programming for efficient search. This is problematic, as the space of translations is exponentially large with respect to the output length |y|.
We now formulate this discrete optimisation problem as a continuous one, and then use standard algorithms for continuous optimisation for decoding. Let us assume that the maximum length of a possible translation for a source sentence is known and denote it as . The best translation for a given source sentence solves the following optimisation problem: where we allow the translation to be padded with sentinel symbols to the right, which are ignored in computing the model probability. Equivalently, we can rewrite the above discrete optimisation problem as follows: arg miñ whereỹ i are vectors using the one-hot representation of the target words I |V T | .
η is the step size 5: return arg min t Q(ŷ t 1 , . . . ,ŷ t ) We now convert the optimisation problem (5) to a continuous one by dropping the integrality constraintsỹ i ∈ I |V | and require the variables to take values from the probability simplex: arg min under the distributionŷ i . After solving the above constrained continuous optimisation problem, there is no guarantee that the resulting solution {ŷ * i } i=1 will comprise onehot vectors, i.,e., target language words. Instead it can find fractional solutions, that require 'rounding' in order to to resolve them to lexical items. To solve this problem, we take the arg max, 2 i.e., take the highest scoring word for each positionŷ * i . We leave exploration of more elaborate projection techniques to the future work.
In the context of graphical models, the above relaxation technique gives rise to linear programming for approximate inference (Sontag, 2010;Belanger and McCallum, 2016). However, our decoding problem is much harder due to the nonlinearity and non-convexity of the objective function operating on high dimensional space for deep models. We now turn our attention to optimisation algorithms to effectively solve the decoding optimisation problem.

Exponentiated Gradient (EG)
Exponentiated gradient (Kivinen and Warmuth, 1997) is an elegant algorithm for solving optimisation problems involving simplex constraints. Re-2 Ties are broken arbitrarily. call our constrained optimisation problem: arg min EG is an iterative algorithm, which updates each distributionŷ t i in the current time-step t based on the distributions of the previous time-step as follows: and Z t i is the normalisation constant The partial derivatives ∇ i,w are calculated using the back propagation algorithm treating {ŷ i } i=1 as parameters and the original parameters of the model Θ as constants. Adapting EG to our decoding problem leads to Algorithm 1. It can be shown that the EG algorithm is a gradient descent algorithm for minimising the following objective function subject to the simplex constraints: In other words, the algorithm looks for the maximum entropy solution which also maximizes the log likelihood under the model. There are intriguing parallels with the maximum entropy formulation of log-linear models (Berger et al., 1996). In our setting, the entropy term acts as a prior which discourages overly-confident estimates in the absence of sufficient evidence.

Stochastic Gradient Descent (SGD)
To be able to apply SGD to our optimisation problem, we need to make sure that the simplex constraints are enforced. One way to achieve this is by reparameterising using the softmax transformation, i.e.ŷ i = softmax (r i ). The resulting unconstrained optimisation problem, now overr i , becomes arg min T ] in the model. To apply SGD updates, we need the gradient of the objective function with respect to the new variablesr i which can be derived with the backpropagation algorithm based on the chain rule: The resulting SGD algorithm is summarized in Algorithm 2.

Decoding in Extended NMT
Our decoding framework allows us to effectively and flexibly add additional global factors over the output symbols during inference. This enables decoding for richer global models, for which there is no effective means of greedy decoding or beam search. We outline several such models, and their corresponding relaxed objective functions for optimisation-based decoding.
Bidirectional Ensemble. Standard NMT generates the translation in a left-to-right manner, conditioning each target word on its left context. However, the joint probability of the translation can be decomposed in a myriad of different orders; one compelling alternative would be to condition each target word on its right context, i.e., generating the target sentence from right-to-left. We would not expect a right-to-left model to outperform a left-to-right, however, as the left-to-right ordering reflects the natural temporal order of spoken language. However, the right-to-left model is likely to provide a complementary signal in translation, as it will be bringing different biases and making largely independent prediction errors to those of the left-to-right model. For this reason, we propose to use both models, and seek to find translations that have high probability according both models (this mirrors work on bidirectional decoding in classical statistical machine translation by Watanabe and Sumita (2002b).) Decoding under the ensemble of these models leads to an intractable search problem, not well suited to traditional greedy or beam search algorithms, which require a fixed generation order of the target words. This ensemble decoding problem can be formulated simply in our linear relaxation approach, using the following objective function: where α is an interpolation hyper-parameter, which we set to 0.5; Θ → and Θ ← are the pretrained left-to-right and right-to-left models, respectively. This bidirectional agreement may also lead to improvement in translation diversity, as shown in  in a re-ranking evaluation.
Bilingual Ensemble. Another source of complementary information is in terms of the translation direction, that is forward translation from the source to the target language, and reverse translation in the target to source direction. Decoding must find a translation which scores well under both the forward and reverse translation models. This is inspired by the direct and reverse feature functions commonly used in classical discriminative SMT (Och and Ney, 2002) which have been shown to offer some complementary benefits (although see Lopez and Resnik (2006)). More specifically, we decode for the best translation in the intersection of the source-to-target and targetto-source models by minimizing the following objective function: where α is an interpolation hyper-parameter to be fine-tuned; and Θ s→t and Θ s←t are the pre-trained   source-to-target and target-to-source models, respectively. Decoding for the best translation under the above objective function leads to an intractable search problem, as the reverse model is global over the target language, meaning there is no obvious means of search with a greedy algorithm or similar.
Discussion. There are two important considerations on how best to initialise the relaxed optimisation in the above settings, and how best to choose the step size. As the relaxed optimisation problem is, in general, non-convex, finding a plausible initialisation is likely to be important for avoiding local optima. Furthermore, a proper step size is a key in the success of the EG-based and SGD-based optimisation algorithms, and there is no obvious method how to best choose its value. We may also adaptively change the step size using (scheduled) annealing or via the line search. We return to this considerations in the experimental evaluation.

Setup
Datasets. We conducted our experiments on datasets with different scales, translating between Chinese→English using the BTEC corpus, and German→English using the IWSLT 2015 TED Talks (Cettolo et al., 2014) and WMT 2016 3 corpora. The statistics of the datasets can be found in Table 1.
NMT Models. We implemented our continuous-optimisation based decoding method on top of the Mantidae toolkit 4 (Cohn et al., 2016), and using the dynet deep learning library 5 (Neubig et al., 2017). All neural network models were configured with 512 input embedding and hidden layer dimensions, and 256 alignment dimension, with 1 and 2 hidden layers in the source and target, respectively. We used a LSTM recurrent structure (Hochreiter and Schmidhuber, 1997) for both source and target RNN sequences. For the vocabulary, we use word frequency cut-off of 5, and words rarer than this were mapped to a sentinel. For the large-scale WMT dataset, we applied byte-pair encoding (BPE) method (Sennrich et al., 2016) to better handle unknown words. 6 For training our neural models, we use early stopping based on development perplexity, which usually occurs after 5-8 epochs.
Evaluation Metrics. We evaluated in terms of search error, measured using the model score of the inferred solution (either continuous or discrete), as well as measuring the end translation quality with case-insensitive BLEU (Papineni et al., 2002). The continuous cost measures − 1 |ŷ| log P Θ (ŷ | x) under the model Θ; the discrete model score has the same formulation, albeit using the discrete rounded solution y (see §3). Note the cost can be used as a tool for selecting the best inference solution, as well as assessing convergence, as we illustrate below. q q q qq qqqqq qqqqqq q q q q q q q q q q q qqqqqqq qqqqqq q q q q q q q q 1 10 1 5 20 50 100 200 400 iterations CCost q q q qqqqqqq qqqqqq q q q q q q q q q q q qqqqqqq qqqqqq q q q q q q q q 1 10 1 5 20 50 100 200 400 iterations DCost q q q qqqqqqq qqqqqq q q q q q q q q q q q qqqqqqq qqqqqq q q q q q q q q

Results and Analysis
Initialisation and Step Size. As our relaxed optimisation problems are non-convex, local optima are likely to be a problem. We test this empirically, focusing on the effect that initialisation and step size, η, have on the inference quality. For plausible initialisation states, we evaluate different strategies: uniform in which the relaxed variablesŷ are initialised to 1 |V T | ; and greedy or beam wherebyŷ are initialised based on an already good solution produced by a baseline decoder with greedy (gdec) or beam (bdec). Instead of using the Viterbi outputs as a one-hot representation, we initialise to the probability prediction vectors, 7 which serves to limit attraction of the initialisation condition, which is likely to be a local (but not global) optima. Figure 1 illustrates the effect of initialisation on the EG algorithm, in terms of search error (left and middle) and translation quality (right), as we vary the number of iterations of inference. There is clear evidence of non-convexity: all initialisation methods can be seen to converge using all three measures, however they arrive at highly different solutions. Uniform initialisation is clearly not a viable approach, while greedy and beam initialisation both yield much better results. The best initialisation, beam, outperforms both greedy and beam decoding in terms of BLEU.
Note that the EG algorithm has fairly slow convergence, requiring at least 100 iterations, irrespective of the initialisation. To overcome this, 7 Here, EG uses softmax normalization whereas SGD uses the pre-softmax vector.
we use momentum (Qian, 1999) to accelerate the convergence by modifying the term ∇ t i,w in Algorithm 1 with a weighted moving average of past gradients: where we set the momentum term γ = 0.9. The EG with momentum (EG-MOM) converges after fewer iterations (about 35), and results in marginally better BLEU scores. The momentum technique is usually used for SGD involving additive updates; it is interesting to see it also works in EG with multiplicative updates. The step size, η, is another important hyperparameter for gradient based search. We tune the step size using line search over [10, 400] over the development set. Figure 1 illustrates the effect of changing step size from 50 to 400 (compare EG and EG-400 with uniform), which results in a marked difference of about 10 BLEU points, underlining the importance of tuning this value. We found that EG with momentum had less of a reliance on step size, with optimal values in [10, 50]; we use this setting hereafter.
Continuous vs Discrete Costs. Another important question is whether the assumption behind continuous relaxation is valid, i.e., if we optimise a continuous cost to solve a discrete problem, do we improve the discrete output? Although the continuous cost diminishes with inference iterations (Figure 1 left), and appears to converge, it is not clear whether this corresponds to a better discrete output (note that the discrete cost and BLEU   scores do show improvements Figure 1 centre and right.) Figure 2 illustrates the relation between the two cost measures, showing that in most cases the discrete and continuous costs are identical. Linear relaxation fails only for a handful of cases, where the nearest discrete solution is significantly worse than it would appear using the continuous cost.
EG vs SGD. Both the EG and SGD algorithms are iterative methods for solving the relaxed optimisation problem with simplex constraints. We measure empirically their difference in terms of quality of inference and speed of convergence, as illustrated in Figure 3. Observe that SGD requires 150 iterations for convergence, whereas EG requires many fewer (50). This concurs with previous work on learning structured prediction models with EG (Globerson et al., 2007). Further, the EG algorithm consistently produces better results in terms of both model cost and BLEU.
EG vs Reranking. Reranking is an alternative method for integrating global factors into the existing NMT systems. We compare our EG decoding algorithm against the reranking approach with bidirectional factor where the N-best outputs of a left-to-right decoder is re-scored with the forced decoder operating in a right-to-left fashion. The q q q q q q qqqqq q q q q q q q q q q q q q q q q q q q qqqqq q q q q q q q q q q q q q results are shown in Table 2. Our EG algorithm initialised with the reranked output achieves the best BLEU score. We also compare reranking with EG algorithm initialised with the beam decoder, where for direct comparison we filter out sentences with length greater than that of the beam output in the k-best lists. These results show that the EG algorithm is capable of effectively exploiting the search space. Beyond achieving similar or better translations to re-ranking, note that EG is simpler in implementation, as it does not require kbest lists, weight tuning and so forth. Instead this is replaced with iterative gradient descent. The run-time of the two methods are comparable, when reranking uses modest k, however EG can be considerably faster when k is large, as is typically done to extract the full benefit from re-ranking. This performance difference is a consequence of GPU acceleration of the dense vector operations in EG inference.
Computational Efficiency. We also quantify the computational efficiency of the proposed decoding approach. Benchmarking on a GPU Titan X for decoding BTEC zh→en, the average time per sentence is 0.02 secs for greedy, 0.07s for beam=5, 0.11s for beam=10, and 3.1s for relaxed EG decoding, which uses an average of 35 EG iterations. The majority of time in the EG algorithm is in the forward and backward passes, taking 30% and 67% of the time, respectively. Our imple-  mentation was not optimised thoroughly, and it is likely that it could be made significantly faster, which we defer to future research.
Main Results. Table 3 shows our experimental results across all datasets, evaluating the EG algorithm and its variants. 8 For the EG algorithm with greedy initialisation (top), we see small but consistent improvements in terms of BLEU. Beam initialisation led to overall higher BLEU scores, and again demonstrating a similar pattern of improvements, albeit of a lower magnitude, over the initialisation values. Next we evaluate the capability of our inference method with extended NMT models, where approximate algorithms such as greedy or beam search are infeasible. With the bidirectional ensemble, we obtained the statistically significant BLEU score improvements compared to the unidirectional models, for either greedy or beam initialisation. This is interesting in the sense that the unidirectional right-to-left model always performs worse than the left-to-right model. However, our method with bidirectional ensemble is capable of combining their strengths in a unified setting. For the bilingual ensemble, we see similar effects, with better BLEU score improvements in most cases, albeit of a lower magnitude, over the bidirectional one. This is likely to be due to a disparity with the training condition for the models,

Related Work
Decoding (inference) for neural models is an important task; however, there is limited research in this space perhaps due to the challenging nature of this task, with only a few works exploring some extensions to improve upon them. The most widely-used inference methods include sampling (Cho, 2016), greedy and beam search (Sutskever et al., 2014;Bahdanau et al., 2015, inter alia), and reranking (Birch, 2016;. Cho (2016) proposed to perturb the neural model by injecting noise(s) in the hidden transition function of the conditional recurrent neural language model during greedy or beam search, and execute multiple parallel decoding runs. This strategy can improves over greedy and beam search; however, it is not clear how, when and where noise should be injected to be beneficial. Recently, Wiseman and Rush (2016) proposed beam search optimisation while training neural models, where the model parameters are updated in case the gold standard falls outside of the beam. This exposes the model to its past incorrect predicted labels, hence making the training more robust. This is orthogonal to our approach where we focus on the decoding problem with a pre-trained model.
Reranking has also been proposed as a means of global model combination: Birch (2016) and  re-rank the left-to-right decoded translations based on the scores of a rightto-left model, learning to more diverse translations. Related,  learn to adjust the beam diversity with reinforcement learning.
Perhaps most relevant is Snelleman (2016), performed concurrently to this work, who also proposed an inference method for NMT using linear relaxation. Snelleman's method was similar to our SGD approach, however he did not manage to outperform beam search baselines with an encoderdecoder. In contrast we go much further, proposing the EG algorithm, which we show works much more effectively than SGD, and demonstrate how this can be applied to inference in an attentional

BTEC zh→en
Source Reference i am sure that i called the hotel yesterday and made a reservation . beam dec (l2r) i 'm sure i called the hotel reservation and i made a reservation . beam dec (r2l) i 'm sure i made this hotel reservation and made a reservation . rerank +bidir.
i 'm sure i called the hotel reservation and i made a reservation . rerank +biling.
i 'm sure i called the hotel reservation and i made a reservation . EGdec i 'm sure i called the hotel yesterday and i made a reservation . +bidir.
i 'm sure i called the hotel yesterday and i made a reservation . +biling.
i 'm sure i called the hotel yesterday and i made a reservation .

TED Talks de→en
Source wir sind doch alle gute bürger der sozialen medien , bei denen die währung neid ist . stimmt ' s ? Reference i mean , we 're all good citizens of social media , are n't we , where the currency is envy ? beam dec (l2r) we 're all great UNK of social media , where the currency is envy . right ? beam dec (r2l) we 're all good citizens in social media , which is where that is envy . right ? rerank +bidir.
we 're all good citizens of social media , where the currency is envy . right ? rerank +biling.
we 're all good citizens of social media , where the currency is envy . right ? EGdec we 're all great UNK of social media , where the currency is envy . right ? +bidir.
we 're all good UNK of social media , where the currency is envy . right ? +biling.
we 're all good citizens of social media , where the currency is envy . right ?  encoder-decoder. Moreover, we demonstrate the utility of related optimisation for inference over global ensembles of models, resulting in consistent improvements in search error and end translation quality.

WMT de→en
Recently, relaxation techniques have been applied to deep models for training and inference in text classification (Belanger and McCallum, 2016;Belanger et al., 2017), and fully differentiable training of sequence-to-sequence models with scheduled-sampling (Goyal et al., 2017). Our work has applied the relaxation technique specifically for decoding in NMT models.

Conclusions
This work presents the first attempt in formulating decoding in NMT as a continuous optimisation problem. The core idea is to drop the integrality (i.e. one-hot vector) constraint from the prediction variables and allow them to have soft assignments within the probability simplex while minimising the loss function produced by the neural model. We have provided two optimisation algorithms -exponentiated gradient (EG) and stochastic gradient descent (SGD) -for optimising the resulting contained optimisation problem, where our findings show the effectiveness of EG compared to SGD. Thanks to our framework, we have been able to decode when intersecting left-to-right and right-to-left as well as source-to-target and target-to-source NMT models. Our results show that our decoding framework is effective and leads to substantial improvements in translations generated from the intersected models, where the typical greedy or beam search algorithms are not applicable.
This work raises several compelling possibilities which we intend to address in future work, such as improving decoding speed, integrating additional constraints such as word coverage and fertility into decoding, 9 and applying our method to other intractable structured prediction problems.