CaLcs: Continuously Approximating Longest Common Subsequence for Sequence Level Optimization

Maximum-likelihood estimation (MLE) is one of the most widely used approaches for training structured prediction models for text-generation based natural language processing applications. However, besides exposure bias, models trained with MLE suffer from wrong objective problem where they are trained to maximize the word-level correct next step prediction, but are evaluated with respect to sequence-level discrete metrics such as ROUGE and BLEU. Several variants of policy-gradient methods address some of these problems by optimizing for final discrete evaluation metrics and showing improvements over MLE training for downstream tasks like text summarization and machine translation. However, policy-gradient methods suffers from high sample variance, making the training process very difficult and unstable. In this paper, we present an alternative direction towards mitigating this problem by introducing a new objective (CaLcs) based on a differentiable surrogate of longest common subsequence (LCS) measure that captures sequence-level structure similarity. Experimental results on abstractive summarization and machine translation validate the effectiveness of the proposed approach.


Introduction
Recently, deep neural networks have achieved state-of-the-art results in various tasks including computer vision, natural language processing, and speech processing. Specifically, neural text generation models, central focus of this work, have led to great progress in central downstream NLP tasks like text summarization, machine translation, and image captioning. For example, the abstractive summarization task, which has previously not been the popular choice for text sum- * Work done while interning at Google Brain. marization due to lack of appropriate text generation methods, has gained revived attention with the success of neural sequence-to-sequence models (Sutskever et al., 2014;Bahdanau et al., 2015). There has been several recent work with an impressive progress on this task including (Rush et al., 2015;Nallapati et al., 2016;Miao and Blunsom, 2016;See et al., 2017;Tan et al., 2017;. Machine translation is another central field in NLP where the emergence of neural sequence-to-sequence models has enabled viable alternative approaches (Luong et al., 2015;Bahdanau et al., 2015;Cho et al., 2014;Sutskever et al., 2014) to challenge traditional phrase-based methods (Koehn et al., 2003).
Most of the recent existing works on neural text generation are based on variants of sequence-tosequence models with attention (Bahdanau et al., 2015) trained with Maximum-likelihood estimation (MLE) with teacher forcing. As Ranzato et al. (2016) points out in a previous work, these models have two major drawbacks. First, they are trained to maximize the probability of correct next word given the entire sequence of previous ground truth words. While, at test time, the models need to generate the entire sequence by feeding its own predictions at previous time steps. This discrepancy is called exposure bias and hurts the performance as the model is never exposed to its own predictions during training. The second drawback, called wrong objective, is due yet another discrepancy between training and testing. It refers to the critique (Ranzato et al., 2016) that MLE-trained models tend to have suboptimal performance as they are trained to maximize a convenient objective (i.e., maximum likelihood of word-level correct next step prediction) rather than a desirable sequence-level objective that correlates better with the common discrete evaluation metrics such as ROUGE (Lin and Och, 2004) for summarization, BLEU (Papineni et al., 2002) for translation, and word error rate for speech recognition, not loglikelihood. On the other hand, training models that directly optimize for such discrete metrics as objective is hard due to non-differentiable nature of the corresponding loss functions (Rosti et al., 2011). To address these issues, Ranzato et al. (2016) introduces an incremental learning recipe that uses a hybrid loss function combining REIN-FORCE (Williams, 1992) and cross-entropy. Recently, Paulus et al. (2018) also explored combining maximum-likelihood and policy gradient training for text summarization.
Towards sequence level optimization, previous works (Ranzato et al., 2016;Paulus et al., 2018) employ reinforcement learning (RL) with a policy-gradient approach which works around the difficulty of differentiating the reward function by using it as a weight. However, REINFORCE is known to suffer from high sample variance and credit assignment problems which makes the training process difficult and unstable besides resulting in models that are hard to reproduce (Henderson et al., 2018).
In this paper, we propose an alternative approach for sequence-level training with longest common subsequence (LCS) metric that measures the sequence-level structure similarity between two sequences. We essentially introduce a continuous approximation to the discrete LCS metric which can be directly optimized against using standard gradient-based methods. Our proposed approach has the advantage of being able to directly optimize for a surrogate reward as opposed to using the exact reward only as a weight as in RL-inspired works. Hence, it provides a viable alternative perspective to policy-gradient methods for side stepping the non-differentiability with respect to the exact reward. In addition, it simultenuously combats the exposure bias problem through exposing the model to its own predictions while computing our approximation to LCS metric.
To this end, we introduce a new learning recipe that incorporates the aformentioned continuous approximation to LCS metric (CALCS) as an additional objective on top of maximum-likelihood loss in existing neural text generation models. We evaluate the proposed approach on abstractive text summarization and machine translation tasks. To this end, we use recently introduced pointer-generator network (See et al., 2017) and transformer (Vaswani et al., 2017) as underlying baselines for summarization and machine translation, respectively. More precisely, we start from a pre-trained baseline model with cross-entropy loss, and continue training the model to optimize for the proposed differentiable objective based on CALCS. Using this recipe, we conduct various experiments on CNN/Daily Mail (Hermann et al., 2015;Nallapati et al., 2016) summarization and WMT 2014 English-to-German machine translation tasks. Experimental results validate the effectiveness of the proposed approach on both tasks.

Continuously Approximating Longest Common Subsequence Metric
In this work, we explore the potential use of longest common subsequence (LCS) metric from an algorithmic point of view to address the aforementioned wrong objective and exposure bias problems. LCS metric measures a sequence-level structure similarity between discrete sequences by identifying longest co-occurring in sequence ngrams and it has been shown to correlate well with human judgments for downstream text generation tasks (Lin and Och, 2004). To this end, we propose a way to continuously approximate LCS metric and use this differentiable approximation as the objective to train text generation models rather than the exact LCS measure, which is hard to optimize for due to non-differentiability of the corresponding loss function. Although such differentiable approximation provides a unique advantage from modeling and optimization perspective, the difficulty of controlling its tightness might be a potential drawback in terms of its applicability. In this section, we will first introduce our proposed approximation to LCS metric, and then provide a natural way to control its tightness. Consider a sequence generation problem conditioned on an input sequence x = (x 1 , x 2 , . . . , x n ) and let y = (y 1 , y 2 , . . . , y m ) denote its corresponding ground-truth output sequence. Let f (x, Θ) = z = (z 1 , z 2 , . . . , z k ) denote hypothesis sequence obtained by greedy decoding from a generic encoder-decoder architecture for input sequence x, where Θ represents model parameters. Also, let p 1 , p 2 , . . . , p k be the probability distributions over vocabulary V at decoding time steps from which z 1 , z 2 , . . . , z k are generated via argmax operator, respectively.

CALCS
In this section, we define our approach to continuously approximate the longest common subsequence measure (LCS), which is an unnormalized version of ROUGE-L metric (Lin and Och, 2004) (See Appendix B) that is commonly used for performance evaluation of text summarization models. The main intuition behind our approach is to relax the common necessity for hard inferences while computing discrete metrics by instead comparing discrete tokens in a soft way. Towards this end, we start by defining LCS metric.
Definition 1. Given two sequences y and z of tokens, longest common subsequence LCS(y, z) is defined as the longest sequence of tokens that appear left-to-right (but not necessarily in a contiguous block) in both sequences.
The most common and intuitive solution for computing longest common subsequence is via dynamic programming. We will briefly revisit this here as it will be useful in terms of both recall and notational convenience while describing our surrogate LCS measure. Let r i,j denote the longest common subsequence between prefix sequences y [:i] = (y 1 , y 2 , . . . , y i ) and z [:j] = (z 1 , z 2 , . . . , z j ) of y and z, respectively. A dynamic programming solution is given by (1) r i,j for all i = 1, 2, . . . , m and j = 1, 2, . . . , k. It can be computed in mk iterations using the formula in Eqn 1. After computing 2D dynamic programming matrix r, we obtain LCS(y, z) = r m,k .
Towards removing the dependence on hard inference for computing LCS, we now define our approximation to longest common subsequence, which we call CALCS. At high-level, the idea is to continuously relax the original LCS measure. To this end, we leverage output probability distributions p 1 , p 2 , . . . , p k as soft predictions to refine the dynamic programming formulation for original LCS. More precisely, we recursively define soft longest common subsequence s i,j between prefixes y [:i] and z [:j] in analogous to r i,j as follows: denote the probability of generating y i at j-th decoding step. Intuitively, CALCS replaces the hard to- as a continuous relaxation of discrete comparison operator 1 [y i = z j ], s i,j establishes a natural continuous approximation to r i,j . Similar to LCS, after iteratively filling up s i,j matrix, we define Although the proposed approximation is a natural way of relaxing/extending the hard binary comparison of discrete tokens, it is not clear how tight the approximation is, which is established in the next section.

On the Tightness of Approximation
In this section, we first discuss the tightness of the proposed approximation, and then provide a natural way of controlling it.

Bounding the Approximation Error
We now present a bound on the approximation error of the proposed CALCS compared to the original LCS measure. Characterization of this bound will enable us to theoretically argue about the feasibilty of using the proposed surrogate reward function for our objective as well as controlling its tightness.
LCS measure is intrinsically monotonic by definition. We start by a lemma that establishes a similar monotonicity property for CALCS. Lemma 1.
[Monotonicity] The following two inequalities Proof. See Appendix A for the proof.
Having established a certain monotonicity property for CALCS, we will discuss its approximation error to the original LCS measure. Let denote the approximation error of CALCS to LCS measure between generated prefix sequence y [:i] and the ground-truth prefix z [:j] .
denote the path of dynamic programming algorithm for LCS ending at (i, j) = (i q , j q ) cell of m × k grid. Then, Proof. We will establish the proof by investigating two cases and combining them.
CASE 1: z j = y i . In this case, we have and by 1. Using Eq. 7, we get Using the definition of δ and triangle inequality, we get where inequality 9 follows from the monotonicity established by Lemma 1. Moreover, z j = y i implies p (y i ) j = max(p j ) because z is generated by greedy decoding. Plugging this in Eq. 9 and using Eq. 8, we can immediately conclude that CASE 2: z j = y i . By definition 1, we have Using this identity, we obtain Applying triangle inequality on the last equation above, we get where inequality 12 follows from again the monotonicity of s[·, ·], and inequality 11 follows from the following identity that holds true for all real numbers a, b, c, d ≥ 0 Combining 11 and 13 completes the proof for this case. Finally, two cases investigated above together establish the proof of Lemma 2.
Lemma 2 leads to the following important corollary. j 1 ), . . . , (i q , j q )} be the path of dynamic programming algorithm for LCS ending at (i, j) = (i q , j q ) cell of m × k grid. Then, Proof. Applying Lemma 2 iteratively and using δ 0,0 = 0, we get Summing (q + 1)-many inequalities above side by side and cancelling out the same terms appearing on both sides of the resulting inequality establishes the proof of corollary.

Controlling the Tightness of Approximation
Corollary 1 hints for a natural way of controlling the tightness of approximation CALCS by exploiting the peakedness of model's softmax output probability distributions. More precisely, upper bound on the approximation error is represented as a sum of 1 − max(p j )'s, hence the more peaked the model's output probability distributions on average, the smaller the approximation error we are guaranteed by the established bounds. We exploit this property to control the tightness of approximation by making a modification to computation of the proposed CALCS measure. Formally, let l 1 , l 2 , . . . , l k denote the unnormalized logits of the model output before applying softmax to obtain probabilities p 1 , p 2 , . . . , p k at decoding time steps, respectively. Hence, Recall that CALCS is computed using p j 's. Using peaked softmax, we can obtain more peaked probability distributions without causing any change in the actual generated sequence z via greedy decoding. This is simply because the order of probabilities for corresponding vocabulary words will not change, only the probability disribution p j will get more peaked. So, we define peaked softmax operator with hyperparameter α as By Corollary 1, |δ i,j | → 0 as α → 0 for CALCS measure computed with p j (α). One can further attempt to use Corollary 1 as a guide to pinpoint a range of α values to force the approximation error within certain desired limits. We will use α as a hyperparameter in this work.
Corollary 1 is also useful for alternative ways of controlling the tightness of approximation such as incurring penalty for high-entropy output probability distributions or simply penalizing the maximum output probability values less than a desired threshold (that explicitly controls the tightness of the approximation). We leave such options of controlling the approximation error for future work.
With the guidance of Corollary 1 and peaked softmax in Eq. 16, we conclude that CALCS establishes a promising approximation for LCS measure. In the next section, we introduce a new objective function using CALCS as a continuously differentiable reward to be directly maximized.

Sequence Level Optimization via CALCS
In this section, we describe how to leverage CALCS to define a loss function for sequence level optimization. For notational consistency, we will use f (x, Θ) to denote an encoder-decoder architecture that takes an input sequence x and outputs a sequence of tokens z = (z 1 , z 2 , . . . , z m ) via greedy decoding from corresponding probability distributions p 1 , p 2 , . . . , p m at each step.
For a pair of input sequence x and its corresponding ground-truth output sequence y, we define as the loss function for a sample (x, y) based on the CALCS, where |y| denote the length of sequence y. It is important to note here that while computing probability distribution p t at decoding step t, we feed model's own prediction z t−1 at the previous time step to fight exposure bias.
It is important to observe here that J CALCS (x, y; Θ) is differentiable in p 1 , p 2 , . . . , p k by definition and each p i is differentiable in model parameters Θ. Hence, J CALCS (x, y; Θ) is differentiable in model parameters Θ, which allows us to directly optimize the network parameters with respect to LCS metric. The bound we established on the approximation error and our proposed strategy to control it theoretically ensures the feasibility of using the introduced loss function J CALCS to optimize for LCS metric.

Model
In this section, we first briefly revisit the pointergenerator (See et al., 2017) and transformer (Vaswani et al., 2017) networks that are used as the underlying baselines in our experiments. Subsequently, we describe how the proposed objective function and its variants are used to train new summarization and machine translation models.

Baseline Models
Pointer-Generator Network. We use pointergenerator network (See et al., 2017) as our baseline sequence-to-sequence model for text summarization. It is essentially a hybrid between sequence-to-sequence model with attention (Bahdanau et al., 2015) and a pointer network  that supports two decoding modes, copying and generating, via a soft switch mechanism. This enables the model to copy a word from the input sequence based on the attention distribution. On each decoding time step t, the decoder LSTM is fed the word embedding of the previous word, and computes a decoder state s t , an attention distribution a t over the words of input article, and a probability P vocab (w) of generating word w for summary from output vocabulary V , which is then softly combined with the copy mode's probability distribution P copy (w) via soft switch probability p gen ∈ [0, 1] by p (w) t = p gen P vocab (w) + (1 − p gen )P copy (w) and P copy (w) = where a t i indicates the attention probability on ith word of the input article. The whole network is then trained end-to-end with the negative loglikelihood loss function of for a sample article-summary pair (x, y) where Θ denote the learnable model parameters. It is important to note here that we do not use the coverage mechanism introduced by the original work (See et al., 2017) to prevent the potential repetition problem in the summaries generated by the model.
Transformer Network. For machine translation, we use the transformer network (Vaswani et al., 2017), which is a recently published model that achieved state-of-the-art results on WMT 2014 English-to-German MT task with less computational time owing to its highly parallelizable architecture. The core idea behind this model is to use stacked self-attention mechanisms along with point-wise, fully connected layers for both encoder and decoder to represent its input and output. For the sake of brevity, we refer the reader to (Vaswani et al., 2017) for further details regarding the architecture. Similar to previously defined loss functions, let J TF (x, y; Θ) denote the perexample loss function of transformer networks for an input-output translation pair (x, y) where Θ is again indicating the learnable model parameters.

Model Variants and Training
Let {(x (l) , y (l) )} N l=1 denote the set of training examples, where x (l) 's are input sequences, and y (l) 's are their corresponding ground-truth output sequences. Before optimizing for the introduced objective J CALCS , we first train the corresponding baseline network by minimizing Unlike J CALCS , loss functions J {PG,TF} for baseline models are computed by teacher forcing, feeding the previous ground-truth word at each decoding step. We will denote the baseline models by POINTGEN for pointer-generator network and TRANSFORMER for transformer network.
To optimize for the proposed objective J CALCS , we initialize the model parameters Θ from the pretrained baseline network and continue training the model by minimizing the joint loss where λ is a hyperparameter controlling the balance between the two losses. During the training with the joint loss, we compute J CALCS (x, y; Θ), defined in Eq. 17, by performing |y|-many decoding steps as a simple strategy to prevent the model from gaming the training objec-  (See et al., 2017) 36.44 33.42 w/ coverage (See et al., 2017) 39.53 36.38 LEAD-3 baseline (See et al., 2017) 40.34 36.57 RL (Paulus et al., 2018) 41.16 39.08 ML + RL (Paulus et al., 2018) 39  (See et al., 2017). ** sign near ROUGE-L results reported for our models indicates a difference in our ROUGE-L evaluation as explained below.
tive by generating longer and longer hypotheses instead of incurring an additional length penalty. We will refer to the resulting model trained with the loss function in Eq. 18 as {POINTGEN, TRANSFORMER}+CALCS depending on the baseline model.

Experiments
We numerically evaluate the proposed method on two sequence generation benchmarks: abstractive document-summarization and machine translation. We compare the results of the proposed method against the recently proposed strong baseline models (See et al., 2017) for summarization and and (Vaswani et al., 2017) for machine translation tasks.

Abstractive Summarization
We use a modified version of the CNN/Daily Mail dataset (Hermann et al., 2015) that is first used for summarization by (Nallapati et al., 2016). However, we follow the processing script provided by (See et al., 2017) to obtain non-anonymized version of the data that contains 287,226 training pairs, 13,368 validation pairs, and 11,490 test pairs of news articles (781 tokens on average) and their corresponding ground-truth summaries (56 tokens on average). We refer the reader to (See et al., 2017) for further details of the difference of their version from (Nallapati et al., 2016).
For training our baseline model, we use single layer LSTM encoder (bi-directional) and decoder with hidden dimensions of 512 and 1024, respectively. We use a vocabulary of 50k words for both source and target. Following the original paper, we also do not pre-train word embeddings, which are learned with the rest of model parameters during training. We use the Adam (Kingma and Ba, 2015) optimizer with a learning rate of 0.00001 for training. We pre-train the baseline model for 20k steps by applying greedy scheduled sampling  with fixed ground-truth feeding probability of 75%. Once the baseline model training is complete, we start optimizing for CALCS objective as described in the previous section. Also, we set λ = 1.0 and α = 1.0, which are tuned on the development set.
In Table 1, we report our main results on the summarization task. POINTGEN+SS refers to the baseline model trained with scheduled sampling. POINTGEN+SS+CALCS corresponds our model trained with CALCS starting from POINTGEN+SS model. Experimental results demonstrate that training with our proposed objective provides an improvement of 2.2 points in ROUGE-L score. This also provides empirical evidence to justify that our approximate CALCS effectively captures what the original LCS metric is supposed to measure, recalling ROUGE-L is a normalized LCS. The reason why ROUGE-L scores of our models are lower than previously reported is that we evaluate ROUGE-L score by taking the entire summary as a single sequence instead of splitting it into sentences, which is also the way we compute CALCS objective during the model training process. The main motivation behind this approach is to encourage the model to preserve the sentence order within a summary, and evaluate its performance in the same way. We consider the capability of preserving the order across produced sentences as an important attribute a multi-sentence summarization model should have in terms of readability and fluency of its generated summaries as a whole. When POINTGEN*+SS and POINT-GEN*+SS+CALCS are evaluated by splitting the generated summaries into sentences, their corresponding ROUGE-L scores become 35.38 and 35.12, respectively. We also observe a nice sideimprovement of 1.0 point in ROUGE-1 score over the baseline, which achieves a comparable performance with the long-overdue LEAD-3 baseline score. It might also be comparable to the recently reported state-of-the-art ROUGE-1 result on CNN/DailyMail dataset by Paulus et al. (2018) as they used a different dataset processing pipeline, which makes it difficult to directly compare with ours.

Machine Translation
We also evaluate our sequence-level training approach on the WMT 2014 English-to-German machine translation task, which contains 4.5M pairs of sentences.
To train our baseline transformer model, we closely follow the small model in the original transformer paper (Vaswani et al., 2017). We use a vocabulary of size 32k. Our encoder and decoder consist of N = 6 identical layers each. Following the notation in the original paper, we set the other parameters as d model = 512, d ff = 2048, h = 8, P drop = 0.1. We set λ = 0.3 and α = 1.0, which are tuned on the development set.
In Table 2, we show our empirical results on machine translation task. Our first observation is that our trained baseline transformer network achieves a better performance than the one reported in the original paper (Vaswani et al., 2017) by 0.3 BLEU score, which might be solely due to hyperparameter tuning. More importantly, we observe that training with our proposed CALCS objective leads to noticeable 0.2 BLEU point improvements over the baseline, which further reinforces our confidence in effectiveness of our proposed sequence-level training approach and its applicability to other sequence prediction tasks. It is also interesting to note that optimizing for LCS metric via its continuous approximation leads to improvements in evaluation with another discrete metric BLEU. On the other had, optimizing for the exact discrete metric BLEU via reinforcement learning strategy may not improve the evaluation performance in BLEU as reported by . As a final remark, we would like to note that our proposed approach is orthogonal to advancements in more expressive and powerful architecture designs. Hence it has the potential to provide further improvements over the recently proposed models such as WEIGHTED TRANS-FORMER (Ahmed et al., 2018).

Related Work
Text Summarization. Before the successful application of neural generative models, most of the existing works on text summarization (Dorr et al., 2003;Durrett et al., 2016) have focused on extractive methods. While some of the early approaches have used a rich set of heuristic rules or sparse features to select textual units to include in the summary, more recent works (Cheng and Lapata, 2016;Nallapati et al., 2017) leverage neural models to select words and sentences from the original text. With the emergence of sequenceto-sequence models (Sutskever et al., 2014) and large-scale datasets like CNN/Daily Mail (Hermann et al., 2015;Nallapati et al., 2016) and NYT (Paulus et al., 2018), abstractive summarization of longer text have become a more feasible and popular task. Several recent approaches have been proposed to tackle abstractive summarization problem, where Nallapati et al. (2016) exploits hierarchical encoders, See et al. (2017) proposes pointer-generator network and coverage mechanism to overcome OOV and repetition problems, Tan et al. (2017) introduces a graphbased attention mechanism and hierarchical beam search strategy, and (Paulus et al., 2018) proposes to optimize for ROUGE metric via reinforcement learning. Although impressive progress has been achieved for sentence-level summarization, attempts on abstractive document summarization task are still in early stages where the simple LEAD-3 baseline performance is only very recently matched (Paulus et al., 2018). Neural Machine Translation. With the recent success of encoder-decoder architectures (Sutskever et al., 2014;Bahdanau et al., 2015), neural machine translation systems has gained a a lot of attention both from academia (Cho et al., 2014;Luong et al., 2015;Luong and Manning, 2016) and industry Vaswani et al., 2017;Ahmed et al., 2018) over statistical machine translation, which has been the dominating translation paradigm for years. Most of these works has focused more on enhancing the architecture design aspect to tackle with various challenges such as different attention mechanisms (Bahdanau et al., 2015;Luong et al., 2015), a character-level decoder (Chung et al., 2016), a translation coverage mechanism (Tu et al., 2016), and so on. However, only very recently, a few works Ranzato et al., 2016;Norouzi et al., 2016;Shen et al., 2016;Bahdanau et al., 2017;Zhukov and Kretov, 2017;Casas et al., 2018) have investigated sequence-level optimization by training to maximize BLEU score. Neural Sequence Generation with RL. Most neural sequence generation models are trained with the objective of maximizing the probability of the next correct word. However, this results in a major discrepancy between training and test settings of these models because they are trained with cross-entropy loss at word-level, but evaluated based on sequence-level discrete metrics such as ROUGE (Lin and Och, 2004) or BLEU (Papineni et al., 2002). On the other hand, directly optimizing for such evaluation metrics is hard due to non-differentiable nature of the exact objective (Rosti et al., 2011). Recent works (Ranzato et al., 2016;Bahdanau et al., 2017;Paulus et al., 2018) address the difficulty of differentiating with respect to rewards based on such discrete metrics using variants of reinforcement learning. These methods essentially propose to mitigate the problem by optimizing the reward weighted log-likelihood of the hypothesis sequences generated by the model distribution. In this paper, we propose an alternative solution to tackle this problem by introducing a differentiable approximation to exact LCS metric that can be directly optimized by standard gradient-based methods without RL, while still addressing the exposure bias problem.

Conclusion and Future Work
In this work we explored an alternative approach for training text generation models with sequencelevel optimization to combat wrong objective and exposure bias problems. We introduced a new objective function based on a continuous approximation of LCS metric that measures sequence-level structure similarity between sentences. We applied our proposed approach to CNN/Daily Mail dataset for long document summarization and WMT 2014 English-to-German machine translation task. By extending the objectives of strong neural baseline models with our proposed objective, we empirically demonstrated its effectiveness on these two tasks. Our proposed approach suggests a promising alternative to policy-gradient methods to side step the difficulty of differentiating w.r.t reward function while directly optimizing for sequence-level discrete metrics.