Reinforced Video Captioning with Entailment Rewards

Sequence-to-sequence models have shown promising improvements on the temporal task of video captioning, but they optimize word-level cross-entropy loss during training. First, using policy gradient and mixed-loss methods for reinforcement learning, we directly optimize sentence-level task-based metrics (as rewards), achieving significant improvements over the baseline, based on both automatic metrics and human evaluation on multiple datasets. Next, we propose a novel entailment-enhanced reward (CIDEnt) that corrects phrase-matching based metrics (such as CIDEr) to only allow for logically-implied partial matches and avoid contradictions, achieving further significant improvements over the CIDEr-reward model. Overall, our CIDEnt-reward model achieves the new state-of-the-art on the MSR-VTT dataset.


Introduction
The task of video captioning (Fig. 1) is an important next step to image captioning, with additional modeling of temporal knowledge and action sequences, and has several applications in online content search, assisting the visuallyimpaired, etc. Advancements in neural sequenceto-sequence learning has shown promising improvements on this task, based on encoderdecoder, attention, and hierarchical models (Venugopalan et al., 2015a;Pan et al., 2016a). However, these models are still trained using a wordlevel cross-entropy loss, which does not correlate well with the sentence-level metrics that the task is finally evaluated on (e.g., CIDEr, BLEU). Moreover, these models suffer from exposure bias (Ran- Figure 1: A correctly-predicted video caption generated by our CIDEnt-reward model. zato et al., 2016), which occurs when a model is only exposed to the training data distribution, instead of its own predictions. First, using a sequence-level training, policy gradient approach (Ranzato et al., 2016), we allow video captioning models to directly optimize these nondifferentiable metrics, as rewards in a reinforcement learning paradigm. We also address the exposure bias issue by using a mixed-loss (Paulus et al., 2017;Wu et al., 2016), i.e., combining the cross-entropy and reward-based losses, which also helps maintain output fluency.
Next, we introduce a novel entailment-corrected reward that checks for logically-directed partial matches. Current reinforcement-based text generation works use traditional phrase-matching metrics (e.g., CIDEr, BLEU) as their reward function. However, these metrics use undirected ngram matching of the machine-generated caption with the ground-truth caption, and hence fail to capture its directed logical correctness. Therefore, they still give high scores to even those generated captions that contain a single but critical wrong word (e.g., negation, unrelated action or object), because all the other words still match with the ground truth. We introduce CIDEnt, which penalizes the phrase-matching metric (CIDEr) based reward, when the entailment score is low. This ensures that a generated caption gets a high re-  Figure 2: Reinforced (mixed-loss) video captioning using entailment-corrected CIDEr score as reward.
ward only when it is a directed match with (i.e., it is logically implied by) the ground truth caption, hence avoiding contradictory or unrelated information (e.g., see Fig. 1). Empirically, we show that first the CIDEr-reward model achieves significant improvements over the cross-entropy baseline (on multiple datasets, and automatic and human evaluation); next, the CIDEnt-reward model further achieves significant improvements over the CIDEr-based reward. Overall, we achieve the new state-of-the-art on the MSR-VTT dataset.

Related Work
Past work has presented several sequence-tosequence models for video captioning, using attention, hierarchical RNNs, 3D-CNN video features, joint embedding spaces, language fusion, etc., but using word-level cross entropy loss training (Venugopalan et al., 2015a;Yao et al., 2015;Pan et al., 2016a,b;Venugopalan et al., 2016). Policy gradient for image captioning was recently presented by Ranzato et al. (2016), using a mixed sequence level training paradigm to use non-differentiable evaluation metrics as rewards. 1 Liu et al. (2016b) and Rennie et al. (2016) improve upon this using Monte Carlo roll-outs and a test inference baseline, respectively. Paulus et al. (2017) presented summarization results with ROUGE rewards, in a mixed-loss setup.

Models
Attention Baseline (Cross-Entropy) Our attention-based seq-to-seq baseline model is similar to the Bahdanau et al. (2015) architecture, where we encode input frame level video features {f 1:n } via a bi-directional LSTM-RNN and then generate the caption w 1:m using an LSTM-RNN with an attention mechanism. Let θ be the model parameters and w * 1:m be the ground-truth caption, then the cross entropy loss function is: , W T is the projection matrix, and w t and h d t are the generated word and the RNN decoder hidden state at time step t, computed using the standard RNN recursion and attention-based context vector c t . Details of the attention model are in the supplementary (due to space constraints).

Reinforcement Learning (Policy Gradient)
In order to directly optimize the sentence-level test metrics (as opposed to the cross-entropy loss above), we use a policy gradient p θ , where θ represent the model parameters. Here, our baseline model acts as an agent and interacts with its environment (video and caption). At each time step, the agent generates a word (action), and the generation of the end-of-sequence token results in a reward r to the agent. Our training objective is to minimize the negative expected reward function: where w s is the word sequence sampled from the model. Based on the REINFORCE algorithm (Williams, 1992), the gradients of this nondifferentiable, reward-based loss function are: We follow Ranzato et al. (2016) approximating the above gradients via a single sampled word Ground-truth caption Generated (sampled) caption CIDEr Ent a man is spreading some butter in a pan puppies is melting butter on the pan 140.5 0.07 a panda is eating some bamboo a panda is eating some fried 256.8 0.14 a monkey pulls a dogs tail a monkey pulls a woman 116.4 0.04 a man is cutting the meat a man is cutting meat into potato 114.3 0.08 the dog is jumping in the snow a dog is jumping in cucumbers 126.2 0.03 a man and a woman is swimming in the pool a man and a whale are swimming in a pool 192.5 0.02 sequence. We also use a variance-reducing bias (baseline) estimator in the reward function. Their details and the partial derivatives using the chain rule are described in the supplementary.
Mixed Loss During reinforcement learning, optimizing for only the reinforcement loss (with automatic metrics as rewards) doesn't ensure the readability and fluency of the generated caption, and there is also a chance of gaming the metrics without actually improving the quality of the output (Liu et al., 2016a). Hence, for training our reinforcement based policy gradients, we use a mixed loss function, which is a weighted combination of the cross-entropy loss (XE) and the reinforcement learning loss (RL), similar to the previous work (Paulus et al., 2017;Wu et al., 2016). This mixed loss improves results on the metric used as reward through the reinforcement loss (and improves relevance based on our entailmentenhanced rewards) but also ensures better readability and fluency due to the cross-entropy loss (in which the training objective is a conditioned language model, learning to produce fluent captions).
Our mixed loss is defined as: where γ is a tuning parameter used to balance the two losses. For annealing and faster convergence, we start with the optimized cross-entropy loss baseline model, and then move to optimizing the above mixed loss function. 2

Reward Functions
Caption Metric Reward Previous image captioning papers have used traditional captioning metrics such as CIDEr, BLEU, or METEOR as reward functions, based on the match between the generated caption sample and the ground-truth reference(s). First, it has been shown by Vedantam 2 We also experimented with the curriculum learning 'MIXER' strategy of Ranzato et al. (2016), where the XE+RL annealing is based on the decoder time-steps; however, the mixed loss function strategy (described above) performed better in terms of maintaining output caption fluency. et al. (2015) that CIDEr, based on a consensus measure across several human reference captions, has a higher correlation with human evaluation than other metrics such as METEOR, ROUGE, and BLEU. They further showed that CIDEr gets better with more number of human references (and this is a good fit for our video captioning datasets, which have 20-40 human references per video).
More recently, Rennie et al. (2016) further showed that CIDEr as a reward in image captioning outperforms all other metrics as a reward, not just in terms of improvements on CIDEr metric, but also on all other metrics. In line with these above previous works, we also found that CIDEr as a reward ('CIDEr-RL' model) achieves the best metric improvements in our video captioning task, and also has the best human evaluation improvements (see Sec. 6.3 for result details, incl. those about other rewards based on BLEU, SPICE).

Entailment Corrected Reward
Although CIDEr performs better than other metrics as a reward, all these metrics (including CIDEr) are still based on an undirected n-gram matching score between the generated and ground truth captions. For example, the wrong caption "a man is playing football" w.r.t. the correct caption "a man is playing basketball" still gets a high score, even though these two captions belong to two completely different events. Similar issues hold in case of a negation or a wrong action/object in the generated caption (see examples in Table 1).
We address the above issue by using an entailment score to correct the phrase-matching metric (CIDEr or others) when used as a reward, ensuring that the generated caption is logically implied by (i.e., is a paraphrase or directed partial match with) the ground-truth caption. To achieve an accurate entailment score, we adapt the state-of-theart decomposable-attention model of Parikh et al.
(2016) trained on the SNLI corpus (image caption domain). This model gives us a probability for whether the sampled video caption (generated by our model) is entailed by the ground truth caption as premise (as opposed to a contradiction or neu-tral case). 3 Similar to the traditional metrics, the overall 'Ent' score is the maximum over the entailment scores for a generated caption w.r.t. each reference human caption (around 20/40 per MSR-VTT/YouTube2Text video). CIDEnt is defined as: which means that if the entailment score is very low, we penalize the metric reward score by decreasing it by a penalty λ. This agreement-based formulation ensures that we only trust the CIDErbased reward in cases when the entailment score is also high. Using CIDEr−λ also ensures the smoothness of the reward w.r.t. the original CIDEr function (as opposed to clipping the reward to a constant). Here, λ and β are hyperparameters that can be tuned on the dev-set; on light tuning, we found the best values to be intuitive: λ = roughly the baseline (cross-entropy) model's score on that metric (e.g., 0.45 for CIDEr on MSR-VTT dataset); and β = 0.33 (i.e., the 3-class entailment classifier chose contradiction or neutral label for this pair). Table 1 shows some examples of sampled generated captions during our model training, where CIDEr was misleadingly high for incorrect captions, but the low entailment score (probability) helps us successfully identify these cases and penalize the reward.

Experimental Setup
Datasets We use 2 datasets: MSR-VTT (Xu et al., 2016) has 10, 000 videos, 20 references/video; and YouTube2Text/MSVD (Chen and Dolan, 2011) has 1970 videos, 40 references/video. Standard splits and other details in supp. Automatic Evaluation We use several standard automated evaluation metrics: METEOR, BLEU-4, CIDEr-D, and ROUGE-L (from MS-COCO evaluation server (Chen et al., 2015)). Human Evaluation We also present human evaluation for comparison of baseline-XE, CIDEr-RL, and CIDEnt-RL models, esp. because the automatic metrics cannot be trusted solely. Relevance measures how related is the generated caption w.r.t, to the video content, whereas coherence measures readability of the generated caption.
Training Details All the hyperparameters are tuned on the validation set. All our results (including baseline) are based on a 5-avg-ensemble. See supplementary for extra training details, e.g., about the optimizer, learning rate, RNN size, Mixed-loss, and CIDEnt hyperparameters.
6 Results 6.1 Primary Results Table 2 shows our primary results on the popular MSR-VTT dataset. First, our baseline attention model trained on cross entropy loss ('Baseline-XE') achieves strong results w.r.t. the previous state-of-the-art methods. 4 Next, our policy gradient based mixed-loss RL model with reward as CIDEr ('CIDEr-RL') improves significantly 5 over the baseline on all metrics, and not just the CIDEr metric. It also achieves statistically significant improvements in terms of human relevance evaluation (see below). Finally, the last row in Table 2 shows results for our novel CIDEnt-reward RL model ('CIDEnt-RL'). This model achieves statistically significant 6 improvements on top of the strong CIDEr-RL model, on all automatic metrics (as well as human evaluation). Note that in Table 2, we also report the CIDEnt reward scores, and the CIDEnt-RL model strongly outperforms CIDEr and baseline models on this entailmentcorrected measure. Overall, we are also the new Rank1 on the MSR-VTT leaderboard, based on their ranking criteria.

Human Evaluation
We also perform small human evaluation studies (250 samples from the MSR-VTT test set output) to compare our 3 models pairwise. 7 As shown in Table 3 and Table 4, in terms of relevance, first our CIDEr-RL model stat. significantly outperforms the baseline XE model (p < 0.02); next, our CIDEnt-RL model significantly outperforms the CIDEr-RL model (p < 4 We list previous works' results as reported by the MSR-VTT dataset paper itself, as well as their 3 leaderboard winners (http://ms-multimedia-challenge. com/leaderboard), plus the 10-ensemble video+entailment generation multi-task model of Pasunuru and Bansal (2017). 5 Statistical significance of p < 0.01 for CIDEr, ME-TEOR, and ROUGE, and p < 0.05 for BLEU, based on the bootstrap test (Noreen, 1989;Efron and Tibshirani, 1994). 6 Statistical significance of p < 0.01 for CIDEr, BLEU, ROUGE, and CIDEnt, and p < 0.05 for METEOR. 7 We randomly shuffle pairs to anonymize model identity and the human evaluator then chooses the better caption based on relevance and coherence (see Sec. 5). 'Not Distinguishable' are cases where the annotator found both captions to be equally good or equally bad).   Table 3 and Table 4).   0.03). The models are statistically equal on coherence in both comparisons.

Other Datasets
We also tried our CIDEr and CIDEnt reward models on the YouTube2Text dataset. In Table 5, we first see strong improvements from our CIDEr-RL model on top of the cross-entropy baseline. Next, the CIDEnt-RL model also shows some improvements over the CIDEr-RL model, e.g., on BLEU and the new entailment-corrected CIDEnt score. It also achieves significant improvements on human relevance evaluation (250 samples). 8

Other Metrics as Reward
As discussed in Sec. 4, CIDEr is the most promising metric to use as a reward for captioning, based on both previous work's findings as well as ours. We did investigate the use of other metrics as the reward. When using BLEU as a reward (on MSR-VTT), we found that this BLEU-RL model achieves BLEU-metric improvements, but was worse than the cross-entropy baseline on human evaluation. Similarly, a BLEUEnt-RL model achieves BLEU and BLEUEnt metric improvements, but is again worse on human evaluation.  We also experimented with the new SPICE metric (Anderson et al., 2016) as a reward, but this produced long repetitive phrases (as also discussed in Liu et al. (2016b)). Fig. 1 shows an example where our CIDEntreward model correctly generates a ground-truth style caption, whereas the CIDEr-reward model produces a non-entailed caption because this caption will still get a high phrase-matching score. Several more such examples are in the supp.

Conclusion
We first presented a mixed-loss policy gradient approach for video captioning, allowing for metric-based optimization. We next presented an entailment-corrected CIDEnt reward that further improves results, achieving the new state-of-theart on MSR-VTT. In future work, we are applying our entailment-corrected rewards to other directed generation tasks such as image captioning and document summarization (using the new multi-domain NLI corpus (Williams et al., 2017)).