Improving Reinforcement Learning Based Image Captioning with Natural Language Prior

Recently, Reinforcement Learning (RL) approaches have demonstrated advanced performance in image captioning by directly optimizing the metric used for testing. However, this shaped reward introduces learning biases, which reduces the readability of generated text. In addition, the large sample space makes training unstable and slow.To alleviate these issues, we propose a simple coherent solution that constrains the action space using an n-gram language prior. Quantitative and qualitative evaluations on benchmarks show that RL with the simple add-on module performs favorably against its counterpart in terms of both readability and speed of convergence. Human evaluation results show that our model is more human readable and graceful. The implementation will become publicly available upon the acceptance of the paper.


Introduction
Image captioning (Farhadi et al., 2010;Kulkarni et al., 2011;Yao et al., 2017;Lu et al., 2016;Dai et al., 2017; aims at generating natural language descriptions of images. Advanced by recent developments of deep learning, many captioning models rely on an encoder-decoder based paradigm (Vinyals et al., 2015), where the input image is encoded into hidden representations using a Convolutional Neural Network (CNN) followed by a Recurrent Neural Network (RNN) decoder to generate a word sequence as the caption. Further, the decoder RNN can be equipped with spatial attention mechanisms (Xu et al., 2015) to incorporate precise visual contexts, which often yields performance improvements empirically.
Although the encoder-decoder framework can be effectively trained with maximum likelihood estimation (MLE) (Salakhutdinov, 2010), recent 1 https://github.com/tgGuo15/PriorImageCaption research (Ranzato et al., 2015) have pointed out that the MLE based approaches suffer from the so-called exposure bias problem. To address this problem, (Ranzato et al., 2015) proposed a Reinforcement Learning (RL) based training framework. The method, developed on top of the RE-INFORCE algorithm (Williams, 1992), directly optimizes the non-differentiable test metric (e.g. BLEU (Papineni et al., 2002), CIDEr , METEOR (Banerjee and Lavie, 2005) etc.), and achieves promising improvements. However, learning with RL is a notoriously difficult task due to the high-variance of gradient estimation. Actor-critic (Sutton and Barto, 1998) methods are often adopted, which involves training an additional value network to predict the expected reward. On the other hand, (Rennie et al., 2017) designed a self-critical method that utilizes the output of its own test-time inference algorithm as the baseline to normalize the rewards, which leads to further performance gains.
Beside to the high-variance problem, we notice that there are two other drawbacks of RL-based captioning methods that are often overlooked in the literature. First, while these methods can directly optimize the non-differentiable rewards and achieve high test scores, the generated captions contain many repeated trivial patterns, especially at the end of the sequence. Table 1 shows examples of bad-endings generated by a self-critical based RL algorithm (model details refer to Section 4). Specifically, 46.44% generated captions end with phrases as "with a", "on a", "of a", etc.
(for detailed statistics see Appendix A), on the MSCOCO (Chen et al., 2015) validation set with the standard data splitting by (Karpathy and Li, 2015). The reason is that the shaped reward function biases the learning. In Figure 1, we see these additive patterns at the end of captions, although make no sense to humans, yield to a higher re- ward. Empirically, removing these endings results in a huge performance drop of around 6%. (Paulus et al., 2017) has also reported that in abstractive summarization, using RL only achieves high ROUGE (Lin, 2004) score, yet the humanreadability is very poor. The second drawback is that RL-based text generation is sample-inefficient due to the large action space. Specifically, the search space is of size O(|V| T ), where V is a set of words, T is the sentence length, and | · | denotes the cardinality of a set. This often makes training unstable and converge slowly. In this work, to tackle these two issues, we propose a simple yet effective solution by introducing coherent language constraints on local action selections in RL. Specifically, we first obtain wordlevel n-gram (Kneser and Ney, 1995) model from the training set and then use it as an effective prior. During the action sampling step in RL, we reduce the search space of actions based on the constitution of the previous word contexts as well as our n-gram model. To further promote samples with high rewards, we sample multiple sentences during the training and update the policy based on the best-rewarded one. Such simple treatments prevent the appearance of bad endings and expedite the convergence while maintaining comparable performance to the pure RL counterpart. In addition, the proposed framework is generic, which can be applied to many different kinds of neural structures and applications.

Model Architecture
Encoder-Decoder Model: We adopt a similar structure as GNIC (Vinyals et al., 2015), which first encodes an image I to a dense vector h I by CNN. The vector h I is then fed as the input to an LSTM-based (Hochreiter and Schmidhuber, 1997) language model decoder. At each step t, the LSTM receives the previous output w t−1 as the input; computes the hidden state h t ; and predicts the next word w t as below: where w 0 = h I and h 0 and c 0 are initialized to zero. The generation ends if a special token *end* is predicted.
Attention Model: Instead of utilizing a static representation of the image, attention mechanism dynamically reweights the spatial features from CNN to focus on the different region of the image at each word generation. We specifically consider the standard architecture used in (Xu et al., 2015), where A = {a 1 , a 2 , ..., a L } is the spatial feature set and each a i ∈ R D corresponds to features extracted at different image locations. Then the hidden states of the LSTM is computed as (2) where f att is an attention model, which we use a single fully connected layer conditioned on the previous hidden state. Once h t is obtained, the word generation is same as equation (1).
Sequence Generation with RL: We follow the training procedure of (Rennie et al., 2017). The decoder LSTM can be viewed as a "policy" denoted by p θ , where θ is the set of parameters of the network. At each time step t, the policy chooses an action by generating a word w t and obtains a new "state" (i.e. hidden states of LSTM, attention weights, etc.). Once the end token is generated, a "reward" r is given based on the score (e.g. CIDEr or BLEU) of the predicted sentence. The goal is to maximize the expected reward as where w s = {w s 1 , w s 2 , ..., w s T } are sampled words at every time step. The REINFORCE algorithm (Williams, 1992) provides unbiased gradient estimation of θ as using a single sequence.
Variance Reduction with Self-Critical: We reduce the variance of the gradient estimator by using the self-critical approach as wherew t is the baseline reward calculated by the current model under the inference algorithm used at test time defined as Then, sequences have rewards higher thanw will be increased in probability, while samples result in lower reward will be suppressed.

Prior Language Constraint with N -Gram Model
Method: We collect all n-grams (n=3 or 4 in our experiments) from a corpus of captions. We use the training set from MSCOCO to avoid the usage of the additional resource. Thus, a fair comparison to previous methods is guaranteed. Then, we filter the n-grams with frequencies lower than five. The set of remaining ones is denoted as F. During training, given the previous tokens predicted by the decoder, we constraint the sample space the current prediction by where α i is an indicator vector whose length is the vocabulary size |V| and its elements are non-zero only if the corresponding word and the previous (n − 1)-gram constitute a valid n-gram in F as Figure 2: Training time of models with (right) and without (left) spatial attention.

Discussion:
The key motivation for applying the above constraint is two-fold: (1) this ensures generated captions always formed by valid n-grams, which provides us a direct way of eliminating the repeated common phrases and bad-endings like the ones in Table 1; and (2) this shrinks the size of action space, which makes the training converges much faster. For MSCOCO, action space is changed from more than 9,000 to 56 on average.

Experiments
Dataset: We perform both quantitative and qualitative evaluations on MSCOCO dataset. The dataset contains 123,287 images and each image has at least five human captions. To seek fair comparison to others, we use the publicly available splits, which contains 82,783 training, 5,000 validation and 5,000 testing images.
Implementation Details: Our implementations are based on the publicly project. 2 We use an Ima-geNet pre-trained 101-layered ResNet 3 (He et al., 2016) to extract visual features. We consider two types (see Section 2) of architectural training with RL: (1) the plain encoder-decoder, and (2) the encoder-decoder with attention. For the former one, we represent each image by a 2,048dimension vector by extracting the features from the last convolutional layer with average pooling. For the attention model, we apply spatial adaptive max pooling and the output feature map has the size of 14 × 14 × 2, 048. At each time step, the attention model produces weights over 196 spatial locations. The size of word embeddings and the hidden dimension of the LSTM are set to 512 for all experiments. More details are in Appendix B.
Compared Methods: We report our results in four different settings, which include the combinations of with/without attention and using tri-/four-  gram. We directly compare with our counterparts that have the same structures but no n-gram modules. Specifically, they are encoder-decoder based self-critical (ED-SC), and the one with attention (Att-SC). In addition, since our experimental setup is almost identical to many existing works, we also include their reported results, which include (Karpathy and Li, 2015;Xu et al., 2015;Ranzato et al., 2015;Ren et al., 2017). At last, we also include the performance of our warm-start modelsthe models trained by MLE (Vinyals et al., 2015) using cross entropy (ED-XE and Att-XE) -as a reference.
Evaluation Metric and Performance Adjustment: We report performance on FIVE metrics: BLEU4, METEOR, ROUGE-L,CIDEr and Bad Ending Rate. For the self-critical baselines, we report two sets of performances: 1) the captions directly generated by the model; and 2) the sequences of removing bad endings of the generated captions, based on the distribution in Appendix A.
Results: Table 2 summarizes the performances of our models compared with other baselines. We see that without performance adjustments, the self-critical RL with attention performs the best. However, since it contains many bad endings, our method achieves supreme results after these repeated patterns are removed. We also provide some qualitative comparison between our attention model and self-critical in Appendix C.
Efficient Training: We show that constraining the action space leads to a more efficient RL training in Figure 2. CIDEr score is calculated after removing bad endings. We plot three curves using architectures with/without attentions. The Green curve is the self-critical, the blue one is with prioritized sampling, and the red one is our final model with 4-gram constraint. We observe that we can speed up almost twice than its counterpart.
Online Evaluation: We also evaluate our attention model on COCO online server 4 and results are reported in Table 3. Att-SC gets a higher score than ours in the online test, however, with a lot of bad endings where the bad ending ratio is 72.7%.
Human Evaluation: We also implement human evaluation on the results generated by our Att-4gram compared with Att-SC. We randomly select 200 images from the test set. Each time, one image with two captions generated by two different models are shown to the volunteer and three choices are provided: (1) the first one is better; (2) both are the same level; (3) the second one is better. See more details in Appendix D. In Table 5, our model wins 400 times and performs more closely to human than Att-SC.
Evaluating Captions Diversity: To further evaluate the quality of the caption model, we follow (Shetty et al., 2017) to measure the diversity of the generated captions. We compute the novelty score of our 4-gram model, which is defined as whether a particular caption has been observed in the training set. When two models have the same level predictive performances (e.g. CIDEr), a higher novelty score usually indicates more diverse generations. We conduct the experiment five times and report the averaged novelty score of our 4-gram model and the Att-SC, which are 77.83% and 59.28% respectively. As the reference, the METEOR and novelty scores reported in (Shetty et al., 2017) are 23.6, and 79.84%, respectively.

Neural Language Models Extension
Inspired by the paper reviews, we extend our model by adopting another language prior to evaluating the effectiveness of constraining action space during REINFORCE training. We train our neural language model based on the MSCOCO caption corpus with an LSTM unit.
LSTM Language Model: Given a word series {w 0 , w 1 , ..., w T }, the target of a neural language model is to maximize the log-likelihood as: max θ log p θ (w 0 , w 1 , ..., w T ).
We model p θ (w 0 , w 1 , ..., w T ) by an LSTM unit: where w 0 is set to a *start* token for all sentences. h 0 and c 0 are initialized to zero. After obtaining the optimized θ * LM , we can use it to constrain the action space similar to the N-gram language model. Specifically, given previous t − 1 sampled words from current caption model, we compute p θ * LM (w t |w 0 , w 1 , ..., w t−1 ), which is the probability of the next word over the entire vocabulary. We then apply a simple thresholding rule to form a subset of valid words for the captioning model.

Additional Experiments
The word embedding size and hidden dimension of θ LM are set to 256 for this experiment. We use Adam optimizer for training language model and the learning rate is set to 0.001. The batch size of language model training and REINFORCE training are both set to 20 in the experiments. η is set to 0.00005 for the first word and increases by a factor of two for every timestep. We report our results in two settings, which include the combination of with/without attention for the caption model (termed ED-LSTM-LM and Att-LSTM-LM). We use the same warmstart models as in the N-gram experiments. The performances are summarized in Table 4 and Table 3. We see that the neural language model provides further performance gains compared to the N-gram model without introducing any badendings. This is because that the LSTM language model covers a larger context than N-gram, which helps to generate more accurate captions.

Conclusion
In this paper, we present a simple but efficient approach to RL-based image caption by considering n-gram language prior to constrain the action space. Our method converges faster and achieves better results than self-critical setting after removing bad endings in the generated captions. In addition, captions generated by our models are more human readable and graceful. We further extend our ideas using neural language model. The results demonstrate that the captioning models are more beneficial from the neural language model than the N-gram model.