Training for Diversity in Image Paragraph Captioning

Image paragraph captioning models aim to produce detailed descriptions of a source image. These models use similar techniques as standard image captioning models, but they have encountered issues in text generation, notably a lack of diversity between sentences, that have limited their effectiveness. In this work, we consider applying sequence-level training for this task. We find that standard self-critical training produces poor results, but when combined with an integrated penalty on trigram repetition produces much more diverse paragraphs. This simple training approach improves on the best result on the Visual Genome paragraph captioning dataset from 16.9 to 30.6 CIDEr, with gains on METEOR and BLEU as well, without requiring any architectural changes.


Introduction
Image captioning aims to describe the objects, actions, and details present in an image using natural language. Most image captioning research has focused on single-sentence captions, but the descriptive capacity of this form is limited; a single sentence can only describe in detail a small aspect of an image. Recent work has argued instead for image paragraph captioning with the aim of generating a (usually 5-8 sentence) paragraph describing an image.
Compared with single-sentence captioning, paragraph captioning is a relatively new task. The main paragraph captioning dataset is the Visual Genome corpus, introduced by Krause et al. (2016). When strong single-sentence captioning models are trained on this dataset, they produce repetitive paragraphs that are unable to describe diverse aspects of images. The generated paragraphs repeat a slight variant of the same sentence multiple times, even when beam search is used. Prior work, discussed in the following section, tried to address this repetition with architectural changes, such as hierarchical LSTMs, which separate the generation of sentence topics and words.
In this work, we consider an approach for training paragraph captioning models that focuses on increasing the diversity of the output paragraph. In particular, we note that self-critical sequence training (SCST) (Ranzato et al., 2015;Rennie et al., 2016), a technique which uses policy gradient methods to directly optimize a target metric, has been successfully employed in standard captioning, but not in paragraph captioning. We observe that during SCST training the intermediate results of the system lack diversity, which makes it difficult for the model to improve. We address this issue with a simple repetition penalty which downweights trigram overlap.
Experiments show that this technique greatly improves the baseline model. A simple baseline, non-hierarchical model trained with repetitionpenalized SCST outperforms complex hierarchical models trained with both cross-entropy and customized adversarial losses. We demonstrate that this strong performance gain comes from the combination of repetition-penalized search and SCST, rather than from either individually, and discuss how this impacts the output paragraphs.

Background and Related Work
Nearly all modern image captioning models employ variants of an encoder-decoder architecture. As introduced by Vinyals et al. (2014), the encoder is a CNN pre-trained for classification and the decoder is a LSTM or GRU. Following work in machine translation, Xu et al. (2015) added an attention mechanism over the encoder features. Recently, Anderson et al. (2017) further improved single-sentence captioning performance by incorporating object detection in the encoder (bottomup attention) and adding an LSTM layer before attending to spatial features in the decoder (top-down attention).
Single-sentence and paragraph captioning models are evaluated with a number of metrics, including some designed specifically for captioning (CIDEr) and some adopted from machine translation (BLEU, METEOR). CIDEr and BLEU measure accuracy with n-gram overlaps, with CIDEr weighting n-grams by TF-IDF (termfrequency inverse-document-frequency), and ME-TEOR uses unigram overlap, incorporating synonym and paraphrase matches. We discuss these metrics in greater detail when analyzing our experiments. Krause et al. (2016) introduced the first large-scale paragraph captioning dataset, a subset of the Visual Genome dataset, along with a number of models for paragraph captioning. Empirically, they showed that paragraphs contain significantly more pronouns, verbs, coreferences, and greater overall "diversity" than singlesentence captions. Whereas most single-sentence captions in the MSCOCO dataset describe only the most important object or action in an image, paragraph captions usually touch on multiple objects and actions.

Related Models
The paragraph captioning models proposed by Krause et al. (2016) included template-based (nonneural) approaches and two encoder-decoder models. In both neural models, the encoder is an object detector pre-trained for dense captioning. In the first model, called the flat model, the decoder is a single LSTM which outputs an entire paragraph word-by-word. In the second model, called the hierarchical model, the decoder is composed of two LSTMs, where the output of one sentencelevel LSTM is used as input to the other word-level LSTM.
Recently, Liang et al. (2017) extended this model with a third (paragraph-level) LSTM and added adversarial training. In total, their model (RTT-GAN) incorporates three LSTMs, two attention mechanisms, a phrase copy mechanism, and two adversarial discriminators. To the best of our knowledge, this model achieves state-ofthe-art performance of 16.9 CIDEr on the Visual Genome dataset (without external data).
For our experiments, we use the top-down single-sentence captioning model in Anderson et al. (2017). This model is similar to the "flat" model in Krause et al. (2016), except that it incorporates attention with a top-down mechanism.

Approach
The primary issue in current paragraph captioning models, especially non-hierarchical ones, is lack of diversity of topic in the output paragraph. For example, for the image of a skateboarder in Figure  1, the flat model outputs "The man is wearing a black shirt and black pants" seven times. This example is not anomalous: it is a typical failure case of the model. Empirically, in validation, ground truth paragraphs contain 0.62 repeated trigrams on average, whereas paragraphs produced by the flat cross-entropy model contain 25.9 repeated trigrams on average.

Self-Critical Sequence Training
Self-critical sequence training (SCST) is a sequence-level optimization procedure proposed by Rennie et al. (2016), which has been widely adopted in single-sentence captioning but has not yet been applied to paragraph captioning. This method provides an alternative approach to wordlevel cross-entropy which can incorporate a task specific metric.
Sequence-level training employs a policy gradient method to optimize directly for a nondifferentiable metric, such as CIDEr or BLEU. This idea was first applied to machine translation by Ranzato et al. (2015) in a procedure called MIXER, which incrementally transitions from cross-entropy to policy gradient training. To normalize the policy gradient reward and reduce variance during training, MIXER subtracts a baseline estimate of the reward as calculated by a linear regressor.
SCST replaces this baseline reward estimate with the reward obtained by the test-time inference algorithm, namely the CIDEr score of greedy search. This weights the gradient by the difference in reward given to a sampled paragraph compared to the current greedy output (see Eq. 3-9 in (Rennie et al., 2016)). Additionally, SCST uses a hard transition from cross-entropy to policy gradient training. The final gradient is: Where w s is a sampled paragraph, w g is a greedy decoded paragraph, r is the reward (e.g CIDEr), p θ is the captioning model.

Repetition Penalty
In preliminary experiments, we find that directly applying SCST is not effective for paragraph captioning models. Table 1 shows that when training with SCST, the model performs only marginally better than cross-entropy. In further analysis, we see that the greedy baseline in SCST training has very non-diverse output, which leads to poor policy gradients. Unlike in standard image captioning, the cross-entropy model is too weak for SCST to be effective.
To address this problem, we take inspiration from recent work in abstractive text summarization, which encounters the same challenge when producing paragraph-length summaries of documents (Paulus et al., 2017). These models target the repetition problem by simply preventing the model from producing the same trigram more than once during inference. We therefore introduce an inference constraint that penalizes the logprobabilities of words that would result in repeated trigrams. The penalty is proportional to the number of times the trigram has already been gener-ated.
Formally, denote the (pre-softmax) output of the LSTM by o, where the length of o is the size of the target vocabulary and o w is the log-probability of word w. We modify o w by o w → o w − k w · α, where k w is the number of times the trigram completed by word w has previously been generated in the paragraph, and α is a hyperparameter which controls the degree of blocking. When α = 0, there is no penalty, so we have standard greedy search. When α → ∞, or in practice when α exceeds about 5, we have full trigram blocking.
We incorporate this penalty into the greedy baseline used to compute policy gradients in SCST. During inference, we employ the same repetition-penalized greedy search.

Methods and Results
For our paragraph captioning model we use the top-down model from Anderson et al. (2017). Our encoder is a convolutional network pretrained for object detection (as opposed to dense captioning, as in Krause et al. (2016) and Liang et al. (2017) The encoder extracts between 10 and 100 objects per image and applies spatial max-pooling to yield a single feature vector of dimension 2048 per object. The decoder is a 1-layer LSTM with hidden dimension 512 and top-down attention.
Evaluation is done on the Visual Genome dataset with the splits provided by Krause et al. (2016). We first train for 25 epochs with crossentropy (XE) loss, using Adam with learning rate 5 · 10 −4 . We then train an additional 25 epochs with repetition-penalized SCST targeting a CIDEr-based reward, using Adam with learning rate 5 · 10 −5 .
Results Table 1 shows the main experimental results. Our baseline cross-entropy captioning model gets similar scores to the original flat model. When the repetition penalty is applied to a model trained with cross-entropy, we see a large improvement on CIDEr and a minor improvement on other metrics. 1 When combining the repetition penalty with SCST, we see a dramatic improvement across all metrics, and particularly on CIDEr. Interestingly, SCST only works when its baseline reward model is strong; for this reason the combination of the repetition penalty and SCST is particularly effective.   Table 3: Analysis of different model outputs (α = 2.0 for models w/ penalty) Finally, Table 3 shows quantitative changes in trigram repetition and ground truth matches. The cross-entropy model fails to generate enough unique phrases. Blocking these entirely gives some benefit, but the SCST model is able to raise the total number of matched trigrams while reintroducing few repeats.

Conclusion
This work targets increased diversity in image paragraph captioning. We show that training with SCST combined with a repetition penalty leads to a substantial improvement in the state-of-theart for this task, without requiring architectural changes or adversarial training. In future work, we hope to further address the language issues of paragraph generation as well as extend this simple approach to other tasks requiring long-form text or paragraph generation.