Improving Image Captioning with Better Use of Captions

Image captioning is a multimodal problem that has drawn extensive attention in both the natural language processing and computer vision community. In this paper, we present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. The representation is then enhanced with neighbouring and contextual nodes with their textual and visual features. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences. We perform extensive experiments on the MSCOCO dataset, showing that the proposed framework significantly outperforms the baselines, resulting in the state-of-the-art performance under a wide range of evaluation metrics.


Introduction
Automatically generating a short description for a given image, a problem known as image captioning (Chen et al., 2015), has drawn extensive attention in both the natural language processing and computer vision community. Inspired by the success of encoder-decoder frameworks with the attention mechanism, previous efforts on image captioning adopt variants of pre-trained convolution neural networks (CNN) as the image encoder and recurrent neural networks (RNN) with visual attention as the decoder (Lu et al., 2017;Anderson et al., 2018;Xu et al., 2015;Lu et al., 2018).
Many previous methods translate image representation into natural language sentences without 1 https://github.com/Gitsamshi/WeakVRD-Captioning Figure 1: Visual relationship graphs from a pre-trained detection model (Yao et al., 2018) (upper) and from the ground-truth caption (bottom). explicitly investigating semantic cues from texts and images. To remedy that, some research has also explored to detect high-level semantic concepts presented in images to improve caption generation (Wu et al., 2016;Gan et al., 2017;You et al., 2016;Yao et al., 2017). It is believed by many that the inductive bias that leverages structured combination of concepts and visual relationships is of importance, which has led to better captioning models (Yao et al., 2018;Guo et al., 2019;Yang et al., 2019). These approaches obtain visual relationship graphs using models pretrained from visual relationship detection (VRD) datasets, e.g., Visual Genome (Krishna et al., 2017), where the visual relationships capture semantics between pairs of localized objects connected by predicates, including spatial (e.g., cake-on-desk) and non-spatial semantic relationships (e.g., maneat-food) (Lu et al., 2016).
As in many other joint text-image modeling problems, it is crucial to obtain a good semantic representation in image captioning that bridges semantics in language and images. The existing approaches, however, have not yet adequately leveraged the semantics available in captions to con-struct image representation and generate captions. As shown in Figure 1, although VRD detection models present a strong capacity in predicting salient objects and the most common predicates, they often ignore predicates vital for captioning (e.g., "grab" in this example). Exploring better models would still be highly desirable.
A major challenge for establishing a structural connection between captions and images is that the links between predicates and the corresponding object regions are often ambiguous: within the "image-level" label (obj 1 , pred, obj 2 ) extracted from captions, there may exist multiple object regions corresponding to obj 1 and obj 2 . In this paper, we propose to use weakly supervised multi-instance learning to detect if a bag of object (region) pairs in an image contain certain predicates, e.g., predicates appearing in ground-truth captions here (or in other applications, they can be any given predicates under concerns). Based on that we can construct caption-guided visual relationship graphs.
Once the visual relationship graphs (VRG) are built, we propose to adapt graph convolution operations (Marcheggiani and Titov, 2017) to obtain representation for object nodes and predicate nodes. These nodes can be viewed as image representation units used for generation.
During generation, we further incorporate visual relationships-we propose multi-task learning for jointly predicting word and tag sequences, where each word in a caption could be assigned with a tag, i.e., object, predicate, or none, which takes as input the graph node features from the above visual relationship graphs. The motivation for predicting a tag in each step is to regularize which types of information should be taken into more consideration for generating words: predicate nodes features, object nodes features, or the current state of language decoder. We study different types of multi-task blocks in our models.
As a result, our models consist of three major components: constructing caption-guided visual relationship graphs (CGVRG) with weaklysupervised multi-instance learning, building context-aware CGVRG, and performing multi-task generation to regularize the network to take into account explicit predicate object/predicate constraints. We perform extensive experiments on the MSCOCO (Lin et al., 2014) image captioning dataset with both supervised and Reinforcement learning strategy (Rennie et al., 2017). The experiment results show that the proposed models significantly outperform the baselines and achieve the state-of-the-art performance under a wide range of evaluation metrics. The main contributions of our work are summarized as follows: • We propose to construct caption-guided visual relationship graphs that introduce beneficial inductive bias by better bridging captions and images. The representation is further enhanced with neighbouring and contextual nodes with their textual and visual features.
• Unlike existing models, we propose multi-task learning to regularize the network to take into account explicit object/predicate constraints in the process of generation.
• The proposed framework achieves the state-ofthe-art performance on the MSCOCO image captioning dataset. We provide detailed analyses on how this is attained.

Related Work
Image Captioning A prevalent paradigm of existing image captioning methods is based on the encoder-decoder framework which often utilizes a CNN-plus-RNN architecture for image encoding and text generation (Donahue et al., 2015;Vinyals et al., 2015;Karpathy and Fei-Fei, 2015). Soft or hard visual attention mechanism (Xu et al., 2015; has been incorporated to focus on the most relevant regions in each generation step. Furthermore, adaptive attention (Lu et al., 2017) has been developed to decide whether to rely on visual features or language model states in each decoding step. Recently, bottom-up attention techniques (Anderson et al., 2018;Lu et al., 2018) have also been proposed to find the most relevant regions based on bounding boxes. There has been increasing work focusing on filling the gap between image representation and caption generation. Semantic concepts and attributes detected from images have been demonstrated to be effective in boosting image captioning when used in the encoder-decoder frameworks (Wu et al., 2016;You et al., 2016;Gan et al., 2017;Yao et al., 2017). Visual relationship (Lu et al., 2016) and scene graphs (Johnson et al., 2015) have been further employed for image encoder in a unimodal (Yao et al., 2018) or multi-modal (Yang et al., 2019;Guo et al., 2019) manner to improve the over- all performance via the graph convolutional mechanism (Marcheggiani and Titov, 2017). Besides, Kim et al. (2019) proposes a relationship-based captioning task to lead better understanding of images based on relationship. As discussed in introduction, we will further explore the relational semantics available in captions for both constructing image representation and generating caption.
Visual Relationship Detection Visual relations between objects in an image have attracted more studies recently. Conventional visual relation detection have dealt with subject-predicate-object triples, including spatial relation and other semantic relation. Lu et al. (2016) detect the triples by performing subject, object, and predicate classification separately.  attempt to encode more distinguishable visual features for visual relationships detection. Probabilistic output of object detection (Dai et al., 2017; is also considered to reason about the visual relationships.
Given an image I, the goal of image captioning is to generate a visually grounded natural language sentence. We learn our model by minimizing the cross-entropy loss with regard to the ground truth caption S * = {w * 1 , w * 2 , ..., w * T }: The model is further tuned with a Reinforcement Learning (RL) objective (Rennie et al., 2017) to maximize the reward of the generated sentence S: where d is a sentence-level scoring metric.
An overview of our image captioning framework is depicted in Figure 2, with the detail of the components described in the following sections.

Caption-Guided Visual Relationship Graph (CGVRG) with Weakly Supervised Learning
A general challenge of modeling p(S|I) is obtaining a better semantic representation in the multimodal setting to bridge captions and images. Our framework first focuses on constructing captionguided visual relationship graphs (CGVRG).

Extracting Visual Relationship Triples and Detecting Objects
The process of constructing CGVRG first extracts relationship triples from captions using textual scene graph parser as described in (Schuster et al., 2015). Our framework employs Faster R- CNN (Ren et al., 2015) to recognize instances of objects and returns a set of image regions for objects: V = {v 1 , v 2 , · · · , v n }.

Constructing CGVRG
The main focus of CGVRG is constructing visual relationship graphs. As discussed in introduction, the existing approaches use pre-trained VRD (visual relationship detection) models, which often ignore key relationships needed for captioning. This gap can be even more prominent if the domain/data used to train image-captioning is farther from where VRD is pretrained. A major challenge to use predicate triples from captions to construct CGVRG is that, the links between predicates and the corresponding object regions are often ambiguous as discussed in introduction. To solve this problem, we use weakly supervised, multi-instance learning.
Obtaining Representation for Object Region Pairs For an image I with a list of salient object regions obtained in object detection {v 1 , v 2 , · · · , v n }, we have a set of region pairs U = {u 1 , u 2 , · · · , u N }, where N = n(n − 1). As shown in Figure 3(b), the visual features of any two object regions and their union box will be collected to compute p r j un , the probability that a region pair u n is associated with the predicate r j , where r j ∈ R and R = {r 1 , r 2 , · · · , r M } include frequent predicates obtained from the captions in training data. The feed-forward network of Figure 3(b) will be trained in weakly supervised training.
Weakly Supervised Multi-Instance Training As shown in Figure 3(c), during training, one object pair t = (o 1 , o 2 ), e.g., (women, hat), can correspond to multiple pairs of object regions: the four women-hat combinations between the two women and two hats. To make our description clearer, we refer to t = (o 1 , o 2 ) as an object pair, and the four women-hat pairs in the image as object region pairs. Accordingly, for a triple we extracted t = (o 1 , r, o 2 ), r ∈ R, e.g., (woman, in, hat), the predicate r (i.e., in) can be associated with multiple object region pairs (here, (w0, h0), (w0, h1), (w1, h0), and (w1, h1)).
To predict predicates over object region pairs, we propose to use Multi-Instance Learning  as our weakly supervised learning approach. Multi-Instance Learning receives a set of labeled bags, each bag containing a set of instances. A bag would be labeled negative if all the instances in it are negative. On the other hand, a bag is labeled positive if there is at least one positive instance in the bag.
In our problem, an instance is a region pair. Therefore for a candidate predicate r ∈ R (e.g., in), we use N r to denote the object region pairs corresponding to predicate r. If r appears in the caption S, N r would be a positive bag. We use N \ N r to denote the negative bag for r. When r is not contained in the caption, the entire N would be the negative bag (the last row of Figure 3(c)). The probability of a bag b having the predicate r j is measured with "noisy-OR": where p r j un has been introduced above. We adopt the cross-entropy loss on the basis of all predicate probabilities over bags, given an image I and cap- multi-instance learning for predicate "in" and "feed", respectively. Here, w, h, and g denote woman, hat, and giraffe, respectively. tion S: where the indicator function 1 condition = 1 if the condition is true, otherwise 1 condition = 0.
Constructing the Graphs Once obtaining the trained module, we can build a CGVRG graph G = (V, E) for a given image I, where the node set V includes two types of nodes: object nodes and predicate nodes. We denote o i as the i th object node and r ij as a predicate node that connects o i and o j (refer to Figure 1 or the middle part of Figure 2). The edges in E are added based on triples; i.e., (o i , r ij , o j ) will assign two directed edges from node o i to r ij and from r ij to o j , respectively. Note that due to the use of the proposed weakly supervised models, the acquired graphs can now contain predicates that exist in captions but not in the VRD models used in the previous work that does not explicitly consider predicates in captions. We will show in our experiments that this improves captioning quality.

Context-Aware CGVRG
We further enhance CGVRG in the context of both modalities, images and text, using graph convolution networks. We first integrate visual and textual features: the textual features for each node are from a word embedding and the visual features are regional visual representations extracted via RoI pooling from Faster R-CNN. The specific features g o i , g r ij for object o i and predicate r ij are shown as follows: where φ r and φ o are feed-forward networks using ReLU activation; g t o i , g t r ij , and g v o i denote textual features of o i , r ij and visual features of o i , respectively.
We present the process of encoding G to produce a new set of context-aware representation X . The representation of predicate r ij and o i are computed as follows: (9) where f r , f in , f out are feed-forward networks using ReLU activation. N in and N out denote the adjacent nodes with o i as head and tail, respectively. N i is the total number of adjacent nodes.

Multi-task Caption Generation
Unlike the existing image-captioning models, we further incorporate visual relationships into generation -we propose multi-task learning for jointly predicting word and tag sequences as each word in a caption will be assigned a tag, i.e., object, predicate, or none. The module takes as input the graph node features from the context-aware CGVRG. The output of the generation module is hence the sequence of words y = {y 1 , · · · , y T } as well as the tags z = {z 1 , · · · , z T }. Two different approaches are leveraged to train the two tasks jointly. The bottom LSTM is used to align a textual state to graph node representations: where LSTM means one step of recurrent unit computation via LSTM; x is the mean-pooled representation of all nodes in the graph; h 1 t−1 and h 2 t−1 denote hidden states of bottom and top LSTM in time step t−1, respectively; e is the word embedding table.
The state h 1 t is then used as a query to attend over graph node features {x o } and {x r } separately to get attended featuresx r t andx o t : where ATT is a soft-attention operation between a query and graph node features. The top LSTM works as a language model decoder, in which the hidden state h 2 0 is initialized with the mean-pooled semantic representation of all detected predicates {r}. In time step t, the input consists of the output from the bottom LSTM layer h 1 t and attended graph featuresx r t ,x o t :

Multi-task Learning
We propose two different blocks to perform the two tasks jointly, as shown in Figure 4. In each step, a multi-task learning block deals with task s 1 as predicting a tag z t and task s 2 as predicting a word y t . Specifically MT-I treats the two tasks independent of each other: p(y t |y <t , I) = softmax(f y (h 2 t )) where f z and f y are feed-forward networks with ReLU activation. Inspired by the adaptive attention mechanism (Lu et al., 2017), MT-II further exploits the probability from p(z t |y <t , I) to integrate the representation of current hidden state h 2 t and attended features from graphx r t ,x o t : p(y t |y <t , I) = softmax(f y (ĥ 2 t )), where p na , p r , p o denote the probabilities of tag z t being "none", "predicate", and "object", respectively. The multi-task loss function is as follows: logp(y t |y <t , I)+γlogp(z t |y <t , I) where γ is the hyper-parameter to balance the two tasks.

Training and Inference
The overall training process can be broken down into two parts: the CGVRG detection module training period and the caption generator training period; the latter includes cross-entropy optimization and the CIDEr-D optimization. For CGVRG detection module training, the detection module is optimized with the multi-instance learning loss in Equation 5. For caption generator training, the model is first optimized with the cross-entropy loss in Equation 19, and then we directly optimize the model with the expected sentence-level reward (CIDEr-D in this work) shown in Equation 3 by self critical sequence learning (Rennie et al., 2017). In the inference stage, given an image, the CGVRG detection module obtains a graph upon them. The graph convolution network encodes graphs to obtain the context aware multi-modal representations. Then graph object/predicate node features are further provided to the multi-task caption generation module to generate sequences with beam search.

Datasets and Experiment Setup
MSCOCO We perform extensive experiments on the MSCOCO benchmark (Lin et al., 2014). The Karpathy split (Karpathy and Fei-Fei, 2015) is adopted for our model selection and offline testing, which contains 113K training images, 5K validation images and 5K testing images. As for the online test server, the result is trained on the entire training and validation set (123K images). To evaluate the generated captions, we employ standard evaluation metrics: SPICE (Anderson et al., 2016), CIDEr-D , METEOR (Denkowski and Lavie, 2014), ROUGE-L (Lin, 2004), and BLEU (Papineni et al., 2002).
Visual Genome We use the Visual Genome (Krishna et al., 2017) dataset to pre-train our object detection model. The dataset includes 108K images. To pre-train the object detection model with Faster R-CNN, we strictly follow the setting in (Anderson et al., 2018), taking 98K/5K/5K for training, validation, and testing, respectively. The split is carefully selected to avoid contamination of the MSCOCO validation and testing sets, since nearly 51K Visual Genome images are also included in the MSCOCO dataset.

Implementation Details
We use Faster R- CNN (Ren et al., 2015) to identify and localize instances of objects. The object detection phase consists of two modules. The first module proposes object regions using a deep CNN, i.e., ResNet-101 (He et al., 2016). The second module extracts feature maps using region-of-interest pooling for each box proposals. Practically, we take the final output of the ResNet-101 and perform nonmaximum suppression for each object class with an IoU threshold. As a result, we obtain a set of image regions, V = {v 1 , v 2 , · · · , v n }, where n ∈ [10, 100] varies with input images and confidence thresholds. Each region is represented as a 2,048-dimensional vector obtained from the pool5 layer after the RoI pooling. We then apply a feedforward network with a 1000-dimensional output layer for predicates classification. The network of the same size is also used for feature projection (φ o , φ i ) and GCN (f r , f in , f out ). In the decoder LSTM, the word embedding dimension is set to be 1,000 and the hidden unit dimension in the toplayer and bottom-layer LSTM is set to be 1,000 and 512, respectively. The trade-off parameter γ in multi-task learning is 0.15. The whole system is trained with the Adam optimizer. We set the initial learning rate to be 0.0005 and mini-batch size to be 100. The maximum number of training epochs is 30 for Cross-entropy and CIDEr-D optimization respectively. For sequence generation in the inference stage, we adopt the beam search strategy and set the beam size to be 3.
We construct object and predicate categories for VRD training. Similar to (Lu et al., 2018), we manually expand the original 80 object categories to    (Anderson et al., 2018;Yang et al., 2019).
413 fine-grained categories by utilizing a list of caption tokens. For example, the object category "person" is expanded to a list of fine-grained categories ["boy", "man", · · ·]. Then for all extracted triples that have both objects appearing in the 413 category list, we select the 200 most frequent predicates as our predicate categories.

Quantitative Analysis
Model Comparison We compare our models with the following state-of-the-art models: (1) SCST (Rennie et al., 2017) employs an improved policy gradient algorithm by utilizing its own inference output to normalize the rewards; (2) LSTM-A (Yao et al., 2017) integrates the detected image attributes into the CNN-plus-RNN image captioning framework; (3) Up-Down (Anderson et al., 2018) uses both a bottom-up and top-down attention mechanism to focus more on salient object regions; (4) GCN-LSTM (Yao et al., 2018) leverages graph convolutional networks over the detected objects and relations; (5) CAVP (Liu et al., 2018) proposes a context-aware policy network by accounting for visual attentions as context for generation; (6) VSUA (Guo et al., 2019) exploits the alignment between words and different categories of graph nodes; (7) SAGE (Yang et al., 2019) utilizes an additional graph encoder to incorporate language inductive bias into the encoder-decoder framework.
Our baseline is built on Up-Down (Anderson et al., 2018). We propose two variants of final models using different multi-task blocks, namely MT-I and MT-II shown in Fig 4(b). We conduct extensive comparisons on the dataset with the above state-of-the-art techniques. We also perform detailed analysis to demonstrate the impact of different components of our framework. Table 1 lists the results of various single models on the MSCOCO Karpathy split. Our model outperforms the baseline model significantly, with CIDEr-D scores being improved from 113.5 to 119.0 and 120.1 to 129.6 in the cross-entropy and CIDEr-D optimization period, respectively. In addition, the model with MT-II shows an advantage over that with MT-I on SPICE, which implies that the proposed adaptive visual attention mechanism works in multi-task block II.
Table 2 compares our model with three models that also incorporate VRG, plus the baseline model, on the MSCOCO online test server. Our model improves significantly from the baseline (from 120.5 to 126.7 in CIDEr-D) and has achieved the best results across all evaluation metrics on c40 (40 reference captions). Figure 5 shows the effect of taking different weights γ in the multi-task loss item (Equation 19).
The results indicate that the weight around 0.15 yields the best performance in both multi-task blocks. Meanwhile, Figure 6 shows the ablation analysis by removing the multi-task caption generation and graph convolution operation, respectively, to check the effect of these components. The results  show that both the graph convolution operation and multi-task learning help improve the quality of the generated captions.
Note that the code of our paper has been made publicly available in the webpage provided in the abstract.
Human evaluation We performed human evaluation with three non-author human subjects, using a five-level Likert scale. For each image and each pair of systems in comparison (MT-I vs. Up-Down, MT-I vs. GCN-LSTM, and MT-I vs. SGAE), we show the captions generated by the two systems to the human subjects. We ask each subject if the first caption sentence is: significantly better (2), better (1), equal (0), worse (−1), or significantly worse (−2), compared to the second.
Following (Zhao et al., 2019), we obtain the subjects' ratings for fidelity (the first caption is superior in terms of making less mistakes?), informativeness (the first caption provides more informative and detailed description?), and fluency (the first caption is more fluent?). For each question asked for an image, we calculate the average of the three subjects' scores. For each pair of models in comparison, we randomly sampled 50 images from the Karpathy testset.
• MT-I vs. Up-Down: For fidelity, MT-I is better or significantly better on 44% images (where the average of the three human subjects' scores is larger than 0.5), equal to Up-Down on 46% images (the average is in range [−0.5, 0.5]), and worse or significantly worse on 10% images (average is less than −0.5).
For informativeness, MT-I is better or significantly better on 60% images, equal on 34%, and worse or significantly worse on 6%. For fluency, the numbers are 18%, 72%, and 10%.
• MT-I vs. SGAE: For fidelity, MT-I is better or significantly better on 36% images, equal to SGAE on 56%, and worse or significantly worse on 8%. For informativeness, the numbers are 30%, 48%, and 22%, respectively. For fluency, the numbers are 6%, 90%, and 4%. Figure 7 shows several specific examples, each including an image, a detected caption guided visual relationship graph, a ground truth sentence, a generated word sequence, and a learned visual relationship composition. We can see that the proposed model generates more accurate captions coherent to the visual relationship detected in the image. Consider the upper middle demo as an example; our model extracts a visual relationship graph covering the critical predicates "filled with" and "in front of" for understanding the image, thus producing a comprehensive description. In addition, we observe that the model generates the triple (table, f illed with, f ood), which is a new composition that has not appeared in the training set. Figure 8 visualizes the effect of our tag sequence generation process. Specifically, we visualize the tag probabilities of the "object", "predicate", and "none" category in each generation step. Our model successfully learns to distinguish the correct category for each time step, which is in consistent with the tag of the predicted word. For example, for the generated words "flying over", the probability for the "predicate" category is the highest, which is also true for words like "bird" and "water".

Conclusions
This paper presents a novel image captioning architecture that constructs caption-guided visual relationship graphs to introduce beneficial inductive  bias to better utilize captions. The representation is further enhanced with text and visual features of neighbouring nodes. During generation, the network is regularized to take into account explicit object/predicate constraints with multi-task learning. Extensive experiments are performed on the MSCOCO dataset, showing that the proposed framework significantly outperforms the baselines, resulting in the state-of-the-art performance under various evaluation metrics. In the near future we plan to extend the proposed approach to several other language-vision modeling tasks.