On the Role of Scene Graphs in Image Captioning

Scene graphs represent semantic information in images, which can help image captioning system to produce more descriptive outputs versus using only the image as context. Recent captioning approaches rely on ad-hoc approaches to obtain graphs for images. However, those graphs introduce noise and it is unclear the effect of parser errors on captioning accuracy. In this work, we investigate to what extent scene graphs can help image captioning. Our results show that a state-of-the-art scene graph parser can boost performance almost as much as the ground truth graphs, showing that the bottleneck currently resides more on the captioning models than on the performance of the scene graph parser.


Introduction
The task of automatically recognizing and describing visual scenes in the real world, normally referred to as image captioning, is a long standing problem in computer vision and computational linguistics. Previously proposed methods based on deep neural networks have demonstrated convincing results in this task, (Xu et al., 2015;Lu et al., 2018;Anderson et al., 2018;Lu et al., 2017;Fu et al., 2017;Ren et al., 2017) yet they often produce dry and simplistic captions, which lack descriptive depth and omit key relations between objects in the scene. Incorporating complex visual relations knowledge between objects in the form of scene graphs has the potential to improve captioning systems beyond current limitations.
Scene graphs, such as the ones present in the Visual Genome dataset (Krishna et al., 2017), can be used to incorporate external knowledge into images. Because of the structured abstraction and greater semantic representation capacity than purely image features, they have the potential to improve image captioning, as well as other downstream tasks that rely on visual components. This has led to the development of many parsing algorithms for scene graphs Xu et al., 2017;Dai et al., 2017;Yu et al., 2017). Simultaneously, recent work also aimed at incorporating scene graphs into captioning systems, with promising results (Yao et al., 2018;Xu et al., 2019). However, these previous work still rely on ad-hoc scene graph parsers, raising the question of how captioning systems behave under potential parsing errors.
In this work, we aim at answering the following question: "to what degree scene graphs contribute to the performance of image captioning systems?". In order to answer this question we provide two contributions: 1) we investigate the performance of incorporating scene graphs generated by a state-of-the-art scene graph parser  into a well-established image captioning framework (Anderson et al., 2018); and 2) we provide an upper bound on the performance by comparative experiments with ground truth graphs. Our results show that scene graphs can be used to boost performance of image captioning, and scene graphs generated by state-of-art scene graph parser, though still limited in the number of objects and relations categories, is not far below the ground-truth graphs, in terms of standard image captioning metrics.

Methods
Our architecture, inspired by Anderson et al. (2018) and shown in Figure 1, assumes an offthe-shelf scene graph parser. To improve performance, we also incorporate information from the original image through a set of region features obtained through an object detection model. Note we experiment with each set of features in isolation in Section 3.1. Given those inputs, our model consists a scene graph encoder, an LSTM-based attention module and another LSTM as the decoder.

Scene Graph Encoder
The scene graph is represented as a set of node embeddings which are then updated into contextual hidden vectors using a Graph Convolutional Network (Kipf and Welling, 2017, GCN). In particular, we employ the GCN version proposed by Marcheggiani and Titov (2017), who incorporate directions and edge labels. We treat each relation and object in the scene graph as nodes, which are then connected with five different types of edges. 1 Since we assume scene graphs are obtained by a parser, they may contain noise in the form of faulty or nugatory connections. To mitigate the influence of parsing errors, we allow edge-wise gating so the network learns to prune those connections. We refer to Marcheggiani and Titov (2017) for details of their GCN architecture.

Attention LSTM
The Attention LSTM keeps track of contextual information from the inputs and incorporates information from the decoder. At each time step t, the Attention LSTM takes in contextual information by concatenating the previous hidden state of the Decoder LSTM, the mean-pooled region-level image features, the mean-pooled scene graph node features from the GCN and the previous generated word representation: where W e is the word embedding matrix for vocabulary Σ and u t is the one-hot encoding of the word at time step t. Given the hidden state of the Attention LSTM h 1 t , we generate cascaded attention features, first over scene graph features, and then we concatenate the attention weighted scene graph features with the hidden state of the Attention LSTM to attend over region-level image features. Here, we only show the second attention step over region-level image features as they are identical procedures except for the input: learnable weights.v t andf t are the attention weighted image features and scene graph features respectively.

Decoder LSTM
The inputs to the Decoder LSTM consist of the previous hidden state from the Attention LSTM layer, attention weighted scene graph node features, and attention weighted image features.
Using the notation y 1:T to refer to a sequence of words (y 1 , ..., y T ) at each time step t, the conditional distribution over possible output words is given by: where W p ∈ R |Σ|×H and b p ∈ R |Σ| are learned weights and biases.

Training and Inference
Given a target ground truth sequence y * 1:T and a captioning model with parameters θ, we minimize the standard cross entropy loss. At inference time, we use beam search with a beam size of 5 and apply length normalization (Wu et al., 2016).

Experiments
Datasets MS-COCO, (Lin et al., 2014) is the most popular benchmark for image captioning, which contains 82,783 training images and 40,504 validation images, with five human-annotated descriptions per image. As the annotations of the official testing set are not publicly available, we follow the widely used Kaparthy split (Karpathy and Fei-Fei, 2017), and take 113,287 images for training, 5K for validation, and 5K for testing. We convert all the descriptions in training set to lower case and discard rare words which occur less than five times, resulting in a vocabulary with 10,201 unique words. For the oracle experiments, we take a subset of MS-COCO that intersects with Visual Genome (Krishna et al., 2017) to obtain the ground truth scene graphs. The resulting dataset (henceforth, MS-COCO-GT) contains 33,569 training, 2,108s validation, and 2,116 test images respectively.
Preprocessing The scene graphs are obtained by a state-of-the-art parser: a pre-trained Factorizable-Net trained on MSDN split , which is a cleaner version of the Visual Genome 2 that consists of 150 object categories and 50 relationship categories. Notice that the number of object categories and relationships are much smaller than the actual number of objects and relationships in the Visual Genome dataset. All the predicted objects are associated with a set of bound box coordinates. The region-level image features 3 are obtained from Faster- RCNN (Ren et al., 2017), which is also trained on Visual Genome, using 1,600 object classes and 400 attributes classes.
Implementation Our models are trained with AdamMax optimizer (Kingma and Ba, 2015). We set the initial learning rate as 0.001 with a minibatch size as 256. We set the maximum number of epochs to be 100 with early stopping mechanism. 4 During inference, we set the beam width to 5. Each word in the sentence is represented as a onehot vector, and each word embedding is a 1,024-2 The MSDN split might contain training instances that overlap with the Karpathy split 3 These regions are different to those from the scene graph. To help the model learn to match regions, the inputs to attention include bounding box coordinates. 4 We stop training if the CIDEr score does not improve for 10 epochs, and we reduce the learning by 20 percent if the CIDEr score does not improve for 5 epochs.  Table 1: Results on the full MS-COCO dataset. "I", "G" and "I+G" correspond to models using image features only, scene graphs only and both, respectively. "B", "M", "R", "C" and "S" correspond to BLEU, ME-TEOR, ROUGE, CIDEr and SPICE (higher is better). dimensional vector. For each image, we have K = 36 region features with bounding box coordinates from Faster-RCNN. Each region-level image feature is represented as a 2,048-dimensional vector, and we concatenate the bounding box coordinates to each of the region-level image features. The dimension of the hidden layer in each LSTM and GCN layer is set to 1,024. We use two GCN layers in all our experiments.
Evaluation We employ standard automatic evaluation metrics including BLEU (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007), ROUGE (Lin, 2004), CIDEr (Vedantam et al., 2015) and SPICE (Anderson et al., 2016), and we use the coco-caption tool 5 to obtain the scores. Table 1 shows the performances of our models against baseline models whose architecture is based on Bottom-up Top-down Attention model (Anderson et al., 2018). Overall, our proposed model incorporating scene graph features achieves better results across all evaluation metrics, compared to image features only or graph features only. The results show that our model can learn to exploit the relational information in scene graphs and effectively integrate those with image features. Moreover, the results also demonstrate the effectiveness of edge-wise gating in pruning noisy scene graph features.

Quantitative Results and Analysis
We also conduct experiments comparing Factorizable-Net generated scene graph with ground-truth scene graph, as shown in Table 2. As expected, the results show that the performance is GT: a cop riding a motorcycle next to a white van Image: a police officer riding a motorcycle on a city street Graph: a man riding on the back of a motorcycle down a street I + G: a man riding a motorcycle down a city street in front of a white bus GT: the baby is playing with the phone in the park Image: a little girl is holding a cell phone Graph: a woman sitting on a bench with a cell phone I + G: a little girl is holding a cell phone in a field of grass in a park   Table 2: Results on the MS-COCO-GT dataset. "G (pred)" refers to the parsed scene graphs from Factorizable-Net while "G (truth)" corresponds to the ground truth graphs obtained from Visual Genome. better with ground-truth scene graph. Notably the SPICE score, which measures the semantic correlation between generated captions and ground truth captions, improved by 2.1%, since there are considerably more types of objects, relations and attributes present in the ground-truth scene graphs. Overall, the results show the potential of incorporating automatically generated scene graph features for the captioning system, and we argue with better scene graph parser trained on more objects, relations and attributes categories, the captioning system should provide additional improvements.
Compared to a recent image captioning paper 6 (Li and Jiang, 2019) using scene-graph features, our results are superior, demonstrating the effectiveness of our model. Moreover, compared to a state-of-art image captioning system (Yu et al., 2019), 7 our scores are inferior, as we do not apply scheduled sampling, reinforcement learning, transformer cell or ensemble predictions, which have all been proven to improve the scores significantly. However, our method of incorporating scene-graph features is orthogonal to the state-ofart methods.

Qualitative Results and Analysis
Figure 2 shows some generated captions by different approaches trained on the full Karpathy split of MS-COCO dataset. We can see that all approaches can produce sensible captions describing the image content. However, our approach of incorporating scene graph features and image features can generate more descriptive captions that more closely narrate the underlying relations in the image. In the first example, our model correctly predicts that the motercycle is in front of the white van while the image-only model misses this relational detail. On the other hand, purely graph features sometimes introduce noise. As shown in the second example, the graph-only model mistakes the little girl in a park as a woman on a bench, whereas the image features in our model helps disambiguate faulty graph features.

Conclusion
We have presented a novel image captioning framework that incorporates scene graph features extracted from state-of-art scene graph parser Factorizable-Net. Particularly, we investigate the problem of integrating relation-aware scene graph features encoded by Graph Convolution with region-level image features to boost image captioning performance. Extensive experiments conducted on MSCOCO image captioning dataset has shown the effectiveness of our method. In the future, we want to experiment with building an end-to-end multi-task framework that jointly predicts visual relations and captions.