Bridge the Gap: High-level Semantic Planning for Image Captioning

Recent image captioning models have made much progress for exploring the multi-modal interaction, such as attention mechanisms. Though these mechanisms can boost the interaction, there are still two gaps between the visual and language domains: (1) the gap between the visual features and textual semantics, (2) the gap between the disordering of visual features and the ordering of texts. To bridge the gaps we propose a high-level semantic planning (HSP) mechanism that incorporates both a semantic reconstruction and an explicit order planning. We integrate the planning mechanism to the attention based caption model and propose the High-level Semantic PLanning based Attention Network (HS-PLAN). First, an attention based reconstruction module is designed to reconstruct the visual features with high-level semantic information. Then we apply a pointer network to serialize the features and obtain the explicit order plan to guide the generation. Experiments conducted on MS COCO show that our model outperforms previous methods and achieves the state-of-the-art performance of 133.4% CIDEr-D score.


Introduction
Image captioning which aims to generate textual descriptions of images, is a significant task in both computer vision and natural language process. It not only requires recognizing and understanding the objects and attributes from the given image but also needs to verbalize them with natural language in proper order.
Previous works with neural models follow the encoder-decoder paradigm that uses Convolutional Neural Network (CNN) to encode the input image and apply Recurrent Neural Network (RNN) as decoder to generate the textual descriptions (as shown in Figure 1(a)) (Vinyals et al., 2015;Gan et al., 2017a;Chen et al., 2018;Gan et al., 2017b;Lu et al., 2017;Yang et al., 2016). To explore the multi-modal interaction between the visual content and textual description, some recent methods (Xu et al., 2015;Anderson et al., 2018) apply visual attention mechanism to model the interaction. The visual attention works by learning to selectively attend to image features extracted by the encoder when generating each word. For better interaction, a large number of works focus on boost the performance of neural models with improved attention mechanisms (Huang et al., 2019a;Huang et al., 2019b;Pan et al., 2020). However, there are still two gaps between the visual and language domains that visual attention does not address: (1) the gap between the visual features and textual semantics, (2) the gap between the disordering of visual features and the ordering of texts. For one thing, it is hard for the decoder to associate each word in the caption with the features without a high-level semantic understanding. For another thing, with the visual attention these neural models implicitly select which features to focus on at each decoding step without any explicit guidance or exterior supervision, which makes the generation process uncontrollable and inexplicable.
There have been some researches focusing on alleviating both of the two problems. Some of them apply semantic attention to leverage the high-level semantic information to narrow the first gap (Fang Figure 1: Comparison between the caption model without planning (a) and planning based model (b) where the visual features are first reconstructed with high-level concepts and then serialized with order planning. Thus the visual features can be grounded to the words in the caption (the dotted lines). You et al., 2016).  explicitly represent high-level semantic concepts and incorporate them into the CNN-RNN approach. Li et al. (2019) propose the Entangled Attention based on the Transformer architecture (Vaswani et al., 2017) to explore visual and semantic information simultaneously. These methods focus more on leveraging the high-level semantic information to enhance neural models but pay little attention to the relation between the detected semantic information and the extracted visual features. For the second gap, Cornia et al. (2019) propose a controllable framework that can generate captions grounded on a sequence of image regions which are sorted by a sorting network. Though achieving the controllability to some extent, the method struggles to avoid the problems of inflexibility and error propagation between the sorting and generation.
To address these issues mentioned above, we propose a High-level Semantic PLanning based Attention Network (HS-PLAN) that incorporates both a high-level semantic reconstruction and an explicit order planning as shown in Figure 1 (1) To narrow the gap between the visual features and textual semantics, an attention based reconstruction module is designed to re-represent the visual feature of each image region with the corresponding high-level concepts predicted by the object detector and attribute classifier.
(2) To bridge the gap between the disordering of visual features and the ordering of textual sentences, we implement an attention based pointer network to make explicit order-plan to guide the caption generation. After the planning stage, the planned features are fed to an attention based caption model. The caption model first applies an order-sensitive encoder to encode the planned features further and learn the absolute and relative order information of features with position encoding. Then a visual attention based decoder is employed to generate the textual description of the input image guided by the determined plan.
We conduct experiments on a large benchmark dataset named MS COCO (Lin et al., 2014) to evaluate our proposed model. The results show that our model outperforms all the baselines and achieves the state-of-the-art performance: achieving 133.4% CIDEr-D score with a single model and 134.8% with an ensemble of four models on "Karpathy" test split. The qualitative human evaluation also demonstrates that our model can generate more fluent, faithful and coherent captions.

Related Work
Begin with show and tell (Vinyals et al., 2015), numbers of neural-based encoder-decoder models are proposed for image captioning. They utilize CNN-RNN based frameworks by encoding images into features and then translating image features into sentences, and achieve significant improvements on captioning. Recently attention mechanisms are widely used in image captioning, which provide guidance for choosing the most relevant image region when generating words of sentences (Xu et al., 2015;Anderson et al., 2018;Huang et al., 2019a;Huang et al., 2019b;Pan et al., 2020). Specifically, Huang et al. (Huang et al., 2019a) propose an enhanced attention mechanism to determine the relevance of attention results for better multi-modal interaction. Moreover, (Rennie et al., 2017) applies reinforcement learning with a self-critical reward to models for a more efficient training process. However, these methods are limited to the generation of the word in sentences from image features. It is still hard for these methods to bridge the gaps between the visual and language domains.
Previous captioning approaches focus on two different dimensions to alleviate the problems. Some focus on a better understanding of images or presentation of image features with high-level semantic information (Fang et al., 2015;You et al., 2016). Specifically,  explicitly represent high-level semantic concepts and incorporate them into the CNN-RNN approach.  leverage scene graph for more meaningful semantic representation to transfer the inductive bias from the pure language domain to the vision-language domain. Li et al. (2019) propose the Entangled Attention based on the Transformer architecture (Vaswani et al., 2017) to explore visual and semantic information simultaneously. And others concentrate more on the controllability of the generation stage. Cornia et al. (2019) propose a controllable framework that can generate captions grounded on a sequence of image regions which are sorted by a sorting network. Different from it, our method is flexible that just makes an explicit order plan to guide the generation instead of generates words step-by-step depending on the control signal in a fixed order.

Methodology
In this section, we devise HS-PLAN to model explicit high-level planning to guide the image captioning. The target of captioning model is to generate a textual sentence Y = {y 1 , y 2 , ..., y T } of the given image I. Traditional encoder-decoder models formulate the problem as a two-stage process: feature extraction and caption generation. But our model further decompose the problem into a three-stage process: .., V n } represents the visual features captured by the CNN-based encoder, Z is the explicit plan and V → Z represents the planning stage. The architecture of HS-PLAN is shown in Figure 2. After extracting visual features from a given image, our model first applies a semantic reconstruction module to integrate textual semantic information into visual features to re-represent them. Then an attention-based pointer network is applied to make explicit order-plan to guide the caption generation. After the planning stage, an order-sensitive encoder is employed to further encode the features which are then used for generating textual descriptions with the decoder guided by the determined plan.

Semantic Reconstruction
Given the image, first the visual features v ∈ R n×dv are extracted by a pre-trained Faster- RCNN (Ren et al., 2015), which is also used as the object detector to determine the object of each image region. Further, we use an attribute classifier to detect the possible attributes of each object. Then an attention-based reconstruction module (Figure 2.(A)) is designed to integrate the information of textual semantics into the visual features to narrow the gap between the textual and visual semantics. The textual description of each feature v i is presented as a bag-of-words w = {o 1 , o 2 , o 3 , ...a 1 , a 2 , a 3 , ...} including the possible objects and attributes, which are first embedded as word vectors w ∈ R m×dw where m is the scale of the bag-of-words and d w is the dimension of word embedding. Then we design a vision-to-language attention (V2L) to estimate the similarity between the visual feature and the word embeddings to reconstruct the feature, which is computed as follows: where w i,j is the j-th word in the bag-of-words, W v ∈ R dw×dv , b v ∈ R dw are parameters. After a linear layer and layer normalization, the feature is re-represented by integrating the information of the bag-of-words into the visual feature.
where W w ∈ R dw×dw , b w ∈ R dw are parameters. Finally each visual feature is reconstructed to a more informative one which we refer to as semantic feature.

Order Planning
After reconstructing the features, an attention based pointer network is designed to make explicit order plan to guide the captioning. As shown in Figure 2, the pointer network is a multi-head attention based encoder-decoder architecture with a designed pointer attention module to serialize the semantic features. First a multi-head attention based encoder which is order-insensitive is applied to encode the features where a multi-head self-attention layer is used to capture the dependency between different image regions: where Multihead represents the multi-head attention which takes queries, keys and values as inputs and consists of h parallel scaled dot-product attentions performing in different sub-spaces separately: where Then a multi-head attention based decoder is used to decode and predict the order of features. The decoder is order-sensitive which implements a masked multi-head self-attention layer to capture the dependency from the predicted features to predict the next feature. The output of decoder is then fed into the pointer attention, which is designed to point to the input features one-by-one to serialize them.
A plan is a sequence of features Z = {z 1 , z 2 , ..., z n } with a certain order where n is the number of features extracted from the input image. The probability of P (z t = V i |z <t , V) is modeled as an attention over the input features as follows: where W p ∈ R dw×dw are parameters of the pointer attention, h t is the hidden state of the t-th decoding step of the pointer decoder. Following the obtained probabilities the disordered features are finally serialized to a sequence v p ∈ R n×dw for caption generation. The detected image regions can be grounded to the words in the caption according to the detected objects as illustrated in Figure 1. According to the corresponding relationship, we can assume the order that the word appears in the caption is the order of the relative feature. Following this rule we can obtain the oracle order of each feature by aligning the detected objects of different image regions to the words in the caption. With the oracle order-plan, we can train the order-planning stage supervised.

Caption Model
After the high-level planning, a caption model is applied to generate the textual description guided by the determined plan, where an order-sensitive encoder is applied to further encode the planned features and to capture the order information and a decoder is employed to decode and generate the caption of the given image.

Order-sensitive Encoder
Since the semantic features have been serialized after the order-planning, a position encoding module is first used to model the relative or absolute order information of the sequence of features. Inspired by the Transformer we add the position embedding to each semantic feature which is calculated as follows: P E (pos,2i+1) = cos(pos/10000 2i/d model ), where pos is the position of the feature in the sequence and i is the dimension. Then a multi-head attention based encoder is used to further encode and represent the semantic features. The encoder is a stack of N identical blocks with the same structures, each of which consists of a multi-head self-attention layer and a position-wise feedforward layer: where Multihead is calculated the same as Eq. (6), F F N is the position-wise feedforward layer including two linear transformations with a GeLU activation (Hendrycks and Gimpel, 2016)

in between and
LayerNorm represents layer normalization.

Decoder
The caption decoder of HS-PLAN basically follows the same spirit of the Transformer which is used to generate the target caption Y with the encoded semantic features v e . Inspired by the AoANet (Huang et al., 2019a) we further implement the attention-on-attention module on the Transformer decoder which can determine the relevance between the attention result and the query to improve the performance of attention module. At each decoding step t, first a masked multi-head self-attention is used to capture the dependency from the the input of the decoder, the embeddings of the predicted output y <t , and obtain the hidden state h t . Then a multi-head attention layer modified by the AOA module is used to obtain the context vector, which is fed with the hidden state h t and output of encoder v e and calculated as follows: where W g , W i ∈ R dw×dw , b g , b i ∈ R dw are parameters and Multihead is calculated the same as Eq. (6). With the context vector the conditional probabilities of the output word y t is calculated: where W d ∈ R dw×D are parameters and D is the vocabulary size.

Pretraining
We pretrain the pointer network on MS-COCO and with the oracle plan illustrated in Section 3.1.2 by minimizing the negative log-likelihood of the oracle order-plan: where D represents all the training samples including the input features V, the oracle plans Z and target captions Y. Then we pretrain the caption model with the oracle-plan by optimizing the cross entropy (XE) loss: where T is the length of the ground truth caption.

Training
After the pretraining, we train our model end-to-end with a joint learning of both planning and captioning by aggregating the losses over the two stages: where λ is the hyperparameter. Then we follow the previous works that directly optimize the nondifferentiable metrics with Self-Critical Sequence Training (Rennie et al., 2017): where r is the CIDEr (Vedantam et al., 2015) score function.

Dataset and Metrics
We evaluate our proposed model on the popular benchmark dataset MS-COCO (Lin et al., 2014) containing 123,287 images labeled with 5 captions for each. We use the offline "Karpathy" data split (Karpathy and Li, 2015) for the performance comparisons, where 5, 000 images are used for validation, 5,000 images for testing and the rest for training. Following the previous works we also used five standard automatic evaluation metrics: CIDEr-D(Vedantam et al., 2015), BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), ROUGE-L (Lin, 2004) and SPICE (Anderson et al., 2016). We also implement qualitative human evaluations to further evaluate the quality of the generated captions.

Implementation Details
We use Faster-RCNN in conjunction with ResNet-101 similarly as (Anderson et al., 2018) to extract visual features from images, which have been pretrained on ImageNet (Deng et al., 2009). Further we use the Faster-RCNN as the object detector to detect the objects in different image regions and obtain the textual description of each object, and an attribute classifier to obtain the attributes of the objects   which are utilized to reconstruct the features. The dimension of the original vectors is d v = 2048 and we project them to a new space with the dimension of d w = 1024, which is also the embedding size and the hidden size of the pointer network and the caption model. The pointer contains a 2-layer Transformer encoder and a 2-layer Transformer decoder, the number of heads is h = 8. During the pretraining stage, we train the pointer network for 20 epochs and the caption model with oracle plan for 20 epochs. During the training stage we train our whole model jointly with L for 20 epochs with the mini-batch size of 10 and λ = 0.3. Learning rate is 2e − 4 with an Adam optimizer (Kingma and Ba, 2015). Then we optimize the CIDEr-D score with SCST for another 15 epochs.

Baselines
We compare our model with some following strong baselines: LSTM (Vinyals et al., 2015) which use CNN to encode the image and use LSTM-based decoder to generate the caption; SCST (Rennie et al., 2017) which first use SCST to directly optimize the evaluation metrics; Up-Down (Anderson et al., 2018) which propose the Bottom-Up and Top-Down attention mechanism to identify selective spatial regions; GCN-LSTM (Yao et al., 2018) which encodes the relationships between the objects in the image into feature vectors; ETA (Li et al., 2019) which propose the Entangled Attention to explore visual and semantic information simultaneously. SGAE , which introduces auto-encoding scene graphs into caption model; AAT (Huang et al., 2019b) which proposes an Adaptive Attention Time to align the source and the target adaptively; AoANet (Huang et al., 2019a) which proposes an Attention on Attention module to improve the multi-head attention based caption model.

Overall Results
The performances of the baselines and our proposed model on the COCO Karpathy test split are shown in Table 1. For fair comparison, we report the results of each model optimized with both cross entropy loss and CIDEr Score and separately show the performances for single models and ensemble/fused models. We can see that our proposed model outperforms all the baselines on all the automatic evaluation metrics with both XE loss training and CIDEr-D Score Optimization, achieving the state-of-the-art performance. Specifically, on CIDEr-D score our model achieves 121.8% with XE loss training and 133.4% with   CIDEr-D optimization, which makes a significant improvement over the previous best model AoANet by 3.6%. With an ensemble of four models, HS-PLAN further achieves 134.8% CIDEr-D score. The results demonstrate that the proposed high-level semantic planning is able to facilitate the performance of image captioning model.

Human Evaluation
To further evaluate the quality of the captions generated by our model, we implement qualitative human evaluation on three different aspects: Fluency which measures whether the caption is fluent and has no grammatically error; Faithfulness which measures whether the caption is faithful to the given image and contains enough objects (too much or too little would be deducted); Coherence which measures whether the generated caption is logically coherent and is described in a proper order. For pair-wise comparison we randomly select 100 images with captions generated by our model and three strong baselines. We invite ten annotators with enough knowledge to give preference (win, lose or tie) to each pair of texts (ours vs. a baseline, 600 pairs in total). The results reported in Table 2 show that our model HS-PLAN outperforms the baselines on the three metrics, which further demonstrate the effectiveness of the proposed high-level semantic planning method. We also find that our model has a significant improvement on Faithfulness and Coherence compared with the baselines, illustrating that the high-level semantic plan is able to improve the quality of generated captions.The results demonstrate that our proposed model can generate more fluent, faithful and coherent captions.

Ablation Study
To further evaluate the effectiveness of the proposed high-level semantic planning method, we conduct ablation study by comparing the performance of different settings of HS-PLAN. The results are reported in Table 3. We can find that: (1) The comparisons between the models with or without semantic reconstruction demonstrate that the semantic reconstruction module can heavily improve the performance of caption model and prove that semantic reconstruction can narrow the gap between the visual features and textual words.
(2) Without position encoding, order planning does not lead to obvious improvements, since the encoder is still order-insensitive and can not learn the order information. Without order-planning, the position encoding causes the performance to decrease. It might because position encoding introduces and propagates error from the disordered features. Figure 3: Examples of the captions generated by our model (Ours) and AoANet (Base) as well as the ground truth (GT). The plans are also visualized for each image (the background color represents the V2L attention weight and arrows represent the order plan).
(3) However, order planning plus position encoding can lead to better performances of models, proving that order planning can bridge the gap between the disordering of visual features and the ordering of textual sentence. Since the order planning is able to serialize the features, position encoding can learn correct order information to guide the generation.
(4) During the experiments we surprisingly find that the semantic reconstruction module can also improve the performance of the pointer network, thus making a better order plan to guide the generation. Figure 3 shows six examples of the captions generated by our model and a baseline randomly selected from the Karpathy test split. We also visualize the plans to further show the effectiveness of the proposed planning method. We show the high-level concepts of each image region and use background colors to represent the V2L attention weights, darker is higher. The arrows illustrate the order of the features after the order planning. We find that the baseline still suffers from the problems of information missing like (b) and misunderstanding the objects in images such as (a) and (e). But our model can better understand the objects with the benefit of the semantic reconstruction, such as "surfer" in (b) and "two women" in (f). Further, the captions generated by our model basically follow the order plans, demonstrating that explicit order planning can guide the neural model to generate more informative and well-ordered captions. Generally, the captions generated by our model are more informative, faithful and coherent.

Conclusion
In this paper we integrate the planning strategy to attention based neural models and propose a novel high-level semantic planning method to bridge the gap between the visual features and textual semantics. We design a high-level semantic planning based attention network (HS-PLAN) that incorporates both a semantic reconstruction and an explicit order planning to guide the caption generation. Experiments are conducted on a large benchmark dataset MSCOCO and show that our model outperforms the baselines on both automatic and human evaluation. The experimental results also demonstrate that our model can generate more fluent, faithful and coherent captions.