Leveraging Partial Dependency Trees to Control Image Captions

Controlling the generation of image captions attracts lots of attention recently. In this paper, we propose a framework leveraging partial syntactic dependency trees as control signals to make image captions include specified words and their syntactic structures. To achieve this purpose, we propose a Syntactic Dependency Structure Aware Model (SDSAM), which explicitly learns to generate the syntactic structures of image captions to include given partial dependency trees. In addition, we come up with a metric to evaluate how many specified words and their syntactic dependencies are included in generated captions. We carry out experiments on two standard datasets: Microsoft COCO and Flickr30k. Empirical results show that image captions generated by our model are effectively controlled in terms of specified words and their syntactic structures.The code is available on GitHub.


Introduction
Controllable image captioning emerges as a popular research topic in recent years. Existing works attempt to enhance models' controllability and captions' diversity by controlling the attributes of image captions such as style (Mathews et al., 2016), sentiments (Gan et al., 2017), contents (Dai et al., 2018;Cornia et al., 2019;Zhong et al., 2020) and part-of-speech (Deshpande et al., 2019). However, some important attributes of image captions like words and syntactic structures, are ignored in previous works. For example, for the image in the Figure 2, the work (Cornia et al., 2019) specifies a set of objects like 'dog, man, frisebee' as a control signal, but there still exist lots of possibilities of composing them into different captions, such as 'a dog and a man play frisebee on grass' and 'a dog playing with a man catches frisebee', since both words and syntactic structures are not determined yet.  Figure 1: An example of syntactic dependency tree(left) and partial dependency tree (right) To address this challenging issue, we propose a framework, which employs partial dependency trees as control signals. As shown in Figure 1, a partial dependency tree, a sub-tree of a syntactic dependency tree, contains words and their syntactic structures, and thus we can utilize it to specify control information about words and their syntactic structures.
In addition, we develop a pipeline model called syntactic dependency structure-aware model (SD-SAM) which first derives a full syntactic dependency tree and then flatten it into a caption. The motivation behind this pipeline model is that we assume explicitly generating syntactic dependency trees as intermediate representations can better help the model learn how to apply the specified syntactic information to the captions and the intermediate representations can give users an intuitive impression on which part of the captions' syntactic structures is controlled.
Finally, we propose a syntactic dependencybased evaluation metric which evaluates whether the generated captions have been controlled in terms of syntactic structures. Our metric is computed based on the overlap of syntactic dependencies which is different from existing metrics like BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014), ROUGE (Lin, 2004), CIDEr (Vedantam et al., 2015 and SPICE (Anderson et al., 2018) which rely on the overlap of ngrams or semantic graphs. Empirical results show that image captions generated by our model are effectively controlled in terms of specified words and their syntactic structures. (1) generating syntactic dependency tree using syntactic dependency tree generator.
(2) flatting it into a caption using caption generator.

Framework Definition
The task presented in this paper is defined as generating a caption sentence (i.e. word sequence) y = w 1 , · · · , w |y| given an image I and a partial dependency tree P as input, so that the dependency tree T y of y includes P as far as possible. The syntactic dependency tree of a sentence, as shown in Figure 1, refers to a tree structure to represent syntactic relations between words. A syntactic dependency tree T x of a sentence x = w 1 , · · · , w |x| is defined as a set of dependencies, {D 1 , D 2 , · · · , D |Tx| }, where |T x | denotes the number of dependencies in T x . Each dependency D k is expressed in the form of w i e i,j − − → w j , where w i and w j are the head word and the dependent word of D k , and e i,j is the dependency label. We denote child nodes of w i as C(w i ); i.e. C(w i ) = {w j |w i e i,j − − → w j ∈ T x }. A partial dependency tree P here refers to a sub-tree of the syntactic dependency tree of some sentence.That is, P ⊆ T x for some sentence x.

Syntactic Dependency Structure Aware Model
The syntactic dependency structure-aware model(SDSAM) shown in Figure 2 generates image captions in two steps: (1) the syntactic dependency tree generator on the left part derives a full syntactic dependency tree from the image and the partial dependency tree.
(2) the caption generator on the right part will flatten the syntactic dependency tree into a caption.

The Syntactic Dependency Tree Generator
The syntactic dependency tree generator encodes the input image with a CNN network implemented with Resent152  into image features and encodes the partial dependency tree with a syntactic dependency tree encoder implemented with Tree-LSTM (Tai et al., 2015) into partial dependency tree features.
After combining the image features and the partial dependency tree features, the syntactic dependency tree generator derives the full syntactic dependency tree using the syntactic dependency tree decoder from the combined features s. The syntactic dependency tree decoder consists of two attention modules, Attn in and Attn out , and two interweaved GRU networks (Cho et al., 2014), GRU v and GRU h . The decoding process is carried out from the root node to leaf nodes in a top-down manner. For a node w i , its child nodes are decoded one by one from left-to-right. Each child node is predicted based on the information of its parent node and its left sibling node generated in previous steps. At the mean while, the attention modules highlight the words to be generated for the current child node. Assuming we decode the child w j of node w i , the hidden state of node w i and node w j are denoted as h i and h j respectively. The left sibling of node w j is denoted as w j−1 and its hidden state as h j−1 . For each input image, we detect a set of keywords c = {r 1 , · · · , r |c| } following the method proposed in (You et al., 2016), and encode c into a matrix C ∈ R Ew×|c| , where E w is the size of word embedding. where: In the above formulas, ∈ R Ve×H and V (α) ∈ R Ea×Ew are parameters for reshaping features. Here E s , E a and E q are the size of the input feature s, the attention feature A and the query q respectively. V w and V e are the vocabulary size for the node and edge respectively and H is the size of hidden states. In equation (10), v ∈ R Ea×1 is a parameter and 1 ∈ R |c|×1 is a vector with all elements being one.
The Caption Generator The caption generator takes the syntactic dependency tree generated in the first step as input and encodes it with the syntactic dependency tree encoder into syntactic dependency tree features. The caption generator combines it with image features extracted in the first step and use the combined features to initialize the LSTM decoder (Hochreiter and Schmidhuber, 1997) to generate the caption.

Experiment
Preparing Datasets with Partial Dependency Trees For evaluation, we apply two methods to create partial dependency trees for on Microsoft COCO (Chen et al., 2015) and Flickr30k (Young et al., 2014). The first method extracts partial dependency trees from reference captions. We parsing reference captions to syntactic dependency trees using Spacy 2 and then randomly sample subsets from each syntactic dependency tree. Sampled partial dependency trees are then paired with corresponding reference captions. The dataset created by this procedure is denoted as test gold in Section 5.
The other method creates partial dependency trees from images in two steps: (1) we first train a syntactic dependency classifier to predict syntactic dependencies for an input image. (2) Predicted syntactic dependencies are combined to form a syntactic dependency graph for the input image, from which partial dependency trees are sampled. The dataset created by this procedure is denoted as test pred in Section 5. 2 https://spacy.io For training, following the first method, we directly sample a partial dependency tree from one of the reference captions for each image and the paired reference caption is used as a training target.

Evaluation Metric
The evaluation metrics for image captioning fall into two categories: (1)Quality: evaluating the relevance to human annotations with metrics including BLEU (Papineni et al., 2002) and METEOR (Denkowski and Lavie, 2014); ROUGE (Lin, 2004), and CIDEr  and SPICE (Anderson et al., 2018). (2)Control-ability: evaluating whether generated image captions are successfully controlled by partial dependency trees. We devise a new metric called Dependency Based Evaluation Metric (DBEM) for this purpose. Assuming that a partial dependency tree P = {D 1 , · · · , D |P | } is input, DBEM calculates how many syntactic dependencies specified in the partial dependency tree are included in the dependency tree T y of generated caption y. The DBEM score for the evaluation dataset is given as an average of this score for each input. Formally, Experiment Setting The training of our model is split into two stages including training the syntactic dependency tree generator and training the caption generator. We set the size of hidden states to be 512, the word embedding size to be 512, and the dependency label embedding size to be 300. We train our model using the Adam optimizer (Kingma and Ba, 2015) with a learning rate 5e −4 for the first stage and 1e −4 for the second stage. Two models, including our SDSAM model and the NIC model (Vinyals et al., 2015) with its encoder being replaced with Resnet152, are compared under three different control inputs.
(1) None control: input is an image.
(2) Half control: input is an image and the words of a partial dependency tree.
(3) Full control: input is an image and a partial dependency tree.

Results and Analysis
Quality (1)   partial dependency trees are sampled from reference captions. This table shows that both NIC and SDSAM achieve significant improvements on evaluation scores when more control signals are input. This indicates that generated captions become closer to reference captions. These improvements are expectable since control signals contain information of reference captions. This result attests that partial dependency trees carry information useful for generating specific sentences. When both models are given the same control signals, SDSAM has comparable performance to NIC in n-gram based metrics (i.e. BLEU-4, METEOR, ROUGE and CIDEr), while achieving a significantly better performance on SPICE, which is a semantic relation based metric. This result reveals an interesting phenomenon that explicitly learning the syntactic structures of captions can improve performance on the semantic relation based metric.
(2) Results on test pred : We show the evaluation results on test pred in Table 2, whose partial dependency trees are generated from images. For NIC and SDSAM, evaluation scores mostly remain the same level, but slight improvements are observed in SPICE. This result reveals that partial dependency trees generated from images do not have a significant impact on the quality of image captions, while giving partial dependency trees as control signals do not harm caption quality. For the same control signals, SDSAM has a better performance on SPICE in most cases, which follows the results on test gold .
Controllability DBEM scores on test gold and test pred are shown in Table 3. The table shows that the DBEM scores of both models are very low when no control is given. This reveals that only a small proportion of syntactic dependencies in partial dependency trees appear in reference captions by chance, indicating that additional input to control syntactic structures is meaningful. When the models are given words as control signals, the DBEM scores are significantly increased, meaning that both models can infer syntactic structures from words even without explicit syntactic structure information. However, it is also clear that nearly half of the specified dependencies are missing in generated captions. These observations suggest that words provide useful information as control signals, but are insufficient to specify syntactic structures completely. When partial dependency trees are input, the DBEM scores further improve significantly. It means that most syntactic dependencies specified in partial dependency trees are included in generated captions. This result demonstrates that syntactic structure information plays an important role in precisely controlling image captions.
When the models are given no control signals, SDSAM has better DBEM scores than NIC. This is possibly because SDSAM explicitly learns to generate syntactic dependency trees, and can bet-

Case Study
In Figure 3, we show an example of the output from our model on test pred . Our syntactic dependency classifier first predicts a syntactic dependency graph from the input image. Once the syntactic dependency graph is constructed, we sample three partial dependency trees with different node numbers as shown in the figure. Finally, our SDSAM model infers the captions from the input image and the partial dependency trees. From this example, it is obvious that all words and syntactic structures specified in partial dependency trees also appear in the generated captions. Furthermore, the three generated captions are considerably different from each other, demonstrating that giving partial dependency trees as control signals can improve captions' diversity.

Conclusion
We presented a framework for controlling image captions in terms of words and syntactic structures by giving partial dependency trees as control signals. We develop a syntactic dependency structure aware model to explicitly learn the syntactic structures in control signals. Empirical results show that image captions generated by our model are effectively controlled in terms of specified words and their syntactic structures. Furthermore, the results indicate that explicitly learning to generate the syntactic dependency trees of captions enhances the model's controllability.