Scene Graph Parsing as Dependency Parsing

In this paper, we study the problem of parsing structured knowledge graphs from textual descriptions. In particular, we consider the scene graph representation that considers objects together with their attributes and relations: this representation has been proved useful across a variety of vision and language applications. We begin by introducing an alternative but equivalent edge-centric view of scene graphs that connect to dependency parses. Together with a careful redesign of label and action space, we combine the two-stage pipeline used in prior work (generic dependency parsing followed by simple post-processing) into one, enabling end-to-end training. The scene graphs generated by our learned neural dependency parser achieve an F-score similarity of 49.67% to ground truth graphs on our evaluation set, surpassing best previous approaches by 5%. We further demonstrate the effectiveness of our learned parser on image retrieval applications.


Introduction
Recent years have witnessed the rise of interest in many tasks at the intersection of computer vision and natural language processing, including semantic image retrieval (Johnson et al., 2015;Vendrov et al., 2015), image captioning (Mao et al., 2014;Karpathy and Li, 2015;Donahue et al., 2015;Liu et al., 2017b), visual question answering (Antol et al., 2015;Zhu et al., 2016;Andreas et al., 2016), and referring expressions (Hu et al., 2016;Mao et al., 2016;Liu et al., 2017a). The pursuit for these tasks is in line with people's desire for high level understanding of visual content, in particular, using textual descriptions or questions to help understand or express images and scenes.
What is shared among all these tasks is the need for a common representation to establish connection between the two different modalities. The majority of recent works handle the vision side with convolutional neural networks, and the language side with recurrent neural networks (Hochreiter and Schmidhuber, 1997;Cho et al., 2014) or word embeddings Pennington et al., 2014). In either case, neural networks map original sources into a semantically meaningful (Donahue et al., 2014; vector representation that can be aligned through end-toend training (Frome et al., 2013). This suggests that the vector embedding space is an appropriate choice as the common representation connecting different modalities (see e.g. Kaiser et al. (2017)).
While the dense vector representation yields impressive performance, it has an unfortunate limitation of being less intuitive and hard to interpret. Scene graphs (Johnson et al., 2015), on the other hand, proposed a type of directed graph to encode information in terms of objects, attributes of objects, and relationships between objects (see Figure 1 for visualization). This is a more structured and explainable way of expressing the knowledge from either modality, and is able to serve as an alternative form of common representation. In fact, the value of scene graph representation has already been proven in a wide range of visual tasks, including semantic image retrieval (Johnson et al., 2015), caption quality evaluation (Anderson et al., 2016), etc. In this paper, we focus on scene graph generation from textual descriptions.
Previous attempts at this problem (Schuster et al., 2015;Anderson et al., 2016) follow the same spirit. They first use a dependency parser to obtain the dependency relationship for all words in a sentence, and then use either a rule-based or a learned classifier as post-processing to generate the scene graph. However, the rule-based classifier cannot a young boy in front of a soccer goal a soccer ball in the air a man standing with hands behind back a woman wearing a purple shirt a young boy wearing a black uniform the roof is brown the ball is white a soccer ball on the ground a man wearing a red and white shirt people behind the net goal keeper watching the ball a white ball on the ground goal keeper is wearing gloves a kid is sitting on the ground the man is standing the uniform is black a red and black backpack sitting on the ground trees outside the fence blue and white soccer ball  (Krishna et al., 2017) dataset contains tens of region descriptions and the region scene graphs associated with them. In this paper, we study how to generate high quality scene graphs (two such examples are shown in the figure) from textual descriptions, without using image information.
learn from data, and the learned classifier is rather simple with hand-engineered features. In addition, the dependency parser was trained on linguistics data to produce complete dependency trees, some parts of which may be redundant and hence confuse the scene graph generation process. Therefore, our model abandons the two-stage pipeline, and uses a single, customized dependency parser instead. The customization is necessary for two reasons. First is the difference in label space. Standard dependency parsing has tens of edge labels to represent rich relationships between words in a sentence, but in scene graphs we are only interested in three types, namely objects, attributes, and relations. Second is whether every word needs a head. In some sense, the scene graph represents the "skeleton" of the sentence, which suggests that empty words are unlikely to be included in the scene graph. We argue that in scene graph generation, it is unnecessary to require a parent word for every single word. We build our model on top of a neural depen-dency parser implementation (Kiperwasser and Goldberg, 2016) that is among the state-of-theart. We show that our carefully customized dependency parser is able to generate high quality scene graphs by learning from data. Specifically, we use the Visual Genome dataset (Krishna et al., 2017), which provides rich amounts of region description -region graph pairs. We first align nodes in region graphs with words in the region descriptions using simple rules, and then use this alignment to train our customized dependency parser. We evaluate our parser by computing the F-score between the parsed scene graphs and ground truth scene graphs. We also apply our approach to image retrieval to show its effectiveness.
2 Related Works

Scene Graphs
The scene graph representation was proposed in Johnson et al. (2015) as a way to represent the rich, structured knowledge within an image. The nodes in a scene graph represent either an object, an attribute for an object, or a relationship between two objects. The edges depict the connection and association between two nodes. This representation is later adopted in the Visual Genome dataset (Krishna et al., 2017), where a large number of scene graphs are annotated through crowd-sourcing. The scene graph representation has been proved useful in various problems including semantic image retrieval (Johnson et al., 2015), visual question answering (Teney et al., 2016), 3D scene synthesis (Chang et al., 2014), and visual relationship detection (Lu et al., 2016). Excluding Johnson et al. (2015) which used ground truth, scene graphs are obtained either from images (Dai et al., 2017;Xu et al., 2017; or from textual descriptions (Schuster et al., 2015;Anderson et al., 2016). In this paper we focus on the latter.
In particular, parsed scene graphs are used in Schuster et al. (2015) for image retrieval. We show that with our more accurate scene graph parser, performance on this task can be further improved.

Parsing to Graph Representations
The goal of dependency parsing (Kübler et al., 2009) is to assign a parent word to every word in a sentence, and every such connection is associated with a label. Dependency parsing is particularly suitable for scene graph generation because it directly models the relationship between individual words without introducing extra nonterminals. In fact, all previous work (Schuster et al., 2015;Anderson et al., 2016) on scene graph generation run dependency parsing on the textual description as a first step, followed by either heuristic rules or simple classifiers. Instead of running two separate stages, our work proposed to use a single dependency parser that is end-to-end. In other words, our customized dependency parser generates the scene graph in an online fashion as it reads the textual description once from left to right.
In recent years, dependency parsing with neural network features (Chen and Manning, 2014;Dyer et al., 2015;Cross and Huang, 2016;Kiperwasser and Goldberg, 2016;Dozat and Manning, 2016;Shi et al., 2017) has shown impressive performance. In particular, Kiperwasser and Goldberg (2016) used bidirectional LSTMs to generate features for individual words, which are then used to predict parsing actions. We base our model on Kiperwasser and Goldberg (2016) for both its simplicity and good performance.
Apart from dependency parsing, Abstract Meaning Representation (AMR) parsing (Flanigan et al., 2014;Werling et al., 2015;Konstas et al., 2017) may also benefit scene graph generation. However, as first pointed out in Anderson et al. (2016), the use of dependency trees still appears to be a common theme in the literature, and we leave the exploration of AMR parsing for scene graph generation as future work. More broadly, our task also relates to entity and relation extraction, e.g. Katiyar and Cardie (2017), but there object attributes are not handled. Neural module networks (Andreas et al., 2016) also use dependency parses, but they translate questions into a series of actions, whereas we parse descriptions into their graph form. Finally, Krishnamurthy and Kollar (2013) connected parsing and grounding by training the parser in a weakly supervised fashion.

Task Description
In this section, we begin by reviewing the scene graph representation, and show how its nodes and edges relate to the words and arcs in dependency parsing. We then describe simple yet reliable rules to align nodes in scene graphs with words in textual descriptions, such that customized dependency parsing, described in the next section, may be trained and applied.

Scene Graph Definition
There are three types of nodes in a scene graph: object, attribute, and relation. Let O be the set of object classes, A be the set of attribute types, and R be the set of relation types. Given a sentence s, our goal in this paper is to parse s into a scene graph: is the set of relations between object instances. G(s) is a graph because we can first create an object node for every element in O(s); then for every (o, a) pair in A(s), we create an attribute node and add an unlabeled edge o → a; finally for every (o 1 , r, o 2 ) triplet in R(s), we create a relation node and add two unlabeled edges o 1 → r and r → o 2 . The resulting directed graph exactly encodes information in G(s). We call this the node-centric graph representation of a scene graph.
We realize that a scene graph can be equivalently represented by no longer distinguishing between the three types of nodes, yet assigning labels to the edges instead. Concretely, this means there is now only one type of node, but we assign a ATTR label for every o → a edge, a SUBJ label for every o 1 → r edge, and a OBJT label for every r → o 2 edge. We call this the edge-centric graph representation of a scene graph.
We can now establish a connection between scene graphs and dependency trees. Here we only consider scene graphs that are acyclic 2 . The edgecentric view of a scene graph is very similar to a dependency tree: they are both directed acyclic graphs where the edges/arcs have labels. The difference is that in a scene graph, the nodes are the objects/attributes/relations and the edges have label space {ATTR, SUBJ, OBJT}, whereas in a dependency tree, the nodes are individual words in a sentence and the edges have a much larger label space.

Sentence-Graph Alignment
We have shown the connection between nodes in scene graphs and words in dependency parsing. With alignment between nodes in scene graphs and words in the textual description, scene graph generation and dependency parsing becomes equivalent: we can construct the generated scene graph from the set of labeled edges returned by the dependency parser. Unfortunately, such alignment is not provided between the region graphs and region descriptions in the Visual Genome (Krishna et al., 2017) dataset. Here we describe how we use simple yet reliable rules to do sentence-graph (word-node) alignment.
There are two strategies that we could use in deciding whether to align a scene graph node d (whose label space is O ∪ A ∪ R) with a word/phrase w in the sentence: • Word-by-word match (WBW): d ↔ w only when d's label and w match word-for-word.
• Synonym match (SYN) 3 : d ↔ w when the wordnet synonyms of d's label contain w.
Obviously WBW is a more conservative strategy than SYN. We propose to use two cycles and each cycle further consists of three steps, where we try to align objects, attributes, relations in that order. The pseudocode for the first cycle is in Algorithm 1. The second cycle repeats line 4-15 immediately afterwards, except that in line 6 we also allow SYN. Intuitively, in the first cycle we use a conservative strategy to find "safe" objects, and then scan for their attributes and relations. In the second cycle we relax and allow synonyms in aligning object nodes, also followed by the alignment of attribute and relation nodes.
The ablation study of the alignment procedure is reported in the experimental section.

Customized Dependency Parsing
In the previous section, we have established the connection between scene graph generation and dependency parsing, which assigns a parent word for every word in a sentence, as well as a label for this directed arc. We start by describing our base dependency parsing model, which is neural network based and performs among the state-of-theart. We then show why and how we do customization, such that scene graph generation is achieved with a single, end-to-end model.

Neural Dependency Parsing Base Model
We base our model on the transition-based parser of Kiperwasser and Goldberg (2016). Here we describe its key components: the arc-hybrid system that defines the transition actions, the neural architecture for feature extractor and scoring function, and the loss function.
The Arc-Hybrid System In the arc-hybrid system, a configuration consists of a stack σ, a buffer β, and a set T of dependency arcs. Given a sentence s = w 1 , . . . , w n , the system is initialized with an empty stack σ, an empty arc set T , and β = 1, . . . , n, ROOT, where ROOT is a special index. The system terminates when σ is empty and β contains only ROOT. The dependency tree is given by the arc set T upon termination.
The arc-hybrid system allows three transition actions, SHIFT, LEFT l , RIGHT l , described in Table 1. The SHIFT transition moves the first element of the buffer to the stack. The LEFT(l) transition yields an arc from the first element of the buffer to the top element of the stack, and then removes the top element from the stack. The Stack σ t Buffer β t Arc set T t

Action
Stack σ t+1 Buffer β t+1 Arc set T t+1 RIGHT(l) transition yields an arc from the second top element of the stack to the top element of the stack, and then also removes the top element from the stack.
The following paragraphs describe how to select the correct transition action (and label l) in each step in order to generate a correct dependency tree.
BiLSTM Feature Extractor Let the word embeddings of a sentence s be w 1 , . . . , w n . An LSTM cell is a parameterized function that takes as input w t , and updates its hidden states: As a result, an LSTM network, which simply applies the LSTM cell t times, is a parameterized function mapping a sequence of input vectors w 1:t to a sequence of output vectors h 1:t . In our notation, we drop the intermediate vectors h 1:t−1 and let LSTM(w 1:t ) represent h t . A bidirectional LSTM, or BiLSTM for short, consists of two LSTMs: LSTM F which reads the input sequence in the original order, and LSTM B which reads it in reverse. Then BILSTM(w 1:n , i) = where • denotes concatenation. Intuitively, the forward LSTM encodes information from the left side of the i-th word and the backward LSTM encodes information to its right, such that the vector v i = BILSTM(w 1:n , i) has the full sentence as context.
When predicting the transition action, the feature function φ(c) that summarizes the current configuration c = (σ, β, T ) is simply the concatenated BiLSTM vectors of the top three elements in the stack and the first element in the buffer: MLP Scoring Function The score of transition action y under the current configuration c is determined by a multi-layer perceptron with one hidden layer: where Hinge Loss Function The training objective is to raise the scores of correct transitions above scores of incorrect ones. Therefore, at each step, we use a hinge loss defined as: where Y is the set of possible transitions and Y + is the set of correct transitions at the current step.
In each training step, the parser scores all possible transitions using Eqn. 5, incurs a loss using Eqn. 7, selects a following transition, and updates the configuration. Losses at individual steps are summed throughout the parsing of a sentence, and then parameters are updated using backpropagation.
In test time, we simply choose the transition action that yields the highest score at each step.

Customization
In order to generate scene graphs with dependency parsing, modification is necessary for at least two reasons. First, we need to redefine the label space of arcs so as to reflect the edge-centric representation of a scene graph. Second, not every word in the sentence will be (part of) a node in the scene graph (see Figure 2 for an example). In other words, some words in the sentence may not have a parent word, which violates the dependency parsing setting. We tackle these two challenges by redesigning the edge labels and expanding the set of transition actions.

Redesigning Edge Labels
We define a total of five edge labels, so as to faithfully bridge the edgecentric view of scene graphs with dependency parsing models: • CONT: This label is created for nodes whose label is a phrase. For example, the phrase "in front of" is a single relation node in the scene graph. By introducing the CONT label, we expect the parsing result to be either where the direction of the arcs (left or right) is predefined by hand.
The leftmost word under the right arc rule or the rightmost word under the left arc rule is called the head of the phrase. A single-word node does not need this CONT label, and the head is itself.
• ATTR: The arc label from the head of an object node to the head of an attribute node.
• SUBJ: The arc label from the head of an object node (subject) to the head of a relation node.
• OBJT: The arc label from the head of a relation node to the head of an object node (object).
• BEGN: The arc label from the ROOT index to all heads of object nodes without a parent.
Expanding Transition Actions With the three transition actions SHIFT, LEFT(l), RIGHT(l), we only drop an element (from the top of the stack) after it has already been associated with an arc. This design ensures that an arc is associated with every word. However, in our setting for scene graph generation, there may be no arc for some of the words, especially empty words. Our solution is to augment the action set with a REDUCE action, that pops the stack without adding to the arc set (see Table 1). This action is often used in other transition-based dependency parsing systems (e.g. arc-eager (Nivre, 2004)). More recently, Hershcovich et al. (2017) and Buys and Blunsom (2017) also included this action when parsing sentences to graph structures.  Table 2: The F-scores (i.e. SPICE metric) between scene graphs parsed from region descriptions and ground truth region graphs on the intersection of Visual Genome (Krishna et al., 2017) and MS COCO (Lin et al., 2014) validation set.
We still minimize the loss function defined in Eqn. 7, except that now |Y | increases from 3 to 4. During training, we impose the oracle to select the REDUCE action when it is in Y + . In terms of loss function, we increment by 1 the loss incurred by the other 3 transition actions if REDUCE incurs zero loss.

Implementation Details
We train and evaluate our scene graph parsing model on (a subset of) the Visual Genome (Krishna et al., 2017) dataset. Each image in Visual Genome contains a number of regions, and each region is annotated with both a region description and a region scene graph. Our training set is the intersection of Visual Genome and MS COCO (Lin et al., 2014) train2014 set, which contains a total of 34027 images/ 1070145 regions. We evaluate on the intersection of Visual Genome and MS COCO val2014 set, which contains a total of 17471 images/ 547795 regions.
In our experiments, the number of hidden units in BiLSTM is 256; the number of layers in BiL-STM is 2; the word embedding dimension is 200; the number of hidden units in MLP is 100. We use fixed learning rate 0.001 and Adam optimizer (Kingma and Ba, 2014) with epsilon 0.01. Training usually converges within 4 epochs.
We will release our code and trained model upon acceptance.

Quality of Parsed Scene Graphs
We use a slightly modified version of SPICE score (Anderson et al., 2016) to evaluate the quality of scene graph parsing. Specifically, for every region, we parse its description using a parser (e.g. the one used in SPICE or our customized dependency parser), and then calculate the F-score between the parsed graph and the ground truth region graph (see Section 3.2 of Anderson et al. (2016) for more details). Note that when SPICE calculates the Fscore, a node in one graph could be matched to several nodes in the other, which is problematic. We fix this and enforce one-to-one matching when calculating the F-score. Finally, we report the average F-score across all regions. Table 2 summarizes our results. We see that our customized dependency parsing model achieves an average F-score of 49.67%, which significantly outperforms the parser used in SPICE by 5 percent. This result shows that our customized dependency parser is very effective at learning from data, and generates more accurate scene graphs than the best previous approach.
Ablation Studies First, we study how the sentence-graph alignment procedure affects the final performance. Recall that our procedure involves two cycles, each with three steps. Of the six steps, synonym match (SYN) is only not used in the first step. We tried two more settings, where SYN is either used in all six steps or none of the six steps. We can see from Table 2 Table 3: Image retrieval results. We follow the same experiment setup as Schuster et al. (2015), except using a different scoring function when ranking images. Our parser consistently outperforms the Stanford Scene Graph Parser across evaluation metrics.
F-score drops in both cases, hence supporting the procedure that we chose. Second, we study whether changing the direction of CONT arcs from pointing left to pointing right will make much difference. Table 2 shows that the two choices give very similar performance, suggesting that our dependency parser is robust to this design choice.
Finally, we report the oracle score, which is the similarity between the aligned graphs that we use during training and the ground truth graphs. The F-score is relatively high at 69.85%. This shows that improving the parser (about 20% margin) and improving the sentence-graph alignment (about 30% margin) are both promising directions for future research.
Qualitative Examples We provide one parsing example in Figure 2 and Figure 3. This is a sentence that is relatively simple, and the underlying scene graph includes two object nodes, one attribute node, and one compound word relation node. In parsing this sentence, all four actions listed in Table 1 are used (see Figure 3) to produce the edge-centric scene graph (bottom left of Figure 2), which is then trivially converted to the node-centric scene graph (bottom right of Figure 2).

Application in Image Retrieval
We test if the advantage of our parser can be propagated to computer vision tasks, such as image retrieval. We directly compare our parser with the Stanford Scene Graph Parser (Schuster et al., 2015) on the development set and test set of the image retrieval dataset used in Schuster et al. (2015) (not Visual Genome).
For every region in an image, there is a humanannotated region description and region scene graph. The queries are the region descriptions. If the region graph corresponding to the query is a subgraph of the complete graph of another image, then that image is added to the ground truth set for this query. All these are strictly following Schuster et al. (2015). However, since we did not obtain nor reproduce the CRF model used in Johnson et al. (2015) and Schuster et al. (2015), we used F-score similarity instead of the likelihood of the maximum a posteriori CRF solution when ranking the images based on the region descriptions. Therefore the numbers we report in Table 3 are not directly comparable with those reported in Schuster et al. (2015).
Our parser delivers better retrieval performance across all three evaluation metrics: recall@5, re-call@10, and median rank. We also notice that the numbers in our retrieval setting are higher than those (even with oracle) in Schuster et al. (2015)'s retrieval setting. This strongly suggests that generating accurate scene graphs from images is a very promising research direction in image retrieval, and grounding parsed scene graphs to bounding box proposals without considering visual attributes/relationships (Johnson et al., 2015) is suboptimal.

Conclusion
In this paper, we offer a new perspective and solution to the task of parsing scene graphs from textual descriptions. We begin by moving the labels/types from the nodes to the edges and introducing the edge-centric view of scene graphs. We further show that the gap between edge-centric scene graphs and dependency parses can be filled with a careful redesign of label and action space. This motivates us to train a single, customized, end-to-end neural dependency parser for this task, as opposed to prior approaches that used generic dependency parsing followed by heuristics or simple classifier. We directly train our parser on a subset of Visual Genome (Krishna et al., 2017), without transferring any knowledge from Penn Treebank (Marcus et al., 1993) as previous works did.
The quality of our trained parser is validated in terms of both SPICE similarity to the ground truth graphs and recall rate/median rank when performing image retrieval.
We hope our paper can lead to more thoughts on the creative uses and extensions of existing NLP tools to tasks and datasets in other domains. In the future, we plan to tackle more computer vision tasks with this improved scene graph parsing technique in hand, such as image region grounding. We also plan to investigate parsing scene graph with cyclic structures, as well as whether/how the image information can help boost parsing quality.