CLEVR Parser: A Graph Parser Library for Geometric Learning on Language Grounded Image Scenes

The CLEVR dataset has been used extensively in language grounded visual reasoning in Machine Learning (ML) and Natural Language Processing (NLP). We present a graph parser library for CLEVR, that provides functionalities for object-centric attributes and relationships extraction, and construction of structural graph representations for dual modalities. Structural order-invariant representations enable geometric learning and can aid in downstream tasks like language grounding to vision, robotics, compositionality, interpretability, and computational grammar construction. We provide three extensible main components – parser, embedder, and visualizer that can be tailored to suit specific learning setups. We also provide out-of-the-box functionality for seamless integration with popular deep graph neural network (GNN) libraries. Additionally, we discuss downstream usage and applications of the library, and how it can accelerate research for the NLP community.


Introduction
The CLEVR dataset (Johnson et al., 2017a) is a modern 3D incarnation of historically significant shapes-based datasets like SHRDLU (Winograd, 1970), used for demonstrating AI efficacy on language understanding (Ontanon, 2018;Winograd, 1980;Hudson and Manning, 2018). Although originally aimed at the visual question answering (VQA) problem (Santoro et al., 2017;Hu et al., 2018), its versatility has seen its use in diverse ML domains, including extensions to physics simulation engines for language augmented hierarchical reinforcement learning (Jiang et al., 2019) and causal reasoning (Yi et al., 2019).  Figure 1: A question about a CLEVR image visualized as multimodal parsed graphs Parallelly, research interest in geometric learning and GNN (Kipf and Welling, 2016;Schlichtkrull et al., 2018;Hamilton et al., 2017) based techniques have seen a dramatic surge in recent deep learning zeitgeist. In this focused paper, we present a library that allows easy integration and application of geometric representation learning on CLEVR dataset tasks -enabling the NLP research community to apply GNN based techniques to their research (see 4).
The library has three main (extensible) components: 1. Parser: allows extraction of graph structured relationships among objects of the environment -both for textual questions, and semantic image scene graphs, 2. Embedder: allows generation of latent embeddings using any models or desired backend of choice (like PyTorch 2 ), 3. Visualizer: provides tools for visualizing structural graphs and latent embeddings.
2 Background CLEVR Environment The dataset consists of images with rendered 3D objects of various shapes, colors, materials, and sizes, along with corresponding image scene graphs containing visual semantic information. Templated question generation on the images allows the creation of complex questions that test various aspects of scene understanding. The original dataset contains ≈1M questions generated from ≈100k questions with 90 question template families that can be broadly categorized into five question types: count, exist, numerical comparison, attribute comparison, and query. The dataset also comes with a defined domainspecific-language (DSL) function library F, containing primitive functions that can be composed together to answer questions on CLEVR images (Johnson et al., 2017b). We delegate further details of this dataset to (Johnson et al., 2017a) and the appendix A.

Parser
Text The parser takes a language utterance, which can be a question, caption or command, that is valid in the CLEVR environment, and outputs a structural graph representation -G s , capturing object attributes, spatial relationships (spatial re), and attribute similarity based matching predicates (matching re) in the textual input. This is implemented by adding a CLEVR object entity recognizer (NER) in the NLP parse pipeline as depicted by Figure 3. Note that the NER is permutationally equivariant to the object attributes -i.e. a 'large red rubber ball' will be detected as an object by any of these spans: 'red large rubber ball', 'large ball', 'ball' etc. Images The parser takes image scene graphs as input and outputs a structural graph -G t . The synthesized image scenes accompanying the original dataset can be used as input. Alternatively, parsed image scenes generated using any modern semantic image segmentation method (for e.g. 'Mask-RCNN' (He et al., 2017)) can also be used as input (Yi et al., 2018). A visualized example of a parsed image is shown in figure 4a. For the ease of reproducibility, we also include a curated dataset '1obj' with parsed image scenes using Mask-RCNN semantic segmentation (AppendixA).
While we provide a concrete implementation using the SpaCy 3 NLP library, any other library like the Stanford Parser 4 , or NLTK 5 could be used in its place. The output of the parser from a question and image is depicted in Figure 1.

Embedder
The embedder provides 'word-embedding' (Mikolov et al., 2017) based representation of input text utterances and image scenes using a pre-trained language model (LM). The end-user can instantiate the embedder with a preferred LM, which could be a simple one-hot representation of the CLEVR environment vocabulary, or a large transformer based SotA LMs like BERT, GPT-2, XLNet (Peters et al., 2018;Devlin et al., 2018;Radford et al., 2019;Yang et al., 2019). The embedder uses the parser (see section 3.1) generated graphs G s , G t -where graph G s and G t are defined as generic graph G = (V, E, A), where V is the set of nodes {1,2,..}, E is the set of edges, and A is the adjacency matrix -and returns X , E, the feature matrices of the nodes and edges respectively: The output signature of the embedder is a tuple: (X , A, E), which matches the fundamental datastructure of popular geometric learning libraries like PyTorch Geometric (Fey and Lenssen, 2019), thus allowing seamless integration. We show a concrete implementation of this use case using Py-Torch Geometric (Fey and Lenssen, 2019) and Pytorch in 3.3.2.

Visualizer
We provide multiple visualization tools for analyzing images, text, and latent embeddings.

Visualizing Structural Graphs
This visualizer sub-component enables visualization of the multimodal structural graph outputs -G s , G t -by the parser (see 3.1) using Graphviz and matplotlib.
Visualizing Images Image graphs (G t ) can have a large number of objects and attributes. For ease of viewing, attributes like size, shape (e.g. cylinder), color (e.g. yellow), and material (e.g. metallic) are displayed as nodes of the graph (Figure 4a). We explain elements of Figure 4a to describe the legend in greater detail. The double circles represent the objects, and the adjacent nodes are their attributes. The shape is depicted using the actual shape (e.g. the cyan cylinder -obj2), and the other attributes are depicted as diamonds. The size of one of the diamonds depicts if the object is small or large, e.g. the large cyan diamond attached to obj2 means that it is large. The color of all the attribute nodes depicts the color of the object (e.g. the cyan color of obj2). The presence of a gradient in the remaining diamond depicts the material of the object. For example, the gradient in the diamond attached to obj4 means that it is metallic, and the solid fill for obj2 means that it is rubber. While this legend is a little lengthy, we found that it makes visualiza-tion easier, but the user can choose to revert to the simpler setting of using text to depict the attributes.
Visualizing Text Text corresponding to an image is a partially observable subset of objects, their relationships, and attributes. The dependency graph of the text is visualized just like the images, with only the observable information being depicted ( Figure  4b).
Composing image and text We also provide an option to view an image and the text in the same graph. By connecting corresponding object nodes from the image and text, we create a bipartite graph that allows us to visualize all the information that an image-text pair contains (Figure 4c). Additional examples from the visualizer are presented in appendix A.4.

Visualizer -Embeddings
We also provide a visualizer to analyze the embeddings produced by using methods in section 3.2. We use t-SNE (Maaten and Hinton, 2008), which is a method used to visualize high-dimensional data on 2 or 3 dimensions. We also offer clustering support to allow grouping of similar embeddings together. Both image (Frome et al., 2013) and word embeddings  from learned models have the nice property of capturing semantic information, and our visualizers capture this semantic similarity information in the form of clusters. Figure 5 plots the embeddings for questions drawn from two different distributions train and test, which represent semantically different sequences, and they separate out into distinct clusters.

Related Work and Applications
Some lines of work attempt to generate scene graphs for images. The Visual Genome library (Krishna et al., 2017), in a real-world image setting, is a collection of annotated images (from Flickr, COCO) and corresponding knowledge graph associations. The work of (Schuster et al., 2015) and the corresponding library which is a part of the Stanford NLP library 6 , allows scene graph generation from text (image caption) as input.
Our work is orthogonal to these in that our target dataset is synthetic, which allows full control over the generation of images, questions, and ground truth semantic program chains. Thus, coalesced with our library's functionalities, it allows endto-end (e2e) control over experimenting on every modular aspect of research hypotheses (see 4.1). Further, our work premises on providing multimodal representations -including ground-truth paired graph (joint graph G u ← (G s , G t )) -which has interesting downstream research applications.

Usages and Applications
Applications of language grounding in ML/NLP research are quite broad. To avoid sounding overly grandiose, we exemplify possible applications citing work that pertains to the CLEVR dataset.
Recent work by (Bahdanau et al., 2019) has shown lack of distributional robustness and compositional generalization (Fodor et al., 1988) in NLP. Permutation equivariance within local linguistic component groups has been shown to help with language compositionality (Gordon et al., 2020). Graph-based representations are intrinsically or-der invariant -thus, may help with language compositionality research. Language augmented reward mechanisms are a dense topic in concurrent (human-in-the-loop) reinforcement learning (Knox and Stone, 2012;Griffith et al., 2013), robotics (Knox et al., 2013Kuhlmann et al., 2004), longhorizon, hierarchical POMDP problems in general (Kaplan et al., 2017 -like command completion in physics simulators (Jiang et al., 2019). Other applications could be in program synthesis and interpretability (Mascharka et al., 2018), causal reasoning (Yao, 2010), and general visually grounded language understanding (Yu et al., 2016).
In general, we expect and hope that any existing line or domain of work in NLP using the CLEVR dataset (hundreds, based on citations), will benefit from having graph-based representational learning aided by our proposed library.