Object Ordering with Bidirectional Matchings for Visual Reasoning

Visual reasoning with compositional natural language instructions, e.g., based on the newly-released Cornell Natural Language Visual Reasoning (NLVR) dataset, is a challenging task, where the model needs to have the ability to create an accurate mapping between the diverse phrases and the several objects placed in complex arrangements in the image. Further, this mapping needs to be processed to answer the question in the statement given the ordering and relationship of the objects across three similar images. In this paper, we propose a novel end-to-end neural model for the NLVR task, where we first use joint bidirectional attention to build a two-way conditioning between the visual information and the language phrases. Next, we use an RL-based pointer network to sort and process the varying number of unordered objects (so as to match the order of the statement phrases) in each of the three images and then pool over the three decisions. Our model achieves strong improvements (of 4-6% absolute) over the state-of-the-art on both the structured representation and raw image versions of the dataset.


Introduction
Visual Reasoning (Antol et al., 2015;Andreas et al., 2016;Bisk et al., 2016;Johnson et al., 2017) requires a sophisticated understanding of the compositional language instruction and its relationship with the corresponding image.Suhr et al. (2017) recently proposed a challenging new NLVR task and dataset in this direction with natural and complex language statements that have to be classified as true or false given a multi-image set (shown in Fig. 1).Specifically, each task instance consists of an image with three sub-images and a statement which describes the image.The model is asked to answer the question whether the given statement is consistent with the image or not.
To solve the task, the designed model needs to fuse the information from two different domains, the visual objects and the language, and learn accurate relationships between the two.Another difficulty is that the objects in the image do not have a fixed order and the number of objects also varies.Moreover, each statement reasons for truth over three sub-images (instead of the usual single image setup), which also breaks most of the existing models.In our paper, we introduce a novel end-to-end model to address these three problems, leading to strong gains over the previous best model.Our pointer network based LSTM-RNN sorts and learns recurrent representations of the objects in each sub-image, so as to match it better with the order of the phrases in the language statement.For this, it employs an RL-based policy gradient method with a reward extracted from the subsequent comprehension model.With these strong representations of the visual objects and the statement units, a joint-bidirectional attention flow model builds consistent, two-way matchings between the representations in different domains.Finally, since the scores computed by the bidirectional attention are about the three sub-images, a pooling combination layer over the three subimage representations is required to give the final score of the whole image.
On the structured-object-representation version of the dataset, our pointer-based, end-to-end bidi-rectional attention model achieves an accuracy of 73.9%, outperforming the previous (end-to-end) state-of-the-art method by 6.2% absolute, where both the pointer network and the bidirectional attention modules contribute significantly.We also contribute several other strong baselines for this new NLVR task based on Relation Networks (Santoro et al., 2017) and BiDAF (Seo et al., 2016).Furthermore, we also show the result of our joint bidirectional attention model on the raw-image version (with pixel-level, spatial-filter CNNs) of the NLVR dataset, where our model achieves an accuracy of 69.7% and outperforms the previous best result by 3.6%.On the unreleased leaderboard test set, our model achieves an accuracy of 71.8% and 66.1% on the structured and raw-image versions, respectively, leading to 4% absolute improvements on both tasks.Finally, we present analysis of the pointer network's learned object order as well as success and failure examples of the overall model.

Related work
Besides the NLVR corpus with a focus on complex and natural compositional language (Suhr et al., 2017), other useful visual reasoning datasets have been proposed for navigation and assembly tasks (MacMahon et al., 2006;Bisk et al., 2016), as well as for visual Q&A tasks which focus more on complex real-world images (Antol et al., 2015;Johnson et al., 2017).Specifically for the NLVR dataset, previous models have incorporated property-and count-based features of the objects and the language (Suhr et al., 2017), or extra semantic parsing (logical form) annotations (Goldman et al., 2017) -we focus on end-toend models for this visual reasoning task.
Attention mechanism (Bahdanau et al., 2014;Luong et al., 2015;Xu et al., 2015) has been widely used for conditioned language generation tasks.It is further used to learn alignments between different modalities (Lu et al., 2016;Wang and Jiang, 2016;Seo et al., 2016;Andreas et al., 2016;Chaplot et al., 2017).In our work, a bidirectional attention mechanism is used to learn a joint representation of the visual objects and the words by building matchings between them.
Pointer network (Vinyals et al., 2015) was introduced to learn the conditional probability of an output sequence.Bello et al. (2016) extended this to near-optimal combinatorial optimization via reinforcement learning.In our work, a policy gra-dient based pointer network is used to "sort" the objects conditioned on the statement, such that the sequence of ordered objects is sent to the subsequent comprehension model for a reward.

Model
The training datum for this task consists of the statement s, the structured-representation objects o in the image I, and the ground truth label y (which is 1 for true and 0 for false).Our BiATT-Pointer model (shown in Fig. 2) for the structuredrepresentation task uses the pointer network to sort the object sequence (optimized by policy gradient), and then uses the comprehension model to calculate the probability P (s, o) of the statement s being consistent with the image.Our CNN-BiATT model for the raw-image I dataset version is similar but learns the structure directly via pixellevel, spatial-filter CNNs -details in Sec. 5 and the appendix.In the remainder of this section, we first describe our BiATT comprehension model and then the pointer network.

Comprehension Model with Joint
Bidirectional Attention We use one bidirectional LSTM-RNN (Hochreiter and Schmidhuber, 1997) (denoted by LANG-LSTM) to read the statement s = w 1 , w 2 , . . ., w T , and output the hidden state representations {h i }.
A word embedding layer is added before the LSTM to project the words to high-dimension vectors { wi }.
The raw features of the objects in the j-th subimage are {o j k } (since the NLVR dataset has 3 subimages per task).A fully-connected (FC) layer without nonlinearity projects the raw features to object embeddings {e j k }.We then go through all the objects in random order (or some learnable order, e.g., via our pointer network, see Sec. 3.2) by another bidirectional LSTM-RNN (denoted by OBJ-LSTM), whose output is a sequence of vectors {g j k } which is used as the (left plus right memory) representation of the objects (the objects in different sub-images are handled separately): where N j is the number of the objects in jth subimage.Now, we have two vector sequences for the representations of the words and the objects, using which the bidirectional attention then calculates the score measuring the correspondence between the statement and the image's object structure.To simplify the notation, we will ignore the sub-image index j.We first merge the LANG-LSTM hidden outputs {h i } and the object-aware context vectors {c i } together to get the joint representation { ĥi }.The object-aware context vector c i for a particular word w i is calculated based on the bilinear attention between the word representation h i and the representations of the objects {g k }: where the symbol • denotes element-wise multiplication.
Improvement over BiDAF The BiDAF model of Seo et al. (2016) does not use a full objectto-words attention mechanism.The query-todocument attention module in BiDAF added the attended-context vector to the document representation instead of the query representation.However, the inverse attention from the objects to the words is important in our task because the representation of the object depends on its corresponding words.Therefore, different from the BiDAF model, we create an additional 'symmetric' attention to merge the OBJ-LSTM hidden outputs {g k } and the statement-aware context vectors {d k } together to get the joint representation {ĝ k }.The improvement (6.1%) of our BiATT model over the BiDAF model is shown in Table 1.
where the operator ele max denotes the elementwise maximum over the vectors.The final scalar score for the sub-image is given by a 2-layer MLP over the concatenation of h and ḡ as follows: Max-Pooling over Sub-Images In order to address the 3 sub-images present in each NLVR task, a max-pooling layer is used to combine the above-defined scores of the sub-images.Given that the sub-images do not have any specific ordering among them (based on the data collection procedure (Suhr et al., 2017)), a pooling layer is suitable because it is permutation invariant.Moreover, many of the statements are about the existence of a special object or relationship in one sub-image (see Fig. 1) and hence the max-pooling layer effectively captures the meaning of these statements.We also tried other combination methods (meanpooling, concatenation, LSTM, early pooling on the features/vectors, etc.); the max pooling (on scores) approach was the simplest and most effective method among these (based on the dev set).
The overall probability that the statement correctly describes the full image (with three subimages) is the sigmoid of the final max-pooled score.The loss of the comprehension model is the negative log probability (i.e., the cross entropy): where y is the ground truth label.

Pointer Network
Instead of randomly ordering the objects, humans look at the objects in an appropriate order w.r.t.their reading of the given statement and after the first glance of the image.Following this idea, we use an additional pointer network (Vinyals et al., 2015) to find the best object ordering for the subsequent language comprehension model.The pointer network contains two RNNs, the encoder and the decoder.The encoder reads all the objects in a random order.The decoder then learns a permutation π of the objects' indices, by recurrently outputting a distribution over the objects based on the attention over the encoder hidden outputs.At each time step, an object is sampled without replacement following this distribution.Thus, the pointer network models a distribution p(π | s, o) over all the permutations: Furthermore, the appropriate order of the objects depends on the language statement, and hence the decoder importantly attends to the hidden outputs of the LANG-LSTM (see Eqn. 1).The pointer network is trained via reinforcement learning (RL) based policy gradient optimization.The RL loss L RL (s, o, y) is defined as the expected comprehension loss (expectation over the distribution of permutations): where o[π] denotes the permuted input objects for permutation π, and L is the loss function defined in Eqn.16.Suppose that we sampled a permutation π * from the distribution p(π|s, o); then the above RL loss could be optimized via policy gradient methods (Williams, 1992).The reward R is the negative loss of the subsequent comprehension model L(s, o[π * ], y).A baseline b is subtracted from the reward to reduce the variance (we use the self-critical baseline of Rennie et al. ( 2016)).The gradient of the loss L RL could then be approximated as: This overall BiATT-Pointer model (for the structured-representation task) is shown in Fig. 2.

Experimental Setup
We evaluate our model on the NLVR dataset (Suhr et al., 2017), for both the structured and raw-image versions.All model tuning was performed on the dev set.Given the fact that the dataset is balanced (the number of true labels and false labels are roughly the same), the accuracy of the whole corpus is used as the metric.We only use the raw features of the statement and the objects with minimal standard preprocessing (e.g., tokenization and UNK replacement; see appendix for reproducibility training details).

Results and Analysis
Results on Structured Representations Dataset: Table 1 shows our primary model results.In terms of previous work, the state-of-the-art result for end-to-end models is 'MAXENT', shown in Suhr et al. (2017).1 Our proposed BiATT-Pointer model (Fig. 2) achieves a 6.2% improvement on the public test set and a 4.0% improvement on the unreleased test set over this SotA model.To show the individual effectiveness of our BiATT and Pointer components, we also provide two ablation results: (1) the bidirectional attention BiATT model without the pointer network; and (2) our BiENC baseline model without any attention or the pointer mechanisms.The BiENC model uses the similarity between the last hidden outputs of the LANG-LSTM and the OBJ-LSTM as the score (Eqn.14).
Finally, we also reproduce some recent popular frameworks, i.e., Relationship Network (Santoro et al., 2017) and BiDAF model (Seo et al., 2016), which have been proven to be successful in

Correct Answer: True
There are 2 boxes with at least 2 blue items.

Correct Answer: True
There is a blue object touching the base.
Correct Answer: False There are at least three yellow objects touching any edge.
Correct Answer: True There is exactly one tower which has a blue block over a black block.

Negative Examples
There is a box with 3 items and a black item on top.
There are two grey boxes with at least two black objects touching the edge.

Answer: False Answer: False
There is a box with a blue triangle, a yellow square and a yellow circle.
Answer: True Answer: True Next, in Fig. 3, we also show some negative examples on which our model fails to predict the correct answer.The top two examples involve complex high-level phrases e.g., "touching any edge" or "touching the base", which are hard for an endto-end model to capture, given that such statements are rare in the training data.Based on the result of the validation set, the max-pooling layer is selected as the combination method in our model.The max-pooling layer will choose the highest score from the sub-images as the final score.Thus, the layer could easily handle statements about single-subimage-existence based reasoning (e.g., the 4 positively-classified examples in Fig. 1).However, the bottom two negatively-classified examples in Fig. 3 could not be resolved because of the limitation of the max-pooling layer on scenarios that consider multiple-subimage-existence.We did try multiple other pooling and combination methods, as mentioned in Sec.3.1.Among these methods, the concatenation, early pooling and LSTM-fusion approaches might have the ability to solve these particular bottom-two failed statements.In our future work, we are addressing mul-tiple types of pooling methods jointly.
Finally, we show the effectiveness of the pointer network in learning the object order, in Fig. 4. The red arrows indicate the sorted order of the objects as learned by our pointer network conditioned on the language instruction.In the top two examples, the model learns to sort the objects in a path which is in accordance with the spatial relationships in the statement (e.g., "blue block over a black block" or "item on top").In the bottom two examples, the model also tries to learn the order of the objects that is aligned well with the occurrences of the words in the statement.

Conclusion
We presented a novel end-to-end model with joint bidirectional attention and object-ordering pointer networks for visual reasoning.We evaluate our model on both the structured-representation and raw-image versions of the NLVR dataset and achieve substantial improvements over the previous end-to-end state-of-the-art results.

Figure 1 :
Figure 1: NLVR task: given an image with 3 subimages and a statement, the model needs to predict whether the statement correctly describes the image or not.We show 4 such examples which our final BiATT-Pointer model correctly classifies but the strong baseline models do not (see Sec. 5).

Figure 2 :
Figure 2: Our BiATT-Pointer model with a pointer network and a joint bidirectional attention module.

Figure 4 :
Figure 4: Examples of our learned object ordering.The red arrows indicate the order of the objects learned by the pointer network.

Table 1 :
Dev, Test-P (public), and Test-U (unreleased) results of our model on the structured-representation and raw-image datasets, compared to the previous SotA results and other reimplemented baselines.