Resolving Language and Vision Ambiguities Together: Joint Segmentation & Prepositional Attachment Resolution in Captioned Scenes

We present an approach to simultaneously perform semantic segmentation and prepositional phrase attachment resolution for captioned images. The motivation for this work comes from the fact that some ambiguities in language simply cannot be resolved without simultaneously reasoning about an associated image. If we consider the sentence"I shot an elephant in my pajamas", looking at the language alone (and not reasoning about common sense), it is unclear if it is the person or the elephant that is wearing the pajamas or both. Our approach involves producing a diverse set of plausible hypotheses for both semantic segmentation and prepositional phrase attachment resolution that are then jointly re-ranked to select the most consistent pair. We show that our semantic segmentation and prepositional phrase attachment resolution modules have complementary strengths, and that joint reasoning produces more accurate results than any module operating in isolation. We also show that multiple hypotheses are crucial to improved multiple-module reasoning. Our vision and language approach significantly outperforms a state-of-the-art NLP system (Stanford Parser [16,27]) by 17.91% (28.69% relative) in one experiment, and by 12.83% (25.28% relative) in another. We also make small improvements over a state-of-the-art vision system (DeepLab-CRF [13]).


Introduction
Perception and intelligence problems are hard. Whether we are interested in understanding an im- Ambiguity: (dog next to woman) on couch vs dog next to (woman on couch) Figure 1: Overview of our approach. We propose a model for simultaneous 2D semantic segmentation and prepositional phrase attachment resolution by reasoning about sentence parses. The language and vision modules each produce M diverse hypotheses, and the goal is to select a pair of consistent hypotheses. In this example the ambiguity to be resolved from the image caption is whether the dog is standing on or next to the couch. Both modules benefit by selecting a pair of compatible hypotheses.
age or a sentence, our algorithms must operate under tremendous levels of ambiguity. When a human reads the sentence "I eat sushi with tuna", it is clear that the preposition phrase "with tuna" modifies "sushi" and not the act of eating, but this may be ambiguous to a machine. This problem of determining whether a prepositional phrase ("with tuna") modifies a noun phrase ("sushi") or verb phrase ("eating") is formally known as Prepositional Phrase Attachment Resolution (PPAR) (Ratnaparkhi et al., 1994). Consider the captioned scene shown in Fig-ure 1. The caption "A dog is standing next to a woman on a couch" exhibits a PP attachment ambiguity -"(dog next to woman) on couch" vs "dog next to (woman on couch)". It is clear that having access to image segmentations can help resolve this ambiguity, and having access to the correct PP attachment can help image segmentation. There are two main roadblocks that keep us from writing a single unified model (say a graphical model) to perform both tasks: (1) Inaccurate Models -empirical studies (Meltzer et al., 2005, Szeliski et al., 2008, Kappes et al., 2013 have repeatedly found that models are often inaccurate and miscalibrated -their "most-likely" beliefs are placed on solutions far from the ground-truth. (2) Search Space Explosion -jointly reasoning about multiple modalities is difficult due to the combinatorial explosion of search space ({exponentially-many segmentations} × {exponentially-many sentence-parses}).
Proposed Approach and Contributions. In this paper, we address the problem of simultaneous object segmentation (also called semantic segmentation) and PPAR in captioned scenes. To the best of our knowledge this is the first paper to do so.
Our main thesis is that a set of diverse plausible hypotheses can serve as a concise interpretable summary of uncertainty in vision and language 'modules' (What does the semantic segmentation module see in the world? What does the PPAR module describe?) and form the basis for tractable joint reasoning (How do we reconcile what the semantic segmentation module sees in the world with how the PPAR module describes it?).
Given our two modules with M hypotheses each, how can we integrate beliefs across the segmentation and sentence parse modules to pick the best pair of hypotheses? Our key focus is consistency -correct hypotheses from different modules will be correct in a consistent way, but incorrect hypotheses will be incorrect in incompatible ways. Specifically, we develop a MEDIATOR model that scores pairs for consistency and searches over all M 2 pairs to pick the highest scoring one. We demonstrate our approach on three datasets -ABSTRACT-50S (Vedantam et al., 2014), PASCAL-50S, and PASCAL-Context-50S (Mottaghi et al., 2014). We show that our vision+language approach significantly outperforms the Stanford Parser (De Marneffe et al., 2006) by 20.66% (36.42% relative) for ABSTRACT-50S, 17.91% (28.69% relative) for PASCAL-50S, and by 12.83% (25.28% relative) for PASCAL-Context-50S. We also make small but consistent improvements over DeepLab-CRF (Chen et al., 2015).

Related Work
Most works at the intersection of vision and NLP tend to be 'pipeline' systems, where vision tasks take 1-best inputs from NLP (e.g., sentence parsings) without trying to improve NLP performance and vice-versa. For instance, Fidler et al. (2013) use prepositions to improve object segmentation and scene classification, but only consider the mostlikely parse of the sentence and do not resolve ambiguities in text. Analogously, Yatskar et al. (2014) investigate the role of object, attribute, and action classification annotations for generating human-like descriptions. While they achieve impressive results at generating descriptions, they assume perfect vision modules to generate sentences. Our work uses current (still imperfect) vision and NLP modules to reason about images and provided captions, and simultaneously improve both vision and language modules. Similar to our philosophy, an earlier work by Barnard and Johnson (2005) used images to help disambiguate word senses (e.g. piggy banks vs snow banks). In a more recent work, Gella et al. (2016) studied the problem of reasoning about an image and a verb, where they attempt to pick the correct sense of the verb that describes the action depicted in the image. Berzak et al. (2015) resolve linguistic ambiguities in sentences coupled with videos that represent different interpretations of the sentences. Perhaps the work closest to us is Kong et al. (2014), who leverage information from an RGBD image and its sentential description to improve 3D semantic parsing and resolve ambiguities related to coreference resolution in the sentences (e.g., what "it" refers to). We focus on a different kind of ambiguity -the Prepositional Phrase (PP) attachment resolution. In the classification of parsing ambiguities, coreference resolution is considered a discourse ambiguity (Poesio and Artstein, 2005) (arising out of two different words across sentences for the same object), while PP attachment is considered a syntactic ambiguity (arising out of multiple valid sentence structures) and is typically considered much more difficult to resolve (Bach, 2016, Davis, 2016.
A number of recent works have studied problems at the intersection of vision and language, such as Visual Question Answering (Antol et al., 2015, Geman et al., 2014, Malinowski et al., 2015, Visual Madlibs (Yu et al., 2015), and image captioning (Vinyals et al., 2015, Fang et al., 2015. Our work falls in this domain with a key difference that we produce both vision and NLP outputs. Our work also has similarities with works on 'spatial relation learning' Fritz, 2014, Lan et al., 2012), i.e. learning a visual representation for noun-preposition-noun triplets ("car on road"). While our approach can certainly utilize such spatial relation classifiers if available, the focus of our work is different. Our goal is to improve semantic segmentation and PPAR by jointly reranking segmentation-parsing solution pairs. Our approach implicitly learns spatial relationships for prepositions ("on", "above") but these are simply emergent latent representations that help our reranker pick out the most consistent pair of solutions.

Approach
In order to emphasize the generality of our approach, and to show that our approach is compatible with a wide class of implementations of semantic segmentation and PPAR modules, we present our approach with the modules abstracted as "black boxes" that satisfy a few general requirements and minimal assumptions. In Section 4, we describe each of the modules in detail, making concrete their respective features, and other details.

What is a Module?
The goal of a module is to take input variables x ∈ X (images or sentences), and predict output variables y ∈ Y (semantic segmentation) and z ∈ Z (prepositional attachment expressed in sentence parse). The two requirements on a module are that it needs to be able to produce scores S(y|x) for potential solutions and a list of plausible hypotheses Y = {y 1 , y 2 , . . . , y M }.
Multiple Hypotheses. In order to be useful, the set Y of hypotheses must provide an accurate summary of the score landscape. Thus, the hypotheses should be plausible (i.e., high-scoring) and mutually non-redundant (i.e., diverse). Our approach (described next) is applicable to any choice of diverse hypothesis generators. In our experiments, we use the k-best algorithm of Huang and Chiang (2005) for the sentence parsing module and the DivMBest algorithm  for the semantic segmentation module. Once we instantiate the modules in Section 4, we describe the diverse solution generation in more detail.

Joint Reasoning Across Multiple Modules
We now show how to intergrate information from both segmentation and PPAR modules. Recall that our key focus is consistency -correct hypotheses from different modules will be correct in a consistent way, but incorrect hypotheses will be incorrect in incompatible ways. Thus, our goal is to search for a pair (semantic segmentation, sentence parsing) that is mutually consistent.
Let Y = {y 1 , . . . , y M } denote the M semantic segmentation hypotheses and Z = {z 1 , . . . , z M } denote the M PPAR hypotheses. MEDIATOR Model. We develop a "mediator" model that identifies high-scoring hypotheses across modules in agreement with each other. Concretely, we can express the MEDIATOR model as a factor graph where each node corresponds to a module (semantic segmentation and PPAR). Working with such a factor graph is typically completely intractable because each node y, z has exponentiallymany states (image segmentations, sentence parsing). As illustrated in Figure 2, in this factor-graph view, the hypothesis sets Y, Z can be considered 'delta-approximations' for reducing the size of the output spaces.
Unary factors S(·) capture the score/likelihood of each hypothesis provided by the corresponding module for the image/sentence at hand. Pairwise factors C(·, ·) represent consistency factors. Impor- Figure 2: Illustrative inter-module factor graph. Each node takes exponentially-many or infinitely-many states and we use a 'delta approximation' to limit support.
tantly, since we have restricted each module variables to just M states, we are free to capture arbitrary domain-specific high-order relationships for consistency, without any optimization concerns. In fact, as we describe in our experiments, these consistency factors may be designed to exploit domain knowledge in fairly sophisticated ways.
Consistency Inference. We perform exhaustive inference over all possible tuples.
(1) Notice that the search space with M hypotheses each is M 2 . In our experiments, we allow each module to take a different value for M , and typically use around 10 solutions for each module, leading to a mere 100 pairs, which is easily enumerable. We found that even with such a small set, at least one of the solutions in the set tends to be highly accurate, meaning that the hypothesis sets have relatively high recall. This shows the power of using a small set of diverse hypotheses. For a large M , we can exploit a number of standard ideas from the graphical models literature (e.g. dual decomposition or belief propagation). In fact, this is one reason we show the factor in Figure 2; there is a natural decomposition of the problem into modules.
Training MEDIATOR. We can express the ME-DIATOR score as M(y i , z j ) = w φ(x, y i , z j ), as a linear function of score and consistency features are the single-module (semantic segmentation and PPAR module) score features, and φ C (·, ·) are the inter-module consistency features. We describe these features in detail in the experiments. We learn these consistency weights w from a dataset annotated with ground-truth for the two modules y, z. Let {y * , z * } denote the oracle pair, composed of the most accurate solutions in the hypothesis sets. We learn the MEDIATOR parameters in a discriminative learning fashion by solving the following Structured SVM problem: Intuitively, we can see that the constraint (2b) tries to maximize the (soft) margin between the score of the oracle pair and all other pairs in the hypothesis sets. Importantly, the slack (or violation in the margin) is scaled by the loss of the tuple. Thus, if there are other good pairs not too much worse than the oracle, the margin for such tuples will not be tightly enforced. On the other hand, the margin between the oracle and bad tuples will be very strictly enforced.
This learning procedure requires us to define the loss function L(y i , z j ), i.e., the cost of predicting a tuple (semantic segmentation, sentence parsing). We use a weighted average of individual losses: The standard measure for evaluating semantic segmentation is average Jaccard Index (or Intersectionover-Union) (Everingham et al., 2010), while for evaluating sentence parses w.r.t. their prepositional phrase attachment, we use the fraction of prepositions correctly attached. In our experiments, we report results with such a convex combination of module loss functions (for different values of α).

Experiments
We now describe the setup of our experiments, provide implementation details of the modules, and describe the consistency features.
Datasets. Access to rich annotated image + caption datasets is crucial for performing quantitative evaluations. Since this is the first paper to study the problem of joint segmentation and PPAR, no standard datasets for this task exist so we had to curate our own annotations for PPAR on three image caption datasets -ABSTRACT-50S ( To curate the PASCAL-Context-50S PPAR annotations, we first select all sentences that have preposition phrase attachment ambiguities. We then plotted the distribution of prepositions in these sentences. The top 7 prepositions are used, as there is a large drop in the frequencies beyond these. The 7 prepositions are: "on", "with", "next to", "in front of", "by", "near", and "down". We then further sampled sentences to ensure uniform distribution across prepositions. We perform a similar filtering for PASCAL-50S and ABSTRACT-50S (using the top-6 prepositions for ABSTRACT-50S). Details are in the supplement. We consider a preposition ambiguous if there are at least two parsings where one of the two objects in the preposition dependency is the same across the two parsings while the other object is different (e.g. (dog on couch) and (woman on couch)). To summarize the statistics of all three datasets: 1. ABSTRACT-50S (Vedantam et al., 2014): 25,000 sentences (50 per image) with 500 images from abstract scenes made from clipart. Filtering for captions containing the top-6 prepositions resulted in 399 sentences describing 201 unique images. These 6 prepositions are: "with", 'next to", "on top of", "in front of", "behind", and "under". . This makes the vision task more challenging. Filtering this dataset for the top-7 prepositions resulted in a total of 966 unique images and 1,822 image-caption pairs. Ground truth annotations for the PPAR were collected using Amazon Mechanical Turk. Workers were shown an image and a prepositional attachment (extracted from the corresponding parsing of the caption) as a phrase ("woman on couch"), and asked if it was correct. A screenshot of our interface is available in the supplement. Overall, there are 2,540 total prepositions, 2,147 ambiguous prepositions, 84.53% ambiguity rate and 283 sentences with multiple ambiguous prepositions. Setup. Single Module: We first show that visual features help PPAR by using the ABSTRACT-50S dataset, which contains clipart scenes where the extent and position of all the objects in the scene is known. This allows us to consider a scenario with a perfect vision system.
Multiple Modules: In this experiment we use imperfect language and vision modules, and show improvements on the PASCAL-50S and PASCAL-Context-50S datasets.
Module 1: Semantic Segmentation (SS) y. We use DeepLab-CRF (Chen et al., 2015) and Di-vMBest  to produce M diverse segmentations of the images. To evaluate we use image-level class-averaged Jaccard Index.
Module 2: PP Attachment Resolution (PPAR) z. We use a recent version (v3.3.1; released 2014) of the PCFG Stanford parser module (De Marneffe et al., 2006, Huang andChiang, 2005) to produce M parsings of the sentence. In addition to the parse trees, the module can also output dependencies, which make syntactical relationships more explicit. Dependencies come in the form dependency type(word 1 , word 2 ), such as the preposition dependency prep on(woman-8, couch-11) (the number indicates the word position in sentence). To evaluate, we count the percentage of preposition attachments that the parse gets correct. Baselines: • INDEP. In our experiments, we compare our proposed approach (MEDIATOR) to the highest scoring solution predicted independently from each module. For semantic segmentation this is the output of DeepLab-CRF (Chen et al., 2015) and for the PPAR module this is the 1-best output of the Stanford Parser (De Marneffe et al., 2006, Huang andChiang, 2005). Since our hypothesis lists are generated by greedy M-Best algorithms, this corresponds to predicting the (y 1 , z 1 ) tuple. This comparison establishes the importance of joint reasoning. To the best of our knowledge, there is no existing (or even natural) joint model to compare to. • DOMAIN ADAPTATION. We learn a reranker on the parses. Note that domain adaptation is only needed for PPAR since the Stanford parser is trained on Penn Treebank (Wall Street Journal text) and not on text about images (such as image captions). Such domain adaptation is not necessary for semantic segmentation. This is a competitive single-module baseline. Specifically, we use the same parse-based features as our approach, and learn a reranker over the M z parse trees (M z = 10). Our approach (MEDIATOR) significantly outperforms both baselines. The improvements over IN-DEP show that joint reasoning produces more accurate results than any module (vision or language) operating in isolation. The improvements over DO-MAIN ADAPTATION establish the source of improvements is indeed vision, and not the reranking step. Simply adapting the parse from its original training domain (Wall Street Journal) to our domain (image captions) is not enough.
Ablative Study. Ours-CASCADE: This ablation studies the importance of multiple hypothesis. For each module (say y), we feed the single-best output of the other module z 1 as input. Each module learns its own weight w using exactly the same consistency features and learning algorithm as MEDI-ATOR and predicts one of the plausible hypotheseŝ y CASCADE = argmax y∈Y w φ(x, y, z 1 ). This ablation of our system is similar to (Heitz et al., 2008) and helps us in disentangling the benefits of multiple hypothesis and joint reasoning.
Finally, we note that Ours-CASCADE can be viewed as special cases of MEDIATOR. Let MEDI-ATOR-(M y , M z ) denote our approach run with M y hypotheses for the first module and M z for the second. Then INDEP corresponds to MEDIATOR-(1, 1) and CASCADE corresponds to predicting the y solution from MEDIATOR-(M y , 1) and the z solution from MEDIATOR-(1, M z ). To get an upper-bound on our approach, we report oracle, the accuracy of the most accurate tuple in 10 × 10 tuples.
In the main paper, our results are presented where MEDIATOR was trained with equally weighted loss (α = 0.5), but we provide additional results for varying values of α in the supplement.
MEDIATOR and Consistency Features. Recall that we have two types of features -(1) score features φ S (y i ) and φ S (z j ), which try to capture how likely solutions y i and z j are respectively, and (2) consistency features φ C (y i , z j ), which capture how consistent the PP attachments in z j are with the segmentation in y i . For each (object 1 , preposition, object 2 ) in z j , we compute 6 features between object 1 and object 2 segmentations in y i . Since the humans writing the captions may use multiple synonymous words (e.g. dog, puppy) for the same visual entity, we use word2vec (Mikolov et al., 2013) similarities to map the nouns in the sentences to the corresponding dataset categories.
• Semantic Segmentation Score Features (φ S (y i )) (2-dim): We use ranks and solution scores from DeepLab-CRF (Chen et al., 2015). • PPAR Score Features (φ S (z i )) (9-dim): We use ranks and the log probability of parses from (De Marneffe et al., 2006), and 7 binary indicators for PASCAL (6 for ABSTRACT-50S) denoting which prepositions are present in the parse. Figure 3: Example on PASCAL-50S ("A dog is standing next to a woman on a couch."). The ambiguity in this sentence "(dog next to woman) on couch" vs "dog next to (woman on couch)". We calculate the horizontal and vertical distances between the segmentation centers of "person" and "couch" and between the segmentation centers of "dog" and "couch". We see that the "dog" is much further below the couch (53.91) than the woman (2.65). So, if the MEDIATOR model learned that "on" means the first object is above the second object, we would expect it to choose the "person on couch" preposition parsing.
• Inter-Module Consistency Features (56dim): For each of the 7 prepositions, 8 features are calculated: -One feature is the Euclidean distance between the center of the segmentation masks of the two objects connected by the preposition. These two objects in the segmentation correspond to the categories with which the soft similarity of the two objects in the sentence is highest among all PASCAL categories. -Four features capture max{0, (normalized -directional-distance)}, where directionaldistance measures above/below/left/right displacements between the two objects in the segmentation, and normalization involves dividing by height/width. -One feature is the ratio of sizes between object 1 and object 2 in the segmentation. -Two features capture the word2vec similarity between the two objects in PPAR (say 'puppy' and 'kitty') with their most similar PASCAL category (say 'dog' and 'cat'), where these features are 0 if the categories are not present in segmentation. A visual illustration for some of these features for PASCAL can be seen in Figure 3. In the case where an object parsed from z j is not present in the segmentation y i , the distance features are set to 0. The ratio of areas features (area of smaller object / area of larger object) are also set to 0 assuming that the smaller object is missing. In the case where an object has two or more connected components in the segmentation, the distances are computed w.r.t. the centroid of the segmentation and the area is computed as the number of pixels in the union of the instance segmentation masks. We also calculate 20 features for PASCAL-50S and 59 features for PASCAL-Context-50S that capture that consistency between y i and z j , in terms of presence/absence of PASCAL categories. For each noun in PPAR we compute its word2vec similarity with all PASCAL categories. For each of the PASCAL categories, the feature is the sum of similarities (with the PASCAL category) over all nouns if the category is present in segmentation, and is -1 times the sum of similarities over all nouns otherwise. This feature set was not used for ABSTRACT-50S, since these features were intended to help improve the accuracy of the semantic segmentation module. For ABSTRACT-50S, we only use the 5 distance features, resulting in a 30dim feature vector.

Single-Module Results
We performed a 10-fold cross-validation on the ABSTRACT-50S dataset to pick M (=10) and the weight on the hinge-loss for MEDIATOR (C). The results are presented in Table 1. Our approach significantly outpeforms 1-best outputs of the Stanford Parser (De Marneffe et al., 2006) by 20.66% (36.42% relative). This shows a need for diverse hypotheses and reasoning about visual features when picking a sentence parse. oracle denotes the best achievable performance using these 10 hypotheses.

Multiple-Module Results
We performed 10-fold cross-val for our results of PASCAL-50S and PASCAL-Context-50S, with 8   train folds, 1 val fold, and 1 test fold, where the val fold was used to pick M y , M z , and C. Figure 4 shows the average combined accuracy on val, which was found to be maximal at M y = 5, M z = 3 for PASCAL-50S, and M y = 1, M z = 10 for PASCAL-Context-50S, which are used at test time. We present our results in Table 2. Our approach significantly outperforms the Stanford Parser (De Marneffe et al., 2006) by 17.91% (28.69% relative) for PASCAL-50S, and 12.83% (25.28% relative) for PASCAL-Context-50S. We also make small improvements over DeepLab-CRF (Chen et al., 2015) in the case of PASCAL-50S. To measure statistical significance of our results, we performed paired t-tests between MEDIATOR and INDEP. For both modules (and average), the null hypothesis (that the accuracies of our approach and baseline come from the same distribution) can be successfully rejected at p-value 0.05. For sake of completeness, we also compared MEDIATOR with our ablated system (CASCADE) and found statistically significant differences only in PPAR.
These results demonstrate a need for each module to produce a diverse set of plausible hypotheses for our MEDIATOR model to reason about. In the case of PASCAL-Context-50S, MEDIATOR performs identical to CASCADE since M y is chosen as 1 (which is the CASCADE setting) in crossvalidation. Recall that MEDIATOR is a larger model class than CASCADE (in fact, CASCADE is a special case of MEDIATOR with M y = 1). It is interesting to see that the large model class does not hurt, and MEDIATOR gracefully reduces to a smaller capacity model (CASCADE) if the amount of data is not enough to warrant the extra capacity. We hypothesize that in the presence of more training data, crossvalidation may pick a different setting of M y and M z , resulting in full utilization of the model capacity. Also note that our domain adaptation baseline achieved an accuracy higher than MAP/Stanford-Parser, but significantly lower than our approach for both PASCAL-50S and PASCAL-Context-50S. We also performed this for our single-module experiment and picked M z (=10) with cross-validation, on by with Figure 5: Visualizations for 3 different prepositions (red = high scores, blue = low scores).
We can see that our model has implicitly learned spatial arrangements unlike other spatial relation learning (SRL) works.  which resulted in an accuracy of 57.23%. Again, this is higher than MAP/Stanford-Parser (56.73%), but significantly lower than our approach (77.39%). Clearly, domain adaptation alone is not sufficient.
We also see that oracle performance is fairly high, suggesting that when there is ambiguity and room for improvement, MEDIATOR is able to rerank effectively.
Ablation Study for Features. Table 3 displays results of an ablation study on PASCAL-50S and PASCAL-Context-50S to show the importance of the different features. In each row, we retain the module score features and drop a single set of consistency features. We can see all consistency features contribute to the performance of MEDIATOR.
Visualizing Prepositions. Figure 5 shows a visualization for what our MEDIATOR model has implicitly learned about 3 prepositions ("on", "by", "with"). These visualizations show the score obtained by taking the dot product of distance features (Euclidean and directional) between object 1 and object 2 connected by the preposition with the corresponding learned weights of the model, considering object 2 to be at the center of the visualization. Notice that these were learned without explicit training for spatial learning as in spatial relation learning (SRL) works Fritz, 2014, Lan et al., 2012). These were simply recovered as an intermediate step towards reranking SS + PPAR hypotheses. Also note that SRL cannot handle multiple segmentation hypotheses, which our work shows are important (Table 2 CASCADE). In addition, our approach is more general.

Discussions and Conclusion
We presented an approach to the simultaneous reasoning about prepositional phrase attachment res-olution of captions and semantic segmentation in images that integrates beliefs across the modules to pick the best pair of a diverse set of hypotheses. Our full model (MEDIATOR) significantly improves the accuracy of PPAR over the Stanford Parser by 17.91% for PASCAL-50S and by 12.83% for PASCAL-Context-50S, and achieves a small improvement on semantic segmentation over DeepLab-CRF for PASCAL-50S. These results demonstrate a need for information exchange between the modules, as well as a need for a diverse set of hypotheses to concisely capture the uncertainties of each module. Large gains in PPAR validate our intuition that vision is very helpful for dealing with ambiguity in language. Furthermore, we see even larger gains are possible from the oracle accuracies.
While we have demonstrated our approach on a task involving simultaneous reasoning about language and vision, our approach is general and can be used for other applications. Overall, we hope our approach will be useful in a number of settings.