Anaphora Resolution for Improving Spatial Relation Extraction from Text

Spatial relation extraction from generic text is a challenging problem due to the ambiguity of the prepositions spatial meaning as well as the nesting structure of the spatial descriptions. In this work, we highlight the difficulties that the anaphora can make in the extraction of spatial relations. We use external multi-modal (here visual) resources to find the most probable candidates for resolving the anaphoras that refer to the landmarks of the spatial relations. We then use global inference to decide jointly on resolving the anaphora and extraction of the spatial relations. Our preliminary results show that resolving anaphora improves the state-of-the-art results on spatial relation extraction.


Introduction
Spatial relation extraction is the task of determining the relations that can exist among the spatial roles extracted from the text (D'Souza and Ng, 2015). In the recent years, significant progress has been made in spatial language understanding (i.e. mapping natural language text to a formal spatial meaning representation) (Kordjamshidi et al., 2017a;Kordjamshidi and Moens, 2015a). As a basic example consider the sentence, "A car is parked in front of a house." In this sentence car is a trajector, house is a landmark and in front of is a spatial indicator. Spatial indicators indicate the existence of spatial information in a sentence. Trajector is an entity whose location is described and landmark is a reference object for describing the location of a trajector.
Extraction of the spatial relations with a good accuracy is still challenging (Pustejovsky et al., 2015). Particularly, our investigation on the errors of the previous models shows that when in a sentence the landmark is expressed as a pronoun like ("it", "them", "him",..), the extraction of spatial relations becomes more difficult.
For example, in the sentence, "A narrow, rising street with colourful houses on both sides, among them a green house with balconies and a white car parked in front of it, and a blue-and-white church on the right", some of the spatial relations for this sentence will contain a landmark which is a pronoun such as R 1 ←[a green house] tr , [among] [it] lm . This issue is related to the well-known anaphora resolution problem which is also problematic for our goal of spatial relation extraction.
Anaphora Resolution which mostly appears as pronoun resolution, is the linguistic phenomenon by which the given pronoun is interpreted with the help of earlier or later items in the discourse (Mitkov, 2005). The pronoun word/phrase is referred as anaphor whereas the word/phrase to which it is referring is called antecedent, as both anaphor and antecedent are referring to the same object in the real world, they are termed coreferential (Mitkov et al., 2007). It might be possible that for some anaphor, the antecedent is not mentioned in the same sentence, for example, consider a sentence, "there are a couple of trees in front of it", here "it" is referring to some object which is not mentioned in the sentence, however, the referring object might have been mentioned in another sentence of the document. Anaphora Resolution generally is recognized as a difficult problem in Natural Language Processing (Lee et al., 2017a;Marasovic et al., 2017).
The main research questions that we aim to address in this paper are, 1) whether the external knowledge from multimodal resources can help anaphora resolution in text. 2) whether the anaphora resolution can help in the spatial relation extraction from text (especially the relations in the form of triplet -Trajector, Spatial Indicator, Landmark). To answer these questions, we incorpo-Figure 1: Image Textual Description: "A narrow, rising street with colourful houses on both sides, among them a green house with balconies and a white car parked in front of it, and a blue-and-white church on the right" rated anaphora resolution for the pronouns in the sentence and proposed a global machine learning model to exploit the resolved pronouns. In the first step, we find the list of possible landmarks that can replace a pronoun in a relation (under consideration) with a specific candidate trajector and candidate spatial indicator. We used Visual Genome (Krishna et al., 2017) (an external) dataset for this purpose.
Visual genome dataset provides us a list of possible landmarks which can be used to resolve the anaphora by filtering them based on their similarity with the candidate landmarks that appear in the sentence. This information is used in the global inference model for joint prediction. We improve the spatial relation extraction from text by incorporating anaphora resolution to recognize landmarks in spatial relations which distinguishes our work from other works on anaphora resolution. The contribution of this paper includes a) exploiting external visual relation datasets to inject external knowledge into our models b) forming a joint model that imposes the consistency between the decisions made by separate relation classifiers that decide on a candidate spatial relation with pronoun landmark and candidate spatial relations with that pronoun replaced by candidate noun resolvants. c) obtaining state-of-the-art results on spatial information extraction by exploiting the anaphora resolution. This paper shows our preliminary efforts in the sense that we have not applied the existing work on anaphora resolution. We do not aim at improving the current techniques in that area but only show that such resolutions using visual resources can help spatial relation extraction.
The rest of this paper is organized as follows, first we describe the problem setting in Section 2; our proposed model for this problem is described in Section 3. The dataset used in tests, and evaluation results, are presented in Section 4. In Section 5, we briefly point to the related work in this area. Finally, Section 6 summarizes the conclusions and outlines directions for future work.

Problem Definition
The goal is to improve the extraction of spatial information from text by incorporating anaphora resolution for landmark candidates. We briefly define the spatial role labeling (SpRL) task which is based on a previous formalization of ( Kordjamshidi et al., 2017bKordjamshidi et al., , 2011Kordjamshidi and Moens, 2015b). Given a sentence S, segmented into phrases P = [P 1 , P 2 , P 3 , ...P n ] where P i is the identifier of i th phrase in the sentence, the goal of spatial role labeling is to find the phrases which carry spatial roles (i.e. trajector (Tr), spatial indicator (Sp), landmark (Lm)), as introduced in Section 1 and identify the links between them to form spatial realtion, R = [T r, Sp, Lm]. Moreover, each Spatial relation is further classified into coarse-grained type -(region, direction, distance) and fine-grained types based on their coarse-grained types (e.g. (region,EC), (region,DC), (direction,left), (direction,right)). Figure 2 shows an example of spatial roles, spatial relations and spatial relation type extracted from a given text. In this example, the location of statue (trajector) is described with respect to the hill (landmark) using the preposition on (spatial indicator). In Figure 1, the caption shows the textual description of an image, featuring multiple spatial relations ( R 1 ←[a green not mentioned in the given sentence). R 1 →landmark ([it] lm ), and R 2 →landmark ([them] lm ) are referring to [colorful houses], and [a green house] respectively. R 1 , R 2 belongs to a well known anaphora resolution problem where the given pronoun is interpreted with the help of earlier or later items in the discourse whereas R 3 , R 4 belongs to coreference resolution problem (Lee et al., 2017b;Ng, 2010;Martschat and Strube, 2015) that aims at finding all expressions in the document that refer to the same entity.
The hypothesis of this paper is that how anaphora resolution for landmark candidates might help the inference for the extraction of roles as well as the relations from sentences. In this work, we proposed a model to address anaphora resolution for landmark candidates with the aim of improving the spatial relation extraction. In this paper, we assume that the antecedent (if any) of the anaphora (landmark here) is mentioned within the same sentence, therefore, crosssentence anaphora resolution is not performed in this work.

Architecture
Depending on the description of the sentence, the spatial relations can contain pronoun land-marks (such as "it", "them", "him", "her"). Consider the aforementioned spatial relations R 1 and R 2 extracted from sentence T , R 1 →landmark ([them] lm ) and R 2 →landmark ([it] lm ) are referring to [colorful houses] and [a green house] phrases of the sentence T respectively. The components of computing the anaphora resolution for pronoun landmark spatial relations is described in the following subsections.

Exploiting External Knowledge
Given a candidate spatial relation R with a pronoun landmark, we are interested in finding the possible landmark objects which can occur with the given trajector and spatial indicator. For this purpose, we used an external resource, that is Visual Genome relationship dataset (VG). This dataset contains the relation (preposition) between various subjects and objects -for details see section 4.1. Given R, similar relations are extracted from visual genome dataset V by matching preposition and subject with R → spatialIndicator and R → trajector−headword respectively, that is the candidate words for the sp and tr roles.
In this way, we obtain the list of possible landmark objects and their frequencies in the VG dataset. We compute the frequency ratio per object and this ratio is interpreted as the possibility score of a relation containing that landmark. In other words, the score R S is computed as is the frequency of having object i with the given trajector-spatial indicator pair, and T V R is the This will yield the set of possible triplets given the trajector-indicator pair with a score assigned to each triplet. We denote this set as, where U R i and S U Ri is the i th unique relation and its score respectively.

Scoring Landmark Candidate Resolvants
For each sentence we perform a pre-processing step based on the previous works and obtain a set of noun phrases that serve as the landmark candidates denoted by S L . The aforementioned retrieved triplets from visual genome, U R , can contain many landmarks which don't exist in our landmarks candidates set, therefore, in this step, we compute the similarity (using Google Word2Vec) score between each landmark in S L with all U R landmarks. The final score for each candidate landmark in the sentence will be the maximum score that is computed by averaging the similarity score and occurrence score of that landmark with respect to all U R candidates. In this way we obtain a score for each candidate landmark in S L .

Learning Model
We formulate this problem as a structured output prediction problem where given a set of inputoutput pairs as training examples, E = {(x i , y i ) ∈ X × Y : i = 1..N }, an objective function g(x, y; W ) = W, f (x, y) is learned. This function is a linear discriminant function defined over combined feature representation of inputs and outputs denoted by f (x, y). However, in this work, independent classifiers are trained per role and relations and only the predication is performed based on the global inference as in (Kordjamshidi et al., 2017a;Rahgooy et al., 2018) .
We construct a graph using the phrases {p 1 , ..., p n } (i.e. each phrase is a node in the graph) and link these nodes to make composed concepts such as relations. A classifier is associated with each concept in the graph and the domain knowledge is encoded over these concepts by global constraints. Global reasoning is imposed over these classifiers to produce the final outputs by using these constraints. Furthermore, we used binary classifiers to classify the spatial roles and relations where trajector, landmark, spatial indicator are denoted by tr, sp, lm respectively and sp.tr.lm, sp.tr.lm.γ, sp.tr.lm.λ denotes spatial relations, coarse-grained relations, and fine-grained relations. Additionally, we denote the new-relation-classifier described in section 3.5 by sp.tr.lm N RC .
Each phrase in the sentence is described by a vector of linguistic features denoted by: ψ phrase (p i ) (e.g. word form, POS tag, headword POS tag, dependencyRelation, subCategorization, etc), these features are used by spatial role classifiers. The spatial relation is composed of three phrases (p i , p j , p k ), therefore, the combination of these phrases along with their descriptive vectors are used in the spatial relation feature set referred as: φ text triplet (p i , p j , p k ) (e.g. distance between trajector and spatial indicator, concatenation of trajector, spatial indicator, and landmark). These features are proposed by (Roberts and Harabagiu, 2012) and (Kordjamshidi et al., 2017a).
Each tr candidate at least should appear in one relation 2 i j sp i tr j lm k ≥ lm k Each lm candidate at least should appear in one relation 3 j k sp i tr j lm k = sp i Each sp candidate should appear in one relation 4 j tr j ≥ sp i For each sp we should have at-least one tr 5 k lm k ≥ sp i For each sp we should have at-least one lm 6 sp i tr j lm k γ ≤ sp i tr j lm k is-a constraints between relations and coarse-grained types 7 sp i tr j lm k λ ≤ sp i tr j lm k γ λ∈Λγ is-a constraints between coarse-grained and corresponding fine-grained types where Λ γ denotes the candidate finegrained types related to coarse-grained type γ.
8 sp i tr j lm k N RC ≤ sp i tr j lm k Spatial relation with pronoun candidate should be classified as true if anyone in top N of the anaphora-resolved triplets is classified as true.

Constraints
The global constraints used in our proposed model is combination of previously proposed constraints (1-7) (Rahgooy et al., 2018) and new one (constraint 8) described in Table 3.3. In fact, the global inference is performed using integer linear programming techniques subject to these constraints.

Global Prediction Model
We obtain the output of each classifier in the model holistically by global reasoning that is by considering global correlations among classifiers, when calculating outputs. This goal is achieved by optimizing an objective function that is the summation of classifiers' discriminant functions. The global objective function for our model is on the basis of our previous work as follows, i∈Csp Wsp, φsp i .spi + i∈C tr Wtr, φtr i .tri+ i∈C lm W lm , φ lm i .lmi+ i∈Csp j∈C tr k∈C lm W sptrlm , φ sp i tr j lm k .spitrjlm k + γ∈Γ i∈Csp j∈C tr k∈C lm W sptrlm , φ sp i tr j lm k γ .spitrjlm k γ+ λ∈Λ i∈Csp j∈C tr k∈C lm W sptrlm , φ sp i tr j lm k λ .spitrjlm k λ+ τ ∈Υ i∈Csp j∈C tr k∈C lm W sptrlm , φ sp i tr j lm k N RCγ .spitrjlm k N RC .
Each classifier is shown as a binary variable and Λ, Γ, Υ are the candidates for fine-grained relations, coarse-grained relations, and pronounlandmark spatial relations respectively. The following model variations are designed to evaluate the performance of the proposed model. Furthermore, in all model variations, the CLEF 2017 mSprl dataset described in 4.1 is used for the training and evaluation of the classifiers.
• Anaphora-Replacement (A-Replacement): In this model, we replace the landmark phrase text of spatial relation where the landmark is a pronoun with the highest scored probable landmark (see 3.2), this approach is used for both training and testing. Furthermore, we train independent classifiers for spatial roles and relations classification. This is a learning only model where each classifier makes independent predictions. This model doesn't use any constraints, and is compared with similar (Rahgooy et al., 2018) baseline model in section 4.
• Anaphora-Inference (A-Inference): In this model, 1) we create an additional triplet classifier for classifying the relations that contain pronoun landmarks and we name it newrelation-classifier (NRC) and use it at the inference time, 2) joint prediction is performed using the constraints described in   3.4 to optimize the global objective function explained in section 3.5 which includes the new-relation-classifier. This implies that both relation classifier and the new-relationclassifier are assigned values jointly and should agree. For training the new-relationclassifier, we generate additional examples by replacing the pronoun landmarks in the ground-truth with the highest scored landmark from our candidate set, S L . The original spatial relations with pronoun landmarks are also retained in the training. The training mechanism of remaining classifiers remains unchanged (i.e. trained on original spatial relations). In testing phase, we take the top N candidates from the scored landmarks generated in 3.2 for spatial relations with pronoun landmarks. In this way, we regenerate a set of candidate triplets by replacing the pronoun with the top probable landmarks. Our global inference decides jointly with using the original triplet classifier in a way that it satisfies the constraint that if anyone of these triplets is predicted as true, spatial relation classifier is forced at inference time to predict the spatial relation with the anaphora as true. See constraint number 8 in section 3.4. The experiments show that this simple idea can promote the relation extraction when anaphora occurs in the triplet candidates.

Datasets
CLEF 2017 mSpRL dataset: Our model is evaluated on this dataset which is a subset of IAPR TC-12 1 Benchmark and annotated specifically for the SpRL task. The training set contains 761 and whereas test set contains 939 spatial relations respectively (Kordjamshidi et al., 2017b). The total number of spatial relations containing pronoun landmark in train and test is 44 and 129 respectively.
Visual Genome dataset (VG): Visual Genome dataset has seven main components (Krishna et al., 2017), one of it is 'relationships' which contains the relationships between pairs of objects in the images. Each relation has two arguments, the first one is referred as subject whereas the latter one is referred as object. These relationships can be actions, spatial, prepositions, verbs, comparative or prepositional phrases. Visual genome dataset contains 108077 images whereas its relationships part contains 2316104 relation instances. This dataset is used to obtain the possible landmarks that can occur in a relationship with a given subject.

Experimental Results
In this section, we experimentally show the effectiveness of our proposed model in improving the spatial role/relation extraction. We use Saul (Kordjamshidi et al., , 2016 to implement the models and solve the global inference of Section 3.5. The code is publicly available here 2 . We compare our approach with the state-of-theart (Rahgooy et al., 2018). However, in the mentioned paper, the authors use visual data from the accompanying images to improve the models. In   this paper, we use their best model (referred here as M 0 -Baseline and M 0 + C -Baseline plus constraints) which is trained on text only and we ignore the visual information which is aligned with the text. The experimental results in Table 4 show that our baseline model (A-Replacement) is significantly better as compared to the state-of-theart baseline model (M0). This shows that replacing the pronoun landmark candidates with our proposed model probable landmark has positive impact on extraction of spatial roles (as shown in Table 2) and relations. The improvement in the results is because the spatial roles predication is improved, which gives a more confidence to the model to classify the triplets as spatial relations which leads to more positive predictions and higher recall of the relations. Furthermore, our second model (A-Inference) in which we train an additional new-relationclassifier by generating additional examples and perform joint inference further improves the results over the state-of-the-art model with constraints (M0+C). The experimental results in Table 3 show that adding constraints to our second model (A+Inference) significantly improves the classification of spatial roles (i.e. trajectors and landmarks), although the spatial indicators is slightly improved. Also these constraints help improving the coarse-grained spatial relations as shown in table 6, although it doesn't have any impact on distance category because the number of examples in test set is very small (i.e. three instances only).
Our results improve the state-of-the-art models for spatial relation extraction. Both proposed models significantly improves the extraction of spatial roles and relations (when compared with independent learning and with constrained models). However, the results of some of the categories in the fine-grained relations drops which are not reported here. These results are at the preliminary stage and we further analyze our models. Particularly, we will use existing anaphora resolution models to see how those could help and provide a more reasonable baseline. This baseline will help us to evaluate the advantage of the external visual knowledge more clearly. It will be interesting to investigate what caused this drop in fine-grained relation types. In addition to such further analysis, this work can be extended into two possible directions, 1) incorporate cross-sentence anaphora resolution for landmark candidates, and 2) incorporate co-reference resolution in general for all spatial relations.

Related Work
Our proposed model is a joint model for considering anaphora resolution to help spatial information extraction. Anaphora resolution is a fundamental problem in natural language processing and existing techniques can broadly be categorized into two types 1) Rule based models: apply rules to reduce candidate antecedents and resolve anaphora and 2) statistical models: use probabilistic models for the resolution of anaphora (Lee et al., 2017a). Early work (Hobbs, 1978;Asher and Wada, 1988;Lappin and Leass, 1994;Morton, 2000) focused on designing rule-based systems for anaphora resolution (the target was finding antecedents of pronouns only), however, these systems relied heavily on handcraft rules/weights. In early 2000, (Soon et al., 2001;Yang et al., 2003;Ng and Cardie, 2002) used statistical machine learning methods to resolve co-reference, these methods used a com-  mon strategy, that is, train a statistical model to measure the likeness of a pair as corefer. However, each candidate is resolved independently of the others which means how good a candidate antecedent is relative to others is not considered. To address this problem, (Denis and Baldridge, 2009) proposed a model by combining machine learning with global inference for performing the resolution jointly. Recently, (Park et al., 2016) proposed an mention pair model using deep learning and a system that combines both rule-based and deep learning-based systems using a guided MP model for co-reference resolution.
According to (Lee et al., 2017a), machine learning based models for anaphora resolution are relatively easy to build as compared to rule based models, however, a huge amount of handcrafted feature design is required in order to build a successful anaphora resolution model. Furthermore, the authors highlighted four key features of a ideal anaphora resolution system one of which is antecedent features should be learned automatically (i.e. minimum human design effort should be required). The proposed model doesn't require any handcrafting features or rules to implement the anaphora resolvers.
Join models have been proposed for resolving co-references with mention head detection using underlying integer linear programming as we do here (Peng et al., 2015). The main difference of our work compared to the above mentioned research works is that here we do not directly solve the anaphora resolution problem, but we use a kind of indirect supervision from an external multi-modal resource to help anaphora resolution and by means of that we solve our specific target problem. Our target problem of spatial information extraction has not been jointly performed with neither anaphora nor co-reference resolution tasks before. However, resolving co-references in the multi-modal setting has been investigated recently (Huang et al., 2017) in which text and video refer to the same scene and help each other in the resolution. As pointed above, this is different from using the vision modality as a source of distant supervision which is our aim in this work.

Conclusion
In this paper, we investigated the challenging issues of the extraction of spatial relations, that is, the triplets of (spatial indicator, trajector, landmark) from generic text. Particularly, We highlighted one important problem that is the issue of anaphoras accruing in the text that make recognizing landmarks and consequently recognizing the spatial relations difficult. In the presence of the anaphora recognizing the right link between the described objects in the text and extracting the relations correctly for any arbitrary pair of object becomes more challenging. Our proposed solution has been to use the external visual resources that can help to find out the most probable landmarks for a specific object and obtain the possible resolutions with a score. Using the scored resolutions we perform global inference to decide on both the anaphora resolution and spatial relation extraction jointly. Our best model improves the state-of-theart results in all precision, recall and F1 metrics while having a more positive (about +2%) influence on the recall of the spatial relations extraction. While our preliminary experimental results show the advantage of anaphora resolution in spatial relation extraction, we will investigate more sophisticated baselines in the future to evaluate the advantage of external knowledge resources (that we used in this work) versus using the existing approaches for anaphora resolution in our models.