SemEval-2015 Task 8: SpaceEval

Human languages exhibit a variety of strate-gies for communicating spatial information, including toponyms, spatial nominals, locations that are described in relation to other locations, and movements along paths. SpaceEval is a combined information extraction and classiﬁcation task with the goal of identifying and categorizing such spatial information. In this paper, we describe the SpaceEval task, annotation schema, and corpora, and evaluate the performance of several supervised and semi-supervised machine learning systems developed with the goal of automating this task.


Introduction
SpaceEval builds on the Spatial Role Labeling (SpRL) task introduced in SemEval 2012 (Kordjamshidi et al., 2012) and used in SemEval 2013 (Kolomiyets et al., 2013). The base annotation scheme of the previous tasks was introduced in (Kordjamshidi et al., 2010), with empirical practices in (Kordjamshidi et al., 2011;Kordjamshidi and Moens, 2015). While those previous tasks are similar in their goal, SpacEval adopts the annotation specification from ISOspace (Pustejovsky et al., 2011a;Moszkowicz and Pustejovsky, 2010; ISO/TC 37/SC 4/WG 2, 2014), a new standard for capturing spatial information. The SpRL in SemEval 2012 had a focus on the main roles of trajectors, landmarks, spatial indicators, and the links between these roles which form spatial relations. The formal semantics of the relations were considered at a course-grained level, consisting of three types: directional, regional (topological), and distal. The related annotated data, CLEF IAPR TC-12 Image Benchmark (Grubinger et al., 2006), contained mostly static spatial relations. In SemEval 2013, the SpRL task was extended to the recognition of motion indicators and paths, which are applied to the more dynamic spatial relations. Accordingly, the data set was expanded and the text from the Degree Confluence Project (Jarrett, 2013) webpages were annotated.
SpaceEval extends the task in several dimensions, first by enriching the granularity of the semantics in both static and dynamic spatial configurations, and secondly by broadening the variety of annotated data and the domains considered. In SpaceEval the concept of place is distinguished from the concept of spatial entity as a fundamental typing distinction. That is, the roles of trajector (figure) and landmark (ground) are roles that are assigned to spatial entities and places when occurring in spatial relations. Places, however, are inherently typed as such, and remain places, regardless of what spatial roles they may occupy. Obviously, an individual may assume multiple role assignments, and in both ISOspace and SpRL this is assumed to be the case. However, because SpRL focuses on role assignment, it does not introduce the general concept of spatial entity.
There are other differences in the relational schemas of SpRL and SpaceEval which can be easily mapped to each other. For example, in SpRL the general concept of spatial relation is defined and the semantics of the relationship (e.g., directional, regional) is added as an attribute of the relation while in SpceEval these semantics introduce new types of relations (e.g., QSLINK and OLINK). In addition to the variations in relational schemas, there are some additional extensions in the SpaceEval annotation. These include augmenting the main elements with more fine-grained attributes. These attributes, in turn, impact the way the spatial semantics are interpreted. For example, the spatial entities are described with their dimensionality, form, etc. SpaceEval, also strongly highlights the concepts involved in dynamic spatial relations by introducing movelink relations and motion tags for annotating motion verbs or nominal motion events and their category from the perspective of spatial semantics. These fine-grained annotations of all the relevant concepts that contribute to grasping spatial semantics makes this scheme and the accompanying corpus unique. The details of the task, including the annotation schema, evaluation configurations, breakdown of the sub-tasks, data set, participant systems, and evaluation results are described in the rest of the paper.

The Task
The goals of SpaceEval include identifying and classifying items from an inventory of spatial concepts: Participants were offered three test configurations for this task.
Configuration 1 Only unannotated test data was provided. Configuration 2 Manually annotated spatial elements, without attributes, were provided. Configuration 3 Manually annotated spatial elements, with attributes, were provided.
The SpacEval task is broken down into the following sub-tasks: Spatial Elements (SE) a. Identify spans of spatial elements including locations, paths, events and other spatial entities. b. Classify spatial elements according to type: PATH (road, river, highway), PLACE (mountain, village), MOTION (walk, fly), NONMOTION EVENT (sit, read), SPA-TIAL ENTITY (any entity in a spatial relation). c. Identify their attributes according to type.

Spatial Signal Identification (SS)
a. Identify spans of spatial signals (in, on, above). b. Identify their attributes.

Motion Signal Identification (MI)
a. Identify spans of path-of-motion and manner-of-motion signals (arrive, leave, drive, walk). b. Identify their attributes.

Spatial Configuration Identification (QSLink)
a. Identify qualitative spatial relations between spatial signals and spatial elements (connected, unconnected, part-of, etc.). b. Identify their attributes.

Spatial Orientation Identification (OLink)
a. Identify orientational relations between spatial signals and spatial elements (above, under, in front of, etc.). b. Identify their attributes.

The SpaceBank Corpus
The data for this task are comprised of annotated textual descriptions of spatial entities, places, paths, motions, localized non-motion events, and spatial relations. The data set selected for this task, a subset of the SpaceBank corpus first described in (Pustejovsky and Yocum, 2013), consists of submissions retrieved from the Degree Confluence Project (DCP) (Jarrett, 2013), Berlitz Travel Guides retrieved from the American National Corpus (ANC) (Reppen et al., 2005), and entries retrieved from a travel weblog, Ride for Climate (RFC) (Kroosma, 2012). The DCP documents are the same set as those annotated with Spatial Role Labeling (SpRL) for SemEval-2013 Task 3 (Kolomiyets et al., 2013), however, for this task, the DCP texts were re-annotated according to ISO-Space.

Annotation Schema
The annotation of spatial information in text involves at least the following: a PLACE tag (for locations and regions participating in spatial relations); a PATH tag (for paths and boundaries between regions); a SPATIAL ENTITY tag (for spatial objects whose location changes over time); link tags (for topological relations, direction and orientation, frames of reference, and motion event participants); and signal tags (for spatial prepositions) 1 . ISO-Space has been designed to capture both spatial and spatio-temporal information as expressed in natural language texts (Pustejovsky et al., 2012). We have followed a strict methodology of specification development, as adopted by ISO TC37/SC4 and outlined in (Bunt, 2010) and (Ide and Romary, 2004), and as implemented with the development of ISO-TimeML (Pustejovsky et al., 2005) and others in the family of SemAF standards.
SpaceEval's three link tags are as follows: 1. MOVELINK -for movement relations; 2. OLINK -orientation relations; 3. QSLINK -qualitative spatial relations; QSLINKs are used in ISO-Space to capture topological relationships between tagged elements. The relType attribute values come from an extension to the RCC8 set of relations that was first used by SpatialML (Mani et al., 2010). The possible RCC8+ values include the RCC8 values (Randell et al., 1992), in addition to IN, a disjunction of TPP and NTPP.
Orientation links describe non-topological relationships. A SPATIAL SIGNAL with a DIRECTIONAL semantic type triggers such a link. In contrast to topological spatial relations, OLINK relations are built around a specific frame of reference type and 1 For more information, cf. (Pustejovsky et al., 2012). a reference point. The referencePt value depends on the frame type of the link. The ABSO-LUTE frame type stipulates that the referencePt is a cardinal direction. For INTRINSIC OLINKs, the referencePt is the same identifier that is given in the landmark attribute. For OLINKs with a RELATIVE frame of reference, the identifier for the viewer should be provided as to the referencePt.
The following samples from the RFC and ANC sub-corpora have been annotated with a subset of ISO-Space for the SpaceEval task 2 : Since SpaceEval is building on the SpRL shared tasks, we opted to retain the trajector and landmark attributes for labeling the participants in QSLINK and OLINK relations. This is a deviation from the ISO-Space (Pustejovsky et al., 2011b) standard, which specifies figure and ground labels based on cognitive-semantic categories explored in the semantics of motion and location by Leonard Talmy (Talmy, 1978;Talmy, 2000) and others. ISO-Space adopted the figure/ground terminology to identify the potentially asymmetric roles played by participants within spatial relations. For MOVELINKs, however, we distinguish the notion of a figure/trajector with the ISO-Space mover attribute label. Table 1 includes corpus statistics broken down into the ANC, DCP, and RFC sub-corpora in addition to the train:test partition (∼3:1). The counts of document, sentence, and lexical tokens are tabulated as well as counts of each annotation tag type.

Annotation and Adjudication
All annotations for this task were of English language texts and all annotations were created and adjudicated by native English speakers. Due to dependencies of link tag elements on extent tag elements, the annotation and adjudication tasks were broken down into the following phases: Phase 1 Extent tag span and attribute annotation.  Phase 2 Extent tag adjudication.
Phase 3 Link tag argument and attribute annotation.
Phase 4 Link tag adjudication.
Phases 2 and 4 produced gold standards from annotations in the preceding annotation phases. This annotation strategy ensured that the intermediate gold standard extent tag set was adjudicated before any link tag annotations were performed.
The annotation and adjudication effort was conducted at Brandeis University using Multidocument Annotation Environment (MAE) and Multi-annotator Adjudication Interface (MAI) (Stubbs, 2011). We used MAE to perform each phase of the annotation procedure and MAI to adjudicate and produce gold standard standoff annotations in XML format. In addition to the ISO-Space annotation tags and attributes, as a post-process, we also provided sentence and lexical tokenization as a separate standoff annotation layer in the XML data for the training and test sets.
Each document was covered by a minimum of three annotators for each annotation phase (though not necessarily the same annotators per phase). As such, we report inter-annotator agreement (IAA) as a mean Fleiss's κ coefficient for all extent tag types annotated in Phase 1, and individual kappa scores for each of the three link tag types annotated in Phase 3 in Table 2. The scores for extent tags and MOVELINK indicate high agreement, however link tag annotation was less consistent for the remaining link tags. Though the OLINK and QSLINK tag agreement is better than chance, it is not high. We believe the lower agreement for these link tags reflects the complexity of the annotation task.

Extent Tags
Link Tags

Submissions and Results
In this section we evaluate results from runs of five systems. Three systems were submitted by outside groups including Honda Research Institute Japan (HRIJP-CRF-VW), Ixa Group in the University of the Basque Country (IXA), and University of Texas, Dallas (UTD) 3 . We also present results for two systems developed internally at Brandeis University: a suite of logistic regression classifiers with minimal feature engineering intended as a performance baseline covering all sub-tasks in addition to a CRF system with more advanced features, but limited to subtasks 1a and 1b for Configuration 1.
UTD A suite of 13 classifiers for classifying spatial roles and relations including classifiers for stationary spatial relations and their participants in addition to classification of participants of motion events and their attributes.

Baseline
Our baseline classification system (BASE-LINE) consists of a suite of 47 classifiers built from Scikit-learn's (Pedregosa et al., 2011) sklearn.linear model logistic regression package. The system builds a collection of extent objects from the annotation and lexical tokenizations provided in the SpaceEval XML distribution data. Each extent instance has attributes for further feature and label extraction: the target chunk used to form the extent instance; any annotation tag associated with the chunk; lists of all surrounding tokens in the sentence, split between tokens preceding the target and those following, and a pointer to the original annotation XML for the purposes of global feature extraction and generating new XML tags based on the eventual model predictions. Some extent attributes are optional, depending on the sub-task. E.g., in sub-task 1a, no attributes are required since this sub-task is a simple classification task. For link tags, extent objects are instantiated using the text chunks associated with the extent tags that serve as the link trigger. After pre-processing, the system has a complete collection of extent instances for the corpus.
Subsequent to pre-processing, the extent data are further processed for label and feature extraction. The label and feature extractors were hand-tweaked for each sub-task: • For extent tag identification, the label extractor checks if a given token occurs at the end of a chunk, and the feature extractors include capitalization and POS tags. • For classifying extent tag types, the feature extractors include the target chunk string, POS tag, and a seven-token context window (bounded by the sentence) centered on the target token. • For extent tag attribute classification, the only feature extracted was the text of the chunk associated with the target tag. • For link tag identification, a heuristic system was developed to select candidate extent tags for the trigger argument. The remaining arguments in the relation were identified by their distance and direction from the trigger. Feature extractors for this process included the text of the trigger chunk, a count of the tags in local context (the same sentence) before and after the trigger, and the types of the extent tags that occur in the context. • For open-class link tag attributes, feature extractors included the count of extent tags before and after the trigger tag in the sentence. For closed-class link tag attributes feature extractors were limited to the text of the trigger chunk and the trigger tag type. 6 • For link tag arguments that take an IDREF as a value, a unique label function was created that extracts the offsets of the candidate extent tags in the same sentence as the trigger.
The label and feature vectors were maintained using the DictVectorizer from Scikit-learn's feature extraction module. To train the system, the vectors were used to fit the model to the training data. For decoding, the tag labels and attributes from the test data were discarded and the remaining feature vectors were transformed into a hypothesis index based on the model, which was translated to a final value using a codebook. The hypotheses were then written out to XML in accordance to the task DTD.

Brandeis CRF
In addition to the BASELINE system, we also developed a more advanced pipeline (BRANDEIS-CRF) to automate the SpaceEval sub-tasks 1a and 1b using a linear-chain conditional random field model using lexical, part-of-speech (POS), named-entityrecognition (NER), and semantic labels. We report overall F1 measures of 0.83 and 0.77 for tasks 1a and 1b, respectively, which are comparable to other top results (cf. Section 5.3). Our implementation used the CRFSuite (Okazaki, 2007) open source package, which facilitated rapid training and model inspection. The hypotheses were written out to XML in accordance to the task DTD.
We used a small set of 9 core features, augmented with bigram contexts, resulting in a total of 27 features. These features consist of lexical, syntactic, and semantic information, many of which have been applied successfully in a variety of information extraction tasks (Fei Huang et al., 2014), such as named entity recognition (Vilain et al., 2009b) or coreference resolution (Fernandes et al., 2014). The complete set of features are outlined in Table 3.  For part-of-speech (POS) and named entity (NE) tags, we used the Stanford Log-linear Part-of-Speech Tagger (Toutanova et al., 2003) and the Stanford Named Entity Recognizer (Finkel et al., 2005). Additionally, we made use of Sparser (McDonald, 1996), a rule-based natural language parser in order to provide rich semantic features. Sparser parses unstructured text in cycles, where a variety of handwritten rules apply given the applications of previous rules or the current parse of the text. After parsing, Sparser provides a set of edges, which provide both semantic and syntactic information. For our purposes, we used the CATEGORY and FORM attributes of the resulting edges. Table 4 shows that the Sparser features can be informative for this task, as five of the top ten positive weights are from Sparser. As a disclaimer, we acknowledge that model weights are not always sufficient for determining the most informative features (Vilain et al., 2009a).
However, there were several problems using Sparser. One issue is that Sparser performs its own internal tokenization and chunking, as it expects unstructured text as input, i.e. a string. To align the already tokenized sentences with a Sparser parse, we used a matching algorithm that aligned a token with its corresponding Sparser edge. A second problem was that Sparser frequently fails on inputs, and the points of failure can be difficult to identify due to the interaction of its various phases and context based 7 Token character length is ≤ 5, (5..10], or > 10.  rules. Thus, we were not able to get CATEGORY and FORM for all tokens. As a remedy, we included local forms of these Sparser features (prefixed with L), which were collected by inputting tokens by themselves to Sparser. This suggests that word lists could be very informative for this task. Table 5 shows mean precision (P), recall (R), F1, and accuracy (ACC) scores for each group for each evaluation configuration and sub-task that was attempted. The overall precision and recall measures we report are the arithmetic means of the precision and recall for each tag label or attribute in the corresponding sub-task. The overall, macro-average F1 measures we report are the harmonic mean of the overall P and R. Accuracy is computed as the number of correctly classified labels or attributes divided by the total number of labels or attributes in the gold standard. Overall accuracy and F1 are plotted in Appendix A.

Evaluation Results
Not all groups attempted all of the evaluation configurations 8 . The HRIJP-CRF-VW system was evaluated only for Configuration 1 tasks 1a, 1b, 1d, and 1e (not 1c), and Configuration 3 sub-tasks 3a and 3b. HRIJP-CRF-VW was not evaluated for Configuration 2 since those sub-tasks were not attempted. The UTD submission only covered Configuration 3, thus was only evaluated for sub-tasks 3a and 3b.

Conclusion
It is clear from the participating system results that recognizing spatial entities as a sub-task is a fairly well-understood area, with reasonable performance. All systems using CRF models for recognizing places, paths, motion and non-motion events, and spatial entities performed well. Furthermore, MOVELINK recognition results were extremely promising, due to the general tendency for movement to be accompanied by recognizable clues. The overall poor performance for recognition of spatial relations between entities, on the other hand (QSLINKs and OLINKs) indicates that these are difficult relational identification tasks, reflected in the lower IAA scores for these relations as well.
For the next SpaceEval evaluation, we believe that a more focused task, possibly embedded within an application, would lower the barrier to entry in the competition. It would also permit us to use an extrinsic evaluation for performance of the systems. We also hope to release the SpaceBank corpus through LDC later this year. This would enable the commu-nity to become more familiar with the dataset and specification.