SpRL-CWW: Spatial Relation Classification with Independent Multi-class Models

In this paper we describe the SpRL-CWW entry into SemEval 2015: Task 8 SpaceEval. It detects spatial and motion relations as de-ﬁned by the ISO-Space speciﬁcations in two phases: (1) it detects spatial elements and spatial/motion signals with a Conditional Random Field model that uses a combination of distributed word representations and lexico-syntactic features; (2) given relation candidate tuples, it simultaneously detects relation types and labels the spatial roles of participating elements by using a combination of syntactic and semantic features in independent multi-class classiﬁcation models for each relation type. In evaluation on the shared task data, our system performed particularly well on detection of elements and relations in unannotated data.


Introduction
Understanding human language about location and motion is important for many applications including robotics, navigation systems, and wearable computing. Shared tasks dedicated to the problem of representing and detecting spatial and motion relations have been organized for SemEval 2012 (Kordjamshidi et al., 2012), 2013 (Kolomiyets et al., 2013), and 2015. In this paper we present SpRL-CWW, our entry to SemEval 2015 Task 8: SpaceEval, and present extended evaluation of our system to investigate the impact of the task annotations and system configurations on task performance. Kordjamshidi et al. (2011) proposed the task of Spatial Role Labeling (SpRL) to detect spatial and motion relations in text. SpRL was modeled after semantic role labeling (see (Fillmore et al., 2003;Màrquez et al., 2008)), with spatial indicators instead of predicates signaling the presence of relations, and spatial roles instead of semantic roles.

SpaceEval Task Definition
A canonical example of a spatial relation from (Kordjamshidi et al., 2011)  The spatial indicator (SP ) on indicates that there is a spatial relation between the trajector (T R; primary object of spatial focus) and the landmark (LM ; secondary object of spatial focus). SpRL was formalized as a task of classifying tuples of < w SP , w T R , w LM > as spatial relations or not. The SpRL task was reformulated and reintroduced in SpaceEval 1 using the ISO-Space annotation specifications (Pustejovsky et al., 2012). The biggest change was the decoupling of the semantic type and role of spatial relation arguments. A taxonomy of Spatial Element (SE) types was introduced to describe the meaning of arguments independent of their participation in relations, and spatial roles were treated as instance-specific annotations on spatial and motion relations.
The SE types introduced are: SPATIAL_ENTITY, PATH, PLACE, MOTION, NON_MOTION_EVENT, and MEASURE. Two types were also introduced to represent expressions that indicated the presence of relations: SPATIAL_SIGNAL and MOTION.
Spatial and motion relations were redefined as: • MOVELINK: motion relation • QSLINK: qualitative spatial relation • OLINK: spatial orientation relation   Examples of SpaceEval annotations are given in Figure 1. The training data for SpaceEval consists of portions of the corpora from past SemEval SpRL tasks as well as a new dataset consisting of passages from guidebooks. Following the schema described in this section, a total of 6,782 spatial elements and signals comprising 2,186 relations were annotated. We participated in the task configurations given in Figure 2, as defined by the official SpaceEval task description.

Related Research
KUL-SKIP-CHAIN-CRF (Kordjamshidi et al., 2011) was a skip-chain CRF-based sequential labeling model. It used a combination of lexico-syntactic information and semantic role information and used preposition templates to represent long distance dependencies. It was used as a baseline system in the SemEval 2012 and 2013 SpRL tasks.
UTD-SpRL (Roberts and Harabagiu, 2012) was an entry into the SemEval 2012 SpRL task that adopted a joint relation detection and role labeling approach with the motivation that roles in spatial relations were dependent on each other. The approach used heuristics to gather spatial relation candidate tuples. A hand-crafted dictionary was used to detect SPATIAL_INDICATOR candidates, and noun phrase heads were treated as TRAJECTOR and LANDMARK candidates. A model for relation classification and role labeling was then trained with lib-LINEAR using POS, lemma, and dependency-pathbased features, with feature selection used to prune away ineffective features.
Spatial indicators and roles were detected with sequential labeling using SV M hmm with detected indicators used as features for spatial role labeling. In addition, shallow grammatical features in the form of POS n-grams were used in place of richer syntactic information in order to avoid overfitting. The model also used PMI-score based word space representations as described in (Sahlgren, 2006).
UNITOR-HMM-TK's approach to spatial relation identification avoided feature engineering by employing an SVM model with a smoothed partial tree  Figure 3: The SpRL-CWW system architecture. Spatial elements and signals are detected, from which relation candidate tuples are generated, and then relations with their arguments labeled are identified by a separate classifier for each relation type. The red arrow indicates special trigger dictionary processing that is only carried out for SpaceEval tasks 1d and 1e, and for Setting F of the relation classification task extended evaluation in Table 3. kernel over modified dependency trees to capture syntactic information. More recent work on spatial relation identification includes (Kordjamshidi and Moens, 2014).

Approach
SpRL-CWW uses a feature-rich CRF model to jointly label spatial elements and spatial/motion signals. Previous approaches (Kordjamshidi et al., 2011;Bastianelli et al., 2013) proposed a two-step sequential labeling method for this task. In the first step, they label spatial signals 3 since they indicate the presence of a relation, which spatial roles depend on. In the second step, they label all the other spatial roles in the sentence using the extracted signals as features. However, any errors made in the first step will deteriorate the performance of the second. Furthermore, for SpaceEval 2015 the spatial element annotations are less likely to depend on the presence of a relation and can be detected independently. Thus, our system avoids the performance degradation associated with pipeline approaches by combining the two steps.
SpRL-CWW's CRF model labels each word in a sentence with one of the labels described in Section 2, or with NONE. In line with UNITOR-HMM-TK (Bastianelli et al., 2013), shallow lexico-syntactic features are applied instead of the full syntax of the sentence to avoid over-fitting the training data. We use word vectors trained on Web-scale corpora for a fine-grained lexical representation.
An example of our feature representation for the sentence "Saitama is northwest of Tokyo." is given in Figure 4.

Datasets
We evaluated our system on the SpaceEval training data as described in Section 2, and additionally on the SpaceEval Task 3 test data, which was distributed with gold labeled Spatial Elements, Indicators, and Motions. The test data consisted of 16 files with 317 sentences and 1,609 spatial roles.

Results
Official task results for spatial element/signal identification (Task 1a) and classification (Task 1b) are shown in Table 1.
We performed more detailed evaluation using 5fold cross validation on the training data and on the released gold test data. Our results are presented in Table 2. These results and have an f1score that is slightly lower than the official reported result. 5 Evaluation over the test data produced a 4 http://www-nlp.stanford.edu/projects/glove/ 5 As the official evaluation data and scripts have not been fully released at the time of writing, it is not possible to determine the cause of the discrepancy in f1-scores. Comparison between strict and "relaxed" matching as used in prior Se-mEval SpRL tasks did not account for the difference.   Table 3. slightly higher f1-score than on the training data. We theorize that this is due to cross-fold validation using a smaller dataset for its model.

Approach
To identify spatial relations, the SpRL-CWW system determines which spatial elements and signals, can be combined to form valid spatial relations. Since the type of a relation (MOVELINK, QSLINK, or OLINK) is dependent upon its arguments, our method, inspired by UTD-SpRL (Roberts and Harabagiu, 2012), jointly classifies spatial relations and labels participating arguments in one classification step. We aim to simplify our model and improve learning by only  Table 3: Settings for extended relation detection evaluation over the SpaceEval 2015 training data. All evaluation is conducted with 5-fold cross validation, the full RE feature set from Figure 5, gold standard SEs, and gold standard triggers. The overall precision, recall, and f1-scores are reported for each setting with the highest performing in bold. Setting A was used for our official submission. Where indicated, L2 regularization was performed with λ 2 = 1 * 10 −14 .
considering relations that contain a trigger and by labeling only the following attributes which correspond to primary spatial and motion roles: • MOVELINK: trigger, mover, goal • QSLINK and OLINK: trigger, trajector, landmark

Candidate Trigger Extraction
First, candidate triggers are extracted from each sentence. The model we presented for detecting signals in Section 4.1 has a high f1-score but low precision. Because we want to prioritize recall for generating candidate tuples, when classifying relations on unannotated text, dictionaries of triggers automatically compiled from the training data are used to extract potential triggers from sentences. These dictionaries are used in Task 1d and 1e in Figure 2. In Task 3, where gold spatial roles are provided, MOTIONs are used as potential MOVELINK triggers, SPATIAL_SIGNALs are used as potential QSLINK and OLINK triggers. Evaluation of the trigger dictionaries shows that they have much higher recall than CRF models 6 . Additional relation classification evaluation in Table 3 show that the dictionaries (Setting F) achieve an f1-score improvement of 0.055 over the CRF models (Setting G).

Candidate Tuple Generation
All possible candidate relations in a sentence are then generated using the extracted triggers and the spatial elements in the sentence. A candidate tuple consists of an extracted trigger and two other spatial elements: arg1 and arg2. Since some relations, such as the one represented in Figure 1 Example 6, can have undefined arguments, tuples with undefined arguments are also generated. For Example 4 6 In particular, recall for SPATIAL_SIGNALS increases from 0.603 to 0.936 and MOTION recall increases from 0.700 to 0.812 on the SpaceEval test data.

Setup
Once again, Stanford CoreNLP was used for POS tagging, lemmatization and dependency parsing. The classification models were trained with Vowpal Wabbit's one-against-all multi-class classifier using its online stochastic gradient descent implementation with all the default settings.  Table 4: SpRL-CWW's relation classification results for the highest-performing Setting D.

Datasets
We evaluated our system on the trial and training data that was released for SpaceEval, with the exception of 9 files that didn't have spatial relations annotated. Since our system focuses on relations with a trigger, we filtered out the relations that contained no trigger. The resulting dataset of 1,801 relations was used for training and evaluation.

Results
Official task results for relation classification are shown in Table 1. Task 1d results use the SEs that were detected in the previous step (Task 1b). Task 3a results are for relation classification using gold spatial elements and signals.

Discussion
Participation in SpaceEval raised several questions which we attempt to answer by conducting extended evaluation of our system on the SpaceEval training data using 5-fold cross validation 7 . The settings and results are summarized in Table 3.

Which features were effective?
The feature ablation results in Table 5 show the three features with the largest contribution to SE and SI classification. They verify the contribution of word vectors trained on Web-scale data and support Bastianelli's et al. (2013)'s claim that shallow grammatical information is essential.

Does the fine-grained SpaceEval annotation scheme help or hinder?
In order to explore this, we compare the top performing setting with SE type-related features (Setting B) to a setting with them removed (Setting C). Absence of these features decrease the f1-score by 0.044, providing evidence that fine-grained SE types help relation classification, though the relation and spatial role taxonomy requires consideration. 7 Partitions were made by taking a stratified split of the document set when ordered by decreasing size.  Furthermore, each gold Spatial Signal that was provided for Task 3 had one of three possible semantic types; DIRECTIONAL, TOPOLOGICAL or DIR_TOP (both). Instead of using all Spatial Signals as candidate triggers for QSLINKs and OLINKs, we only considered TOPOLOGICAL Spatial Signals as candidate triggers for QSLINK and DIRECTIONAL Spatial Signals as candidate triggers for OLINK. This setting (Setting D) achieved the highest f1-score and recall, demonstrating the importance of Spatial Signal semantic types in relation classification. Full relation classification results for Setting D are summarized in Table 4. 8

Is less (or no) feature engineering feasible?
We attempt this by automatically generating features using Vowpal Wabbit's quadratic feature generation. We disable all features underlined in Figure 5) and instruct VW to automatically construct features by generating all possible feature combinations. Settings E and F compare the base feature set before and after quadratic features are added. While quadratic features achieve a lower f1-score, they have the highest precision of all settings, suggesting feature generation may be useful for increasing precision of relation classification, but the low f1score of Setting F indicates care is needed in selecting the base feature set. We are exploring feature engineering reduction further with a phrase vectorbased model inspired by (Hermann et al., 2014).

Conclusion
In this paper we presented the SpRL-CWW entry to SpaceEval 2015: Task 8. Official evaluation showed that it performed especially well on unannotated data. Extended evaluation verified the contribution of Web-scale word vectors, trigger dictionaries, and SE type information; and automatic feature generation showed promise. For future work, we plan to explore phrase vector-based approaches to SpRL.