BERT-based Spatial Information Extraction

Spatial information extraction is essential to understand geographical information in text. This task is largely divided to two subtasks: spatial element extraction and spatial relation extraction. In this paper, we utilize BERT (Devlin et al., 2018), which is very effective for many natural language processing applications. We propose a BERT-based spatial information extraction model, which uses BERT for spatial element extraction and R-BERT (Wu and He, 2019) for spatial relation extraction. The model was evaluated with the SemEval 2015 dataset. The result showed a 15.4% point increase in spatial element extraction and an 8.2% point increase in spatial relation extraction in comparison to the baseline model (Nichols and Botros, 2015).


Introduction
Extracting spatial relations from text is a type of relation extraction, focusing on the static and dynamic spatial relations in the text. It is essential for natural language understanding systems, such as robot navigation systems and questionanswering systems, to understand geographical relations or to track moving objects.
For example, in the sentence, "Tom is on the box," we find a static relation in which Tom is the trajector, box is the landmark, and on denotes their static spatial relation. In the following sentence, "He steps down from the box to the ground," we also find a dynamic spatial relation, in which He (Tom) is the mover, steps down is the trigger, box is 1 The subtasks are defined in more detail for evaluation in SemEval-2015 Task 8 (Pustejovsky et al., 2015). Spatial element extraction and spatial relation extraction correspond to 1.b and 1.d tasks, respectively, in the definition. However, the source, and ground is the destination. Using simple inference based on the extracted relations, we can infer a new relation: "Tom is on the ground now." The task is largely divided into two subtasks 1 : spatial element extraction and spatial relation extraction. Finding candidate elements for spatial relations roles, such as the trajector, landmark, and trigger, defines spatial element extraction. Finding or verifying relations among the role candidates defines spatial relation extraction.
Many natural language processing techniques and machine learning methods have been applied to spatial information extraction. For example, a conditional random field (CRF) model (Lafferty et al., 2001) is used for spatial element extraction, and support vector machine (SVM) (Suykens and Vandewalle, 1999;Roberts and Haragagiu, 2012) and convolutional neural net (CNN) (Mazalov et al., 2015) models are used for spatial relation extraction. Various language resources, such as GloVe (Pennington et al., 2014), WordNet (Salaberri et al., 2015), and PropBank (Salaberri et al., 2015) are also used for spatial information extraction.
In this paper, we propose a BERT-based spatial information extraction model that utilizes BERT (Devlin et al., 2018) extensively for both spatial element extraction and spatial relation extraction. Recently, many context-aware language models have been developed, including not only BERT, but also ELMO (Peters et al., 2018), XLNet (Yang et al., 2019), and GPT (Radford et al., 2018), among others. We chose BERT simply because many downstream applications of the BERT system have been developed for named entity in this paper, spatial element extraction task is extended to extract not only spatial elements such as paths, places, motions, spatial entities, for instance, but also spatial signals and motion signals.
In Section 2, we briefly summarize related works. In Section 3, we describe our proposed model, which consists of three modules, specifically a spatial element extraction model, a triple candidate generator, and a spatial relation extraction model. Section 4 presents the experimental results of our model. Finally, section 5 concludes the paper.

Related Works
An early method of spatial information extraction was introduced as spatial role labeling (SpRL) by Kordjamshidi et al. (2011). SemEval-2012 introduced a spatial role labeling task mainly focusing on static spatial relations. SemEval-2013 expanded static spatial relations to capture finegrained semantics and to include dynamic spatial relations.
SemEval-2015 was the first shared task conference to evaluate implementation systems for the SpaceEval annotation scheme, which is the current spatial information annotation scheme (Pustejovsky et al., 2015). Many spatial information extraction systems have been developed based on the SpaceEval annotation scheme. Nichols and Botros (2015) proposed the SpRL-CWW model, which uses a CRF model (Lafferty et al., 2001) for spatial element extraction and an SVM model (Suykens and Vandewalle, 1999) for spatial relation extraction. It uses many input features for element extraction, such as word embedding using GloVe (Pennington et al., 2014), named entities, part of speech tags and dependency parse labels. SVM is used to filter out correct triples from all possible combinations of triples. D' Souza and Ng (2015) proposed the UTD-SpRL model based on SVM, which includes more than 100 different features generated by a greedy feature selection technique and uses the joint detection of a relation's arguments. The X-Space model proposed by Salaberri et al. (2015) uses node information, such as the place, position, location and so forth, included in WordNet for spatial element extraction. It also uses argument information in PropBank for spatial relation classification.
A multimodal approach that uses image and text information simultaneously in a multimodal spatial role labeling (mSpRL) shared task was also presented in CLEF 2017 (Kordjamshidi et al., 2017), but the result was not satisfactory (Zablocki et al., 2017). Mazalov et al. (2015) extracted spatial roles and their relations by adapting a convolutional neural network based system developed for semantic role labeling. The pre-existing system was successfully adapted to spatial information extraction. Dan et al. (2020) proposed spatial BERT to predict the spatial relation between two entities given an image involving them. The spatial BERT was composed of a spatial model, implemented with a feed forward network, and a language model, which were implemented with BERT. The language model is used as complementary features to predict unseen (untrained) relations in images. Despite the fact that BERT is used as the language model in this approach, spatial relation extraction is limited to relation detection for the given subject and object entities in the image. Our approach also uses BERT but deals instead with the entire process of relation extraction from raw text; we extract spatial elements from raw text, determine their corresponding spatial roles, and find spatial relations from the spatial roles.

Spatial Information Extraction Model
We divide the spatial information extraction task into two subtasks: spatial element extraction and spatial relation extraction, according to the ISOspace annotation scheme (ISO, 2014;Pustejovsky et al., 2015). For the integrated system, we pipelined the two subtasks via a triple candidate generator. Figure 1 shows the overall architecture of our system. A sentence is inputted to the element extractor, which is jointed with one of link's role modules. The element extractor outputs the spatial elements and spatial roles jointly. The spatial roles are combined to triples as spatial relation candidates by the triple candidate generator. The triple candidates are classified as either valid relations or invalid relations by the relation extractor. Each module is described in the following sections.
For the general architecture of spatial relation extraction, two restrictions are imposed in this work. First, only three arguments are allowed. According to ISOspace, we have certain arguments for each relation, as shown Table 1. To maintain the static architecture of relation extraction, we set the number of arguments of each relation to three. Therefore, we keep all arguments for QSLink (Qualitative Spatial Link), OLink (Orientation information Link), and MeLink (Measurement Link), but for MvLnk (Movement Link), we choose three arguments out of seven: mover, goal, and motion.
Second, only one prime spatial role is determined in the element extraction stage. In multiple relations in a sentence, an entity may be related to multiple relations in multiple roles. In this case, it is necessary to choose only one role in a typical case. For example, in the sentence in Figure 2, 'vase' has two roles for each relation: trajector and landmark. Because sharing these two different roles most frequently occurs, we decided to include these roles as one role label, traLand. The triple candidate generator interprets this role label as two roles separately, trajector and landmark, for triple candidate generation.

Spatial Element Extraction
Spatial element extraction is a problem of sequence labeling, which can be easily solved with BERT. The structure of the model is shown in Figure 3. A sentence is segmented into word pieces and they are inputted to BERT to extract the spatial elements and spatial roles jointly. In previous methods, many features are extracted through preprocessing for learning by CRFs (Nichols and Botros, 2015). However, the BERT-based spatial element extraction module does not require any preprocessing for feature extraction; rather, it requires only raw text as input for fine tuning.
Multi-layer perceptron (MLP), used for a classifier on top of BERT, performs fully connected layer computation and produces IOBbased tags for annotation. Because BERT is based on word pieces (Wu et al., 2016), the outputs are also word pieces. For sequence labeling, we labeled only the first word piece. For example, we can assume that 'flower' is divided into two word pieces 'flow' and '##er', with only 'flow' then annotated as a normal tag, such as a Spatial Entity tag, whereas '##er' is annotated as an Other tag.
We use a joint model for spatial element extraction and spatial role extraction. Two classifiers are located on top of the BERT system and share the same parameters for BERT fine tuning. We noted an improvement in the joint model over the single model during a preliminary test on Korean data (Kim and Lee, 2016).

Triple Candidate Generator
Because the spatial role extractor produces only entity tags, we do not know which entity is related to which entity, especially when there are multiple relations. Moreover, we do not know the relation type to which they belong. The triple candidate generator produces all possible combinations of given spatial roles for the spatial relation extractor to determine which combination and type should be chosen. For example, in the sentence shown in Figure 4, we have two trajectors, 'bike' and 'puppy'; two landmarks, 'warehouse' and 'gate'; and two triggers, 'by' and 'in front of'. The triple candidate generator produces all combinations of trajector, landmark, and trigger. In this case, it produces 8 (2*2*2) triple candidates. Generally, for a set of trajector T, a set of landmark L, and a set of trigger G, we have a number of Cartesian product triple candidates: |T|*|L|*|G|.

Spatial Relation Extraction
Spatial relation extraction is a task to identify the relation between given entities, in our case, triple entities. A similar task has been done by using BERT in semantic role labeling (SRL), in which a relationship is classified for two given semantic role arguments (Wu and He, 2019). This model showed the best performance in SRL, and we refer to this model as R-BERT in this paper. We adopted R-BERT for spatial relation extraction, but we modified two aspects of the model. We extended two arguments to three arguments, and we include null argument for the case of a movement link. Figure 5 shows the structure of the modified model. A sentence, marked with a triple candidate, is inputted to BERT. The BERT outputs of each of the roles in a triple and CLS token are averaged and go through the fully connected network. The four outputs are concatenated and then go through fully connected network again. The softmax of the output is the final result to determine the validity of the triple relation.
For the argument span, the input format is changed with the start index and end index along with the words [words, start index, end index]. In the case of two arguments, we utilize formulae (1) to (5) for the spans of i and j, and k and m: We extended the model to operate for three arguments of spatial information extraction. We added one additional tanh output for a trigger with span q and r, as shown in formula (6). We also modified formula (4) to formula (7) to include the trigger: For the null argument in the case of a movement link, we utilize the last character in the sentence. For example, the sentence "John leaves from school." contains a mover, "John," and a motion, "leaves," but it does not have a goal. In this case, we represent the goal as the null argument. Therefore, we have a three-argument span: ['John', 0, 0], ['.', 4, 4], and ['leaves', 1, 1]. For both the spatial element model and the relation role extraction model, the hyper-parameters in Table 2 are used. An experiment was conducted with a dataset of SemEval-2015 task 8: SpaceEval. Table 3 shows the statistics of the dataset.
Because non-motion events and MeLink are usually not evaluated in spatial information extraction tasks, they were excluded from our experiment. We also added invalid triplets generated by the triple candidate generator for training.
These triplets accounted for approximately 40% of all data and were used as negative data in the training data.

Results
Because the SpRL-CWW model (Nichols and Botros, 2015) was best in SpaceEval, it was used as the baseline model in this evaluation. In our model, spatial elements and spatial roles are jointly trained and extracted. Because the spatial roles depend on the link relation type, we have four types of joint models here: QSLink, OLink, MvLink, and MeLink. The evaluation results for these models are shown in Table 4. Overall, the performance for element extraction was better than that for role extraction. Moreover, the Joint-with-QSLink model showed the worst performance, whereas the Joint-with-MeLink model showed the best performance.

Ablation study
In order to observe the effects of the proposed features, in this case the traLand tag, and the joint model of the spatial elements and roles, we conducted an ablation test.  Table 7. Ablation test of models without using the joint training feature and a dual-role tag (traLand)

Conclusion
Spatial information extraction is necessary for many applications, such as robot navigation and question-answering systems, to understand geographical information in text. This task is processed largely with two subtasks: spatial element extraction and spatial relation extraction.
In this paper, we proposed a BERT-based spatial information extraction model that uses BERT (Devlin et al., 2018) for spatial element extraction and R-BERT (Wu and He, 2019) for spatial relation extraction. The two modules are connected with a pipeline through a triple candidate generator.
Spatial elements are extracted jointly with spatial roles that are input for spatial relation extraction. The joint model contributes to increase the performance of spatial role extraction in some cases, which is more useful for relation extraction. R-BERT, which was originally used for semantic role labeling, was modified here to handle three arguments and a null argument for spatial relation extraction.
Our model was evaluated with the SemEval 2015 dataset. The result showed a 15.4% point improvement in spatial element extraction and an 8.2% point improvement in spatial relation extraction in comparison to the baseline model (Nichols and Botros, 2015). This proves that our BERT-based model is very effective for spatial information extraction.