A Walk-based Model on Entity Graphs for Relation Extraction

We present a novel graph-based neural network model for relation extraction. Our model treats multiple pairs in a sentence simultaneously and considers interactions among them. All the entities in a sentence are placed as nodes in a fully-connected graph structure. The edges are represented with position-aware contexts around the entity pairs. In order to consider different relation paths between two entities, we construct up to l-length walks between each pair. The resulting walks are merged and iteratively used to update the edge representations into longer walks representations. We show that the model achieves performance comparable to the state-of-the-art systems on the ACE 2005 dataset without using any external tools.


Introduction
Relation extraction (RE) is a task of identifying typed relations between known entity mentions in a sentence. Most existing RE models treat each relation in a sentence individually (Miwa and Bansal, 2016;Nguyen and Grishman, 2015). However, a sentence typically contains multiple relations between entity mentions. RE models need to consider these pairs simultaneously to model the dependencies among them. The relation between a pair of interest (namely "target" pair) can be influenced by other pairs in the same sentence. The example illustrated in Figure 1 explains this phenomenon. The relation between the pair of interest Toefting and capital, can be extracted directly from the target entities or indirectly by incorporating information from other related pairs 1 Source code available at https://github.com/ fenchri/walk-based-re in the sentence. The person entity (PER) Toefting is directly related with teammates through the preposition with. Similarly, teammates is directly related with the geopolitical entity (GPE) capital through the preposition in. Toefting and capital can be directly related through in or indirectly related through teammates. Substantially, the path from Toefting to teammates to capital can additionally support the relation between Toefting and capital.
Multiple relations in a sentence between entity mentions can be represented as a graph. Neural graph-based models have shown significant improvement in modelling graphs over traditional feature-based approaches in several tasks. They are most commonly applied on knowledge graphs (KG) for knowledge graph completion (Jiang et al., 2017) and the creation of knowledge graph embeddings Shi and Weninger, 2017). These models rely on paths between existing relations in order to infer new associations between entities in KGs. However, for relation extraction from a sentence, related pairs are not predefined and consequently all entity pairs need to be considered to extract relations. In addition, state-of-the-art RE models sometimes depend on external syntactic tools to build the shortest dependency path (SDP) between two entities in a sentence (Xu et al., 2015;Miwa and Bansal, 2016). This dependence on external tools leads to domain dependent models.
In this study, we propose a neural relation extraction model based on an entity graph, where entity mentions constitute the nodes and directed edges correspond to ordered pairs of entity mentions. The overview of the model is shown in Figure 2. We initialize the representation of an edge (an ordered pair of entity mentions) from the representations of the entity mentions and their context. The context representation is achieved by employing an attention mechanism on context words. We then use an iterative process to aggregate up-to l-length walk representations between two entities into a single representation, which corresponds to the final representation of the edge.
The contributions of our model can be summarized as follows: • We propose a graph walk based neural model that considers multiple entity pairs in relation extraction from a sentence. • We propose an iterative algorithm to form a single representation for up-to l-length walks between the entities of a pair. • We show that our model performs comparably to the state-of-the-art without the use of external syntactic tools.

Proposed Walk-based Model
The goal of the RE task is given a sentence, entity mentions and their semantic types, to extract and classify all related entity pairs (target pairs) in the sentence. The proposed model consists of five stacked layers: embedding layer, BLSTM Layer, edge representation layer, walk aggregation layer and finally a classification layer. As shown in Figure 2, the model receives word representations and produces simultaneously a representation for each pair in the sentence. These representations combine the target pair, its context words, their relative positions to the pair entities and walks between them. During classification they are used to predict the relation type of each pair.

Embedding Layer
The embedding layer involves the creation of n w , n t , n p -dimensional vectors which are assigned to words, semantic entity types and relative positions to the target pairs. We map all words and semantic types into real-valued vectors w and t respec-  tively. Relative positions to target entities are created based on the position of words in the sentence. In the example of Figure 1, the relative position of teammates to capital is −3 and the relative position of teammates to Toefting is +16. We embed real-valued vectors p to these positions.

Bidirectional LSTM Layer
The word representations of each sentence are fed into a Bidirectional Long-short Term Memory (BLSTM) layer, which encodes the context representation for every word. The BLSTM outputs new word-level representations h (Hochreiter and Schmidhuber, 1997) that consider the sequence of words.
We avoid encoding target pair-dependent information in this BLSTM layer. This has two advantages: (i) the computational cost is reduced as this computation is repeated based on the number of sentences instead of the number of pairs, (ii) we can share the sequence layer among the pairs of a sentence. The second advantage is particularly important as it enables the model to indirectly learn hidden dependencies between the related pairs in the same sentence.
For each word t in the sentence, we concatenate the two representations from left-to-right and right-to-left pass of the LSTM into a n e -dimensional vector,

Edge Representation Layer
The output word representations of the BLSTM are further divided into two parts: (i) target pair representations and (ii) target pair-specific context representations. The context of a target pair can be expressed as all words in the sentence that are not part of the entity mentions. We represent a related pair as described below. A target pair contains two entities e i and e j . If an entity consists of N words, we create its BLSTM representation as the average of the BLSTM representations of the corresponding words, e = 1 |I| i∈I e i , where I is a set with the word indices inside entity e.
We first create a representation for each pair entity and then we construct the representation for the context of the pair. The representation of an entity e i is the concatenation of its BLSTM representation e i , the representation of its entity type t i and the representation of its relative position to entity e j , p ij . Similarly, for entity e j we use its relative position to entity e i , p ji . Finally, the representations of the pair entities are as follows: The next step involves the construction of the representation of the context for this pair. For each context word w z of the target pair e i , e j , we concatenate its BLSTM representation e z , its semantic type representation t z and two relative position representations: to target entity e i , p zi and to target entity e j , p zj . The final representation for a context word w z of a target pair is, v ijz = [e z ; t z ; p zi ; p zj ]. For a sentence, the context representations for all entity pairs can be expressed as a three-dimensional matrix C, where rows and columns correspond to entities and the depth corresponds to the context words.
The context words representations of each target pair are then compiled into a single representation with an attention mechanism. Following the method proposed in Zhou et al. (2016), we calculate weights for the context words of the targetpair and compute their weighted average, where q ∈ R n d , n d = n e + n t + 2n p denotes a trainable attention vector, α is the attended weights vector and c ij ∈ R n d is the context representation of the pair as resulted by the weighted average. This attention mechanism is independent of the relation type. We leave relation-dependent attention as future work.
Finally, we concatenate the representations of the target entities and their context (∈ R nm ). We use a fully connected linear layer, W s ∈ R nm×ns with n s < n m to reduce the dimensionality of the resulting vector. This corresponds to the representation of an edge or a one-length walk between nodes i and j: v

Walk Aggregation Layer
Our main aim is to support the relation between an entity pair by using chains of intermediate relations between the pair entities. Thus, the goal of this layer is to generate a single representation for a finite number of different lengths walks between two target entities. To achieve this, we represent a sentence as a directed graph, where the entities constitute the graph nodes and edges correspond to the representation of the relation between the two nodes. The representation of one-length walk between a target pair v (1) ij , serves as a building block in order to create and aggregate representations for one-to-l-length walks between the pair. The walkbased algorithm can be seen as a two-step process: walk construction and walk aggregation. During the first step, two consecutive edges in the graph are combined using a modified bilinear transformation, where v (λ) ij ∈ R n b corresponds to walks representation of lengths one-to-λ between entities e i and e j , represents element-wise multiplication, σ is the sigmoid non-linear function and W b ∈ R n b ×n b is a trainable weight matrix. This equation results in walks of lengths two-to-2λ.
In the walk aggregation step, we linearly combine the initial walks (length one-to-λ) and the extended walks (length two-to-2λ), where β is a weight that indicates the importance of the shorter walks. Overall, we create a representation for walks of length one-to-two using Equation (3) and λ = 1. We then create a representation for walks of length one-to-four by re-applying the equation with λ = 2. We repeat this process until the desired maximum walk length is reached, which is equivalent to 2λ = l.

Classification Layer
For the final layer of the network, we pass the resulted pair representation into a fully connected layer with a softmax function, where W r ∈ R n b ×nr is the weight matrix, n r is the total number of relation types and b r is the bias vector.
We use in total 2r+1 classes in order to consider both directions for every pair, i.e., left-to-right and right-to-left. The first argument appears first in a sentence in a left-to-right relation while the second argument appears first in a right-to-left relation. The additional class corresponds to non-related pairs, namely "no relation" class. We choose the most confident prediction for each direction and choose the positive and most confident prediction when the predictions contradict each other.

Dataset
We evaluate the performance of our model on ACE 2005 2 for the task of relation extraction. ACE 2005 includes 7 entity types and 6 relation types between named entities. We follow the preprocessing described in Miwa and Bansal (2016).

Experimental Settings
We implemented our model using the Chainer library (Tokui et al., 2015). 3 The model was trained with Adam optimizer (Kingma and Ba, 2015). We initialized the word representations with existing pre-trained embeddings with dimensionality of 200. 4 Our model did not use any external tools except these embeddings.
The forget bias of the LSTM layer was initialized with a value equal to one following the work of Jozefowicz et al. (2015). We use a batchsize of 10 sentences and fix the pair representation dimensionality to 100. We use gradient clipping, dropout on the embedding and output layers and L2 regularization without regularizing the biases, to avoid overfitting. We also incorporate early stopping with patience equal to five, to chose the number of training epochs and parameter averaging. We tune the model hyper-parameters on the respective development set using the RoBO Toolkit (Klein et al., 2017). Please refer to the supplementary material for the values.
We extract all possible pairs in a sentence based on the number of entities it contains. If a pair is not found in the corpus, it is assigned the "no relation" class. We report the micro precision, recall and F1 score following Miwa and Bansal (2016) and Nguyen and Grishman (2015). Table 1 illustrates the performance of our proposed model in comparison with SPTree system Miwa and Bansal (2016) on ACE 2005. We use the same data split with SPTree to compare with their model. We retrained their model with gold entities in order to compare the performances on the relation extraction task. The Baseline corresponds to a model that classifies relations by using only the representations of entities in a target pair.

Results
As it can be observed from the table, the Baseline model achieves the lowest F1 score between the proposed models. By incorporating attention we can further improve the performance by 1.3 percent point (pp). The addition of 2-length walks further improves performance (0.9 pp). The best results among the proposed models are achieved for maximum 4-length walks. By using up-to 8-length walks the performance drops almost by 2 pp. We also compared our performance with Nguyen and Grishman (2015) (CNN) using their data split. 5 For the comparison, we applied our 5 The authors kindly provided us with the data split. # Entities l = 1 l = 2 l = 4 l = 8 best performing model (l = 4). 6 The obtained performance is 65.8 / 58.4 / 61.9 in terms of P / R / F1 (%) respectively. In comparison with the performance of the CNN model, 71.5 / 53.9 / 61.3, we observe a large improvement in recall which results in 0.6 pp F1 increase. We performed the Approximate Randomization test (Noreen, 1989) on the results. The best walks model has no statistically significant difference with the state-of-the-art SPTree model as in Table 1. This indicates that the proposed model can achieve comparable performance without any external syntactic tools.
Finally, we show the performance of the proposed model as a function of the number of entities in a sentence. Results in Table 2 reveal that for multi-pair sentences the model performs significantly better compared to the no-walks models, proving the effectiveness of the method. Additionally, it is observed that for more entity pairs, longer walks seem to be required. However, very long walks result to reduced performance (l = 8).

Related Work
Traditionally, relation extraction approaches have incorporated a large variety of hand-crafted features to represent related entity pairs (Hermann and Blunsom, 2013;Miwa and Sasaki, 2014;Nguyen and Grishman, 2014;Gormley et al., 2015). Recent models instead employ neural network architectures and achieve state-of-the-art results without heavy feature engineering. Neural network techniques can be categorized into recurrent neural networks (RNNs) and convolutional neural networks (CNNs). The former is able to encode linguistic and syntactic properties of long word sequences, making them preferable for sequence-related tasks, e.g. natural language generation (Goyal et al., 2016), machine translation (Sutskever et al., 2014).
State-of-the-art systems have proved to achieve good performance on relation extraction using RNNs (Cai et al., 2016;Miwa and Bansal, 2016;Xu et al., 2016;Liu et al., 2015). Nevertheless, most approaches do not take into consideration the dependencies between relations in a single sentence (dos Santos et al., 2015;Nguyen and Grishman, 2015) and treat each pair separately. Current graph-based models are applied on knowledge graphs for distantly supervised relation extraction (Zeng et al., 2017). Graphs are defined on semantic types in their method, whereas we built entity-based graphs in sentences. Other approaches also treat multiple relations in a sentence (Gupta et al., 2016;Miwa and Sasaki, 2014;Li and Ji, 2014), but they fail to model long walks between entity mentions.

Conclusions
We proposed a novel neural network model for simultaneous sentence-level extraction of related pairs. Our model exploits target and context pair-specific representations and creates pair representations that encode up-to l-length walks between the entities of the pair. We compared our model with the state-of-the-art models and observed comparable performance on the ACE2005 dataset without any external syntactic tools. The characteristics of the proposed approach are summarized in three factors: the encoding of dependencies between relations, the ability to represent multiple walks in the form of vectors and the independence from external tools. Future work will aim at the construction of an end-to-end relation extraction system as well as application to different types of datasets.

A Hyper-parameter Settings
We tuned our proposed model using the RoBO toolkit (https://github.com/automl/ RoBO). Table 3 provides the selected options we used for tuning the model.

Optimization Options
Optimization method Bohamiann Maximizer scipy Acquisition function log ei Number of iterations 50 Initial points 3