Relation Extraction among Multiple Entities Using a Dual Pointer Network with a Multi-Head Attention Mechanism

Many previous studies on relation extrac-tion have been focused on finding only one relation between two entities in a single sentence. However, we can easily find the fact that multiple entities exist in a single sentence and the entities form multiple relations. To resolve this prob-lem, we propose a relation extraction model based on a dual pointer network with a multi-head attention mechanism. The proposed model finds n-to-1 subject-object relations by using a forward de-coder called an object decoder. Then, it finds 1-to-n subject-object relations by using a backward decoder called a sub-ject decoder. In the experiments with the ACE-05 dataset and the NYT dataset, the proposed model achieved the state-of-the-art performances (F1-score of 80.5% in the ACE-05 dataset, F1-score of 78.3% in the NYT dataset)


Introduction
Relation extraction is the task of recognizing semantic relations (i.e., tuple structures; subject-relation-object triples) among entities in a sentence. Figure 1 shows three triples that can be extracted from the given sentence. With significant success of neural networks in the field of natural language processing, various relation extraction models based on convolutional neural networks (CNNs) have been suggested (Kumar, 2017); the CNN model with max-pooling (Zeng et al., 2014), the CNN model with multisized window kernels , the combined CNN model (Yu and Jiang, 2016), and the contextualized graph convolutional network (C-GCN) model (Zhang et al., 2018).
Relation extraction models based on recurrent neural network (RNNs) has been the other popular choices; the long-short term memory (LSTM) model with dependency tree (Miwa and Bansal, 2016), the LSTM model with position-aware attention mechanism (Zhang et al., 2017), and the walk-based model on entity graphs (Christopoulou et al., 2019). Most of these previous models have been focused on extracting only one relation between two entities from a single sentence. However, multiple entities exist in a single sentence, and these entities can form multiple relations. To address this issue, we propose a relation extraction model to find all possible relations among multiple entities in a sentence at once.
The proposed model is based on the pointer network (Vinyals et al., 2015). The pointer network is a sequence-to-sequence (Seq2Seq) model in which an attention mechanism (Bahdanau et al., 2015) is modified to learn the conditional probability of an output whose values correspond to positions in a given input sequence. We modify the pointer network to have dual decoders; an object decoder (a forward decoder) and a subject decoder (a backward decoder). The object decoder plays a role to extract n-to-1 relations as shown in the following example: (James-BirthPlace-South Korea) and (Tom-BirthPlace-SouthKorea) extracted from 'James and Tom was born in South Korea'. The subject decoder plays a role to extract 1-to-n relations as shown in the following example: (James-Position-student) and (James-Affiliation-Stanford university) extracted from 'James is a student at Stanford university'.

Relation Extraction among Multiple Entities using a Dual Pointer Network with a Multi-Head Attention Mechanism
Seongsik Park Harksoo Kim Kangwon National University, South Korea {a163912, nlpdrkim}@kangwon.ac.kr Figure 2 illustrates an overall architecture of the proposed model. As shown in Figure 2, the proposed model consists of two parts: One is a context and entity encoder, and the other is a dual pointer network decoder.
The context and entity encoder (the left part of Figure 2) computes degree of associations between words and entities in a given sentence. In the context and entity encoder, { 1 , 2 , … , } and { 1 , 2 , … , } are word embedding vectors and entity embedding vectors, respectively. The word embedding vectors are concatenations of two types of embeddings; word-level GloVe embeddings for representing meanings of words (Pennington et al., 2014) and character-level CNN embeddings for alleviating out-of-vocabulary problems (Park et al., 2018). The entity embedding vectors are similar to the word embedding vectors except that entity type embeddings are additionally concatenated. The entity type embeddings are vector representations associated with each entity type 1 and are initialized as random values. The word embedding vectors are input to a bidirectional LSTM network in order to obtain contextual information. The entity embedding vectors are input to a forward LSTM network because entities are listed in the order appeared in a sentence. The output vectors of the bidirectional LSTM network and the forward LSTM network are input to the context-to-entity attention layer ('Context2Entity Attention' in Figure 2) in order to compute relative degrees of associations between words and entities according to the same manner 1 We use seven entity types such as person, location, organization, facility, geo-political, vehicle and weapon in the ACE-2005 dataset. Then, we use three with the Context2Query attention proposed in Seo et al. (2017).
In a pointer network, attentions show position distributions of an encoding layer. Since an attention is highlighted at only one position, the pointer network has a structural limitation when one entity forms relations with several entities (for instance, 'James' in Figure 1). The proposed model adopts a dual pointer network decoder (the right part of Figure 2) to overcome this limitation. The first decoder called an object decoder learns the position distribution from subjects to objects. Conversely, the second decoder called a subject decoder learns the position distribution from objects to subjects. In Figure 1, 'James' should point to both 'south Korea' and 'Stanford university'. If we use a conventional forward decoder (the object decoder), this problem could not be solved because the forward decoder cannot point to multiple targets. However, the subject decoder (a backward decoder) can resolve this problem because 'south Korea' and 'Stanford university' can respectively point to 'James'.
Additionally, we adopt a multi-head attention mechanism in order to improve performances of the dual pointer network. The multi-head attention mechanism splits the input value into multiple heads and compute the attention of each head. The inputs {ℎ 1 , ℎ 2 , … , ℎ } of multi-head attention layer are the vectors that concatenate the entity embedding vectors { 1 , 2 , … , } and the output vectors { 1 , 2 , … , } of the context-to-entity attention layer. The random initialized vector ℎ is used entity type such as person, location and organization in the NYT dataset. Multi-head Attention for handling entities that do not have any relations with other entities. In other words, entities without any relations point to ℎ . As shown in Figure 2, the dual pointer network decoder returns two kinds of value sequences. One is a sequence of relation labels { 1 , 2 , … , }, and the other is a sequence of pointed positions { 1 , 2 , … , }.

Datasets and Experimental Settings
We evaluated the proposed model by using the following benchmark datasets.
ACE-05 corpus: The Automatic Content Extraction dataset (ACE) includes seven major entity types and six major relation types. The ACE-05 corpus is not proper to evaluate models to extract multiple triples from a sentence. Therefore, if some triples in the ACE-05 corpus share a sentence (i.e., some triples are occurred in the same sentence), we merged the triples. As a result, we obtained a data set annotated with multiple triples. Then, we divided the new data set into a training set (5,023 sentences), a development set (629 sentences), and a test set (627 sentences) by a ratio of 8:1:1.
New York Times (NYT) corpus (Riedel et al., 2010): the NYT corpus is a news corpus sampled from New York Times news articles. The NYT corpus is produced by distant supervision method. Zheng et al (2017) and Zeng et al (2018) used this dataset as supervised data. We excluded sentences without relation facts from Zheng's corpus. Finally, we obtained 66,202 sentences in total. We used 59,581 sentences for training and 6,621 for evaluate.
Optimization of the proposed model was done with the Adam optimizer (Kingma and Ba, 2014) with learning-rate = 0.001, encoder units = 128, decoder units = 256, dropout rate = 0.1. Table 1 shows performances of the proposed model and the comparison models when the ACE-05 corpus is used as an evaluation dataset. In Table  1, SPTree LSTM (Miwa and Bansal, 2016) is a model that applies the dependency information between the entities. FCM (Gormley et al., 2015) is a model in which handcrafted features are combined with word embeddings. CNN+RNN (Nguyen and Grishman, 2015) is a hybrid model of CNN and RNN. HRCNN (Kim and Choi, 2018) Table 2 shows performances of the proposed model and the comparison models when the NYT corpus is used as an evaluation dataset. In Table 2, NovelTag (Zheng et al., 2017) MultiDecoder (Zeng et al., 2018) are models that jointly extract entities and relations. It is not reasonable to directly compare the proposed model with NovelTag and Mul-tiDecoder because the proposed model needs goldlabeled entities while NovelTag and MultiDecoder automatically extracts entities from sentences. Although the direct comparisons are unfair, the proposed model showed much higher performances than expected.   Table 3 shows performance changes according to the number of entities per sentence in the ACE-05 corpus. As shown in Table 3, the more the number of entities per sentence was, the lower the performances of the proposed model were. We think that the decreasing of performances is due to the increasing of complexities. The performance when the number of entities is more than five was slightly improved as compared with the performance when the number of entities is four. The reason is that many entities do not have any relations with the other entities.

Conclusion
We proposed a relation extraction model to find all possible relations among multiple entities in a sentence at once. The proposed model is based on a pointer network with a multi-head attention mechanism. To extract all possible relations from a sentence, we modified a single decoder of the pointer network to a dual decoder. In the dual decoder, the object decoder extracts n-to-1 subjectobject relations, and the subject decoder extracts 1-to-n subject-object relations. In the experiments with the ACE-05 corpus and the NYT corpus, the proposed model showed good performances.