Extracting Relational Facts by an End-to-End Neural Model with Copy Mechanism

The relational facts in sentences are often complicated. Different relational triplets may have overlaps in a sentence. We divided the sentences into three types according to triplet overlap degree, including Normal, EntityPairOverlap and SingleEntiyOverlap. Existing methods mainly focus on Normal class and fail to extract relational triplets precisely. In this paper, we propose an end-to-end model based on sequence-to-sequence learning with copy mechanism, which can jointly extract relational facts from sentences of any of these classes. We adopt two different strategies in decoding process: employing only one united decoder or applying multiple separated decoders. We test our models in two public datasets and our model outperform the baseline method significantly.


Introduction
Recently, to build large structural knowledge bases (KB), great efforts have been made on extracting relational facts from natural language texts. A relational fact is often represented as a triplet which consists of two entities (an entity pair) and a semantic relation between them, such as < Chicago, country, U nitedStates >.
So far, most previous methods mainly focused on the task of relation extraction or classification which identifies the semantic relations between two pre-assigned entities. Although great progresses have been made (Hendrickx et al., 2010;Zeng et al., 2014;Xu et al., 2015a,b), they all assume that the entities are identified beforehand and neglect the extraction of entities. To extract both of entities and relations, early works (Zelenko et al., 2003; Chan and Roth, 2011) adopted a pipeline Figure 1: Examples of Normal, EntityPairOverlap (EPO) and SingleEntityOverlap (SEO) classes. The overlapped entities are marked in yellow. S1 belongs to Normal class because none of its triplets have overlapped entities; S2 belongs to EntityPairOverlap class since the entity pair < Sudan, Khartoum > of it's two triplets are overlapped; And S3 belongs to SingleEntityOverlap class because the entity Aarhus of it's two triplets are overlapped and these two triplets have no overlapped entity pair. manner, where they first conduct entity recognition and then predict relations between extracted entities. However, the pipeline framework ignores the relevance of entity identification and relation prediction (Li and Ji, 2014). Recent works attempted to extract entities and relations jointly. Yu and Lam (2010); Li and Ji (2014); Miwa and Sasaki (2014) designed several elaborate features to construct the bridge between these two subtasks. Similar to other natural language processing (NLP) tasks, they need complicated feature engineering and heavily rely on pre-existing NLP tools for feature extraction. Recently, with the success of deep learning on many NLP tasks, it is also applied on relational facts extraction. Zeng et al. (2014); Xu et al. (2015a,b) employed CNN or RNN on relation classification. Miwa and Bansal (2016); Gupta et al. (2016); Zhang et al. (2017) treated relation extraction task as an end-to-end (end2end) tablefilling problem. Zheng et al. (2017) proposed a novel tagging schema and employed a Recurrent Neural Networks (RNN) based sequence labeling model to jointly extract entities and relations.
Nevertheless, the relational facts in sentences are often complicated. Different relational triplets may have overlaps in a sentence. Such phenomenon makes aforementioned methods, whatever deep learning based models and traditional feature engineering based joint models, always fail to extract relational triplets precisely. Generally, according to our observation, we divide the sentences into three types according to triplet overlap degree, including Normal, EntityPairOverlap (EPO) and SingleEntityOverlap (SEO). As shown in Figure 1, a sentence belongs to Normal class if none of its triplets have overlapped entities. A sentence belongs to EntityPairOverlap class if some of its triplets have overlapped entity pair. And a sentence belongs to SingleEntityOverlap class if some of its triplets have an overlapped entity and these triplets don't have overlapped entity pair. In our knowledge, most previous methods focused on Normal type and seldom consider other types. Even the joint models based on neural network (Zheng et al., 2017), it only assigns a single tag to a word, which means one word can only participate in at most one triplet. As a result, the triplet overlap issue is not actually addressed.
To address the aforementioned challenge, we aim to design a model that could extract triplets, including entities and relations, from sentences of Normal, EntityPairOverlap and SingleEntityOverlap classes. To handle the problem of triplet overlap, one entity must be allowed to freely participate in multiple triplets. Different from previous neural methods, we propose an end2end model based on sequence-to-sequence (Seq2Seq) learning with copy mechanism, which can jointly extract relational facts from sentences of any of these classes. Specially, the main component of this model includes two parts: encoder and decoder. The encoder converts a natural language sentence (the source sentence) into a fixed length semantic vector. Then, the decoder reads in this vector and generates triplets directly. To generate a triplet, firstly, the decoder generates the relation. Secondly, by adopting the copy mechanism, the decoder copies the first entity (head entity) from the source sentence. Lastly, the decoder copies the second entity (tail entity) from the source sentence. In this way, multiple triplets can be extracted (In detail, we adopt two different strategies in decoding process: employing only one unified decoder (OneDecoder) to generate all triplets or applying multiple separated decoders (MultiDecoder) and each of them generating one triplet). In our model, one entity is allowed to be copied several times when it needs to participate in different triplets. Therefore, our model could handle the triplet overlap issue and deal with both of EntityPairOverlap and SingleEntityOverlap sentence types. Moreover, since extracting entities and relations in a single end2end neural network, our model could extract entities and relations jointly.
The main contributions of our work are as follows: • We propose an end2end neural model based on sequence-to-sequence learning with copy mechanism to extract relational facts from sentences, where the entities and relations could be jointly extracted.
• Our model could consider the relational triplet overlap problem through copy mechanism. In our knowledge, the relational triplet overlap problem has never been addressed before.
• We conduct experiments on two public datasets. Experimental results show that we outperforms the state-of-the-arts with 39.8% and 31.1% improvements respectively.

Related Work
By giving a sentence with annotated entities, Hendrickx et al.   Figure 2: The overall structure of OneDecoder model. A bi-directional RNN is used to encode the source sentence and then a decoder is used to generate triples directly. The relation is predicted and the entity is copied from source sentence.
By giving a sentence without any annotated entities, researchers proposed several methods to extract both entities and relations. Pipeline based methods, like Zelenko et al. (2003) and Chan and Roth (2011), neglected the relevance of entity extraction and relation prediction. To resolve this problem, several joint models have been proposed. Early works (Yu and Lam, 2010;Li and Ji, 2014;Miwa and Sasaki, 2014) 2017), jointly extract the entities and relations based on neural networks. These models are based on tagging framework, which assigns a relational tag to a word or a word pair. Despite their success, none of these models can fully handle the triplet overlap problem mentioned in the first section. The reason is in their hypothesis, that is, a word (or a word pair) can only be assigned with just one relational tag.
This work is based on sequence-to-sequence learning with copy mechanism, which have been adopted for some NLP tasks. Dong and Lapata (2016) presented a method based on an attentionenhanced and encoder-decoder model, which encodes input utterances and generates their logical forms. Gu et al. (2016); He et al. (2017) applied copy mechanism to sentence generation. They copy a segment from the source sequence to the target sequence.

Our Model
In this section, we introduce a differentiable neural model based on Seq2Seq learning with copy mechanism, which is able to extract multiple relational facts in an end2end fashion.
Our neural model encodes a variable-length sentence into a fixed-length vector representation first and then decodes this vector into the corresponding relational facts (triplets). When decoding, we can either decode all triplets with one unified decoder or decode every triplet with a separated decoder. We denote them as OneDecoder model and MultiDecoder model separately.

OneDecoder Model
The overall structure of OneDecoder model is shown in Figure 2.

Encoder
To encode a sentence s = [w 1 , .., w n ], where w t represent the t-th word and n is the source sentence length, we first turn it into a matrix X = [x 1 , · · · , x n ], where x t is the embedding of t-th word.
The canonical RNN encoder reads this matrix X sequentially and generates output o E t and hid- where f (· ) represents the encoder function.
Following (Gu et al., 2016), our encoder uses a bi-directional RNN (Chung et al., 2014) to encode the input sentence. The forward and back- , to represent the concatenate result. Similarly, the concatenation of forward and backward RNN hidden states are used as the representation of sen-

Decoder
The decoder is used to generate triplets directly. Firstly, the decoder generates a relation for the triplet. Secondly, the decoder copies an entity from the source sentence as the first entity of the triplet. Lastly, the decoder copies the second entity from the source sentence. Repeat this process, the decoder could generate multiple triplets. Once all valid triplets are generated, the decoder will generate NA triplets, which means "stopping" and is similar to the "eos" symbol in neural sentence generation. Note that, a NA triplet is composed of an NA-relation and an NA-entity pair. As shown in Figure 3 (a), in time step t (1 ≤ t), we calculate the decoder output o D t and hidden state h D t as follows: where g(· ) is the decoder function and h D t−1 is the hidden state of time step t − 1. We initialize h D 0 with the representation of source sentence s. u t is the decoder input in time step t and we calculate it as: where c t is the attention vector and v t is the embedding of copied entity or predicted relation in time step t − 1. W u is a weight matrix. Attention Vector. The attention vector c t is calculated as follows: where o E i is the output of encoder in time step i, α = [α 1 , ..., α n ] and β = [β 1 , ..., β n ] are vectors, w c is a weight vector. selu(· ) is activation function (Klambauer et al., 2017).
After we get decoder output o D t in time step t (1 ≤ t), if t%3 = 1 (that is t = 1, 4, 7, ...), we use o D t to predict a relation, which means we are decoding a new triplet. Otherwise, if t%3 = 2 (that is t = 2, 5, 8, ...), we use o D t to copy the first entity from the source sentence, and if t%3 = 0 (that is t = 3, 6, 9, ...), we copy the second entity.
Predict Relation. Suppose there are m valid relations in total. We use a fully connected layer to calculate the confidence vector q r = [q r 1 , ..., q r m ] of all valid relations: where W r is the weight matrix and b r is the bias. When predict the relation, it is possible to predict the NA-relation when the model try to generate NA-triplet. To take this into consideration, we calculate the confidence value of NA-relation as: where W N A is the weight matrix and b N A is the bias. We then concatenate q r and q N A to form the confidence vector of all relations (including the NA-relation) and apply softmax to obtain the probability distribution p r = [p r 1 , ..., p r m+1 ] as: We select the relation with the highest probability as the predict relation and use it's embedding as the next time step input v t+1 . Copy the First Entity. To copy the first entity, we calculate the confidence vector q e = [q e 1 , ..., q e n ] of all words in source sentence as: where w e is the weight vector. Similar with the relation prediction, we concatenate q e and q N A to form the confidence vector and apply softmax to obtain the probability distribution p e = [p e 1 , ..., p e n+1 ]: Similarly, We select the word with the highest probability as the predict the word and use it's embedding as the next time step input v t+1 . Copy the Second Entity. Copy the second entity is almost the same as copy the first entity. The only difference is when copying the second entity, we cannot copy the first entity again. This is because in a valid triplet, two entities must be different. Suppose the first copied entity is the k-th word in the source sentence, we introduce a mask vector M with n (n is the length of source sentence) elements, where: then we calculate the probability distribution p e as: where ⊗ is element-wise multiplication. Just like copy the first entity, We select the word with the highest probability as the predict word and use it's embedding as the next time step input v t+1 .

MultiDecoder Model
MultiDecoder model is an extension of the proposed OneDecoder model. The main difference is when decoding triplets, MultiDecoder model decode triplets with several separated decoders. Figure 3 (b) shows the inputs and outputs of decoders of MultiDecoder model. There are two decoders (the green and blue rectangle with shadows). Decoders work in a sequential order: the first decoder generate the first triplet and then the second decoder generate the second triplet. Similar with Eq 2, we calculate the hidden state h D i t and output o D i t of i-th (1 ≤ i) decoder in time step t as follows: is the decoder function of decoder i. u t is the decoder input in time step t and we calculated it as Eq 3. h D i t−1 is the hidden state of i-th decoder in time step t − 1.ĥ D i t−1 is the initial hidden state of i-th decoder, which is calculated as follows:

Training
Both OneDecoder and MultiDecoder models are trained with the negative log-likelihood loss function. Given a batch of data with B sentences S = {s 1 , ..., s B } with the target results Y = {y 1 , ..., y B }, where y i = [y 1 i , ..., y T i ] is the target result of s i , the loss function is defined as follows: T is the maximum time step of decoder. p(x|y) is the conditional probability of x given y. θ denotes parameters of the entire model.

Dataset
To evaluate the performance of our methods, we conduct experiments on two widely used datasets. The first is New York Times (NYT) dataset, which is produced by distant supervision method (Riedel et al., 2010). This dataset consists of 1.18M sentences sampled from 294k 1987-2007 New York Times news articles. There are 24 valid relations in total. In this paper, we treat this dataset as supervised data as the same as Zheng et al. (2017). We filter the sentences with more than 100 words and the sentences containing no positive triplets, and 66195 sentences are left. We randomly select 5000 sentences from it as the test set, 5000 sentences as the validation set and the rest 56195 sentences are used as train set.
The second is WebNLG dataset (Gardent et al., 2017). It is originally created for Natural Language Generation (NLG) task. This dataset contains 246 valid relations. In this dataset, a instance including a group of triplets and several standard sentences (written by human). Every standard sentence contains all triplets of this instance. We on-ly use the first standard sentence in our experiments and we filter out the instances if all entities of triplets are not found in this standard sentence. The origin WebNLG dataset contains train set and development set. In our experiments, we treat the origin development set as test set and randomly split the origin train set into validation set and train set. After filtering and splitting, the train set contains 5019 instances, the test set contains 703 instances and the validation set contains 500 instances.
The number of sentences of every class in NYT and WebNLG dataset are shown in Table 1. It's worthy noting that a sentence can belongs to both EntityPairOverlap class and SingleEntityOverlap class.

Settings
In our experiments, for both dataset, we use LSTM (Hochreiter and Schmidhuber, 1997) as the model cell; The cell unit number is set to 1000; The embedding dimension is set to 100; The batch size is 100 and the learning rate is 0.001; The maximum time steps T is 15, which means we predict at most 5 triplets for each sentence (therefore, there are 5 decoders in MultiDecoder model). These hyperparameters are tuned on the validation set. We use Adam (Kingma and Ba, 2015) to optimize parameters and we stop the training when we find the best result in the validation set.

Baseline and Evaluation Metrics
We compare our models with NovelTagging model (Zheng et al., 2017), which conduct the best performance on relational facts extraction. We directly run the code released by Zheng et al. (2017) to acquire the results.
Following Zheng et al. (2017), we use the standard micro Precision, Recall and F1 score to evaluate the results. Triplets are regarded as correct when it's relation and entities are both correct. When copying the entity, we only copy the last word of it. A triplet is regarded as NA-triplet when and only when it's relation is NA-relation and it has an NA-entity pair. The predicted NA-triplets will be excluded.   As we can see, in NYT dataset, our MultiDecoder model achieves the best F1 score, which is 0.587. There is 39.8% improvement compared with the NovelTagging model, which is 0.420. Besides, our OneDecoder model also outperforms the NovelTagging model. In the WebNLG dataset, MultiDecoder model achieves the highest F1 score (0.371). MultiDecoder and OneDecoder models outperform the NovelTagging model with 31.1% and 7.8% improvements, respectively. These observations verify the effectiveness of our models.

Results
We can also observe that, in both NYT and WebNLG dataset, the NovelTagging model achieves the highest precision value and lowest recall value. By contrast, our models are much more balanced. We think that the reason is in the structure of the proposed models. The NovelTagging method finds triplets through tagging the words. However, they assume that only one tag could be assigned to just one word. As a result, one word can participate at most one triplet. Therefore, the NovelTagging model can only recall a small number of triplets, which harms the recall performance. Different from the NovelTagging model, our models apply copy mechanism to find entities for a triplet, and a word can be copied many times when this word needs to participate in multiple different triplets. Not surprisingly, our models recall more triplets and achieve higher recall value. Further experiments verified this.

Detailed Results on Different Sentence Types
To verify the ability of our models in handling the overlapping problem, we conduct further experiments on NYT dataset. Figure 4 shows the results of NovelTagging, OneDecoder and MultiDecoder model in Normal, EntityPairOverlap and SingleEntityOverlap classes. As we can see, our proposed models perform much better than NovelTagging model in Entity-PairOverlap class and SingleEntityOverlap classes. Specifically, our models achieve much higher performance on all metrics. Another observation is that NovelTagging model achieves the best performance in Normal class. This is because the NovelTagging model is designed more suitable for Normal class. However, our proposed models are more suitable for the triplet overlap issues. Furthermore, it is still difficult for our models to judge how many triplets are needed for the input sentence. As a result, there is a loss in our models for Normal class. Nevertheless, the overall perfor-  We also compare the model's ability of extracting relations from sentences that contains different number of triplets. We divide the sentences in NYT test set into 5 subclasses. Each class contains sentences that has 1,2,3,4 or >= 5 triplets. The results are shown in Figure 5. When extracting relation from sentences that contains 1 triplets, NovelTagging model achieve the best performance. However, when the number of triplets increases, the performance of NovelTagging model decreases significantly. We can also observe the huge decrease of recall value of NovelTagging model. These experimental results demonstrate the ability of our model in handling multiple relation extraction.   Table 3 and Table 4. We can observe that on both NYT and WebNLG datasets, these two models have comparable abilities on relation generation. However, MultiDecoder performs better than OneDecoder model when generating entities. We think that it is because MultiDecoder model utilizes different decoder to generate different triplets so that the entity generation results could be more diverse.

Conclusions and Future Work
In this paper, we proposed an end2end neural model based on Seq2Seq learning framework with copy mechanism for relational facts extraction. Our model can jointly extract relation and entity from sentences, especially when triplets in the sentences are overlapped. Moreover, we analyze the different overlap types and adopt two strategies for this issue, including one unified decoder and multiple separated decoders. We conduct experiments on two public datasets to evaluate the effectiveness of our models. The experiment results show that our models outperform the baseline method signif-icantly and our models can extract relational facts from all three classes. This challenging task is far from being solved. Our future work will concentrate on how to improve the performance further. Another future work is test our model in other NLP tasks like event extraction.