Learning the Extraction Order of Multiple Relational Facts in a Sentence with Reinforcement Learning

The multiple relation extraction task tries to extract all relational facts from a sentence. Existing works didn’t consider the extraction order of relational facts in a sentence. In this paper we argue that the extraction order is important in this task. To take the extraction order into consideration, we apply the reinforcement learning into a sequence-to-sequence model. The proposed model could generate relational facts freely. Widely conducted experiments on two public datasets demonstrate the efficacy of the proposed method.


Introduction
Relation extraction (RE) is a core task in natural language processing (NLP). RE can be used in information extraction (Wu and Weld, 2010), question answering (Yih et al., 2015;Dai et al., 2016) and other NLP tasks. Most existing works assumed that a sentence only contains one relational facts (a relational fact, or a triplet, contains a relation and two entities). But in fact, a sentence often contains multiple relational facts (Zeng et al., 2018b). The multiple relation extraction task tries to extract all relational facts from a sentence.
Existing works on multiple relation extraction task can be divided into five genres. 1) Pseu-doPipeline genre, including Miwa and Bansal (2016); Sun et al. (2018). They first recognized all the entities of the sentence, then extracted features for each entity pair and predicted their relation. They trained the entity recognition model and relation prediction model together instead of separately. Therefore, we call them PseudoPipeline methods. 2) TableFilling genre, including Miwa and Sasaki (2014); Gupta et al. (2016) and . They maintained a entity-relation table and predicted a semantic tag (either entity tags or relation tags) for each cell in the table.
According to the predicted tags, they can recognize the entities and the relation between each entity pair. 3) NovelTagging genre, including Zheng et al. (2017). This method can be seen as a development of TableFilling method. They assigned a pre-defined semantic tag to each word of the sentence and collected triplets based on the tags. Their tags include both entity and relation information. Therefore, they don't need to maintain a entity-relation table. 4) MultiHeadSelection genre, including Bekoulis et al. (2018a) and Bekoulis et al. (2018b). They first recognized the entities, then they formulated the relation extraction task as a multi-head selection problem. For each entity, they calculated the score between it and every other entities for a given relation. The combination of the entity pair and relation with the score exceeding a threshold will be kept as a triplet. 5) Generative genre, including Zeng et al. (2018b). They directly generate triplets one by one by a sequence-to-sequence model with copy mechanism (Gu et al., 2016;Vinyals et al., 2015). To generate a triplet, they first generated the relation, then they copy the first entity and the second entity from the source sentence.
However, none of them have considered the extraction order of multiple triplets in a sentence. Given a sentence, the PseudoPipeline methods extract relations of different entity pairs separately. Although they jointly training the entity model and relation model, they ignore the influence between triplets actually. The TableFilling, NovelTagging and MultiHeadSelection methods extract the triplets in the word order of this sentence. They firstly deal with the first word, then the second one and so on. The generative method could generate triplets in any order actually. However, Zeng et al. (2018b)  global rules (e.g., alphabetical order) is straightforward. But one global sorting rule may not fit every sentences.
In this paper, we argue that the extraction order of triplets in a sentence is important. Take Figure  1 as example. It's difficult to extract F 1 first because we don't know what "Arros negre" is in the first place. Extracting F 2 is more straightforward as the key words "dish", "region" in the sentence is helpful. F 2 can help us to extract F 1 because now we are confident that "Arros negre" is some kind of food, so that "ingredient" is a suitable relation between "Arros negre" and "Cubanelle". From this intuitive example, we can see that the extracted triplets could influence the extraction of the remaining triplets.
To automatically learning the extraction order of multiple relational facts in a sentence, we propose a sequence-to-sequence model and apply reinforcement learning (RL) on it. we follow the Generative genre because such a model could extract triplets in various order, which is convenient for us to explore the influence of triplets extraction order. Our model reads in a raw sentence and generates triplets one by one. Thus, all triplets in a sentence could be extracted. To take the triplets extraction order into consideration, we convert the triplets generation process as a RL process. The sequence-to-sequence model is regarded as the RL policy. The action is what we generate in each time step. We assume that a better generation order could lead to more valid generated triplets. The RL reward is related to the generated triplets. In general, the more triplets are correctly generated, the higher the reward. Unlike supervised learning with negative log likelihood (NLL) loss, which forces the model to generate triplets in the order of the ground truth, reinforcement learning allows the model generate triplets freely to achieve higher reward.
The main contributions of this work are: • We discuss the triplets extraction order prob-lem in the multiple relation extraction task. In our knowledge, this problem has never been addressed before.
• We apply reinforcement learning method on a sequence-to-sequence model to handle this problem.
• We conduct widely experiments on two public datasets. Experimental results show that the proposed method outperform the strong baselines with 3.4% and 5.5% improvements respectively.

Related Work
Given a sentence with two annotated entities (an entity pair), the relation classification task aims to identify the predefined relation between these two entities. Zeng et al. (2014) was among the first to apply neural networks in relation classification task. They adopted the Convolutional Neural Network (CNN) to learn the sentence representation automatically. In the following, dos Santos et al. (2015); Xu et al. (2015a) also applied CNN to extract relation. Xu et al. (2015b) utilized shortest dependency path between two entities with a LSTM (Hochreiter and Schmidhuber, 1997) based recurrent neural network. Zhou et al. (2016) applied attention mechanism to learn different weights for each word and used LSTM to represent sentence. These methods all assumed that the entity pair is given beforehand and a sentence only contains two entities. To extract both entities and relation from sentence, early works like Zelenko et al. (2003); Chan and Roth (2011) adopted pipeline methods. However, such pipeline methods neglect the relevance between entities and relation. Latter works focused on joint models that extract entities and relation jointly. Yu and Lam (2010); Li and Ji (2014); Miwa and Bansal (2016) relied on NLP tools to do feature engineering, which suffered from the error propagation problem. Miwa and Sasaki (2014); Gupta et al. (2016);  applied neural networks to jointly extract entities and relations. They converted the relation extraction task into a table filling task. Zheng et al. (2017) took a step further and converted this task into a tagging task. They assigned a semantic tag to each word in the sentence and collected triplets according to the tag information. Bekoulis et al. (2018b,a) model the relation extraction task as a multi-head selec-tion problem. However, these models can not take triplet's extraction order into consideration. Sun et al. (2018) proposed a joint learning paradigm based on minimum risk training. Their method ignore the influence between relational facts. Zeng et al. (2018b) proposed an sequence-to-sequence model with copy mechanism to handle the overlapping problem in multiple relation extraction. They randomly choose a extraction order for each sentence.
RL has attracted lot of attention recently. It has been successfully applied in many games (Mnih et al., 2015;Silver et al., 2016). Narasimhan et al. (2015); He et al. (2016) applied RL on text based games. Narasimhan et al. (2016) employed deep Q-network to optimize a reward function that reflects the extraction accuracy while penalizing extra effort.  applied policy gradient method to model future reward in chatbot dialogue. They designed a reward to promote three conversational properties: informativity, coherence and ease of answering. Su et al. (2016) using on-line activate reward learning for policy optimization in spoken dialogue systems because the user feedback is often unreliable and costly to collect. Yu et al. (2017) applied RL method to overcome the limitations that the Generative Adversarial Net (GAN) in generating sequences of discrete tokens. Our work is related to ; Yu et al. (2017) since we also apply RL to generate better sequences.
There are several works that related to both relation extraction and RL, which are also related to our work. Zeng et al. (2018a); Feng et al. (2018); Qin et al. (2018) applied RL to distantly supervised relation extraction task. Zeng et al. (2018a) turned the bag relation prediction into an RL process. They assumed that the relation of the bag is determined by the relation of sentences from the bag. They set the final reward to +1 or -1 by comparing the predict bag relation with the gold relation. Feng et al. (2018) adopted policy gradient method to select high-quality sentences from the bag. The selected sentences are feed to the relation classifier and the relation classifier provides rewards to the instance selector. Similarly, Qin et al. (2018) explored a deep RL strategy to generate the false-positive indicator. Our work is different from them since we focus on supervised relation extraction task.

Method
We first introduce our basic model and then introduce how to apply RL on it. Similar to Zeng et al. (2018b), our neural model is also a sequence-tosequence model with copy mechanism. It reads in a raw sentence and generates triplets one by one. Instead of training the model with NLL loss, we regard the triplets generation process as a RL process and optimize the model with REIN-FORCE (Williams, 1992) algorithm. Therefore, we don't have to determine the triplets order of each sentence beforehand, we let the model generate triplets freely. We show the RL process in Figure 2.

Sequence-to-Sequence Model with Copy Mechanism
The sequence-to-sequence model with copy mechanism is a kind of CopyNet (Gu et al., 2016) or PointerNetwork (Vinyals et al., 2015). Two components included in this model: encoder and decoder. The encoder is a bi-directional recurrent neural network, which is used to encode a variable-length sentence into a fixed-length vector. We denote the outputs of encoder as where o E i denotes the output of i-th word of the encoder and n is the sentence length.
The decoder is another recurrent neural network, which is used to generate triplets one by one. The NA-triplets will be generated if the valid triplets number is less than the maximum triplets number. 1 It takes three-time steps to generate one triplet. That is, in time step t (t = 1, 2, 3, ..., T ), if t%3 = 1, we predict the relation. If t%3 = 2, we copy the first entity and if t%3 = 0, we copy the second entity. T is the maximum decode time step. Note that T is always divisible by 3.
Suppose there are m predefined valid relations, in time step t (t = 1, 4, 7, ...), we calculate the confidence score for each valid relation: [1,1,1,0,0,0] Reward Sentence Figure 2: The RL process. The model reads in a raw sentence and generates triplets. Then, a reward is assigned to each time step based on the generated triplets. Lastly, the rewards is used to optimize the model. NA relation: where W N A t and b N A t are parameters in time step t. Then we concatenate q r t and q N A t and perform softmax to obtain the probability distribution: To copy the first entity in time step t (t = 2, 5, 8, ...), we calculate the confidence score of each word in source sentence: where q e ti is the confidence score of i-th word and w e t is the weight vector, in time step t. Similarly, to take the NA-triplet into consideration, we also calculate the confidence score for NA entity with Eq 2. We concatenate them and perform softmax to obtain the probability distribution: Copy the second entity in time step t (t = 3, 6, 9, ...) is almost the same as the first entity. The only difference is we also apply the mask (Zeng et al., 2018b) to avoid the copied two entities are the same.
Our model is similar to OneDecoder model and MultiDecoder model in Zeng et al. (2018b). Compared with OneDecoder model, our model using different linear transformation parameters in different decoding time step. Compared with Multi-Decoder model, our model using only one decoder cell to decode all triplets. In our model, we didn't using attention mechanism because we found that the attention mechanism makes no difference to the results.

Reinforcement Learning Process
We regard the triplets generation process as RL process. The loop in Figure 2 represents a RL episode. In each RL episode, the model reads in the raw sentence and generate output sequence. Then we gain triplets from the output sequence and calculate rewards based on them. Finally, we optimize the model with REINFORCE algorithm.

State
We use s t to denote the state of sentence x in decoding time step t. The state s t contains the already generated tokensŷ <t , the information of source sentence x and the model parameters θ.

Action
The action is what we predict (or copy) in each time step. In time step t and t%3 = 1, the model (policy) is required to determine the relation of the triplet; In time step t where t%3 = 2 or 0, the model is required to determine the first or second entity, which is copied from the source sentence. Therefore, the action space A is varied in different time step t.
A = R, t%3 = 1 P, t%3 = 2, 0 where R is the predefined relations and P is the positions of source sentence. We denote the action sequence of the source sentence as a = [a 1 , ..., a T ].
Add F i to V end if 15: end for

Reward
The reward is used to guide the training, which is critical to RL training. However, we can't assign a reward to each step directly during the generation since we don't know whether each action we choose is good or not before we finish the generation. Remind that we could obtain a triplet in every three steps. Once we obtained a triplet, we can compare it with the gold triplets and know if this triplet is good or not. A well generated triplet means it's the same with one of the gold triplets and not the same with any already generated triplets.
When we obtained a good triplet after three steps, we assign reward 1 to each of these three steps. Otherwise, we assign reward 0 to them. After generating valid triplets, we may need to generate NA-triplets. We assign reward 0.5 to each of these three steps if we correctly generate NAtriplet and reward 0 otherwise. We show the details of the reward assignment in Algorithm 1 2 .

Training
The model can be trained with either supervised learning loss or reinforcement learning loss. However, the supervised learning forces the model to 2 How to determine the reward in RL is difficult. We tried several different reward assignments but only this one works. generate triplets in the order of the ground truth while the reinforcement learning allows the model generate triplets freely.

NLL Loss
Training the model with NLL loss requires a predefined ground truth sequence for each sentence. Suppose T is the maximum time step of decoder, we denote the ground truth sequence as [y 1 , ..., y t , ..., y T ]. Them the NLL loss for sentence x can be defined as: whereŷ <t is the already generated tokens; p(·|·) is the conditional probability; θ is the parameters of the entire model.

RL Loss
Training the model with reinforcement learning only require the ground truth triplets for each sentence. The RL loss for sentence x is: whereŷ t is the sampled action and r t is the reward, in time step t. We use the pre-processed dataset used in Zeng et al. (2018b), which contains 5000 sentences in the test set and 5000 sentences in the validation set and 56195 sentences in the train set. In the train set, there are 36868 sentences that contain one triplet, 19327 sentences that contain multiple triplets. In the test set, the sentence number are 3244 and 1756, respectively. There are 24 relations in total.
WebNLG dataset is proposed by Gardent et al. (2017). This data set is originally created for Natural Language Generation (NLG) task. Given a group of triplets, annotators are asked to write a sentence which contains the information of all triplets in this group. We use the dataset preprocessed by Zeng et al. (2018b) and the train set contains 5019 sentences, the test set contains 703 sentences and the validation set contains 500 sentences. In the train set, there are 1596 sentences that contain one triplet, 3423 sentences contain multiple triplets. In the test set, the sentence number are 266 and 437, respectively. There are 246 different relations.

Settings
Zeng et al. (2018b) only use LSTM as the model cell. In this paper, we report the results of both LSTM and GRU (Cho et al., 2014). We follow the most settings from Zeng et al. (2018b). The cell unit number is set to 1000; The embedding dimension is set to 100; The batch size is 100; The maximum time step T is 15, that is, we will extract 5 triplets for each sentence; We use Adam (Kingma and Ba, 2015) to optimize parameters and stop the training when we find the best result in validation set. For the NLL training, the learning rate in both dataset is 0.001. For the RL training, we first pretrain the model with NLL training (pretrain model achieves 80%-90% of the best NLL training performance), then training the model with RL. The RL learning rate is 0.0005.

Evaluation Metrics
We follow the evaluation metrics in Zeng et al. (2018b). Our model can only copy one word for each entity and we use the last word of each entity to represent them. Triplet is regarded as correct when its relation, the first entity and the second entity are all correct. For example, suppose the gold triplets is < Barack Obama, president, U SA >, < Obama, president, U SA > is regarded as correct while < Obama, locate, U SA > and < Barack, prsident, U SA > are not. A triplet is regarded as NA-triplet when and only when its relation is NA relation and it has a NA entity pair. The predicted NA-triplet will be excluded. We use the standard micro Precision, Recall and F1 score to evaluate the results.

Results of Different Extraction Order
To find out if the triplets extraction order of a sentence can make difference in multiple relation extraction task, we conduct widely experiments on both NYT and WebNLG dataset. We show the results of different extraction order of different models with LSTM cell in Table 1. The results of models with GRU cell are shown in Appendix B. We box the best results of a model and the bold values are the best results in this dataset.
CNN denotes the baseline with CNN classifier. We use the NLTK toolkit 3 to recognize the entities first. Then we combine every two entities as an entity pair. Every two entities can lead to two different entity pairs. For each entity pair, we apply a CNN classifier (Zeng et al., 2014) to determine the relation. We leave the details of this model in Appendix A. ONE and MULTI denotes the OneDecoder model and MultiDecoder model in Zeng et al. (2018b). 4 NLL means the model is trained with NLL loss, which requires a predefined ground truth sequence for each sentence. For a sentence with N triplets, there are N ! (the factorial of N ) possible extraction order, which lead to N ! valid sequences. Shuffle means we randomly select one valid sequence as the ground truth sequence in every training epoch for a sentence. Fix-Unsort means we randomly select one valid sequence before training, and use the selected one as ground truth sequence during training. This strategy is used in Zeng et al. (2018b). Alphabetical means we sort the triplets of a sentence in alphabetical order and build ground truth sequence based on the sorted triplets. Frequency means we sort the triplets of a sentence based on the relation frequency. We count the relation frequency from the training set. RL means the model is trained with reinforcement learning. In NYT dataset, we using Alphabetical strategy to pretrain the model, and in WebNLG dataset, we pretrain the model wtih Frequency strategy.
Form Table 1, we can observe that: (a) The CNN baseline is not performing well because this model neglect the influence between triplets. (b) Compared with FixUnsort strategy, simply change the ground truth sequence in different training epoch (the Shuffle strategy) is also not good. The performance of OneDecoder drops from 0.566 to 0.552 in NYT dataset and 0.305 to 0.283 in WebNLG dataset. (c) In both dataset and for all models trained with NLL loss, sort the triplets in some order (Alphabetical or Frequency order)  on WebNLG dataset. The generated triplets (excluding NA-triplets) sequence of our model which is pretrained with FixUnsort strategy then trained with RL, is denoted as FURL. Similarly, the triplets sequence of our model which is pretrained with Frequency strategy then trained with RL is denoted as FreqRL. And the triplets sequence of our model which is trained with FixUnsort strategy is denoted as FUNLL.
is the generated triplets sequence of FURL for sentence x, B = [F a , F c ] is the generated triplets sequence of FUNLL for the same sentence x. F a , as well as F b and F c , is a triplet. The first triplet of the sequence A is F a , which is the same with the first triplet of the sequence B. But the second triplet of A is different from B. Therefore, there are only 1 triplet is in the same position for A and B. The triplets number is the maximum triplets number of A and B, which is 3 in this example. We calculate the order comparison of sentence x as 1/3 = 0.333. The order comparison of FUNLL and FURL (denote as FUNLL-FURL) is the mean value of all

sentences.
We show the order comparison results of our model with LSTM and GRU cell in Table 2. As we can see, although FURL model is pretrained by FUNLL, FURL is more alike FreqRL (0.446 and 0.435), rather then FUNLL (0.326 and 0.390).
This experiment verified that after RL training, the model trend to generate triplets in the same order.

Multiple Relation Extraction
To verify the ability of extracting multiple relational facts, we conduct the experiment on NYT dataset of our model with LSTM cell. We show the results in Figure 3. The left part of this figure shows the performance of sentences with one triplet. The right part shows the performance of sentences with multiple triplets.
As we can see, when the sentence only contains one triplet, our model trained with RL can achieve comparative performance with the strong baselines. When there are multiple triplets for a sentence, our model trained with RL outperform all baselines significantly. By training with RL, our model could extract triplets more precisely. Although the recall value is slightly lower then NLL training with Frequency strategy, it exceeds other baselines significantly. These observations demonstrate that RL training is effective to handle the multiple relation extraction task.

Weakness
Although we overcome all strong baselines by training the model with RL, there are still some weaknesses in our method.
The first weakness is the decrease in recall. Table 1 shows that NLL training with Alphabetical or Frequency strategy achieves the highest recall in most cases. Training the model with RL achieves the highest precision and relatively low recall. This phenomenon demonstrates that the model trained with RL generates relatively fewer triplets. Although we can extract triplets more accurate, it is still a weakness of our method since we try to extract all triplets from a sentence.
The second weakness is our model can only copy one word for each entity. Following Zeng et al. (2018b), we only copy the last word of an entity. But in reality, most entities contains more than one word. In the future, we will consider how to extract the complete entity. For example, we could add the BIO tag prediction in the encoder and train the BIO loss together with current loss. Therefore, we can recognize the complete entity with the help of BIO tags. Or, we can take two steps to generate one entity, one step for the head word and the other for the tail word.

Conclusions
In this paper, we discuss the multiple triplets extraction order problem in the multiple relation extraction task. We propose a sequence-to-sequence model with reinforcement learning to take the extraction order into consideration. Widely experiments on NYT dataset and WebNLG dataset are conducted and verified that the proposed method is effective in handling this problem.

Acknowledgments
This work is supported by the National Natural Science Foundation of China (No.61533018, No.61702512) and the independent research project of National Laboratory of Pattern Recognition.

A The Details of the CNN Baseline
In this section, we will describe the details of the CNN baseline. This baseline is a pipeline method. For a sentence, we use the NLTK toolkit to recognize the entities first. Then, we combine each two entities as an entity pair and use a CNN relation classifier to predict their relation.
For example, suppose we recognize 3 entities in sentence s, which denoted as e 1 , e 2 , e 3 . There are 6 different entity pairs (remind that < e 1 , e 2 > and < e 2 , e 1 > are different).
The CNN classifier is basically the same with Zeng et al. (2014). Each word is turned into a embedding which including it's word embedding and position embedding. After the convolution layer, we apply a maxpooling layer on it. Then we apply a two layer softmax classify layer to obtain the final results. We train the model with NLL loss.
Specifically, the word embedding dimension is 100, the position embedding dimension is 5, we use 128 filters and the filter size is 3. The hidden layer size of softmax classifier is 100 and we use tanh as the activation function. We optimize the model with Adam optimizer (Kingma and Ba, 2015).
During evaluation, if the entity pair is classified into NA relation, we will exclude this triplet. Otherwise, the triplet will be regarded as a predict triplet. If the predicted triplet is the same as one of the gold triples, it will be regarded as correct triplet. To be fair, when comparing the entities in the triplets, we only compare the last word of each entity. As long as the last word of the extract entity is the same as the gold one, we regard it as correct.

B Results of GRU Cell
We show the results of different extraction order of models with GRU cell in Table 3.