Extracting Entities and Relations with Joint Minimum Risk Training

We investigate the task of joint entity relation extraction. Unlike prior efforts, we propose a new lightweight joint learning paradigm based on minimum risk training (MRT). Specifically, our algorithm optimizes a global loss function which is flexible and effective to explore interactions between the entity model and the relation model. We implement a strong and simple neural network where the MRT is executed. Experiment results on the benchmark ACE05 and NYT datasets show that our model is able to achieve state-of-the-art joint extraction performances.


Introduction
Detecting entities and relations is usually the first step towards extracting structured knowledge from plain texts. Its goal is to identify text spans representing typed objects (entities) and semantic relations among those text spans (relations). For example, in the following sentence, [ "Associate Press" is an organization entity (ORG), "writer" is a person entity (PER), and the two entities have an affiliation relation (ORG-AFF).
Two types of models have been applied to the extraction task, the pipeline model and the joint model. In the pipeline setting, the task is broken down into independently learned components (an entity model and a relation model). Despite its flexibility, the pipeline ignores interactions between the two models. For example, the entity model doesn't look at relation annotations which are useful for identifying entities (e.g., if an ORG-AFF relation exists, the entity model can only assign ORG and AFF to its entities). The joint setting, on the other hand, extracts entities and relations in a unified model, which can explore shared information and alleviate error propagations between models. Here we will focus on joint models.
One simple joint learning paradigm is through sharing parameters (Miwa and Bansal, 2016;Katiyar and Cardie, 2017). Typically, instead of training two independent models, the entity and relation model can share some input features or internal hidden states. It has an advantage that no additional constraint is required on the two submodels. But the connections among sub-models are still not fully explored due to independent submodel decoders. For example, to get signals from relation annotations, the entity model needs to wait for the relation model to update the shared parameters. To further utilize the interaction between decoders, some complex joint decoding algorithms (e.g., simultaneously decoding entities and relations in beam search) have been carefully studied (Li and Ji, 2014;Katiyar and Cardie, 2016;Zhang et al., 2017;Zheng et al., 2017). In this paradigm, it is important (and hard) to make a good balance between the exactness of the joint decoding algorithm and capacities of individual sub-models.
In this work, we propose a joint minimum risk training (MRT) (Och, 2003;Smith and Eisner, 2006) method for the entity and relation extraction task. It provides a lightweight way to strengthen connections between the entity model and the relation model, and keeps their capacities unaffected. Given an input x and a loss function ∆(ŷ, y) (measuring the difference between model output y and the true y), MRT seeks a posterior P (ŷ|x) to minimize the expected loss Eŷ ∼P (ŷ|x) ∆(ŷ, y). Comparing with prior joint decoding algorithms, the MRT-based algorithm is simple and can be applied to a broad range of entity relation extraction models without changing the original sub-models and decoders (Figure 1).
One advantage of the MRT-based method is that it can explicitly optimize a global sentence-level loss (e.g., F1 score) rather than local token-level losses. Therefore, it may catch more sentencelevel information in the training time and match evaluation metrics better in the testing time. Furthermore, besides the handcrafted losses, we also try to directly learn a loss function from data during the joint MRT process. The automatically obtained loss would help MRT to calibrate its risk estimation with knowledge from the data distribution. On the other hand, comparing with preivous single task MRT, the joint MRT algorithm here will integrate messages from different sub-models, which is the key step for enhancing decoder interactions in the joint learning. As a result, the training of the entity model now can directly acknowledge the loss of the relation model (without waiting for shared parameters) and vice versa.
We compile the proposed joint MRT with a strong neural network-based model which uses recurrent neural networks (RNN) in the entity model and convolutional neural networks (CNN) in the relation model. On benchmark ACE05 and NYT datasets, we show that the new RNN+CNN structure outperforms previous neural network-based models. After adding the joint MRT, our model is able to achieve state-of-the-art performances.
To summarize, our main contributions include 1. proposing a new joint learning paradigm based on minimum risk training for the joint entity relation extraction task.
2. implementing a strong and simple neuralnetwork-based entity relation extraction model which carries the proposed MRT algorithm. 1

Related Work
In many pipelined entity relation extraction systems, one first learns an entity model, then learns a relation model based on entities generated by the entity model (Miwa et al., 2009;Chan and Roth, 2011;Lin et al., 2016). Such systems are often flexible to incorporate different data sources and different learning algorithms. However, they may also suffer from error propagation and data inefficiency. To tackle the problem, many recent studies try to develop joint extraction algorithms. Parameter sharing is a basic strategy in joint learning paradigms. For example, in (Miwa and Bansal, 2016), the entity model is a sentence-level RNN, and the relation model is a dependency tree path RNN which takes hidden states of the entity model as features (i.e., the shared parameters). Our basic extraction model is similar to theirs but with a CNN-based relation model. Similarly, Katiyar and Cardie (2017) build a simplified relation model on the entity RNN using the attention mechanism.
To further explore interactions between the entity decoder and the relation decoder, some joint decoding algorithms were studied. For example, Katiyar and Cardie (2016) propose a CRF-based model which conducts joint decoding with augmented transition matrices. Zheng et al. (2017) propose to directly encode relations in the sequential labelling tag set. Both of them are exact decoding algorithms, but they need adding constraints on the relation model (e.g., Zheng et al. (2017) cannot handle entities which appear in multiple relations). On the other side, Li and Ji (2014) develop a joint decoding algorithm based on beam search. Zhang et al. (2017) study a globally normalized joint model. They retain capacities of submodels, while their decoding algorithms are inexact. Here, we introduce MRT to the task, which is a more lightweight setting of joint learning.
Minimum risk training is a learning framework which tries to handle models with arbitrary discrepancy metrics (i.e., losses of a model output w.r.t. the true answer) (Och, 2003;Smith and Eisner, 2006;Gimpel and Smith, 2010). It has been successfully applied to many NLP tasks. Some recent work include (He and Deng, 2012; which apply MRT to (neural) ma-chine translation, (Xu et al., 2016) which develops a shift-reduce CCG parser to directly optimize F1, and (Ayana et al., 2016) which uses a MRT-based model for summarization. We note that most previous applications of MRT focus on a single job, while the joint entity relation extraction consists of two sub-tasks. Investigating MRT in joint learning scenarios is the main topic of this work.
Finally, the sampling algorithm of solving MRT is similar to the policy gradient algorithm in reinforcement learning (RL) (Sutton and Barto, 1998). Some recent NLP applications which share the key idea of MRT but are described with RL language also show promising results (e.g., dialog systems (Li et al., 2016), machine translation (Nguyen et al., 2017)). The idea of learning loss functions from data is similar to inverse reinforcement learning (Abbeel and Ng, 2004;Ratliff et al., 2006).

The Approach
We define the joint entity and relation extraction task following the setting of (Miwa and Bansal, 2016). Given an input sentence s = w 1 , . . . , w |s| (w i is a word), the task is to extract a set of entities E and a set of relations R. An entity e ∈ E is a sequence of words labelling with an entity type (e.g., person (PER), organization (ORG)). Let T e be the set of possible entity types. A relation r is a triple (e 1 , e 2 , l), where e 1 and e 2 are two entities, l is a relation type describing the semantic relation between e 1 and e 2 (e.g., organization affiliation relation (ORG-AFF)). Let T r be the set of possible relation types.
In our joint extraction method ( Figure 2), we treat entity detection as a sequence labelling task (Section 3.1) and relation detection as a classification task (Section 3.2). Models of the two tasks share parameters and are trained jointly. Departing from previous joint learning algorithms (Miwa and Bansal, 2016;Katiyar and Cardie, 2017;Zhang et al., 2017), we introduce minimum risk training to the joint extraction model. It optimizes a global loss function and bridges the discrepancy between training and testing (Section 3.3).

Entity Detection
To represent entities in s, we assign a tag t i to each word w i following the BILOU tagging scheme: t i takes a value in {B- * , I- * , L- * , O, U- * }, where B, I, L and O denote the begin, inside, end and outside of an entity, U denotes a single word en-tity and * ∈ T e represents different entity types. For example, for a person (PER) entity "Patrick McDowell", we assign B-PER to "Patrick" and L-PER to "McDowell". Given an input sentence s, the entity model predicts the tags of wordŝ t =t 1 ,t 2 , . . . ,t |s| by learning from the true tags t = t 1 , t 2 , . . . , t |s| .
We use a bidirectional long short term memory (bi-LSTM) network (Hochreiter and Schmidhuber, 1997) to solve the sequence labelling task. At each sentence position i, a forward LSTM chain computes a hidden state vector ⃗ h i by recursively collecting information from the beginning of s to the current position i. Similarly, a backward LSTM chain collects information ⃗ h i from the end of s to the position i.
The word representation x i of w i has two parts . c i is a character-based representation of w i which is obtained by running a convolution neural network on the character sequence of w i : To predict the tagt i , we combine the forward and the backward hidden vector to h i = ⃗ h i ⊕ ⃗ h i , and apply a softmax function on h i to get the posterior oft i , parameters of the entity model. Given an input sentence s and its ground truth tag sequence t, the training objective is to minimize L ent , 2

Relation Detection
Given a set of detected entitiesÊ (obtaining from the entity tag sequencet), we consider all entity pairs inÊ as candidate relations. The task of relation detection is to predict a relation type l ∈ T r for each pair, 3 and output a relation set To build the relation model, we extract two types of features, namely, features regarding words in e 1 , e 2 and features regarding contexts of the entity pair (e 1 , e 2 ).
To extract features on words in e 1 , e 2 , we use two convolutional neural networks. Taking e 1 as an example, for each word w i in e 1 , we first collect w i 's bi-LSTM hidden vector h i from the entity model. Then, we concatenate h i with a onehot entity tag representation v i oft i . We build a feature vector f e 1 for e 1 by running a CNN (a single convolution layer with a max-pooling layer) on vectors Similarly, we build f e 2 for e 2 with another CNN.
For context features of the entity pair (e 1 , e 2 ), we build three feature vectors by looking at words between e 1 and e 2 (f middle ), words on the left of the pair (f left ) and words on the right of the pair (f right ). For f middle , we run a CNN on words between e 1 and e 2 like the case of f e 1 , f e 2 . For f left and f right , we use the "LSTM-Minus" method as (Wang and Chang, 2016;Zhang et al., 2017). Assume that the left context of (e 1 , e 2 ) is from sentence position 0 Similarly, if the right context of (e 1 , e 2 ) is from j to |s| − 1, then We also use a onehot feature f dist to describe the distance between e 1 and e 2 in the sentence.
Finally, f e 1 , f e 2 , f middle , f left , f right and f dist are concatenated to a single vector f e 1 ,e 2 . To get the posterior of the relation typel, we apply a multilayer perceptron with one hidden layer on f e 1 ,e 2 , parameters of the relation model (shared parameters with the entity model are omitted).
Given an input sentence s, the training objective is to minimize where the true label l of a candidate entity pair (e 1 , e 2 ) can be read from true annotations.

Joint Minimum Risk Training
To jointly learn the entity model and the relation model, one common strategy is to optimize the combined objective function L = L ent + L rel , where the joint learning is accomplished by the shared parameters. However, we would think that L optimizes a "local" loss by observing that a) in both L ent and L rel , the loss functions are calculated by only looking at local parts. For example, the loss in L ent is based on the correctness of local entity tags t i rather than a global measurement (e.g., F1 score of extracted entities), b) both the entity model and the relation model are unaware of the loss from the other side. For example, the entity model needs to wait for the relation model to update the shared parameters rather than get direct supervision from the loss of the relation model. Here we introduce the minimum risk training framework to the joint model. Comparing with optimizing the local loss in L, the joint MRT will optimize a global loss and provide a tighter connection between the entity decoder and the relation decoder. To illustrate the algorithm, we first aggregate some notations.
Let y ≜ (E, R) contain the ground truth entity tag sequence and relations,ŷ ≜ (Ê,R) contain outputs of the joint extraction model and Y(s) be the set of all possible outputs of the input sentence s (y,ŷ ∈ Y(s)). We define the joint probability, where θ = θ E ∪ θ R is the joint model parameter, and P ent , P rel are in Equation 1 and 2.
The objective of MRT is to minimize the following expected loss (i.e., risk), where ∆(ŷ, y) is a (arbitrary) loss function describing the difference betweenŷ and y.
In our model, the loss function ∆(ŷ, y) is the key factor to enhance the joint extraction performances. First, in ∆(ŷ, y), we consider sentencelevel F1 scores of entity and relation extraction results (denoted by F ent (Ê, E), F rel (R, R)). Specifically, we use 1 − F ent (Ê, E) and 1 − F rel (R, R) as the metric of the entity loss and the relation loss respectively. On the one hand, F1 scores characterize the overall performance of the outputs and make the training objective be consistent with the testing time evaluation metric. On the other hand, F1 scores cannot be decomposed onto local predictions ofÊ andR like the log losses in L ent and L rel , thus we need a different training algorithm.
Second, different from previous applications of MRT on single tasks (Xu et al., 2016;, we have two sources of losses in the joint extraction. By integrating losses of individual tasks in the learning algorithm, the entity model could forecast how plausible a candidate entity is according to the relation model, and the relation with prob. 0.1, sample t ′ i uniformly 7: i ← i + 1 8: end while 9: for e1, e2 ∈Ê, e1 ̸ = e2 do 12: sample l ′ ∼ P rel (·|s, e1, e2; θR) 13: model could also know the confidence of the entity extraction results. Here, we define a global loss by adding losses of the two models, To compare with ∆ E+R , we also try two alternatives of ∆(ŷ, y) in experiments, namely, . They only look one model's loss.
Third, in addition to handcrafted loss functions, we further ask whether the joint MRT model could benefit from automatic "loss engineering". Specifically, let Γ(ŷ) be the loss learned from the training set, we augment ∆(ŷ, y) of the MRT objective with Γ(ŷ), and require the learning process to assign a smaller Γ value (with a margin) to the ground truth output y than otherŷ ∈ Y \{y}, min .
Optimizing the expected loss is hard since the size of Y(s) is exponential. In practice, we could approximate the expectation in Equation 3 by sampling a tractable subset Y ′ (s) of Y(s). Specifically, we first obtain an entity set E ′ by sampling (without replacement) an entity tag sequence t ′ from P ent . 5 Then based on the sampled entities, we get a relation set R ′ by sampling l ′ from P rel for each entity pairs. Algorithm 1 lists the pseudo code. 6 In experiments, we also try a variant of Algorithm 1 which only samples from the entity model, and selects relation labels with the maximum posterior (i.e., doesn't sample relations).
With the sampled subset Y ′ (s), we consider a revised version of the original MRT objective, The hyper-parameter α controls the sharpness of the Q distribution (Och, 2003), and µ weights the importance of the entity model and the relation model in Q. Similary, we can rewrite the objective in Equation 5 with Y ′ (s) and Q.
Finally, we remark that if we view MRT as a fine tuning step, it can be applied in any joint learning model based on building the joint distribution P (ŷ|s, θ) (e.g., the globally normalized P in (Zhang et al., 2017)). Thus, we would think that MRT is a flexible and lightweight framework for the joint learning.

Training
To train the joint extraction model, we first pretrain the model with objective L (i.e., minimize the local loss), then optimize the local loss and the global loss simultaneously with objective L + L mrt . The setting is slightly different from previous work which only optimize L mrt in the second step. We find that adding L in the experiments could make the training more stable. 5 To accelerate sampling, we borrow the idea of ε-greedy in reinforcement learning: with probability 0.9, we sample t ′ i from Pent, and with probability 0.1, we sample it uniformly. 6 The time complexity is O(K|s|) which is the same to the beam search algorithm with beam size K (Zhang et al., 2017). 7 Here we follow the literature of MRT to apply the renormalization on Y ′ (s). Another formulation is the policy gradient framework which sticks to the original probability.
When training with L in the pre-training step, we apply the scheduled sampling strategy (Bengio et al., 2015) in the entity model as (Miwa and Bansal, 2016). Models are regularized with dropout and trained using Adadelta (Zeiler, 2012). We give the full derivation of Equation 6's gradient in the supplementary. 8 We select models using development sets: within a fix number of epochs, the model with the best relation extraction performance on the development set is picked out for testing. 9

Experiments
We evaluate the proposed model on two datasets. ACE05 is a standard corpus for the entity relation extraction task. It is labelled with 7 entity types and 6 relation types. We use the same split of ACE05 documents as previous work (351 training, 80 development, and 80 testing). 10 NYT (Riedel et al., 2010) is a larger corpus which is labelled with 3 entity types and 24 relation types. 11 The training set has 353k relation triples which are generated by distant supervision. It also provides another 3880 manually labelled relation triples. Following (Ren et al., 2017;Zheng et al., 2017), we exclude the None relation label and randomly select 10% of the labelled data as the development set. We will mainly discuss the results on ACE05 where many previous joint learning models are available for comparison.
We list detailed hyper-parameter settings in the supplementary. Note that, except µ, α, K which are introduced in the joint MRT and selected on the development set, 12 we don't tune hyperparameters extensively. For example, we use the same setting in both ACE 05 and NYT rather than tune parameters on each of them.
As previous work, we evaluate performances 8 We remark that the MRT objective (Equation 6) is differentiable with respect to model parameters . The non-decomposability of the F1 score does not make the model non-differentiable. In our implementation, the gradient is automatically calculated using autograd tools. Please see the supplementary for more details. 9 We focus on the performance of the ent-to-end relation extraction, so we select models by the relation extraction results. It is also possible to consider both the performances of the entity model and the relation model. We leave the study of advanced model selection algorithms for future work. 10 We use the dataset in https://github.com/tticoin/LSTM-ER, which is from (Miwa and Bansal, 2016). 11 https://github.com/shanzhenren/CoType. 12 The default setting is α = 10 −4 , µ = 1.0, K = 3 in systems without self-learned Γ loss and α = 1, µ = 1.0, K = 2 in systems with Γ loss.  Table 1: Results on the ACE05 test data. (Miwa and Bansal, 2016) and (Katiyar and Cardie, 2017) are joint training systems without joint decoding. (Li and Ji, 2014) and (Zhang et al., 2017) are joint decoding algorithms. NN is our neural network model without minimum risk training. MRT is minimum risk training with loss Γ (Equation 5). We omit pipeline methods which underperform joint models (see (Li and Ji, 2014) for details).
using precision (P), recall (R) and F1 scores. Specifically, an output entity e is correct if its type and the region of its head are correct, and an output relation r is correct if its e 1 , e 2 , l are correct (i.e., "exact match").

Results on ACE05
We first compare proposed models with previous work (Table 1). In general, our plain neural network model (NN) is competitive, and after compiling with MRT, it achieves non-negligible improvement over existing state-of-the-art systems. (both on the entity and the relation extraction). 13 We have following two detailed comparisons. Among systems which only rely on shared parameters ( (Miwa and Bansal, 2016;Katiyar and Cardie, 2017) and NN), NN gives the best result (we give detailed results on different relation types in the supplement). One possible reason is that the "RNN+CNN" network structure is not fully explored in previous joint learning models. More importantly, it suggests that how to build powerful sub-models and utilize shared parameters are still among the key problems of the task.
Comparing with the best joint decoding system which adopts global normalization in training (Zhang et al., 2017), MRT mainly improves the relation extraction results. We think that the improvement may come from the sentence-level loss applied in MRT: both systems consider interactions between decoders, and both objectives are approximated by sampling, but MRT optimizes F1 score while Zhang et al. (2017) optimize label ac-13 It is worth noting that our models don't access additional linguistic resources such as POS tags and dependency trees. We have tried to add syntactic features in (Zhang et al., 2017), but didn't observe improvements.  curacy. For the joint decoding system in (Li and Ji, 2014), although it cannot beat recent neural network-based models, it is interesting to compare MRT with a feature-enriched version of (Li and Ji, 2014)'s model in the future work. Next, we evaluate the joint MRT with different loss functions and sampling methods.
As mentioned in Section 3.3, we have three options (∆ E+R , ∆ E , ∆ R ) for ∆(ŷ, y) and a selflearned loss function Γ. The first five rows of Table 2 show their performances on the test data. We have three observations regarding the results.
1. ∆ R , ∆ E+R have higher relation F1 scores than ∆ E and NN. Thus, adding relation loss in ∆(ŷ, y) is helpful for relation extraction. We think that knowing the relation loss could bias the entity model to highlight the entities appearing in relations, which provides a better candidate relation set for the relation extraction model.
2. ∆ E has the best entity extraction results, which implies that the sentence-level entity loss alone could benefit entity extraction. While after adding relation loss (∆ E+R ), the entity performance slightly decreases. One reason might be that our model selection strategy only focuses on the relation part (footnote 9), thus the model with improved entity performances may not be selected.
3. The learned loss Γ can help to improve performances, but only using Γ is not as effective as the handcrafted ∆ functions (which are tailored to the evaluation metrics). By combining both the prior knowledge and information from the dataset, Γ + ∆ E+R achieves the best results.
Regarding the sampling method, we test a variant of Algorithm 1 which samples entities but not relations (the last five rows of Table 2). Comparing with the default sampling algorithm, it has similar entity extraction performances, but its behaviour on the relation extraction is different. Specifically, adding entity loss in ∆(ŷ, y) (i.e., ∆ E , ∆ E+R ) now affects relation results negatively. It may suggest that when only exploring the output of entity extraction, the entity loss may dominate the relation loss, and trap the joint model to exploit the entity model only. On the other hand, the performances of self-learned loss Γ are less sensitive to the sampling method. We haven't had a clear understanding of the relationship between sampling algorithms and loss functions, but the above results show that adding data-related loss function could improve the robustness of MRT in practice.
Thirdly, we present influences of hyperparameters for MRT with ∆ E+R on the development set in Figure 3 and 4 (other settings have similar results). We find that, for the parameters examined here, it is hard for the entity model and the relation model to agree with each other: parameters achieving high relation performances usually get low entity performances, and vise versa. Thus, if we perform the model selection by only looking at relation extraction results, the joint model may sacrifice entity extraction performances. For α and µ (Figure 3), we observe that on the ACE05 dataset, the model prefers a small α (which means a sharper Q) and µ at boundary (i.e., Q is either close to the entity model or the relation model). Regarding the sample size K (Figure 4), we don't observe a convergence of performances in a small range of K. Since the computation cost increases rapidly as we increase the sample size (K = 5 is about 2x slower than K = 3 in our implementation), we stick to a small K.
Finally, due to lack of space, we provide more discussions on model configurations (including results regarding different entity pair distances, additional experiments on tuning hyper parameters etc.), and detailed error analyses on concrete samples in the supplementary.

Results on NYT
We briefly list results on the NYT dataset in Table 3. The baseline methods are (Ren et al., 2017) which is based on a joint embedding of entities and relations, and (Zheng et al., 2017)   baseline results. In particular, comparing with the joint tagging scheme in (Zheng et al., 2017), MRT adds no constraint on the relation extraction model and can explore the large NYT training set more effectively. At the same time, since the training set is automatically generated, the global losses observed in MRT are also noisy. Like recent work on bandit structured prediction (Kreutzer et al., 2017;Nguyen et al., 2017), the results here suggest that MRT could be a reasonable choice when the supervision of the joint learning is partial and noisy.

Conclusion
We introduced minimum risk training to the task of joint entity and relation extraction. We showed that, with a global loss function, MRT could enhance the connection between the sub-models. Extensive experiments on benchmark datasets witness the effectiveness of the joint MRT.   (Ren et al., 2017), we give results under the "exact match" criterion as ACE05. To compare with (Zheng et al., 2017), we give results which ignore the entity type in the justification of relations. We use α = 1, µ = 1, K = 2 and ∆E+R + Γ.