End-to-End Neural Relation Extraction with Global Optimization

Neural networks have shown promising results for relation extraction. State-of-the-art models cast the task as an end-to-end problem, solved incrementally using a local classifier. Yet previous work using statistical models have demonstrated that global optimization can achieve better performances compared to local classification. We build a globally optimized neural model for end-to-end relation extraction, proposing novel LSTM features in order to better learn context representations. In addition, we present a novel method to integrate syntactic information to facilitate global learning, yet requiring little background on syntactic grammars thus being easy to extend. Experimental results show that our proposed model is highly effective, achieving the best performances on two standard benchmarks.

In recent years, there has been a surge of interest in performing end-to-end relation extraction, jointly recognizing entities and relations given free text inputs (Li and Ji, 2014;Miwa and Sasaki, 2014;Miwa and Bansal, 2016;. End-to-end learning prevents error propagation in the pipeline approach, and allows cross-task dependencies to be modeled explicitly for entity recognition. As a result, it gives better relation extraction accuracies compared to pipelines. Miwa and Bansal (2016) were among the first to use neural networks for end-to-end relation extraction, showing highly promising results. In particular, they used bidirectional LSTM (Graves et al., 2013) to learn hidden word representations under a sentential context, and further leveraged treestructured LSTM (Tai et al., 2015) to encode syntactic information, given the output of a parser. The resulting representations are then used for making local decisions for entity and relation extraction incrementally, leading to much improved results compared with the best statistical model (Li and Ji, 2014). This demonstrates the strength of neural representation learning for end-to-end relation extraction.
On the other hand, Miwa and Bansal (2016)'s model is trained locally, without considering structural correspondences between incremental decisions. This is unlike existing statistical methods, which utilize well-studied structured prediction methods to address the problem (Li and Ji, 2014;Miwa and Sasaki, 2014). As has been commonly understood, learning local decisions for structured prediction can lead to label bias (Lafferty et al., 2001), which prevents globally optimal structures from receiving optimal scores by the model. We address this potential issue by building a structural neural model for end-to-end relation extraction, following a recent line of efforts on globally optimized models for neural structured prediction (Zhou et al., 2015;Watanabe and Sumita, 2015;Andor et al., 2016;Wiseman and Rush, 2016).
In particular, we follow Miwa and Sasaki (2014), casting the task as an end-to-end tablefilling problem. This is different from the actionbased method of Li and Ji (2014), yet has shown to be more flexible and accurate (Miwa and Sasaki, 2014). We take a different approach to representation learning, addressing two potential limitations of Miwa and Bansal (2016).
First, Miwa and Bansal (2016) rely on external syntactic parsers for obtaining syntactic information, which is crucial for relation extraction (Culotta and Sorensen, 2004;Zhou et al., 2005;Bunescu and Mooney, 2005;Qian et al., 2008). However, parsing errors can lead to encoding inaccuracies of tree-LSTMs, thereby hurting relation extraction potentially. We take an alternative approach to integrating syntactic information, by taking the hidden LSTM layers of a bi-affine attention parser (Dozat and Manning, 2016) to augment input representations. Pretrained for parsing, such hidden layers contain rich syntactic information on each word, yet do not explicitly represent parsing decisions, thereby avoiding potential issues caused by incorrect parses.
Our method is also free from a particular syntactic formalism, such as dependency grammar, constituent grammar or combinatory categorial grammar, requiring only hidden representations on word that contain syntactic information. In contrast, the method of Miwa and Bansal (2016) must consider tree LSTM formulations that are specific to grammar formalisms, which can be structurally different (Tai et al., 2015).
Second, Miwa and Bansal (2016) did not explicitly learn the representation of segments when predicting entity boundaries or making relation classification decisions, which can be intuitively highly useful, and has been investigated in several studies (Wang and Chang, 2016;. We take the LSTM-Minus method of Wang and Chang (2016), modelling a segment as the difference between its last and first LSTM hidden vectors. This method is highly efficient, yet gives as accurate results as compared to more complex neural network structures to model a span of words (Cross and Huang, 2016).
Evaluation on two benchmark datasets shows that our method outperforms previous methods of Miwa and Bansal (2016), Li and Ji (2014) and Miwa and Sasaki (2014), giving the best reported results on both benchmarks. Detailed analysis shows that our integration of syntactic features is as effective as traditional approaches based on discrete parser outputs. We make our code publicly As shown in Figure 1, the goal of relation extraction is to mine relations from raw texts. It consists of two sub-tasks, namely entity detection, which recognizes valid entities, and relation classification, which determines the relation categories over entity pairs. We follow recent studies and recognize entities and relations as one single task.

Method
We follow Miwa and Sasaki (2014) and , treating relation extraction as a tablefilling problem, performing entity detection and relation classification using a single incremental model, which is similar in spirit to Miwa and Bansal (2016) by performing the task end-to-end. Formally, given a sentence w 1 w 2 · · · w n , we maintain a table T n×n , where T (i, j) denotes the relation between w i and w j . When i = j, T (i, j) denotes an entity boundary label. We map entity words into labels under the BILOU (Begin, Inside, Last, Outside, Unit) scheme, assuming that there are no overlapping entities in one sentence (Li and Ji, 2014;Miwa and Sasaki, 2014;Miwa and Bansal, 2016). Only the upper triangular table is necessary for indicating the relations.
We adopt the close-first left-to-right order (Miwa and Sasaki, 2014) to map the twodimensional table into a sequence, in order to fill the table incrementally. As shown in Figure 2, first {T (i, i)} are filled by growing i, and then the sequence {T (i, i + 1)} is filled, and then {T (i, i + 2)}, · · · , {T (i, i + n)} are filled incrementally, until the table is fully annotated.
During the table-filling process, we take two label sets for entity detection (i = j) and relation  notes the relation category and ⊥ denotes a NULL relation. 2 At each step, given a partially-filled table T , we determine the most suitable label l for the next step using a scoring function: where W l is a model parameter and h T is the vector representation of T . Based on the function, we aim to find the best label sequence l 1 · · · l m , where m = n(n+1)

2
, and the resulting sequence of partially-filled tables is T 0 T 1 · · · T m , where T i = FILL(T i−1 , l i ), and T 0 is an empty table. Different from previous work, we investigate a structural model that is optimized for the label sequence l 1 · · · l m globally, rather than for each l i locally.

Representation Learning
At the ith step, we determine the label l i of the next table slot based on the current hypothesis T i−1 . Following Miwa and Bansal (2016), we use a neural network to learn the vector representation of T i−1 , and then use Equation 1 to rank candidate next labels. There are two types of input features, including the word sequence w 1 w 2 · · · w n , and the readily filled label sequence l 1 l 2 · · · l i−1 . We build a neural network to represent T i−1 . Figure 3, we represent each word w i by a vector h w i using its word form, POS tag and characters. Two different forms of embeddings are used based on the word form, one being obtained by using a randomly initialized look-up table E w , 2 We remove the illegal table-filling labels during decoding for training and testing. For example, tuned during training and represented by e w , and the other being a pre-trained external word embedding from E w , which is fixed and represented by e w . 3 For a POS tag t, its embedding e t is obtained from a look-up table E t similar to E w . The above two components have also been used by Miwa and Bansal (2016). We further enhance the word representation by using its character sequence Lample et al., 2016), taking a convolution neural network (CNN) to derive a character-based word representation h char , which has been demonstrated effective for several NLP tasks (dos Santos and Gatti, 2014). We obtain the final h w i based on a non-linear feedforward layer on e w ⊕ e w ⊕ e t ⊕ h char , where ⊕ denotes concatenation.

Label Representation
In addition to the word sequence, the history label sequence l 1 l 2 · · · l i−1 , and especially the labels representing detected entities, are also useful disambiguation. For example, the previous entity boundary label can be helpful to deciding the boundary label of the current word. During relation classification, the types of the entities involved can indicate the relation category between them. We exploit the diagonal label sequence of partial table T , which denotes entity boundaries, to enhance the representation learning. A word's entity boundary label embedding e l is obtained by using a randomly initialized looking-up table E l .

LSTM Features
We follow Miwa and Bansal (2016), learning global context representations using LSTMs. Three basic LSTM structures are used: a leftto-right word LSTM ( . Each LSTM derives a sequence of hidden vectors for inputs. Different from Miwa and Bansal (2016), who use the output hidden vectors {h i } of LSTMs to represent words, we exploit segment representations as well. In particular, for a segment of text [i, j], the representation is computed by using LSTM-Minus (Wang and Chang, 2016), shown by Figure 4 The segment representations can reflect entities in a sentence, and thus can be potentially useful for both entity detection and relation extraction.

Feature Representation
We use separate feature representations for entity detection and relation classification, both of which are extracted from the above three LSTM structures. In particular, we first extract a set of base neural features, and then concatenate them and feed them into a non-linear neural layer for entity detection and relation classification, respectively. Figure 5 shows the overall representation.
tity label LSTM, we only use the segment features of entity i and entity j .

Syntactic Features
Previous work has shown that syntactic features are useful for relation extraction (Zhou et al., 2005). For example, the shortest dependency path has been used by several relation extraction models (Bunescu and Mooney, 2005;Miwa and Bansal, 2016). Here we propose a novel method to integrate syntax, without need for prior knowledge on concrete syntactic structures.
In particular, we take state-of-the-art syntactic parsers that use encoder-decoder neural models (Buys and Blunsom, 2015;Kiperwasser and Goldberg, 2016), where the encoder represents the syntactic features of the input sentences. For example, LSTM hidden states over the input word/tag sequences has been used frequently as syntactic features (Kiperwasser and Goldberg, 2016). Such features represent input words with syntactic information. The parser decoder also leverages partially-parsed results, such as features from partial syntactic trees, although we do not use explicit output features. Table 1 shows the encoder structures of three state-of-the-art dependency parsers.
Our method is to leverage trained syntactic parsers, dumping the encoder feature representations given our inputs, using them directly as part of input embeddings in our proposed model. Denoting the dumped syntactic features on each word as h syn 1 h syn 2 · · · h syn n , we feed them into a nonlinear neural layer, and then generate two LSTMs (bi-directional) based on the outputs, namely − −−− → LSTM syn and ← −−− − LSTM syn , respectively, augmenting the original three LSTMs into five LSTMs. Features are extracted from the two new LSTMs in the same way as from the basic bi-directional word LSTMs.
In this paper, we exploit the parser of Dozat and Manning (2016), since it achieves the current best performance for dependency parsing. Our method can be easily generalized to other parsers, which are potentially useful for our task as well. For example, we can use a constituent parser in the same way by dumping the implicit encoder features.
Our exploration of syntactic features has two main advantages over the method of Miwa and Bansal (2016), where dependency path LSTMs are used for relation classification. On the one hand, incorrect dependency paths between entity pairs can propagate to relation classification in Miwa and Bansal (2016), because these paths rely on explicit discrete outputs from a syntactic parser. Our method can avoid the problem since we do not compute parser outputs. On the other hand, the computation complexity is largely reduced by using our method since sequential LSTMs are based on inputs only, while the dependency path LSTMs should be computed based on the dynamic entity detection outputs. When beam search is exploited during decoding, increasing number of dependency paths can be used by a surge of entity pairs from beam outputs.
Our method can be extended into neural stacking Wang et al. (2017), by doing back-propagation training of the parser parameters during model training, which are leave for future work.

Local Optimization
Previous work (Miwa and Bansal, 2016; trains model parameters by modeling each step for labeling one input sentence separately. Given a partial table T , its neural representation h T is first obtained, and then compute the next label scores {l 1 , l 2 , · · · , l s } using Equation 1. The output scores are regularized into a probability distribution {p l 1 , p l 2 , · · · , p ls } by using a softmax layer. The training objective is to minimize the cross-entropy loss between this output distribution with the gold-standard distribution: where l g i is the gold-standard next label for T , and Θ is the set of all model parameters. We refer this training method as local optimization, because it maximizes the score of the gold-standard label at each step locally. During the decoding phase, the greedy search strategy is applied in consistence with the training. At each step, we find the highest-scored label based on the current partial table, before going on to the next step.

Global Optimization
We exploit the global optimization strategy of Zhou et al. (2015) and Andor et al. (2016), maximizing the cumulative score of the gold-standard label sequence for one sentence as a unit. Global optimization has achieved success for several NLP tasks under the neural setting (Zhou et al., 2015;Watanabe and Sumita, 2015). For relation extraction, global learning gives the best performances under the discrete setting (Li and Ji, 2014;Miwa and Sasaki, 2014). We study such models here for neural network models.
Given a label sequence of l 1 l 2 · · · l i , the score of T i is defined as follows: where score(T 0 ) = 0 and score(T i−1 , l i ) is computed by Equation 1. By this definition, we maximize the scores of all gold-standard partial tables.
Again cross-entropy loss is used to perform model updates. At each step i, the objective function is defined by: where x denotes the input sentence, T g i denotes the gold-standard state at step i, and T i are all partial tables that can be reached at step i.
The major challenge is to compute p T g i , because we cannot traverse all partial tables that are valid at step i, since their count increases exponentially by the step number. We follow Andor et al. (2016), approximating the probability by using beam search and early-update.
Shown in Algorithm 1, we use standard beam search, maintaining the B highest-scored partially-filled tables in an agenda at each step. When each action of table filling is taken, all hypotheses in the agenda are expanded by enumerating the next labels, and the B highest-scored resulting tables are used to replace the agenda for the next step. Search begins with the agenda containing an empty table, and finishes when all cells of the tables in the agenda have been filled. When the beam size is 1, the algorithm is the same as greedy decoding. When the beam size is larger than 1, however, error propagation is alleviated. For training, the same beam search algorithm is applied to training examples, and early-update (Collins and Roark, 2004) is used to fix search errors.

Data and Evaluation
We evaluate the proposed model on two datasets, namely the ACE05 data and the corpus of Roth and Yih (2004) (CONLL04), respectively. The ACE05 dataset defines seven coarse-grained entity types and six coarse-grained relation categories, while the CONLL04 dataset defines four entity types and five relation categories.
For the ACE05 dataset, we follow Li and Ji (2014) and Miwa and Bansal (2016), splitting and preprocessing the dataset into training, development and test sets. 5 For the CONLL04 dataset, we follow Miwa and Sasaki (2014) to split the data into training and test corpora, and then divide 10% of the training corpus for development.
We use the micro F1-measure as the major metric to evaluate model performances, treating an entity as correct when its head region and type are both correct, 6 and regard a relation as correct when the argument entities and the relation category are all correct. We exploit pairwise t-test for measuring significance values.

Parameter Tuning
We update all model parameters by back propagation using Adam (Kingma and Ba, 2014) with a learning rate 10 −3 , using gradient clipping by a max norm 10 and l 2 -regularization by a parameter 10 −5 . The dimension sizes of various vectors in neural network structure are shown in Table 2. All the hyper-parameters are tuned by development experiments. All experiments are conducted using gcc version 4.9.4 (Ubuntu 4.9.4-2ubuntu1 14.04.1), on an Intel(R) Xeon(R) CPU E5-2670 @ 2.60GHz.
Online training is used to learn parameters, traversing over the entire training examples by 300 iterations. We select the best iteration number according to the development results. In particular, we exploit pre-training techniques (Wiseman and Rush, 2016) to learn better model parameters. For the local model, we follow Miwa and Bansal (2016), training parameters only for entity detection during the first 20 iterations. For the global model, we pretrain our model using local optimization for 40 iterations, before conducting beam global optimization.

Development Experiments
We conduct several development experiments on the ACE05 development dataset.

Feature Ablation Tests
We consider the baseline system with no syntactic features using local training. Compared with Miwa and Bansal (2016), we introduce characterlevel features, and in addition exploit segmental  features for entity detection. Feature ablation experiments are conducted for the two types of features. Table 3 shows the experimental results, which demonstrate that the character-level features and the segment features we use are both useful for relation extraction.

Local v.s. Global Training
We study the influence of training strategies for relation extraction without using syntactic features. For the local model, we apply scheduled sampling (Bengio et al., 2015), which has been shown to improve the performance of relation extraction by Miwa and Bansal (2016). Table 4 shows the results. Scheduled sampling achieves improved F-measure scores for the local model. With the same greedy search strategy, the globally normalized model gives slightly better results than the local model with scheduled sampling. The performance of the global model increases with a larger beam size. When beam size 5 is exploited, we obtain a further gain of 1.2% on the relation F-measure, which is significantly better than our baseline local model with scheduled sampling (p ≈ 10 −4 ). However, the decoding speed becomes intolerably slow when the beam size increases beyond 5. Thus we exploit a beam size of 5 for global training considering both performance and efficiency.

Syntactic Features
We examine the effectiveness of the proposed implicit syntactic features. Table 5 shows the development results using both local and global optimization. The proposed features improve the relation performances significantly under both settings (p < 10 −4 ), demonstrating that our use of syntactic features is highly effective.
We also compare our feature integration method with the traditional methods based on syntactic   outputs which Miwa and Bansal (2016) and all previous methods use. We use the same parser of Dozat and Manning (2016), building features on its dependency outputs. We exploit the bidirectional tree LSTM of  to extract neural features, and then exploit a nonlinear feed-forward neural network to combine the two features. Similarly, we extract segment features but by using max pooling instead over the sequential outputs of the feed-forward layer, since the vector minus is nonsense here. The final relation results are 53.1% and 53.9% for the local and global models, respectively, which have no significantly differences compared with our models. On the other hand, our method is relatively more efficient, and flexible to the grammar formalism.  Miwa and Bansal (2016), who exploit end-to-end LSTM neural networks with local optimization, and L&J (2014) and M&S (2014) refer to Li and Ji (2014) and Miwa and Sasaki (2014), respectively, which are both globally optimized models using discrete features, giving the top F-scores among statistical models. 7 Overall, neural models give better performances than statistical models, and global optimization can give improved performances as well. Our final model achieves the best performances on both datasets. Compared with the best reported results, our model gives improvements of 1.9% on ACE05, and 6.8% on CONLL04.

Analysis
We conduct analysis on the ACE05 test dataset in order to better understand our models, on its two major contributions, first examining the influences of global optimization, and then studying the gains by using the proposed syntactic features. Intuitively global optimization should give better accuracies at the sentence level. We verify this by examining the sentence-level accuracies, where one sentence is regarded as correct when all the labels in the resulted table are correct. Figure 6 shows the result, which is consistent with our intuition. The sentence-level accuracies of the globally normalized model are consistently better than the local model. In addition, the accuracy decreases sharply as the sentence length increases, with the local model suffering more severely from larger sentences.
To understand the effectiveness of the proposed syntactic features, we examine the relation Fscores with respect to entity distances. Miwa and Bansal (2016) exploit the shortest dependency path, which can make the distance between two entities closer compared with their sequential dis-tance, thus facilitating relation extraction. We verify whether the proposed syntactic features can benefit our model similarly. As shown in Figure 7, the F-scores of entity-pairs with large distances see apparent improvements, demonstrating that our use of syntactic features has a similar effect compared to the shortest dependency path.
Several studies find that extracting entities and relations jointly can benefit both tasks. Early work conducts joint inference for separate models (Ji and Grishman, 2005;Yih, 2004, 2007). Recent work shows that joint learning and decoding with a single model brings more benefits for the two tasks (Li and Ji, 2014;Miwa and Sasaki, 2014;Miwa and Bansal, 2016;, and we follow this line of work in the study. LSTM features have been extensively exploited for NLP tasks, including tagging Lample et al., 2016), parsing (Kiperwasser and Goldberg, 2016;Dozat and Manning, 2016), relation classification Vu et al., 2016;Miwa and Bansal, 2016) and sentiment analysis . Based on the output of LSTM structures, Wang and Chang (2016) introduce segment features, and apply it to dependency parsing. The same method is applied to constituent parsing by Cross and Huang (2016). We exploit this segmental representation for relation extraction.
Global optimization and normalization has been successfully applied on many NLP tasks that involve structural prediction (Lafferty et al., 2001;Collins, 2002;McDonald et al., 2010;Zhang and Clark, 2011), using traditional discrete features. For neural models, it has recently received increasing interests (Zhou et al., 2015;Andor et al., 2016;Xu, 2016;Wiseman and Rush, 2016), and im-proved performances can be achieved with global optimization accompanied by beam search. Our work is in line with these efforts. To our knowledge, we are the first to apply globally optimized neural models for end-to-end relation extraction, achieving the best results on standard benchmarks.

Conclusion
We investigated a globally normalized end-to-end relation extraction model using neural network, based on the table-filling framework proposed by Miwa and Sasaki (2014). Feature representations are learned from several LSTM structures over the inputs, and a novel simple method is used to integrate syntactic information. Experiments show the effectiveness of both global normalization and syntactic features. Our final model achieved the best performances on two benchmark datasets.