End-to-end Relation Extraction using Neural Networks and Markov Logic Networks

End-to-end relation extraction refers to identifying boundaries of entity mentions, entity types of these mentions and appropriate semantic relation for each pair of mentions. Traditionally, separate predictive models were trained for each of these tasks and were used in a “pipeline” fashion where output of one model is fed as input to another. But it was observed that addressing some of these tasks jointly results in better performance. We propose a single, joint neural network based model to carry out all the three tasks of boundary identification, entity type classification and relation type classification. This model is referred to as “All Word Pairs” model (AWP-NN) as it assigns an appropriate label to each word pair in a given sentence for performing end-to-end relation extraction. We also propose to refine output of the AWP-NN model by using inference in Markov Logic Networks (MLN) so that additional domain knowledge can be effectively incorporated. We demonstrate effectiveness of our approach by achieving better end-to-end relation extraction performance than all 4 previous joint modelling approaches, on the standard dataset of ACE 2004.


Introduction
The task of relation extraction (RE) deals with identifying whether any pre-defined semantic relation holds between a pair of entity mentions in the given sentence. Pure relation extraction techniques (Zhou et al., 2005;Jiang and Zhai, 2007;Bunescu and Mooney, 2005;Qian et al., 2008) assume that for a sentence, gold-standard entity mentions (i.e. boundaries as well as types) in it are known. In contrast, end-to-end relation extraction deals with plain sentences without assuming any knowledge of entity mentions in them. The task of end-to-end relation extraction consists of three sub-tasks: i) identifying boundaries of entity mentions, ii) identifying entity types of these mentions and iii) identifying appropriate semantic relation for each pair of mentions. First two sub-tasks correspond to the Entity Detection and Tracking task defined by the the Automatic Content Extraction (ACE) program (Doddington et al., 2004) and the third sub-task corresponds to the Relation Detection and Characterization (RDC) task. ACE standard defined 7 entity types 1 : PER (person), ORG (organization), LOC (location), GPE (geopolitical entity), FAC (facility), VEH (vehicle) and WEA (weapon). It also defined 7 coarse level relation types 2 : EMP-ORG (employment), PER-SOC (personal/social), PHYS (physical), GPE-AFF (GPE affiliation), OTHER-AFF (PER/ORG affiliation), ART (agent-artifact) and DISC (discourse).
Traditionally, the three sub-tasks of end-to-end relation extraction are carried out serially in a "pipeline" fashion. In this case, the errors in any sub-task affect subsequent sub-tasks. Another disadvantage of this "pipeline" approach is that it allows only one-way information flow, i.e. the knowledge about entities is used for identifying relations but not vice versa. Hence to overcome this problem, several approaches (Roth and Yih, 2004;Roth and Yih, 2002;Singh et al., 2013; were proposed which carried out these subtasks jointly rather than in "pipeline" manner. We propose a new approach which combines  Table 1: Expected output of end-to-end relation extraction system for entity mentions the powers of Neural Networks and Markov Logic Networks to jointly address all the three sub-tasks of end-to-end relation extraction. We design the "All Word Pairs" neural network model (AWP-NN) which reduces solution of these three subtasks to predicting an appropriate label for each word pair in a given sentence. End-to-end relation extraction output can then be constructed easily from these labels of word pairs. Moreover, as a separate prediction is made for each word pair, there may be some inconsistencies among the labels. We address this problem by refining the predictions of AWP-NN by using inference in Markov Logic Networks so that some of the inconsistencies in word pair labels can be removed at the sentence level. The specific contributions of this work are : i) modelling boundary detection problem by introducing a special relation type WEM and ii) a single, joint neural network model for all three subtasks of end-to-end relation extraction. The paper is organized as follows. Section 2 provides a detailed problem definition. Section 3 describes our AWP-NN model in detail, followed by Section 4 which describes how the predictions of AWP-NN model are revised using inference in MLNs. Section 5 provides experimental results and analysis. Finally, we conclude in Section 6 with a short note on future work.

Problem Definition
Given a sentence as an input, an end-to-end relation extraction system should produce a list of entity mentions within it. For each entity mention, its boundaries and entity type should be identified. Also, for each pair of valid entity mentions, it should decide whether any pre-defined semantic relation holds between them.
Consider the sentence : His 0 sister 1 Mary 2 Jones 3 went 4 to 5 the 6 United 7 Kingdom 8 . 9 Here, end-to-end relation extraction should produce the output as shown in the tables 1 and 2.  Labels predicted by the AWP-NN model for each word pair can then be used to construct the end-to-end relation extraction output as described in tables 1 and 2.
Consider the example sentence from Section 2. Table 3 shows true annotations of all word pairs in this sentence as required for training the AWP-NN model. Labels used for these annotations can be grouped into the following 5 logical clusters:  Indicates that the words in the word pair belong to the same entity mention and one of the word is the head word of that mention

Features for the AWP-NN model
Previous work (Zhou et al., 2005;Jiang and Zhai, 2007;Bunescu and Mooney, 2005;Qian et al., 2008) in relation extraction establishes the importance of both lexical and syntactic features. Hence, we designed features to capture information about word sequences, POS tags and dependency structure. As each word pair constitutes a separate instance for classification, features are of three types: i) features characterizing individual word in a word pair, ii) features characterizing properties of both the words at a time and iii) features based on feedback, i.e. predictions of preceding instances.

Individual word features
These features are generated separately for both the words in a word pair. 1. Word itself and its POS tag 2. Previous word and previous POS tag 3. Next word and next POS tag 3.1.2 Word pair features These features are generated for a word pair (say W i , W j ) as a whole. 1. Words distance (WD): Number of words in the sentence between the words W i and W j 2. Tree distance (TD): Number of words on the path leading from W i to W j in the sentence's dependency tree 3. Common Ancestor (CA): Lowest common ancestor of the two words in the dependency tree 4. Ancestor Position (AP): It indicates the position of the common ancestor with respect to the two words of a word pair. Different possible positions of the ancestor are -left of W i , W i itself, between W i and W j , W j itself and right of W j . 5. Dependency Path (DP 1 , DP 2 , · · · , DP K ) : Sequence of dependency relation types (ignoring directions) on the dependency path leading from W i to W j in the sentence's dependency tree.

Feedback features
These features are based on predictions of the preceding instances. Unlike other sequence labelling problems such as Named Entity Recognition where each word gets a label and there is natural order / sequence of instances (i.e. words), there is no natural order / sequence of instances (i.e. word pairs) for AWP-NN model. Hence, for each instance we identify its two preceding instances and define two corresponding feedback features (F B 1 and F B 2 ). Let W i , W j be an instance representing a word pair in a sentence having N words such that 1 ≤ i, j ≤ N and i ≤ j. There are following two cases for identifying two preceding instances of W i , W j : • If i = j then both the preceding instances are same i.e. W i−1 , W i−1 . Feedback features: • If i < j then the preceding instances are W i , W i and W j , W j . Feedback features: Label predictions of the preceding instances are then represented using one-hot encoding and used as features. During training, true labels of the preceding instances are used but while decoding, the predicted labels of these instances are used. Hence during decoding, predictions for word pairs of the form W i , W i (diagonal word pairs in the table 3) are obtained first, starting from i = 1 to N . Predictions of other word pairs can be obtained later, as predictions of their preceding instances would then be available.
3.2 Architecture of the AWP-NN model Figure 1 shows various major components in the architecture of the AWP-NN model.

Embedding Layers
Most of the features used by the model are discrete in nature such as words, POS tags, dependency relation types and ancestor position. These discrete features have to be mapped to some numerical representation and embedding layers are used for this purpose. We have employed following embedding layers to represent various types of features: Word embedding layer: It maps each word to a real-valued vector of some fixed dimensions. We initialize this layer with the pre-trained 100 dimensional GloVe word vectors 4 learned on Wikipedia corpus. All the different features which are expressed in the form of words (W 1 , W 2 , N W 1 , P W 1 , N W 2 , P W 2 , P a 1 , P a 2 and CA in the figure 1) share the same word embedding layer. During training, the initial embeddings get finetuned for our supervised classification task. POS embedding layer: It maps each distinct POS tag to some real-valued vector representation. All the different features which are expressed in the form of POS tags (T 1 , T 2 , N T 1 , P T 1 , N T 2 , P T 2 , P aT 1 and P aT 2 in the figure 1) share the same embedding layer. Dependency relation type embedding layer: It maps each distinct dependency relation type to some real-valued vector representation. Both the features based on dependency types (DR 1 , DR 2 , DP 1 , · · · , DP K in the figure 1) also share the same embedding layer. 4 http://nlp.stanford.edu/projects/glove/ AP embedding layer: It maps each distinct ancestor position to some real-valued vector representation. WD/TD embedding layer: Even though word distance (WD) and tree distance (TD) are numerical features, we used embeddings to represent each distinct value for them as range of values of these features is large. It was observed to be better than directly providing them as inputs to the neural network.
In our experiments, we used 20 dimensions for POS embeddings, 40 for dependency relation type embeddings and 5 dimensions for AP, WD and TD embeddings. Unlike word embeddings these were initialized randomly during training.

Hidden Layers
First hidden layer is divided in 3 parts. First two parts of 60 nodes each are connected to only the features capturing first and second word, respectively. These nodes are expected to capture higher level abstract features of both the words separately. In order to force these two parts to learn similar abstract features, the weights matrix is shared among them. The third part of the first hidden layer consisting of 500 nodes is connected to all the input features except dependency path, i.e. individual word features of two words, word pair features and feedback features. Output of this part is further given as input to the second hidden layer of 250 units. Output of the second hidden layer is fed to the final softmax layer. Also, outputs of the first two parts of the first hidden layer are directly connected to the final softmax layer. As the dependency path is represented as a sequence of dependency relation types, it is fed to a separate LSTM layer. Output of the LSTM layer is directly connected to the final softmax layer. Softmax layer consists of 19 nodes, each representing one of the possible prediction label described earlier.
4 Inference using Markov Logic Networks Pawar et al. (2016) presented an approach for end-to-end relation extraction which uses Markov Logic Networks (MLN) (Richardson and Domingos, 2006) to obtain globally consistent output by combining local outputs of individual classifiers. They developed separate classifiers for identifying mention boundaries, predicting entity types and Figure 1: AWP-NN model architecture for predicting appropriate label for the given word pair. (W 1 , W 2 : words in the word pair; N W 1 , P W 1 , N W 2 , P W 2 , N T 1 , P T 1 , N T 2 , P T 2 : next and previous words/POS tags of W 1 and W 2 ; P a 1 , DR 1 , P a 2 , DR 2 : parents and corresponding dependency relation types of W 1 and W 2 in the dependency tree; P aT 1 , P aT 2 : POS tags of the parents of W 1 and W 2 ; F B 1 , F B 2 : Predictions of the preceding instances; CA: Lowest common ancestor of W 1 and W 2 in the dependency tree; T D: Tree distance; W D: Words distance; AP : Ancestor position; DP 1 , DP 2 , · · · , DP K : Sequence of dependency relation types on the dependency path leading from W 1 to W 2 ; Embedding layers for words, POS and dependency relations are shown separately for clarity, but are shared throughout the network.
predicting relation types. Outputs of these classifiers may be inconsistent. E.g., if PER-SOC relation is predicted by the local relation classifier for an entity pair and the local entity classifier predicts entity type as ORG for one of the entity mentions, then there is an inconsistency. Because PER-SOC relation can only exist between two PER entity mentions. Such domain knowledge can be easily incorporated in the form of first-order logic rules in MLN. For each sentence, predictions of individual classifiers are represented in an MLN as firstorder logic rules where weights of these rules are proportional to the prediction probabilities. The consistency constraints among the relation types and entity types can be represented in the form of first-order logic rules with infinite weights. Now, the inference in such an MLN generates a globally consistent output with maximum weighted satisfiability of the rules.
AWP-NN is a single joint model which captures boundaries of mentions, their types and relations among them. As the same parameters are shared for all entity as well as relation type predictions, we expect the model to learn dependencies among relation and entity types. However, as it makes separate predictions for each word pair, there might be some inconsistencies among the labels as described above. We adopt the MLN-based approach of Pawar et al. (2016) for handling these inconsistencies and generate a globally consistent output. For this adoption, we consider the AWP-NN predictions for the words pairs where a word RT F inal(x, y, P ERSOC) ⇒ (ET F inal(x, P ER) ∧ ET F inal(y, P ER)). if AWP-NN model assigns a probability more than some threshold (say 0.01) for any non-NULL relation type. All the generic rules (with infinite weights) described in (Pawar et al., 2016) are used for imposing constraints among the relation and entity types. Also, we added following additional generic rules for specifying constraints for our W EM relation type, which captures information about mention boundaries.

(!ET F inal(x, OT H)∨!ET F inal(y, OT H)).
By definition, the WEM relation holds between a head word of an entity mention and other words of that entity mention. Additionally, head word of an entity mention is labelled with appropriate entity label and other words are labelled with entity type OTH. The above rules state that if there is WEM relation between two words x and y then at least one of them should have label OTH and at least one of them should have entity type label, i.e. a label from domain etype other than OTH. Similarly, all the sentence-specific rules (with finite weights proportional to AWP-NN prediction probabilities) described in (Pawar et al., 2016) are also generated for representing predictions of the AWP-NN model. We use Constant Multiplier (CM) as the weights assignment strategy. Following rule would be generated for each entity type E (from etypes) for each word pair W i , W i , with the weight 10 · P r AW P −N N (E| W i , W i ) where E max is the predicted entity type with the highest probability: Similarly, following rule would be generated for each relation type R (from rtypes) for each word pair W i , W j , with the weight 10 · P r AW P −N N (R| W i , W j ) where R max is the predicted relation type with the highest probability: Using these generic and sentence-specific rules, an MLN is constructed for each sentence. The best values of ET F inal and RT F inal (query predicates) for each word pair are obtained by using the inference in this MLN with ET and RT as evidence predicates based on AWP-NN's predictions. ACE 2004dataset (Doddington et al., 2004 is the most widely used dataset 5 for reporting relation extraction performance. We use this dataset to demonstrate effectiveness of our approach for endto-end relation extraction using AWP-NN model and MLN inference. We perform 5-fold crossvalidation on this dataset where the folds are formed at the document level. We follow the same assumptions made by (Chan and Roth, 2011;Pawar et al., 2016), which are -ignore the DISC relation, do not consider implicit relations (resulting due to intra-sentence coreferences) as false positives and use coarse-level entity and relation types. Direction of Relations: Out of 6 coarse-level relation types that we are considering, we need not model direction for relation types PER-SOC, GPE-AFF and ART. Because in case of these relations, given the entity types of their arguments, the direction of relation is not necessary or becomes implicit. As PER-SOC is a social relation between two PER entity mentions, the direction is not necessary. For GPE-AFF, as entity type of one of the arguments is always GPE, the direction becomes implicit. Also, the relation type ART always holds between an agent (PER, ORG or GPE) and an artifact (FAC, WEA or VEH), hence the direction is implicit. Whereas for relation like EMP-ORG which also represents subsidiary relationship between two ORG entity mentions, it is important to model the relation direction explicitly. Consider following sentence fragments: Hence, we consider 9 distinct relation types: EMP-ORG, EMP-ORG-R, PHYS, PHYS-R, OTHER-AFF, OTHER-AFF-R, PER-SOC, 5 We haven't yet acquired a more recent ACE 2005 dataset GPE-AFF and ART. Hence, the overall dataset contained 4074 instances 6 of valid relation types.

Implementation details
We used Keras (Chollet, 2015) for implementing our AWP-NN model. The model was trained for 40 epochs using batch size of 64 instances. We used Dropout (Srivastava et al., 2014) for regularization with probability 0.5 for hidden layers and 0.1 for embedding layers. We used the tool Alchemy 7 for MLN inference. The value of K (maximum length of dependency path, see Figure  1) was set to be 4, hence all word pairs having length of dependency path more than 4 were assumed to have NULL label. Table 5 shows the comparative performances (in terms of micro-F1 measure) for various approaches. The results are divided in three different sections: 1. only entity extraction: It includes boundary identification as well as entity type classification. 2. only relation extraction: It includes relation type classification for each pair of predicted entity mentions. It is a relaxed version of end-toend relation extraction problem where correct relation label for an entity mention pair is counted as a true positive even if entity types of one or both the mentions are identified incorrectly. 3. entity+relation extraction: It is end-to-end relation extraction which includes boundary identification, entity type classification and relation type classification. Here, correct relation label for an entity mention pair is counted as a true positive only if boundaries and entity types of both the mentions are identified correctly. It can be observed in the table 5 that end-toend relation extraction performance of our AWP-NN model is better than all the 4 previous approaches (Chan and Roth, 2011;Pawar et al., 2016;Miwa and Bansal, 2016) on the ACE 2004 dataset. However, the AWP-NN+MLN approach which uses MLN inference to revise AWP-NN predictions during decoding, achieves the best performance.

Entity Extraction
Relation Extraction Entity+Relation P R F P R F P R F (Chan and Roth, 2011) 42.9 38.9 40.8  83   (2016)), we conduct one tailed one sample t-test. Here, mean and standard deviation of sample of 30 F1 scores by AWP-NN are 49.3 and 0.44, respectively. This leads to p-value of 1.23×10 −12 , hence establishing the statistical significance of AWP-NN's performance.

Effect of using MLN
We analyzed the effect of using MLN by observing the individual sentences where errors of AWP-NN were being corrected by MLN. As an example, consider the following sentence: Lemieux 0 rescued 1 his 2 team 3 from 4 bankruptcy 5 last 6 season 7 by 8 exchanging 9 deferred 10 salary 11 for 12 an 13 ownership 14 stake 15 . 16 End-to-end relation extraction output produced by the AWP-NN model for this sentence is shown in the tables 6 and 7. Only error in this output is that entity type of the mention team should be ORG instead of PER as it refers to some professional team. After MLN inference, the entity type of team is corrected to ORG. This happens because of high-confidence EMP-ORG relations between Lemieux and team and between his and team. As both Lemieux and his are of type PER with high confidence, global inference using MLN 8 forces type of team to be ORG to ensure compatibility of relation and entity types.  Table 7: End-to-end relation extraction output (relations) produced by the AWP-NN model

Entity Mention Boundaries Entity Type
The AWP-NN model was able to outperform (see table 5) all 4 previous approaches without the help of MLN. One reason behind this may be that the AWP-NN model itself was sufficient to learn most of the dependencies among the entity and relation types. However, MLN helped to improve the performance of AWP-NN by 0.6 F1. Though considerable improvement was observed in the precision value, the recall improvement was not significant. In other words, MLN was observed to be more effective for reducing false positives than false negatives.

Difficult to identify entities
We observed that for some entity mentions, it is very difficult to identify their entity types as the key information required for identification lies outside the current sentence. Currently, our approach does not use any information outside the sentences, such as document level co-reference information. Usually these difficult to classify entity mentions are pronoun mentions. Some examples are as follows: 1.
Though, I think that if they could stifle the entire peace process at the moment, then that is what they'd like to do.

2.
It is a partially victory for both sides.
Here, in the first sentence, it is difficult to identify (even for humans) whether entity type of they is PER (e.g. set of leaders) or GPE (e.g. countries). Also, in the second sentence, entity type of sides can be any of PER, ORG or GPE depending on the context. In future, we plan to capture document level information for correctly predicting types of such mentions.

Related Work
There have been multiple lines of research for jointly modelling and extracting entities and relations. Integer Linear Programming (ILP) based approaches (Roth and Yih, 2004;Roth and Yih, 2007) were the earliest ones. Here, various local decisions are associated with suitable "cost" values and they are represented using an integer linear program. The optimal solution to this integer linear program provides the best global output. Another significant lines of research were Probabilistic Graphical Models (Roth and Yih, 2002;Singh et al., 2013), Card-pyramid parsing (Kate and Mooney, 2010) and Structured Prediction Miwa and Sasaki, 2014).
Four previous approaches (Miwa and Sasaki, 2014;Pawar et al., 2016;Miwa and Bansal, 2016) are the most similar to our approach in the sense that they all address the problem of end-to-end relation extraction without assuming gold-standard entity mention boundaries like the earlier approaches. Our idea of labelling "all word pairs" is similar to the table representation idea of Miwa and Sasaki (2014). The major difference is that they identify boundaries of mentions through BIO encoding of labels whereas we try to capture boundaries by treating them as an additional relation type WEM. Also, they perform structured prediction with beam search to find optimal label assignment to the table, whereas we opt for neural network based classification. The idea of using MLNs to incorporate domain knowl-edge and perform joint inference to obtain globally consistent output was proposed by Pawar et al. (2016). The current state-of-the-art approach for end-to-end relation extraction is by Miwa and Bansal (2016), who employ LSTM-RNN based model for addressing this problem.

Conclusion and Future Work
We proposed a novel approach for end-to-end relation extraction which carries out its all three subtasks (identifying entity mention boundaries, their entity types and relations among them) jointly by using a neural network based model. We proposed a "All Word Pairs" neural network model (AWP-NN) which reduces solution of these three subtasks to predicting an appropriate label for each word pair in a given sentence. End-to-end relation extraction output is then constructed from these labels of word pairs. We further improved output of the AWP-NN model by using inference in Markov Logic Networks so that some of the inconsistencies in word pair labels can be removed at the sentence level.
We demonstrated effectiveness of our approaches (AWP-NN and AWP-NN+MLN) on the standard dataset of ACE 2004. They outperformed all 4 previously reported joint modelling approaches (Chan and Roth, 2011;Pawar et al., 2016;Miwa and Bansal, 2016) for end-to-end relation extraction. Since all three subtasks share the same AWP-NN model parameters, many inter-task dependencies are captured effectively by the AWP-NN itself (without MLN) and this can be validated by the fact that AWP-NN itself performs better than all other joint models. However, MLN certainly helps to further improve the end-to-end relation extraction performance by correcting some errors in predictions of the AWP-NN model.
In future, we plan to incorporate some additional features (e.g. document level co-reference information) in the AWP-NN model for improving its performance further. Also, deeper analysis of the errors is required to have a better understanding about which characteristics are better captured by the AWP-NN model as compared to the MLN and vice versa. This will help these two to complement each other in a better way.