Easy First Relation Extraction with Information Redundancy

Many existing relation extraction (RE) models make decisions globally using integer linear programming (ILP). However, it is nontrivial to make use of integer linear programming as a blackbox solver for RE. Its cost of time and memory may become unacceptable with the increase of data scale, and redundant information needs to be encoded cautiously for ILP. In this paper, we propose an easy first approach for relation extraction with information redundancies, embedded in the results produced by local sentence level extractors, during which conflict decisions are resolved with domain and uniqueness constraints. Information redundancies are leveraged to support both easy first collective inference for easy decisions in the first stage and ILP for hard decisions in a subsequent stage. Experimental study shows that our approach improves the efficiency and accuracy of RE, and outperforms both ILP and neural network-based methods.


Introduction
Relation extraction (RE) has been extensively studied due to its crucial role for many knowledge based applications Zhang et al., 2012;Chen et al., 2014a), such as question answering and knowledge graph. There are two types of relation extractors: local and global. The former identify relationships between entity pairs individually according to the local features of sentences, e.g., lexical and syntactic features (Bunescu and Mooney, 2007;Mintz et al., 2009;Surdeanu et al., 2010;Hoffmann et al., 2011;Surdeanu et al., 2012;Zeng et al., 2015;Lin et al., 2016). The latter make decisions by further considering joint features of the entire corpus (Yao et al., 2010;Li et al., 2011Li et al., , 2013de Lacalle and Lapata, 2013;Chen et al., 2014a). Global relation extractors are able to resolve conflict decisions, and to utilize the dependencies among ex-tracted facts to improve the performance (Li et al., 2011;Chen et al., 2014a), commonly by formalizing RE as a constrained optimization problem and solving RE with integer linear programming (ILP) (Roth and Yih, 2004;Choi et al., 2006;Roth and Yih, 2007;Li et al., 2011;Chen et al., 2014a,b). However, global optimization remains dominated by the ILP solvers, suffering from heavy time expenses.
Using integer linear programming for RE as a blackbox solver is a challenging task. First, with the increase of entity pairs and candidate relations, the variables and constraints encoded for ILP increase dramatically, which in return may consume too much computing time and memory. Second, redundant information needs to be encoded cautiously. Consider <United States, capital : 0.6, New York>, <United States, capital : 0.7, New York> and <United States, capital : 0.96, Washington D.C.> in Table 1. Simple statistical methods, such as confidence summation, may easily lead to a wrong decision in this case.
To address the above issues, we propose to make easy (most confident) decisions first, and then to make hard decisions with ILP. The rationale lies in that easy decisions should be made early, and eliminating conflicts with constraints aids hard decision making. We leverage information redundancies, embedded in the results produced by sentence level extractors, to compute the confidences of candidate relations for entity pairs. Redundancy commonly exists in various corpora, e.g., subject "United States" and object "New York" appear multiple times in Table 1. Even if there exists a specific relation between a subject and an object, the relation may not be reflected in certain mentions due to the lack of embedded evidence in single mentions, and information redundancies are essentially to alleviate such insufficiency for decision making. When making easy decisions, we keep the consistency among candidate relations by making use of constraints (i.e., domain and uniqueness constraints) to eliminate conflicts directly, instead of implicitly encoding constraints in ILP (Yao et al., 2010;de Lacalle and Lapata, 2013;Chen et al., 2014a;Koch et al., 2014;Chen et al., 2014b). As a result, the number of variables and constraints to be encoded in ILP is significantly reduced, which speeds up the entire decision making process.
In short, our approach employs an easy-first strategy with information redundancies to make most confident decisions first during which conflict candidate relations are resolved directly by domain and uniqueness constraints, and it only makes the remaining hard decisions with ILP. Our approach is an important improvement of ILPmodels, and it is not a new RE model, but an efficient and effective approach to fully exploiting the results of (any) local RE extractors. As shown in the experiments, our approach improves the performance, on average (4.58%, 17.90%) more accurate and (21, 37) times faster than existing ILP and neural network-based methods, respectively.
In the rest, we first introduce constraints and redundancies in Section 2, then present our detailed approach in Section 3, followed by experimental study in Section 4, related work in Section 5, and conclusions in Section 6.

Constraints and Redundancies
Consider a set M of mentions and a set R of predefined relations {r 1 , . . . , r k }. For each mention m, a sentence level extractor produces an entity pair (subject s and object o) and a confidence score c i for each relation r i (i ∈ [1, k]), which represents the probability that s and o have relation r i (Berger et al., 1996;Hoffmann et al., 2011). Here an NA (unknown) relation is typically included in R.
A mention m preprocessed by a sentence level extractor is essentially a k + 2 tuple (s, o, r 1 : c 1 , . . . , r k : c k ), as illustrated in Table 1, where Observe that an entity pair (subject s and object o) with a relation r ∈ R can be treated as an SRO triple (s, r, o), and a mention m contains k SRO triples. For convenience, given a mention m and a relation r ∈ R, we also denote the score of relation r in m as m[r].c.
Given a set M of mentions preprocessed by a sentence level extractor, our task for relation extraction is to determine the set of correct relations for each entity pair in M , and our approach adopts an easy-first strategy to make fast and accurate decisions, by leveraging constraints and redundancies that are to be introduced below.
In the following, a mention is referred to the one preprocessed by a sentence level extractor.

Constraints
We consider two types of constraints: domain and uniqueness constraints, commonly used to enforce agreements on decisions for RE (Yao et al., 2010;de Lacalle and Lapata, 2013;Chen et al., 2014a;Koch et al., 2014;Chen et al., 2014b). Domain constraints. This type of constrains enforces constraints among the subject and object domains of relations. Given relations r i and r j (1 ≤ i, j ≤ k), (1) an S-S domain constraint ensures that r i and r j share no common entities between their subjects, i.e., r i .subject ∩ r j .subject = ∅, (2) an O-O domain constraint ensures that r i and r j share no common entities between their objects, i.e., r i .object ∩ r j .object = ∅, and (3) an S-O domain constraint ensures that r i and r j share no common entities between the subject of r i and the object of r j , i.e., r i .subject ∩ r j .object = ∅.
For example, (1) relations largestCity and lo-cationCity have their subjects as countries (e.g., "Australia") and organizations (e.g., "University of Sydney"), respectively. They hold an S-S domain constraint; (2) relations locationCity and locationCountry have their objects as cities and countries, respectively. They hold an O-O domain constraint; (3) locationCity has its subjects as or-  We refer to the total set of relations with S-S (resp. O-O and S-O) domain constraints as DC ss (resp. DC oo and DC so ). We also refer to the total set of relations with S (resp. O) uniqueness constraints as U C s (resp. U C o ).

Information Redundancies
Redundancies are used to pick up hidden information (Downey et al., 2005;Banko et al., 2007;Li et al., 2011), and are very common in the corpus, as revealed by the statistics in Table 2. They are essentially the statistical characteristics (knowledge) of the results produced by local sentence level extractors, and are very important for global predictions. In this study, we introduce and leverage four classes of information redundancies: S-O, S-R, O-R and R redundancies. S-O redundancies are introduced to aid the decision making of the top-one relations of mentions with the same subjects and objects. For a mention m = (s, o, r 1 : c 1 , . . . , r k : c k ) preprocessed by a sentence level extractor, its certainty degree ent(m) is defined as follows.
The redundancy score RC(s, r 1 , o) of the top-one relation r 1 for subject s and object o in m, based on its S-O redundancies, is defined as where M s,o,r 1 is the set of mentions in M whose subjects are s, objects are o, and top-one relations are r 1 , and α is a small positive number in (0, 1) (set to 0.05 by default) to enforce that α ent(m ) falls into (0, 1). Informally, RC(s, r 1 , o) is a collective score based on its S-O redundancies, which makes use of information entropy to judge the confidence, and considers both the relative confidence scores and repeated times.
S-R redundancies are introduced to aid the decision making whether a subject meets the domain requirement of a relation. The likelihood score LC(s, r) for subject s and relation r, based on its S-R redundancies, is where M s is the set of mentions with the same subject s. Informally, LC(s, r) measures the probability that relation r has a subject s among all relations except NA. O-R redundancies are introduced to aid the decision making whether an object meets the domain requirement of a relation. The likelihood score LC(o, r) for object o and relation r, based on its O-R redundancies, is where M o is the set of mentions with the same object o. Informally, LC(o, r) measures the probability that relation r has an object o among all relations except NA. R redundancies are introduced to aid the decision making whether a subject and an object have a non-NA relation. The likelihood score LC(s, o) for subject s and object o, based on its R redundancies, is where M s,o is the set of mentions with the same subject s and object o. Informally, LC(s, o) selects prominent information from local decisions to measure the probability of having at least one non-NA relation between s and o.
As will be seen in the next section, redundancy scores RC(s, r 1 , o) are used for making easy decisions, and likelihood scores LC(s, r), LC(o, r) and LC(s, o) are used for hard decisions, respectively.

Easy First Relation Extraction
We propose a novel easy FIrst approach for Relation Extraction with information redundancies, referred to as eFIRE. As shown by the framework in Figure 1, our approach obtains S-O redundancies, and makes easy decisions with easy first collective inference in the first stage. Then it derives S-R, O-R and R redundancies, and makes hard decisions with ILP in the second stage. We next introduce our approach in detail.

Easy First Collective Inference
In the first stage, decisions must be highly accurate to avoid error propagation. As revealed by Table  2, the decisions produced by local extractors are only reliable for top-one relations. Hence, eFIRE first makes decisions for entity pairs using their top-one relations only. The confidences are the redundancy scores obtained with the S-O redundancies (Section 2.2). Once a decision is made, disagreements are resolved immediately with domain and uniqueness constraints directly (Section 2.1). Confidence computing and ordering.  scores. As updating operations happen very often during the process of decision making, we introduce a max-heap MH to maintain all the entity pairs with their top-one relations and redundancy scores for the sake of efficiency. Decision making and conflict resolving. It first makes a decision for the entity pair (s, o) by choosing its top-one relation r 1 such that the redundancy score RC(s, r 1 , o) is the highest in MH.
Then it resolves conflicts accordingly.
(1) For any relation r ∈ R having an S-S domain constraint with r 1 , all entity pairs with subject s and top-one relation r are deleted from MH, and for entity pairs with subject s in M , their scores of relation r are set to zeroes. It is similar for O-O and S-O domain constraints. (2) If relation r 1 has an S uniqueness constraint, all entity pairs with subject s and top-one relation r 1 are deleted from MH, and for entity pairs with subject s and object o = o in M , their scores of relation r 1 are set to zeroes. It is similar for O uniqueness constraints.
The above process is repeated until the highest redundancy score in the max-heap MH is less than a predefined threshold , which is to ensure the correctness of decisions. For the benefit of efficiency, we also index mentions by subjects, by objects and by entity pairs, separately.
The intention of threshold is to distinguish easy decisions from hard ones based on S-O redundancies. This indeed can be reflected from the redundancy scores. Consider an extreme case when there is only one mention m in M with subject s, object o and top-one relation r 1 whose score is 1.0, i.e., there are no S-O redundancies for mention m. In this case, we have RC(s, r 1 , o) = 1.0. However, for cases when multiple mentions with the same subject s, object o and top-one relation r 1 , it is likely that the redundancy score is less than 1.0. Hence, threshold is typically set to a value a little less than 1.0.
Our approach makes a better use of information redundancies in the corpus to aid the decision making process. Recall the example on determining whether the capital of "United States" is "New York" or "Washington D.C." in Section 1. With the S-O redundancies, "Washington D.C." is chosen as the capital of "United States", as RC(United States, capital, New York) = 0.18 and RC(United States, capital, Washington D.C.) = 0.57, which justifies the rationale of introducing the certainty degree ent(m) for mentions m. Complete algorithm for easy first inference. The complete algorithm is presented in Figure 2.
It first initializes the set E of easy decisions empty (line 1). It then computes the redundancy scores of all the entity pairs with their top-one relations in mentions M , using S-O redundancies, and these entity pairs with their top-one relations are sorted in a descending order of their redundancy scores with a max heap MH (line 2). It repeatedly deals with entity pairs and their top-one relations in MH one by one until the highest redundancy score in MH is no more than (lines 3-6). Once a decision is made (lines 4, 5), conflicts are resolved immediately by updating MH and M (line 6). Finally, the modified mentions M and a set E of easy decisions are returned (line 7). Time and space complexity analyses. The algorithm runs in O(|M | 2 (|R| + log |M |)) time, where |M | and |R| are the numbers of mentions and predefined relations, respectively. Observe the following. For a mention, (1) it takes O(|R| + log |M |) time to compute the redundancy scores for all entity pairs with their top-one relations, (2) maintaining MH can be done in O(log |M |) time, and (3) decision making and conflict resolving take O(|M |(|R|+log |M |)) time. As there are in total |M | mentions, we have the conclusion.
It is easy to see that the algorithm takes a space in the linear size of the set M of mentions.

Integer Linear Programming
In the second stage, our goal is to find an optimal configuration for the remaining mentions, making use of S-R, O-R and R redundancies, solving the disagreements by domain and uniqueness constraints, and maximizing the overall confidence of the made decisions. This is an NP-hard optimization problem, and many optimization models can be used to obtain approximate solutions. The tricky part is the design of the objective functions. Here, we adopt the ILP tool "IBM ILOG CPLEX".
More specifically, for each mention m and each of its candidate relations r in the set M of remaining mentions returned in the first stage, we define a binary decision variable v r m indicating whether relation r is chosen for the entity pair (s, o) in m by the solver. For each mention m in M , we choose its top three relations with scores no less than 0.1 as the candidate relations. As revealed by Table 2, candidates beyond top-3 are very unreliable.
Our objective is to maximize the total confidence of all the selected candidates based on the S-R, O-R and R redundancies, and the objective function can be written as: where s and o are the subject and object in m, R m is the set of candidate relations for s and o in m, and M m is the set of mentions in M having the same subject and object as m. The first component is the sum of S-R, O-R, and R redundancies of the selected candidates, and the second one is the sum of the original confidence scores of the selected candidates. The former is designed to encourage the model to select candidates meeting the domain requirements of relations, and the latter is designed to give consideration to decisions produced by sentence level extractors. That is, although the sentence level extractors may make wrong decisions, the global statistics of their decisions are reliable, and should be preserved. The domain and uniqueness constraints are encoded to avoid conflict decisions as follows.
where m i and m j have the same subject, r m i ∈ R m i , r m j ∈ R m j , and r m i and r m j have an S-S domain constraint in DC ss .
where m i and m j have the same object, r m i ∈ R m i , r m j ∈ R m j , and r m i and  where the subject of m i is the object of m j , r m i ∈ R m i , r m j ∈ R m j , and r m i and r m j have an S-O domain constraint in DC so .
where M r,s is the set of mentions with candidate relation r, subject s and pairwise distinct objects, and r has an S uniqueness constraint in U C s .
where M r,o is the set of mentions with candidate relation r, object o and pairwise distinct subjects, and r has an O uniqueness constraint in U C o . By adopting ILP, eFIRE combines the scores refined in the first stage and the constraints to make hard decisions. After the optimization problem is solved, together with the easy decisions obtained in the first stage, eFIRE finally produces a list of selected candidate relations for each entity pair.

Experimental Study
In this section, we present an extensive experimental study of our easy first approach eFIRE.

Experimental Settings
We first present our experimental settings. Datasets. The two datasets, DB_me and DB_nn, stem from DBpedia (Bizer et al., 2009), mapping the triples in DBpedia to sentences in the New York Time corpus. We map 51 different relations to the dataset. We use both a maximum entropy model MaxEnt (Chen et al., 2014a) and neural network model NN (Lin et al., 2016) as the sentence level extractors to output confidence scores, denoted as DB_me and DB_nn, respectively. The features of these two datasets are reported in Table 2. There are 53162 mentions in each dataset, including 38654 mentions with NA relations. We learn domain and uniqueness constraints from Freebase as knowledge base (KB). Algorithms for comparison. To evaluate our approach, we compare with three methods: the ILP based global method for RE in (Chen et al., 2014a) as baseline that use the global clues to help resolve the disagreements, and CNN+ATT and PCNN+ATT in (Lin et al., 2016) that are neural network-based methods with attention mechanism to use all informative sentences. Implementation. All algorithms were implemented with C++. All experiments were run on a PC with 2 Intel(R) Xeon(R) E5ĺC2640 2.6GHz CPUs and 64 GB of memory. When running time is measured, the test was repeated over 5 times and the average is reported.

Experimental Results
We next present our findings of the effectiveness and efficiency of our easy first approach eFIRE. Following previous work, we also use the precision in the low recall portion of the P-R curve as the effectiveness criterion. Exp-1: Overall performance. In the first set of tests, we evaluated the effectiveness and efficiency, and the results are reported in Figures 3(a), 3(b) and Table 3, respectively.
Our approach eFIRE outperforms the methods for comparison. eFIRE is on average (4.80%, 4.36%), (17.99%, 28.10%) and (7.69%, 17.82%) more accurate than baseline, CNN+ATT and PCNN+ATT on (DB_me, DB_nn) in the lowrecall portion [0, 0.25] of the P-R curves, respectively. eFIRE consistently outperforms CNN+ATT and PCNN+ATT over the entire range of recall. While baseline tends to have results with a higher recall, it has a weakness of low precision, which is alleviated by eFIRE that is able to obtain more correct decisions. It is difficult to guarantee high precision and recall at the same time. For most KB population applications, only the high precision part is considered for the effectiveness evaluation. It is worth pointing out that we only compare the testing time of CNN+ATT and PCNN+ATT here.
Our method eFIRE also reduces the running time on all datasets. eFIRE is on average (28, 14), (37, 63) and (18, 31) times faster than baseline, CNN+ATT and PCNN+ATT on (DB_me, DB_nn), respectively. This is because the easy-first strategy of eFIRE significantly re- duces the number of variables and constraints encoded in the ILP solver, as shown in Table 3.
Note that the running time has no obvious linear relationships with the number of variables and constraints among different datasets. In addition to the number of variables and constraints, objective functions have a impact on the running time of ILP too. Further, CPLEX is used as a black box, which makes it hard to have a precise analysis.
These results tell us that the easy-first strategy for RE, by making use of the redundancy information embedded in the local results of sentence level extractors, is an effective and efficient complement for RE using ILP solvers.
Exp-2: Performance of easy first collective inference with S-O redundancies. In the second set of tests, in order to evaluate the impacts of S-O redundancies, we implemented a variant of our approach, referred to as eFIRE-1S, that makes easy decisions with the easy first collective inference, and then adopts the same ILP method in baseline for making the rest decisions. The results are reported in Figures 3(c), 3(d), and Table 3. low-recall portions of the P-R curves on all two datasets. These results tell us that the easy first collective inference using S-O redundancies can not only improve the effectiveness of decision making for RE, but also improve the efficiency, as it significantly reduces the constraints and variables encoded in the ILP solver. Exp-3: Performance of ILP with S-R, O-R and R redundancies. In the third set of tests, to evaluate the impacts of S-R, O-R and R redundancies, we implemented a variant of eFIRE, referred to as eFIRE-2S, that only consists of the second stage of eFIRE. That is, all decisions of eFIRE-2S are made by the ILP solver. The results are reported in Figures 3(c), 3(d) and Table 3. Method eFIRE-2S outperforms baseline on all two datasets. It essentially improves the precision without sacrificing the recall. For ILP based methods, their key differences are the objective functions. eFIRE-2S incorporates more reliable statistics of sentence level extractors, i.e., S-R, O-R and R redundancies, while baseline only uses maximal scores to encourage choosing the candidates with higher individual sentence level confidence scores. So, S-R, O-R and R redundancies benefit the decision making of ILP solvers. The efficiency of eFIRE-2S and baseline are comparable, which implies that the efficiency benefit of eFIRE comes from its first stage easy first collective inference. The results show that our approach eFIRE is effective and efficient when falls into [0.5, 0.9], during which eFIRE outperforms baseline in the low-recall portion. This justifies our setting for threshold , which is typically a little less than 1.0 to distinguish easy decisions from hard ones (Section 3.1). Threshold obviously has an impact on the running time, as the smaller is, the more running time eFIRE has in the first stage, and the less it has in the second stage. Summary. From these tests, we find followings.
(2) The use of the easy-first strategy and S-O redundancies both improves the accuracy of RE and reduces the running time, and the use of S-R, O-R and R redundancies improves the accuracy of RE. (3) The setting of threshold is flexible in a range of [0.5, 0.9] for our approach eFIRE.

Related Work
Relation extraction has been studied extensively in recent years, and can be divided into local re-lation extractors (Mintz et al., 2009;Surdeanu et al., 2010;Hoffmann et al., 2011;Surdeanu et al., 2012;Søgaard et al., 2015) using the lexical features, syntactic features, and other local features of sentences, and global relation extractors utilizing the corpus features and relationships among local decisions (Kate and Mooney, 2010; Yao et al., 2010;Li et al., 2011;Singh et al., 2013;Li and Ji, 2014;Nguyen et al., 2017;Su et al., 2017). Global relation extractors leverage more information to resolve conflict decisions, which typically leads to more accurate decisions than local ones.
Recently, neural network-based models (Zeng et al., 2014;Ji et al., 2017) have been proposed for RE. Lin et al. (2016) proposes an attention mechanism to calculate weights for all sentences of one entity pair and selects plausible sentences from noisy sentences. Different from these are supervised methods that need label data for training, we propose an unsupervised method in this study.
Disagreements among decisions can be resolved by constraints. Yao et al. (2010) propose a relation extraction model that captures selectional preferences and functionality constraints to integrate information across documents. Fader et al. (2011) implement syntactic and lexical constraints on binary relations expressed by verbs in Open IE systems. Koch et al. (2014) impose type (or domain) constraints to only allow relations over appropriately typed mentions for relation extraction. Similar to (Chen et al., 2014a), our approach utilizes both domain and uniqueness constraints to resolve disagreements.
Many global relation extractors use integer linear programming as a blackbox solver (Roth and Yih, 2004;Choi et al., 2006;Roth and Yih, 2007;Li et al., 2011;Chen et al., 2014a,b;Wang et al., 2015). The ILP solver derives decisions through a well designed objective function, and resolves conflict decisions by encoding constraints into ILP. Our easy first approach is complimentary to these methods with each other, as these methods can take the easy first collective inference of our approach for making easy decisions as a first step, and our approach can make use of any of these methods as its solution for making hard decisions in its second stage.
Redundancies in the corpus have been used to pick up hidden information. Downey et al. (2005) consider redundant extractions for judging the correctness of extractions. Li et al. (2011) take ad-vantage of redundant information to conduct reasoning across documents based on the information network structure. We introduce four classes of information redundancies: S-O, S-R, O-R and R redundancies, and we leverage S-O redundancies for making easy decisions, and the others to aid hard decision making.
Easy-first strategy relies on the intuition that "easy decisions should be made early, while harder decisions should be left for later when more information is available (Stoyanov and Eisner, 2012)". Their method makes easy decisions first for coreference resolution, and further makes use of the information from coreference clusters in the form of features to make later decisions. In this study, we propose to make easy (most confident) decisions first for relation extraction, and then to make hard decisions with ILP, where easy decisions are distinguished from hard ones with redundance scores based on S-O redundancies.
Data dependencies have well studied for improving data quality (Ma, 2011;Ma et al., 2015;Fan and Geerts, 2012), which essentially make use of data redundancies and dependencies. Our approach is partially inspired by these studies. Indeed, the uniqueness constraints defined in this study can be treated as functional dependencies (Abiteboul et al., 1995).

Conclusions
In this paper, we have proposed a fast easy first approach for relation extraction by making use of information redundancies, embedded in the results produced by local sentence level extractors, under domain and uniqueness constraints. We have introduced four classes of information redundancies to aid both easy first collective inference for easy decisions in the first stage and ILP for hard decisions in the second stage. Finally, using datasets processed by sentence level extractors with different models, we have experimentally verified that our easy first approach is both effective and efficient compared with state-of-the-art both ILP and neural network-based methods.