Probabilistic Inference for Cold Start Knowledge Base Population with Prior World Knowledge

Building knowledge bases (KB) automatically from text corpora is crucial for many applications such as question answering and web search. The problem is very challenging and has been divided into sub-problems such as mention and named entity recognition, entity linking and relation extraction. However, combining these components has shown to be under-constrained and often produces KBs with supersize entities and common-sense errors in relations (a person has multiple birthdates). The errors are difficult to resolve solely with IE tools but become obvious with world knowledge at the corpus level. By analyzing Freebase and a large text collection, we found that per-relation cardinality and the popularity of entities follow the power-law distribution favoring flat long tails with low-frequency instances. We present a probabilistic joint inference algorithm to incorporate this world knowledge during KB construction. Our approach yields state-of-the-art performance on the TAC Cold Start task, and 42% and 19.4% relative improvements in F1 over our baseline on Cold Start hop-1 and all-hop queries respectively.


Introduction
Automatically transforming a large corpus into a structured knowledge base (KB) has long been a goal of information extraction (IE) research. KB population incorporates many IE tasks including named entity recognition, entity linking and relation extraction, each of which rely on deeper linguistic analysis, e.g., syntactic parsing or anaphora resolution. Since 2012, NIST 1 has run an open shared task in KB population (KBP) under TAC 2 . Most participating systems Min et al., 2015;Angeli et al., 2014;Nguyen et al., 2014;Monahan and Carpenter, 2012) combine many independent components to perform the full task.
As will be familiar to most IE researchers, the individual components are not perfect. When combined into a pipeline, errors compound. We found that a KB produced with a simple combination of state-of-the-art IE components (Ramshaw et al., 2011) is very sensitive to component-level errors (Grishman, 2013). Table 1 illustrates a real entity coreference mistake. Barack Obama and Ehud Barak were incorrectly linked because of ambiguous context and high lexical overlap. The mistake leads to errorful facts about employment, familial relations, etc. We see additional mistakes when we review the names in those entities with the most mentions: the U.S. entity contains more than 20,000 mentions. 85% are correct (e.g., United States, U.S.), but there is a long tail of incorrect yet infrequent(each accounting for < 1%) mentions linked to the entity e.g., North America, Latin American. We also see counter-intuitive errors in relation extraction: 5% of person entities have multiple birthdates; the KB asserts 8 spouses for an infrequently mentioned entity. Similar errors have been reported in (McNamee et al., 2013) and (Singh et al., 2013b). Analyzing these errors suggests a limitation of performing KB population solely with IE tools. These mistakes only become obvious in the context of external world knowledge with the full set of facts extracted from many documents, e.g. when applying our expectations about the cardinality of a relation. With just a single document, resolving these mistakes requires challenging inference (Ji et al., 2005).
In this paper, we present a probabilistic framework to incorporate real-world knowledge into Cold Start KB population. Our contributions include: • Identifying from real world datasets that entity popularity and each relation's cardinality follow the power-law distribution with long tails of low-frequency instances. • Defining a corpus-level joint objective for KBP that incorporates multiple IE components and prior world knowledge on entity popularity and per-relation cardinalities, and showing the prior knowledge helps to reduce errors. • Outperforming the top-ranked entry in Cold Start 2015. The paper is organized as follows: we first introduce the Cold Start KBP task, then present the joint probabilistic framework, followed by analysis of the world knowledge and how to incorporate it. We then describe our inference algorithm. Lastly, we present experimental results, related work, and conclude with suggestions for future research.

Cold Start KBP
The schema consists of 3 entity types (person, organization, and GPE) and 42 slots (relation classes) 3 . Systems start with an empty KB (cold start) and populate it according to the schema with 3 We will use slot and relation interchangeably in this paper. information extracted from the corpus. All facts in the KB must be grounded with justifying text from the corpus.
A KB entity is defined as a cluster of mentions that refer to the same real-world entity, e.g., Smith, John Smith, and John H Smith are 3 mentions for the entity John H Smith. Every named mention of an entity is recorded. A relation is a triplet (subject, slot, object), where subject and object are entities, 4 and slot is the relation between them. For example, (Bart Simpson, per:siblings, Lisa Simpson) is the relation Bart Simpson is a sibling of Lisa Simpson. A relation's provenance points to up to 4 snippets in the corpus that justify the relation. The evaluation process is described in the Experiments Section.

A Probabilistic Framework
Following most Cold Start KBP systems Min et al., 2015;Angeli et al., 2014;Nguyen et al., 2014;Monahan and Carpenter, 2012), our baseline uses a cascade of NLP components from documentlevel analysis to corpus-level aggregation. We run BBN's SERIF (Ramshaw et al., 2011) for mention, value and name tagging, coreference resolution, sentence-level relation extraction, alongside other analysis such as syntactic parses. Then we aggregate entities with entity discovery and linking and relations with relation extraction.
Given a set of pre-trained NLP components, the process is essentially an inference task. We introduce the following notation: • M is the list of mentions • E is the set of entities to populate the KB • R is the set of relation types. • x i is the observed text for mention i, x i ∈ M • u i is entity ID from the KB assigned to mention i, i ∈ {1, 2, ..., |M |}, u i ∈ {1, 2, ..., |E|} • z i,j = r indicates the relation r between a pair of mentions x i , x j ∈ M and r ∈ R {Other} • y r i,j is an indicator variable: y r i,j = 1 if a relation r ∈ R exist between entity pair < e i , e j >, i, j ∈ {1, 2, ..., |E|}, and 0 otherwise.
The key steps in the pipeline are the following: Mention extraction: We use a structured perceptron model (Ramshaw et al., 2011) to extract named mentions M . We define potential functions over variables {u i } for each pair of mention x i and the jth entity e j : The baseline system solves the EDL problem by inferring u * i = arg max Ψ EDL i (u i ). It uses a name database collected from Freebase (Bollacker et al., 2008) and GeoNames 6 . First, it clusters novel names to create new candidate entities in addition to entries in the name database. A novel name is defined as a name that could not be resolved to a database entry. It then rescans the corpus and links each document-level entity to a corpus-level entity (an entry in the name database or a novel name). The EDL model Ψ EDL uses features such as string edit distance and indicators representing whether appearing in the same name variant set. {θ k } and {φ k } are the weights and feature functions respectively.
Mention-level Relation Extraction (MRE): This step infers which relation z i,j = r(r ∈ R {Other}) exists between each pair of mentions < x i , x j >, we define potential functions: We run several relation finding algorithms (Min et al., 2015) , including statistical models trained 5 Decisions made for named mentions will be applied to the corresponding document-level entities. 6 www.geonames.org from ACE 7 relation annotation and distant supervision (Mintz et al., 2009) in which we align Freebase pairs into Gigaword (Parker et al., 2011) to generate training examples, and a pattern matcher with a few hand-written patterns that capture local contexts.
To train a log-linear model Ψ M RE for combining these algorithms and to tune the confidences of their extractions, we follow (Viswanathan et al., 2015) and train a stacked classifier using output and confidences of the extractors. We use assessment datasets from TAC Cold Start KBP 2013 and 2014, and Slot Filling evaluations in 2013 and 2014. The features we used are: source algorithm name, slot, confidence score (if exists), argument mention level (pronoun, name, or nominal), lexical sequence between pair of arguments, propositional path between the pair of arguments. {θ k } and {φ k } are the weights and feature functions respectively.
Relation Extraction (RE): This aggregation step infers which relations exist between each pair of entities at the KB level, i.e. assigning values for {y r i,j }. We define the potential functions over the indicator variables {y r i,j }, by looking at all pairs of mentions x m , x n ∈ M , their potential to have a relation r and likelihood to be associated with entities e i , e j ∈ E: and is a parameter learned from previously seen data. The aggregation from a set of z m,n to a set of y i,j is similar to noisy-or relation aggregation (Hoffmann et al., 2011;Sur-deanu et al., 2012) and supports overlapping relations (Hoffmann et al., 2011;Surdeanu et al., 2012).
The joint distribution defined over the full set of variables u, y is: The Cold Start KBP problem can be seen as finding the maximum a posteriori (MAP) configuration: (u * , y * ) = arg max (u,y) P r(u, y|x) The baseline system approximates the solution by solving in 2 separate stages: solve EDL by fixing u In the relaxed form, the optimization problem required for estimating the potential of y * r i,j is limited to search only over pairs of mentions x m , x n which are now associated with the entities e i , e j , i.e. u m = i and u n = j, instead of enumerating over all pairs of mentions. This can be done efficiently.

Incorporating World Knowledge
We incorporate world knowledge related to entity popularity and a set of per-relation cardinalities as additional factors in the objective. To learn these factors' form, we analyze real-world datasets and find that both factors follow the power-law distribution with long tails of low-frequency instances.

Entity Popularity
We define entity popularity (EP) as the number of mentions of an entity in a corpus. Entities vary in popularity-famous people (e.g. politicians, athletes), countries, and large organizations will be mentioned frequently in news, while other entities-a small city, the local valedictorian may only be mentioned a few times. Ideally, we would model the EP distribution with counts from a large corpus annotated for names and cross-document entity coreference. As we are not aware of any such resource, we look at two approximations: Name variants (NV): We collect name variants (e.g., UN and United Nations) for PER and ORG from Freebase and for GPE from GeoNames.
Name Mentions (NM): From a 50K document sample of Gigaword, for each entity, we search for exact matches to its name variants, and count these matches to estimate the number of entity mentions. Figure 2 (left) plots the per-entity relationship between the count of NM and NV with rank. Both follow the power-law distribution (i.e. the plots are close to straight lines in a log-log scale). In other words, most entities have only a small number of variants and are mentioned only a few times. A handful of popular entities are mentioned frequently and have many variants 8 . The size of entities in the Kripke  system follows power-law distribution, further supporting our findings.
Formally, we define for ith entity the popularity variable q i (u) = Σ m I(u m = i) and the potential for EP factor Ψ EP i (u) as follows: in which θ EP (q i (u)) = α ln(q i ).
The parameter α > 0 is initially fit from Freebase entities and then finetuned with TAC 2014 dataset to reflect real-world distributions. The EP term favors EDL solutions u with a popularity distribution that follows a power-law with a long tail of low-frequency entities.

Relation Cardinality
We define relation r's cardinality regarding e i as the number of entities or values associated with e i through r. For example, if John Smith has 3 children, the cardinality of r=per:children regarding e i =John Smith is 3. Formally, we notate the set of variables {y r ij } as y r i , and the cardinality of a relation r of entity e i as d r i = Σ j y r ij . Per-relation cardinalities (RC) often reflect real world constraints-people have at most one birthdate and typically no more than 5 siblings. To understand the cardinality constraints for the Cold Start relations, we use Freebase, a large, manually curated KB. We align Freebase to the TAC schema following (Chen et al., 2010) and then generate a cardinality for each relation for all entities. The relationship between RC and RC-rank for the the 6 most frequent relations is plotted in Figure 2 (right). For these relations, RC closely resembles a power-law distribution. To favor both powerlaw and a soft size-limit on cardinality, we define the following potentials for RC factors for each relation-entity pair: in which θ RC (d r i )) = β ln(d r i ) − γ(max(d r i − µ r , 0)) 2 with parameters β, γ > 0, and µ r as the mean of the cardinalities of a relation r (estimated from Freebase). The first term in the potential has the power-law assumption, while the second term penalizes large cardinalities for going beyond the mean µ r .

Incorporating Prior World Knowledge
Incorporating the EP and per-relation RC terms into the joint distribution, we obtain the joint objective: with P r(u, y|x) as the baseline objective. A simplified plate diagram is shown in Figure 1.
Learning constraints for real-world corpora: As we're not aware of any large corpus annotated exhaustively with entities and relations, we fit the parameters of the constraints initially from Freebase entities and relations, and then fine-tune them using empirical utility maximization (Jansche, 2005; Ye et al., 2012) for TAC Cold Start all-hop F1 with grid search in the parameter space, using previous years' TAC assessment. Freebase is used in initialization because of its scale while finetuning with TAC assessment ensures the factors to more appropriately represent the underlying distribution of entity popularity and relation cardinality in a real-world corpus.

Jointly Inferring Entities and Relations
The problem of Cold Start KBP becomes finding a MAP assignment of u and y for P r * (u, y|x). Finding the exact solution is hard, as many terms in the objective involve large groups of variables. We propose Algorithm 1 as an approximate heuristic. Line 1 generates an initial KB by approximating a solution for the baseline objective P r(u, y|x) (Section 3), but tends to overlink entities and over-aggregate relations. Lines 2-8 iteratively refine the KB by searching over the (u, y)space using operation o ∈ {SplitE, PruneR}. At t-th iteration, it performs the operation o with the highest potential gain ∆ ln P r * (o(u t , y t )|x). The process is repeated until the gain is smaller than a very small value .
SplitE: splits an entity e i into two entities. Since there are an exponential number of possible SplitE actions, we uses the following two heuristics: 1) cluster name mentions by their string forms, and find an "outlier" cluster of mentions, 2) rank e i 's mentions {x m : u * m = i} by their local EDL potential Ψ EDL m (u * m |x m ) and find the lowest-ranked mention as an "outlier". After find-Input : x, α, β, γ, µ r for r ∈ R Output: u, y 1 (u 0 , y 0 ) = arg max (u,y) P r(u, y|x) Algorithm 1: The MAP inference algorithm ing an "outlier" mention cluster of e i , we divide it into two entities: e g with the "core" mentions, and e h with the "outlier" mention cluster. We repeat the process to find all outlier entities and separate them from the entity. Relation arguments will be reattached to the new entities accordingly. We only consider a short list of most popular entities and split each using the heuristics described above. PruneR: removes a batch of relations (Y = {y r i,j }) by setting 0: y r i,j ← 0 for each y r i,j ∈ Y . The batch is generated with the following steps: first select a set of entity-relation pairs (e i , r) with the highest cardinality d r i = j y r i,j , then repeatedly select the associated relation with the lowest potential j * = arg min j:y r i,j =1 Ψ RE (y r i,j |x). Each y r i,j will be added into the batch until its size reaches 50.
We define the gain for SplitE and PruneR as: with m ranges over IDs of mentions in e i , and Ψ RE i is the sum of the RE factors which are related to entity e i . y , u are the assignment to y, u if a SplitE or a PruneR operation is executed. We also use the short form Ψ RC r (y r ) as the sum of the RC factors which have changed because of a SplitE or a PruneR operation.
Since the gain is only computed for the shortlisted entities and relations, and we only calcu-late the subset of factors (EDL, RE, RC, and EP) related to the operation, ∆ ln P r * (SplitE|x) and ∆ ln P r * (P runeR|x) can be calculated efficiently.

Experiments
We evaluate our system with resources provided to TAC 2015 participants, including 1) a source corpus of 50,000 documents from newswire and discussion forums, 2) a query set consisting of 317 hop-0 entities (expanded to 1,148 hop-0 entrypoint mentions and 8,191 hop-1 queries), 3) LDC 9 assessment of participant responses from automatic submissions 10 and a manually created submission 11 , and 4) software that retrieves answers from a KB and measure performance with the assessment. Additionally, we use TAC 2013 and 2014 datasets for tuning parameters and training stacked classifiers. α = 10, β = 5, γ = 0.1 are set empirically following Section 4.3. We run each experiment 20 times and average the scores.

Queries, Assessment, and Scoring
We briefly describe the evaluation process and scoring metrics. More details appear in (Mayfield, 2014). The Cold Start evaluation measures KB-quality by probing the KB with two types of queries. The queries are either at hop-0 (e.g., which organization(s) is(are) founded by Bill Gates?) or hop-1 (e.g., in which city(-ies) the organization(s) founded by Bill Gates is(are) headquartered?). More formally, the evaluation software tries to find an entity e 0 in the submitted KB that covers the entry-point mention of a hop-0 query q 0 , then finds all relational triples matching (e 0 , r 1 , ?). X, the set of entities matching the open variable, is reviewed by annotators for: (a) assessment of correctness and (b) the identification of non-redundant subset X . The software generates an hop-1 query q 1 = x for each x ∈ X , finds the entity e 1 that aligns with q 1 , and then finds triples matching (e 1 , r 2 , ?). This results in response set Y , the set of entities matching the second open variable. Set Y is assessed by LDC in the same manner as Set X. The process is performed over all submitted KBs 12 . The answers in X (hop-0)
NIST reports two metrics, CS-SF and CS-LDC-MAX, which differ in the treatment of multiple entry-point mentions for a single real-world entity. CS-SF treats each distinct mention as an independent query. CS-LDC-MAX takes only the entry-point mention which maximizes system performance for a given query-entity (i.e. either the responses for Bill Gates or William Gates). For both metrics, NIST calculates micro-averaged precision, recall, and F1 over all queries. As mentioned above, the official evaluation is a human post-hoc assessment of KB output. A system developed outside of the evaluation window, e.g., our proposed algorithm, will likely include responses for which truth is not known, which are ignored by the scoring software. Table 2 compares the TAC top-ranked system to our full configuration using three post-hoc scoring strategies: strict offset-based match, string-match match, and assess in which we apply the offset-based metric using additional internally performed assessments. For the ablation study in Table 3, we use the official scorer's string-match mode. A small number of responses are ignored (Ign) even in stringmatch mode. We further account for these responses by re-estimating precision for hop-0 and hop-1 assuming that the precision of the ignored responses at hop-1 is the same as the hop-0 precision 13 . When this optimistic estimate differs from reported precision, we report it in parentheses.

Results and Discussion
Table 2 compares our full system (KBC+E+R) to the top performing system in TAC Cold Start 2015 using three different approaches to post-hoc scoring. Without manual effort, our joint modeling approach exceeds the performance of the topranked system, which uses a cascade of manuallyspecified rules (Min et al., 2015). Our system obtains 5.4% and 4.8% relative improvement in hop-0 and hop-1 CS-SF F1 over the top-ranked system. Improvement is observed in both hop-0 and hop-1 and with both CS-SF and CS-LDC-MAX showing that the improvement is robust. A sign test shows that the improvements are significant with p < 0.01.

Systems
Hop-   Table 3 ablates each type of world knowledge to show the impact of entity and relation-based factors independently when compared to a version of our system without world knowledge. As expected, the impact of world knowledge is seen in improvements in precision at minor costs to recall. Both types of world knowledge have higher impact on hop-1 than hop-0 as hop-1 measures the formation of the KB with multiple hops in relations. Adding the relation factors has a larger impact than adding the entity factors because our splitting of entities is conservative (only affects < 0.1% entities) while relations' factors removes 7.3% relations. The two classes of factors appear to have largely independent impacts-combining them yields a large improvement. In total, adding prior world knowledge yields relative improvements of 9% in hop-0 precision, 131% on hop-1 precision, 42% on hop-1 F1, and 19.4% on all-hop F1 over the baseline. A sign test shows the improvements are significant with p < 0.01.
Reduction of errors: With relation factors added (KBC+R and KBC+E+R), 7.3% relations (out of 243K) are removed by PruneR with minimal recall loss. The median number of fillers for relations for the top 1% entities drops, e.g. per:title from 7 to 5, per:employee or member of from 5 to 2, and per:city of birth from 3 to 1. Inspection shows that our approach addresses many obvious mistakes: U.K. is removed as a response to (Securities and Exchange Commission(SEC), org:country of headquarters, ?) while U.S. remains. The error, caused by UK's SEC which means UK's analog to the SEC of US, is very hard to resolve without world knowledge. With cardinality constraints that favor only one country of headquarters for an ORG and U.K has a lower confidence than U.S. as a filler, the model identifies U.K. as an incorrect answer.
With entity factors added (KBC+E+R and KBC+E), the model favors a larger amount of smaller but more precise entities. It generates 4% new entities (out of 212K) by splitting the largest < 0.1% entities with the SplitE heuristics described in section 5. For example, the entity Australia is splitted into 3 entities, Australia and two outliers West Aussie and Australian Capital Territory. It also singles out entities such as South America, Idaho, Colorado from the giant U.S. entity with > 20, 000 mentions. When querying the KB facts related to U.S., erroneous answers that would otherwise be reported through relations associated with South America or the U.S. states will be removed.
KELVIN  and BBN system (Min et al., 2015) both use hand-crafted rules to limit the number of fillers, e.g., remove less precise relations if a person has more than 8 (current and ex-) spouses. (Wolfe et al., 2015) and (He and Grishman, 2015) proposed interactive tools for KB construction with human guidance.
Knowledge Base Completion With the recent popularity of structured KBs such as Freebase (Bollacker et al., 2008), YAGO (Suchanek et al., 2007) and above-mentioned KBP techniques, there is a growing interest in completing a partially-complete KB with tensor decomposition (Chang et al., 2014), matrix factorization (Riedel et al., 2013), graph random walk (Lao et al., 2011;Lao et al., 2012;Gardner et al., 2014), neural networks (Socher et al., 2013;Neelakantan et al., 2015;Dong et al., 2014) and others (Guu et al., 2015;Gardner et al., 2013;Das et al., 2016). Knowledge Vault (Dong et al., 2014) pushes it further by combining many extraction components while estimating the confidence of their extractions and scales it to the Web. Model combination (Viswanathan et al., 2015) and confidence estimation (Wick et al., 2013;Li and Grishman, 2013) is related to our model for combining extraction components. The work described here differs from KB completion tasks in its requirement that the initial KB is empty and that all information in the KB be grounded in a text corpus.
Joint Modeling and Inference for IE To address the problem of compounding errors with multiple NLP components for IE, several papers (Finkel and Manning, 2009;Mccallum and Jensen, 2003;Finkel et al., 2006;Singh et al., 2009;Poon and Domingos, 2007;Wellner et al., 2004;Poon and Vanderwende, 2010;Riedel and McCallum, 2011;Chen et al., 2014;Kate and Mooney, 2010;Miwa and Sasaki, 2014) propose joint modeling and inference for IE. (Roth and Yih, 2007) use the ILP framework to enforce manually-specified constraints between entity and relation identification, while (Yu and Lam, 2010) models these two tasks in encyclopedia articles using a discriminative probabilistic model. (Li and Ji, 2014) jointly extracts entity mentions and relations with a structured perception with beam search. (Singh et al., 2013a) performs joint inference for entity, relation and coreference with an extension of the belief propagation algorithm. The work described here differs in its use of world knowledge. The joint modeling and inference for IE is not comparable but complementary to our method, therefore can be incorporated into our system for further gain.

Conclusion and Future Work
We present a joint probabilistic framework for end-to-end Cold Start KBP with prior world knowledge. Experiments show it surpassing the best-performing system at the NIST TAC 2015 Cold Start evaluation. We plan to investigate additional world knowledge in the near future.