Unsupervised Person Slot Filling based on Graph Mining

Slot ﬁlling aims to extract the values ( slot ﬁllers ) of speciﬁc attributes ( slots types ) for a given entity ( query ) from a large-scale corpus. Slot ﬁlling remains very challenging over the past seven years. We propose a simple yet effective unsupervised approach to extract slot ﬁllers based on the following two observations: (1) a trigger is usually a salient node relative to the query and ﬁller nodes in the dependency graph of a context sentence; (2) a relation is likely to exist if the query and candidate ﬁller nodes are strongly connected by a relation-speciﬁc trigger. Thus we design a graph-based algorithm to automatically identify triggers based on personalized PageRank and Afﬁnity Propagation for a given ( query , ﬁller ) pair and then label the slot type based on the identiﬁed triggers. Our approach achieves 11.6%-25% higher F-score over state-of-the-art English slot ﬁlling methods. Our experiments also demonstrate that as long as a few trigger seeds, name tagging and dependency parsing capabilities exist, this approach can be quickly adapted to any language and new slot types. Our promising results on Chinese slot ﬁlling can serve as a new benchmark.


Introduction
The goal of the Text Analysis Conference Knowledge Base Population (TAC-KBP) Slot Filling (SF) task (McNamee and Dang, 2009;Ji et al., 2010;Ji et al., 2011;Surdeanu and Ji, 2014) is to extract the values (fillers) of specific attributes (slot types) for a given entity (query) from a largescale corpus and provide justification sentences to support these slot fillers. KBP defines 25 slot types for persons (e.g., spouse) and 16 slots for organizations (e.g., founder). For example, given a person query "Dominich Dunne" and slot type spouse, a SF system may extract a slot filler "Ellen Griffin" and its justification sentence E1 as shown in Figure 1. Slot filling remains a very challenging task. The two most successful state-of-the-art techniques are as follows.
(1) Supervised classification. Considering any pair of query and candidate slot filler as an instance, these approaches train a classifier from manually labeled data through active learning (Angeli et al., 2014b) or noisy labeled data through distant supervision (Angeli et al., 2014a;Surdeanu et al., 2010) to predict the existence of a specific relation between them.
(2) Pattern matching. These approaches extract and generalize lexical and syntactic patterns automatically or semi-automatically (Sun et al., 2011;Li et al., 2012;Yu et al., 2013;Hong et al., 2014). They usually suffer from low recall due to numerous different ways to express a certain relation type (Surdeanu and Ji, 2014). For example, none of the top-ranked patterns (Li et al., 2012) based on dependency paths in Table 1 can capture the spouse slot in E1.  Both of the previous methods have poor portability to a new language or a new slot type. Furthermore, both methods focus on the flat relation representation between the query and the candidate slot filler, while ignoring the global graph structure among them and other facts in the context.
When multiple facts about a person entity are presented in a sentence, the author (e.g., a news reporter or a discussion forum poster) often uses explicit trigger words or phrases to indicate their relations with the entity. As a result, these interdependent facts and query entities are strongly connected via syntactic or semantic relations.
Many slot types, especially when the queries are person entities, are indicated by such triggers. We call these slots trigger-driven slots. In this paper, we define a trigger as the smallest extent of a text which most clearly indicates a slot type. For example, in E1, "divorced" is a trigger for spouse while "died" is a trigger for death-related slots.
Considering the limitations of previous flat representations for the relations between a query (Q) and a candidate slot filler (F ), we focus on analyzing the whole dependency tree structure that connects Q, F and other semantically related words or phrases in each context sentence. Our main observation is that there often exists a trigger word (T ) which plays an important role in connecting Q and F in the dependency tree for trigger-driven slots. From the extended dependency tree shown in Figure 1, we can clearly see that "divorced" is most strongly connected to the query mention ("he") and the slot filler ("Ellen Griffin Dunne"). Therefore we can consider it as a trigger word which explicitly indicates a particular slot type.
Based on these observations, we propose a novel and effective unsupervised graph mining approach for person slot filling by deeply exploring the structures of dependency trees. It consists of the following three steps: • Step 1 -Candidate Relation Identification: Construct an extended dependency tree for each sentence including any mention referring to the query entity. Identify candidate slot fillers based on slot type constraints (e.g., the spouse fillers are limited to person entities) (Section 2). • Step 2 -Trigger Identification: Measure the importance of each node in the extended dependency tree relative to Q and F , rank them and select the most important ones as the trigger set (Section 3). • Step 3 -Slot Typing: For any given new slot type, automatically expand a few trigger seeds using the Paraphrase Database (Ganitkevitch et al., 2013). Then we use the expanded trigger set to label the slot types of identified triggers (Section 4).
This framework only requires name tagging and dependency parsing as pre-processing, and a few trigger seeds as input, and thus it can be easily adapted to a new language or a new slot type. Experiments on English and Chinese demonstrate that our approach dramatically advances state-ofthe-art results for both pre-defined KBP slot types and new slot types.

Candidate Relation Identification
We first present how to build an extended dependency graph for each evidence sentence (Section 2.1) and generate query and filler candidate mentions (Section 2.2).

Extended Dependency Tree Construction
Given a sentence containing N words, we construct an undirected graph G = (V, E), where V = {v 1 , . . . , v N } represents the words in a sentence, E is an edge set, associated with each edge e ij representing a dependency relation between v i and v j . We first apply a dependency parser to generate basic uncollapsed dependencies by ignoring the direction of edges. Figure 1 shows the dependency tree built from the example sentence. In addition, we annotate an entity, time or value mention node with its type. For example, in Figure 1, "Ellen Griffin Dunne" is annotated as a person, and "1997" is annotated as a year. Finally we perform co-reference resolution, which introduces implicit links between nodes that refer to the same entity. We replace any nominal or pronominal entity mention with its coreferential name mention. For example, "he" is replaced by "Dominick Dunne" in Figure 1. Formally, an extended dependency tree is an annotated tree of entity mentions, phrases and their links.

Query Mention and Filler Candidate Identification
Given a query q and a set of relevant documents, we construct a dependency tree for each sentence. We identify a person entity e as a query mention if e matches the last name of q or e shares two or more tokens with q. For example, "he/Dominick Dunne" in Figure 1 is identified as a mention referring to the query Dominick Dunne. For each sentence which contains at least one query mention, we regard all other entities, values and time expressions as candidate fillers and generate a set of entity pairs (q, f ), where q is a query mention, and f is a candidate filler. In Example E1, we can extract three entity pairs (i.e., {Dominick Dunne} × {Ellen Griffin Dunne, 1997. For each entity pair, we represent the query mention and the filler candidate as two sets of nodes Q and F respectively, where Q, F ⊆ V .

Trigger Identification
In this section, we proceed to introduce an unsupervised graph-based method to identify triggers for each query and candidate filler pair. We rank all trigger candidates (Section 3.1) and then keep the top ones as the trigger set (Section 3.2).

Trigger Candidate Ranking
As we have discussed in Section 1, we can consider trigger identification problem as finding the important nodes relative to Q and F in G. Algorithms such as Pagerank (Page et al., 1999) are designed to compute the global importance of each node relative to all other nodes in a graph. By redefining the importance according to our preference toward F and Q, we can extend PageRank to generate relative importance scores. We use the random surfer model (Page et al., 1999) to explain our motivation. Suppose a random surfer keeps visiting adjacent nodes in G at random. The expected percentage of surfers visiting each node converges to the PageRank score. We extend PageRank by introducing a "back probability" β to determine how often surfers jump back to the preferred nodes (i.e., Q or F ) so that the converged score can be used to estimate the relative probability of visiting these preferred nodes.
Given G and a set of preferred nodes R where R ⊆ V , we denote the relative importance for all v ∈ V with respect to R as I(v | R), following the work of White and Smyth (2003).
For a node v k , we denote N (k) as the set of neighbors of v k . We use π(k), the k-th component of the vector π, to denote the stationary distribution of v k where 1 ≤ k ≤ |V |. We define a preference vector p R = {p 1 , ..., p |V | } such that the probabilities sum to 1, and p k denotes the relative importance attached to v k . p k is set to 1/|R| for v k ∈ R, otherwise 0. Let A be the matrix corresponding to the graph G where A jk = 1/|N (k)| and A jk = 0 otherwise.
For a given p R , we can obtain the personalized PageRank equation (Jeh and Widom, 2003): where β ∈ [0, 1] determines how often surfers jump back to the nodes in R. We set β = 0.3 in our experiment. The solution π to Equation 1 is a steady-state importance distribution induced by p R . Based on a theorem of Markov Theory, a solution π with |V | k=1 π(k) = 1 always exists and is unique (Motwani and Raghavan, 1996).
We define relative importance scores based on the personalized ranks described above, i.e., I(v | R) = π(v) after convergence, and we compute the importance scores for all the nodes in V relative to Q and F respectively.
A query mention in a sentence is more likely to be involved in multiple relations while a filler is usually associated with only one slot type. Therefore we combine two relative importance scores by assigning a higher priority to I(v | F ) as follows.
We discard a trigger candidate if it is (or part of) an entity which can only act as a query or a slot filler. We assume a trigger can only be a noun, verb, adjective, adverb or preposition. In addition, verbs, nouns and adjectives are more informative to be triggers. Thus, we remove any trigger candidate v if it has a higher I(v | {Q, F }) than the first top-ranked verb/noun/adjective trigger candidate.
For example, we rank the candidate triggers based on the query and slot filler pair ("Dominick Dunne", "Ellen Griffin Dunne") as shown in Fig Figure 2: Importance scores of trigger candidates relative to query and filler in E1.

Trigger Candidate Selection
Given Q and F , we can obtain a relative importance score I(v | {Q, F }) for each candidate trigger node v in V as shown in Section 3.1. We denote the set of trigger candidates as T = {t 1 , · · · , t n } where n ≤ |V |.
Since a relation can be indicated by a single trigger word, a trigger phrase or even multiple non-adjacent trigger words, it is difficult to set a single threshold even for one slot type. Instead, we aim to automatically classify top ranked candidates into one group (i.e., a trigger set) so that they all have similar higher scores compared to other candidates.
Therefore, we define this problem as a clustering task. We mainly consider clustering algorithms which do not require pre-specified number of clusters.
We apply the affinity propagation approach to take as input a collection of real-valued similarity scores between pairs of candidate triggers. Realvalued messages are exchanged between candidate triggers until a high-quality set of exemplars (centers of clusters), and corresponding clusters gradually emerges (Frey and Dueck, 2007).
There are two kinds of messages exchanged between candidate triggers: one is called responsibility γ(i, j), sent from t i to a candidate exemplar t j ; the other is availability α(i, j), sent from the candidate exemplar t j to t i .
The calculation of each procedure iterates until convergence. To begin with, the availabilities are initialized to zero: α(i, j) = 0. Then the responsibilities are computed using the following rule: where the similarity score s(i, j) indicates how well t j is suited to be the exemplar for t i . Whereas the above responsibility update lets all candidate exemplars compete for the ownership of a trigger candidate t i , the following availability update gathers evidence from trigger candidates as to whether each candidate exemplar would make a good exemplar: Given T , we can generate an n × n affinity matrix M which serves as the input of the affinity propagation. M ij represents the negative squared difference in relative importance score between t i and t j (Equation 5).
We compute the average importance score for all the clusters after convergence and keep the one with the highest average score as the trigger set. For example, given the query and slot filler pair in Figure 3, we obtain trigger candidates T = {died, divorced, f rom, in, in} and their corresponding relative importance scores. After the above clustering, we obtain three clusters and choose the cluster {divorced} with the highest average relative importance score (0.128) as the trigger set.

Slot Type Labeling
In this section, we will introduce how to label the slot type for an identified relation tuple (Q, T, F ). The simplest solution is to match T against existing trigger gazetteers for certain types of slots. For example, Figure 4 shows how we label the relation as a spouse slot type. In fact, some trigger gazetteers have already been constructed by previous work such as (Yu et al., 2015). However, manual construction of these triggers heavily rely upon labeled training data and high-quality patterns, which would be unavailable for a new language or a new slot type.
Inspired by the trigger-based event extraction work (Bronstein et al., 2015), we propose to extract trigger seeds from the slot filling annotation guideline 1 and then expand them by paraphrasing techniques. For each slot type we manually select two trigger seeds from the guideline and then use the Paraphrase Database (PPDB) (Ganitkevitch et al., 2013;Pavlick et al., 2015) to expand these seeds. Specifically, we select top-20 lexical paraphrases based on similarity scores as our new triggers for each slot type. Some examples are shown in Table 2

Filler Validation
After we label each relation tuple, we perform the following validation steps to filter noise and remove redundancy. For many slot types, there are some specific constraints on entity types of slot fillers defined in the task specification. For example, employee or member of fillers should be either organizations or geopolitical entities, while family slots (e.g., spouse and children) expect person entities. We apply these constraints to further validate all relation tuples. Moreover, single-value slots can only have a single filler (e.g., date of birth), while listvalue slots can take multiple fillers (e.g., cities of residence). However, we might extract conflicting relation tuples from multiple sentences and sources. For each relation tuple, it can also be extracted from multiple sentences, and thus it may receive multiple relative importance scores. We aim to keep the most reliable relation tuple for a single-value slot.
For a single-value slot, suppose we have a collection of relation tuples R which share the same query. Given r ∈ R with a set of relative importance scores I = {i 1 , i 2 , · · · , i n }, we can regard the average score of I as the credibility score of r. The reason is that the higher the relative importance score, the more likely the tuple is to be correct. In our experiments, we use the weighted arithmetic mean as follows so that higher scores can contribute more to the final average: where w k denotes the non-negative weight of i k . When we regard the weight w k equal to the score i k , Equation 6 can be simplified as: We calculate the weighted meanī for each r ∈ R and keep the relation tuple with the highestī.

Data and Scoring Metric
In order to evaluate the quality of our proposed framework and its portability to a new language, we use TAC-KBP2013 English Slot Filling (ESF), TAC-KBP 2015 English Cold Start Slot Filling (CSSF) and TAC-KBP2015 Chinese Slot Filling (CSF) data sets for which we can compare with the ground truth and state-of-the-art results reported in previous work. The source collection includes news documents, web blogs and discussion forum posts. In ESF there are 50 person queries and on average 20 relevant documents per query; while in CSF there are 51 person queries, and on average 5 relevant documents per query.  We only test our method on 18 trigger-driven person slot types shown in Table 3. Some other slot types (e.g., age, origin, religion and title) do not rely on lexical triggers in most cases; instead the query mention and the filler are usually adjacent or seperated by a comma. In addition, we do not deal with the two remaining triggerdriven person slot types (i.e., cause of death and charges) since these slots often expect other types of concepts (e.g., a disease or a crime phrase).
We use the official TAC-KBP slot filling evaluation scoring metrics: Precision (P ), Recall (R) and F-score (F 1 ) (Ji et al., 2010) to evaluate our results.

English Slot Filling
We apply Stanford CoreNLP  for English part-of-speech (POS) tagging, name tagging, time expression extraction, dependency parsing and coreference resolution. In Table 3 we compare our approach with two stateof-the-art English slot filling methods: a distant supervision method  and a hybrid method that combines distant and partial supervision (Angeli et al., 2014b). Our method outperforms both methods dramatically. KBP2015 English cold start slot filling is a task which combines entity mention extraction and slot filing (Surdeanu and Ji, 2014). Based on the released evaluation queries from KBP2015 Cold Start Slot Filling, our approach achieves 39.2% overall Fscore on 18 person trigger-driven slot types, which   is significantly better than state-of-the-art (Angeli et al., 2015) on the same set of news documents (Table 4). Compared to the previous work, our method discards a trigger-driven relation tuple if it is not supported by triggers. For example, "Poland" is mistakenly extracted as the country of residence of "Mandelbrot" by distant supervision  from the following sentence: A professor emeritus at Yale University, Mandelbrot was born in Poland but as a child moved with his family to France where he was educated. maybe because the relation tuple (Mandelbrot, live in, Poland) indeed exists in external knowledge bases. Given the same entity pair, our method identifies "born" as the trigger word and labels the slot type as country of birth.
When there are several triggers indicating different slot types in a sentence, our approach performs better in associating each trigger with the filler it dominates by analyzing the whole dependency tree. For example, given a sentence: Haig is survived by his wife of 60 years, Patricia; his children Alexander, Brian and Barbara; eight grandchildren; and his brother, the Rev. Francis R. Haig.
(Haig, sibling, Barbara) is the only relation tuple extracted from the above sentence by the previous method. Given the entity pair (Haig, Barbara), the relative importance score of "children" (0.1) is higher than the score of "brother" (0.003), and "children" is kept as the only trigger candidate after clustering. Therefore, we extract the tuple (Haig, children, Barbara) instead. In addition, we successfully identify the missing fillers for other slot types: spouse (Patricia), children (Alexander, Brian and Barbara) and siblings (Francis R. Haig) by identifying their corresponding triggers.
In addition, flat relation representations fail to extract the correct relation (i.e., alternate names) between "Dandy Don" and "Meredith" since "brother" is close to both of them in the following sentence: In high school and at Southern Methodist University, where, already known as Dandy Don (a nickname bestowed on him by his brother) , Meredith became an all-American.

Adapting to New Slot Types
Our framework can also be easily adapted to new slot types. We evaluate it on three new person list-value slot types: friends, colleagues and collaborators.
We use "friend" as the slot-specific trigger for the slot friends and "colleague" for the slot colleagues. "collaborate", "cooperate" and "partner" are used to type the slot collaborators.
We manually annotate ground truth for evaluation. It is difficult to find all the correct fillers for a given query from millions of documents. Therefore, we only calculate precision. Experiments show we can achieve 56.3% for friends, 100% for colleagues and 60% for collaborators (examples shown in Table 5).

Impact of Trigger Mining
In Section 3.2, we keep top-ranked trigger candidates based on clustering rather than threshold tuning. We explore a range of thresholds for comparison, as shown in Figure 5. Our approach achieves 57.4% F-score, which is comparable to the highest F-score 58.1% obtained by threshold tuning.
We also measure the impact of the size of the trigger gazetteer. We already outperform state-ofthe-art by using PPDB to expand triggers mined from guidelines as shown in Table 6. As the size of the trigger gazetteer increases, our method (marked with a ) achieves better performance.

Chinese Slot Filling
As long as we have the following resources: (1) a POS tagger, (2) a name tagger, (3) a dependen-  cy parser and (4) slot-specific trigger gazetteers, we can apply the framework to a new language. Coreference resolution is optional. We demonstrate the portability of our framework to Chinese since all the resources mentioned above are available. We apply Stanford CoreNLP  for Chinese POS tagging, name tagging (Wang et al., 2013) and dependency parsing (Levy and Manning, 2003). To explore the impact of the quality of annotation resources, we also use a Chinese language analysis tool: Language Technology Platform (LTP) (Che et al., 2010). We use the full set of Chinese trigger gazetteers published by Yu et al. (2015). Experimental results (Table 7) demonstrate that our approach can serve as a new and promising benchmark. As far as we know, there are no results available for comparison.
However, the performance of Chinese SF is heavily influenced by the relatively low performance of name tagging since our method returns an empty result if it fails to find any query metnion. About 20% and 16% queries cannot be recognized by CoreNLP and LTP respectively. One reason is that many Chinese names are also common words. For example, a buddhist monk's name "觉醒"(wake) is identified as a verb rather than a person entity. colleagues

Lucille Clifton Michael Glaser
Cunningham has collaborated on two books: "Changes: Notes on Choreography," with Frances Starr, and "The Dancer and the Dance," with Jacqueline Lesschaeve.
collaborators Merce Cunningham Jacqueline Lesschaeve A dependency parser is indispensable to produce reliable rankings of trigger candidates. Unfortunately, a high-quality parser for a new language is often not available because of languagespecific features. For example, in Chinese a single sentence about a person's biography often contains more than five co-ordinated clauses, each of which includes a trigger. Therefore a dependency parser adapted from English often mistakenly identifies one of the triggers as a main predicate of the sentence.
In addition, Chinese is a very concise language.   Finally, compared to English, Chinese tends to have more variants for some types of triggers (e.g., there are at least 31 different titles for "wife"in Chinese). Some of them are implicit and require shallow inference. For example, "投奔"(to seek shelter or asylum) indicates a residence relation in most cases.

Related Work
Besides the methods based on distant supervision (e.g., (Surdeanu et al., 2010;Angeli et al., 2014b)) discussed in Section 6.2, pattern-based methods have also been proven to be effective in SF in the past years (Sun et al., 2011;Li et al., 2012;Yu et al., 2013). Dependency-based patterns achieve better performance since they can capture long-distance relations. Most of these approaches assume that a relation exists between Q and F if there is a dependency path connecting Q and F and all the words on the path are equally regarded as trigger candidates. We explore the complete graph structure of a sentence rather than chains/subgraphs as in previous work. Our previous research focused on identifying the relation between F and T by extracting filler candidates from the identified scope of a trigger (e.g., (Yu et al., 2015)). We found that each slot-specific trigger has its own scope, and corresponding fillers seldom appear outside its scope. We did not compare with results from this previous approach which did not consider redundancy removal required in the official evaluations. Soderland et al. (2013) built their SF system based on Open Information Extraction (IE) technology. Our method achieves much higher recall since dependency trees can capture the relations among query, slot filler and trigger in more complicated long sentences. In addition, our triggers are automatically labeled so that we do not need to design manual rules to classify relation phrases as in Open IE.

Conclusions and Future Work
In this paper, we demonstrate the importance of deep mining of dependency structures for slot filling. Our approach outperforms state-of-the-art and can be rapidly portable to a new language or a new slot type, as long as there exists capabilities of name tagging, POS tagging, dependency parsing and trigger gazetteers.
In the future we aim to label slot types based on contextual information as well as sentence structures instead of trigger gazetteers only. There are two primary reasons. First, a trigger can serve for multiple slot types. For example, slot children and its inverse slot parents share a subset of triggers. Second, a trigger word can have multiple different meanings. For example, a sibling trigger word "sister" can also represent a female member of a religious community. We attempt to combine multi-prototype approaches (e.g., (Reisinger and Mooney, 2010)) to better disambiguate senses of trigger words.
Besides considering the cross-sentence conflicts, we also want to investigate the within-sentence conflicts caused by the competition of triggers. A trigger identified by our approach is the most important node in the dependency tree relative to the given entity pair. However, this trigger might be more important to another entity pair, which shares the same filler, in the same sentence. A promising solution is to rank all the entities in the sentence based on their importance relative to the identified trigger and the filler candidate.