Entity Linking for Queries by Searching Wikipedia Sentences

We present a simple yet effective approach for linking entities in queries. The key idea is to search sentences similar to a query from Wikipedia articles and directly use the human-annotated entities in the similar sentences as candidate entities for the query. Then, we employ a rich set of features, such as link-probability, context-matching, word embeddings, and relatedness among candidate entities as well as their related entities, to rank the candidates under a regression based framework. The advantages of our approach lie in two aspects, which contribute to the ranking process and final linking result. First, it can greatly reduce the number of candidate entities by filtering out irrelevant entities with the words in the query. Second, we can obtain the query sensitive prior probability in addition to the static link-probability derived from all Wikipedia articles. We conduct experiments on two benchmark datasets on entity linking for queries, namely the ERD14 dataset and the GERDAQ dataset. Experimental results show that our method outperforms state-of-the-art systems and yields 75.0% in F1 on the ERD14 dataset and 56.9% on the GERDAQ dataset.


Introduction
Query understanding has been an important research area in information retrieval and natural language processing (Croft et al., 2010).A key part of this problem is entity linking, which aims to annotate the entities in the query and link them to a knowledge base such as Freebase and Wikipedia.This problem has been extensively studied over the recent years (Carmel et al., 2014;Usbeck et al., 2015;Cornolti et al., 2016).
The mainstream methods of entity linking for queries can be summed up in three steps: mention detection, candidate generation, and entity disambiguation.The first step is to recognize candidate mentions in the query.The most common method to detect mentions is to search a dictionary collected by the entity alias in a knowledge base and the human-maintained information in Wikipedia (such as anchors, titles and redirects) (Laclavik et al., 2014).The second step is to generate candidates by mapping mentions to entities.It usually uses all possible senses of detected mentions as candidates.Hereafter, we refer to these two steps of generating candidate entities as entity search.Finally, they disambiguate and prune candidate entities, which is usually implemented with a ranking framework.
There are two main issues in entity search.First, a mention may be linked to many entities.The methods using entity search usually leverage little context information in the query.Therefore it may generate many completely irrelevant entities for the query, which brings challenges to the ranking phase.For example, the mention "Austin" usually represents the capital of Texas in the United States.However, it can also be linked to "Austin, Western Australia", "Austin, Quebec", "Austin (name)", "Austin College", "Austin (song)" and 31 other entities in the Wikipedia page of "Austin (disambiguation)".For the query "blake shelton austin lyrics", Blake Shelton is a singer and made his debut with the song "Austin".The entity search method detects the mention "austin" using the dictionary.However, while "Austin (song)" is most related to the context "blake shelton" and "lyrics", the mention "austin" may be linked to all the above entities as candidates.Therefore candidate gener-arXiv:1704.02788v1[cs.CL] 10 Apr 2017 ation with entity search generates too many candidates especially for a common anchor text with a large number of corresponding entities.Second, it is hard to recognize entities with common surface names.The common methods usually define a feature called "link-probability" as the probability that a mention is annotated in all documents.
There is an issue with this probability being static whatever the query is.We show an example with the query "her film"."Her (film)" is a film while its surface name is usually used as a possessive pronoun.Since the static link-probability of "her" from all Wikipedia articles is very low, "her" is usually not treated as a mention linked to the entity "Her (film)".
In this paper, we propose a novel approach to generating candidates by searching sentences from Wikipedia articles and directly using the humanannotated entities as the candidates.Our approach can greatly reduce the number of candidate entities and obtain the query sensitive prior probability.We take the query "blake shelton austin lyrics" as an example.Below we show a sentence in the Wikipedia page of "Austin (song)".
[[Austin (song)|Austin]] is the title of a debut song written by David Kent and Kirsti Manna, and performed by American country music artist [[Blake Shelton]].
Table 1: A sentence in the page "Austin (song)".
In the above sentence, the mentions "Austin" and "Blake Shelton" in square brackets are annotated to the entity "Austin (song)" and "Blake Shelton", respectively.We generate candidates by searching sentences and thus obtain "Blake Shelton" as well as "Austin (song)" from this example.We reduce the number of candidates because many irrelevant entities linked by "austin" do not occur in returned sentences.In addition, as previous methods generate candidates by searching entities without the query information, "austin" can be linked to "Austin, Texas" with much higher static link-probability than all other senses of "austin".However, the number of returned sentences that contain "Austin, Texas" is close to the number of sentences that contain "Austin (song)" in our system.We show another example with the query "her film" in Table 2.In this sentence, "Her", "romantic", "science fiction", "comedy-drama" and "Spike Jonze" are annotated to corresponding en-tities.As "Her" is annotated to "Her (film)" by humans in this example, we have strong evidence to annotate it even if it is usually used as a possessive pronoun with very low static link-probability.
Table 2: A sentence in the page "Her (film)".
We obtain the anchors as well as corresponding entities and map them to the query after searching similar sentences.Then we build a regression based framework to rank the candidates.We use a rich set of features, such as link-probability, context-matching, word embeddings, and relatedness among candidate entities as well as their related entities.We evaluate our method on the ERD14 and GERDAQ datasets.Experimental results show that our method outperforms state-ofthe-art systems and yields 75.0% and 56.9% in terms of F1 metric on the ERD14 dataset and the GERDAQ dataset respectively.

Related Work
Recognizing entity mentions in text and linking them to the corresponding entries helps to understand documents and queries.Most work uses the knowledge base including Freebase (Chiu et al., 2014), YAGO (Yosef et al., 2011) and Dbpedia (Olieman et al., 2014).Wikify (Mihalcea and Csomai, 2007) is the very early work on linking anchor texts to Wikipedia pages.It extracts all ngrams that match Wikipedia concepts such as anchors and titles as candidates.They implement a voting scheme based on the knowledge-based and data-driven method to disambiguate candidates.Cucerzan (2007) uses four recourses to generate candidates, namely entity pages, redirecting pages, disambiguation pages, and list pages.Then they disambiguate candidates by calculating the similarity between the contextual information and the document as well as category tags on Wikipedia pages.Milne and Witten (2008) generate candidates by gathering all n-grams in the document, and retaining those whose probability exceeds a low threshold.Then they define commonness and relatedness on the hyper-link structure of Wikipedia to disambiguate candidates.
The work on linking entities in queries has been extensively studied in recent years.TagME (Ferragina and Scaiella, 2010) is a very early work on entity linking in queries.It generates candidates by searching Wikipedia page titles, anchors and redirects.Then disambiguation exploits the structure of the Wikipedia graph, according to a voting scheme based on a relatedness measure inspired by Milne and Witten (2008).The improved version of TagME, named WAT (Piccinno and Ferragina, 2014), uses Jaccard-similarity between two pages' in-links as a measure of relatedness and uses PageRank to rank the candidate entities.
Unlike the work which revolves around ranking entities for query spans, the Entity Recognition and Disambiguation (ERD) Challenge (Carmel et al., 2014) views entity linking in queries as the problem of finding multiple query interpretations.The SMAPH system (Cornolti et al., 2014) which wins the short-text track works in three phases: fetching, candidate-entity generation and pruning.First, they fetch the snippets returned by a commercial search engine.Next, snippets are parsed to identify candidate entities by looking at the boldfaced parts of the search snippets.Finally, they implement a binary classifier using a set of features such as the coherence and robustness of the annotation process and the ranking as well as composition of snippets.They further extend SMAPH-1 to SMAPH-2 (Cornolti et al., 2016).They use the annotator WAT to annotate the snippets of search results to generate candidates and joint the additionally link-back step as well as the pruning step in the ranking phase, which gets the state-of-theart results on the ERD14 dataset and their released dataset GERDAQ.
Our work is different from using search engines to generate candidates.We firstly propose to search Wikipedia sentences and take advantage of human annotations to generate candidates.The previous work, such as SMAPH, employs search engine for candidate generation.However, it uses WAT, an entity search based tool, to pre-annotate the snippets for candidate generation, which falls back the issues of entity search.

Our Approach
As shown in Figure 1, we introduce our approach with the query "blake shelton austin lyrics".Our approach consists of three main phases: sentence search, candidate generation, and candidate ranking.First, we search the query in all Wikipedia ar-ticles to obtain the similar sentences.Second, we extract human-annotated entities from these sentences.We keep the entities whose corresponding anchor texts occur in the query as candidates, and treat others as related entities.Specifically, we obtain three candidates in this example, namely "Blake Shelton", "Austin, Texas", and "Austin (song)".Finally, we use a regression based model to rank the candidate entities.We get the final annotations of "Blake Shelton" and "Austin (song)" whose scores are higher than the threshold selected on the development set.In the following sections, we describe these three phases in detail.

Sentence Search
Sentences in Wikipedia articles usually contain anchors linking to entities.We are therefore motivated to generate the candidate entities based on the sentence search instead of the common method using entity search.There are some issues in the original annotations because of the annotation regulation.First, entities in their own pages are usually not annotated.Thus we annotate these entities with matching between the text and the page title.Second, entities are usually annotated only in their first appearance.We annotate these entities if they are annotated in previous sentences in the page.Moreover, pronouns are widely used in Wikipedia sentences and are usually not annotated.We use the Stanford CoreNLP toolkit (Manning et al., 2014) to do the coreference resolution.In addition, we use the content in the disambiguation page and the infobox.Although these two kinds of information may have incomplete grammatical structure, it contains enough context information for the sentence search in our task.
We use the Wikipedia snapshot of May 1, 2016, which contains 4.45 million pages and 120 million sentences.We extract sentences that contain at least one anchor in the Wikipedia articles, and extract human-annotated anchors as well as corresponding entities in the sentences.The original annotation contains 82.6 million anchors.We obtain 110 million annotated anchors in 48.4 million sentences after the incrementally annotation.All of above annotations are indexed by Lucene1 by building documents consisting of two fields: the first one contains the sentence and the second one contains all anchors with their corresponding entities.For each query, we search it with Lucene Figure 1: Example of the linking process of the query "blake shelton austin lyrics" using its default ranker2 based on the vector space model and tf-idf to obtain the top K sentences (K is selected on the development set).We extract all entities as the related entities and use these sentences as their support sentences.

Candidate Generation
We back-map anchors and corresponding entities extracted in sentences to generate candidates.We use (a, e) to denote the pair of the anchor text and corresponding entity and use w(a, e) to denote the number of sentences containing the pair (a, e).Then, we prune the candidate pairs according to following rules.First, we only keep the pair whose corresponding anchor text a occurs in the query as a candidate, which has been used in previous work (Ferragina and Scaiella, 2010).Second, we follow the long-string match strategy.If we have two pairs (a 1 , e 1 ) and (a 2 , e 2 ) while a 1 is a substring of a 2 , we drop (a 1 , e 1 ) if w(a 1 , e 1 ) < w(a 2 , e 2 ).This is because a 2 is typically less ambiguous than a 1 .For example, for the query "mesa community college football", we can obtain the anchor "mesa", "college", "community college", and "mesa community college".We only keep "mesa community college" because it is longest and occurs most times in returned sentences.However, if w(a 1 , e 1 ) > w(a 2 , e 2 ), we keep both candidate pairs because a 1 is more common in the query.
In addition, we keep the entity whose surface form is the same with the anchor text and prune others.If we have two pairs (a, e 1 ) and (a, e 2 ) with the same anchor, and only e 2 occurs in the query, we drop the pair (a, e 1 ) if w(a, e 1 ) < w(a, e 2 ).For example, for the query "business day south africa", the anchor "south africa" can be linked to "south africa", "union of south africa", and "south africa cricket team".We only keep the entity "south africa".

Candidate Ranking
We build a regression based framework to rank the candidate entities.In the training phase, we treat the candidates that are equal to the ground truth as the positive samples and the others as negative samples.The regression object of the positive sample is set to the score 1.0.The negative sample is set to the maximum score of overlapping ratio of tokens between its text and each gold answer.
The regression object of the negative sample is not simply set to 0 in order to give a small score if the candidate is very closed to the ground truth.We find it benefits the final results.We use LIBLIN-EAR (Fan et al., 2008) with L2-regularized L2loss support vector regression to train the regression model.The object function is to minimize where x i is the feature set, y i is the object score and w is the parameter to be learned.We follow the default setting that C is set to 1 and eps is set to 0.1.In the test phase, each candidate gets a score of w T x i and then we only output the candidate  We employ four different feature sets to capture the quality of a candidate from different aspects.All features are shown in Table 3.
Context-Independent Features This feature set measures each annotation pair (a, e) without context information.Feature 1-4 catch the syntactic properties of the candidate.Feature 5 is the number of returned sentences that contain (a, e).Feature 6 is the maximum search score (returned by Lucene) in its support sentences.Moreover, inspired by TagME (Ferragina and Scaiella, 2010), we denote f req(a) as the number of times the text a occurs in Wikipedia.We use link(a) to denote the number of times the text a occurs as an anchor.We use lp(a) = link(a)/f req(a) to denote the static link-probability that an occurrence of a has been set as an anchor.We use f req(a, e) to denote the number of times that the anchor text a links to the entity e, and use pr(e|a) = f req(a, e)/link(a) to denote the static prior-probability that the anchor text a links to e. Features 7 and 8 are these two probabilities.

Context-Matching Features
We treat the other words except for the anchor text as the context.This feature set measures the context matching to the query.Feature 9 is the context matching score calculated by tokens.We denote c as the set of context words.For each c i in c, the cm sc(c i ) is the ratio of times that c i occurs in the support sentences, and cm sc(c) = 1 N cm sc(c i ).Features 10 and 11 are the ratio of context words occurring in the first sentence in the entity page and the description of entity's disambiguation page (if existed), respectively.Moreover, we train a 300dimensional word embeddings on all Wikipedia articles by word2vec (Mikolov et al., 2013) and use the average embedding of each word as the sentence representation.Feature 12 is the maximum cosine score between the query and each support sentence.Features 13 and 14 are calculated with the first sentence in the entity's page and the description in the disambiguation page.

Relatedness Features of Candidate Entities
This set of features measures how much an entity is supported by other candidates.Feature 15 is the number of other candidate entities occurring in the support sentences.Feature 16 is the number of candidate entities occurring in the same Wikipedia page with the current entity.
Relatedness Features to Related Entities This set of features measures the relatedness between candidates and related entities outside of queries.Related entities can provide useful signals for disambiguating the candidates.Features 17 and 18 are analogous features with features 15 and 16, which are calculated by the related entities.

Experiment
We conduct experiments on the ERD14 and GER-DAQ datasets.We compare with several baseline annotators and experimental results show that our method outperforms the baseline on these two datasets.We also report the parameter selection on each dataset and analyze the quality of the candidates using different methods.

Dataset
ERD143 is a benchmark dataset in the ERD Challenge (Carmel et al., 2014), which contains both long-text track and short-text track.In this paper we only focus on the short-text track.It contains 500 queries as the development set and 500 queries as the test set.Due to the lack of training set, we use the development set to do the model training and tuning.This dataset can be evaluated by both Freebase and Wikipedia as the ERD Challenge Organizers provide the Freebase Wikipedia Mapping with one-to-one correspondence of entities between two knowledge bases.We use Wikipedia to evaluate our results.
GERDAQ4 is a benchmark dataset to annotate entities to Wikipedia built by Cornolti et al. (2016).It contains 500 queries for training, 250 for development, and 250 for test.The query in this dataset is sampled from the KDD-Cup 2005 and then annotated manually.Both name entities and common concepts are annotated in this dataset.

Evaluation Metric
We use average F1 designed by ERD Challenge (Carmel et al., 2014) as the evaluation metrics.Specifically, given a query q, with labeled entities Â = { Ê1 , . . ., Ên }.We define the Fmeasure of a set of hypothesized interpretations A = {E 1 , . . ., E m } as follows: The average F1 of the evaluation set is the average of the F1 for each query: Following the evaluation guideline in ERD14 and GERDAQ, we define recall to be 1.0 if the gold binding of a query is empty and define precision to be 1.0 if the hypothesized interpretation is empty.

Baseline Methods
We compare with several baselines and use the results reported by the ERD organizer and Cornolti et al. (2016).
AIDA (Hoffart et al., 2011) searches the mention using Stanford NER Tagger based on YAGO2.We select AIDA as a representative system aiming to entity linking for documents following the work in Cornolti et al. (2016).WAT (Piccinno and Ferragina, 2014) is the improved version of TagME (Ferragina and Scaiella, 2010).Magnetic IISAS (Laclavik et al., 2014) retrieves the index extracted from Wikipedia, Freebase and Dbpedia.Then it exploits Wikipedia link graph to assess the similarity of candidate entities for disambiguation and filtering.Seznam (Eckhardt et al., 2014) uses Wikipedia and DBpedia to generate candidates.The disambiguation step is based on PageRank over the graph.NTUNLP (Chiu et al., 2014) searches the query to match Freebase surface forms.The disambiguation step is built on top of TagME and Wikipedia.SMAPH-1 (Cornolti et al., 2014) is the winner in the short-text track in the ERD14 Challenge.SMAPH-2 (Cornolti et al., 2016) is the improved version of SMAPH-1.It generates candidates from the snippets of search results returned by the Bing search engine.

Result
We report results on the ERD datset and GER-DAQ dataset in Table 4 and Table 5, respectively.On the ERD14 dataset, WAT is superior to AIDA but it is still up to 10% than SMAPH-1 that wins the ERD Challenge.SMAPH-2 improves 2% than SMAPH-1.Our system significantly outperforms the state-of-the-art annotator SMAPH-2 by 4.2%.On the GERDAQ dataset, our system is 2.5% superior to the state-of-the-art annotator SMAPH-2.The F1 score in this dataset is much lower than the ERD dataset because common concepts such as "Week" and "Game" that are not annotated in the ERD dataset are annotated in the GERDAQ dataset.
Spell checking has been widely used in the baseline annotators as it is not uncommon in queries (Laclavik et al., 2014).The SMAPH system that generates candidates by search results implicitly leverages the spell-checking embedded in search engines.In our experiments, spell checking improves 1.0% on the ERD dataset and 7.6% on the GERDAQ dataset.Furthermore, only 6.9% of queries in the ERD14 dataset have spelling  10 in Cornolti et al. (2016).
mistakes, whereas the number in the GERDAQ dataset is 23.0%.Thus spell-checking is more important in the GERDAQ dataset.
The result decreases 0.6% on the ERD dataset and 1.1% on the GERDAQ dataset without the additional annotation.Furthermore, while the F1 score decreases 2.4% on the ERD dataset and 1.4% on the GERDAQ dataset without the context features, the score only decreases 0.5% on the ERD dataset and 0.2% on the GERDAQ dataset without the relatedness features.Unlike the work on entity linking for documents (Eckhardt et al., 2014;Witten and Milne, 2008) that features derived from entity relations get promising results,

Parameter Selection
There are two parameters in our framework, namely the number of search sentences and the threshold for final output.We select these two parameters on the development set.We show the F1 score with different numbers of search sentences and thresholds in Figure 2 and Figure 3. On the ERD development set, better results occur in the search number between 600 and 800 as well as the threshold 0.55 and 0.6.On the GERDAQ development set, better results occur in the search number between 700 and 1000 as well as the threshold between 0.45 and 0.5.In our experiment, we set the number of sentences to 700 and the threshold to 0.56 on the ERD dataset as well as 800 and 0.48 on the GERDAQ dataset according to the F1 scores on the development set.

Model Analysis
The main difference between our method and most previous work is that we generate candidates by searching Wikipedia sentences instead of searching entities.For generating candidates with entity search, we build a dictionary containing all anchors, titles, and redirects in Wikipedia.Then we query the dictionary to get the mention and obtain corresponding entities as candidates.We use the same pruning rules and ranking framework in our experiments, but exclude the features from support sentences because the entity search method does not contain the information.The F1 score is shown in Table 6.We achieve similar results in our implementation of the method using entity search on the ERD dataset as Magnetic IISAS (Laclavik et al., 2014) which uses a similar method and ranks 4th with the F1 of 65.57 in the ERD14 Challenge.
We compare the two candidate generation methods in several aspects.First, we show the overall results in Table 6.The average number of candidates from our method is much smaller.It is noted that the anchors from sentence search can also be found in entity search.However, we only extract the entities in the returned sentences while the methods by entity search use all entities linked by the anchors.In addition, features such as the number of sentences containing the entity from sentence search which provide query sensitive prior probability contribute to the ranking process.It improves the F1 score from 73.81 to 75.01 for sentence search and from 66.46 to 69.00 for entity search.More important, the result of "ES+RF" is still significantly worse than the result of both small candidate set and Wikipedia related features that prunes irrelevant candidates at the beginning, which proves that the high-quality candidate set is very important since the larger candidate set brings in lots of noise in training a ranking model.Moreover, there are 102 queries (20.4%) without labeled entities in the ERD dataset.We only give 7 incorrect annotations in these queries while the number is 13 from entity search.Furthermore, as shown in Table 7, the coverage of our method is lower in queries with at least one entity, but we obtain better results on precision, recall and F1 in the final stage.
Figure 4 illustrates the F1 score grouped by the number of candidates using entity search.In almost all columns the F1 score of our method is better than the baseline.In left columns (the number of candidates is less than 10), both methods generate few candidates.The F1 score of our method is higher, which proves that we train a better ranking model because of our small but quality candidate set.Moreover, the right columns (the number of candidates is more than 10) show that the F1 score using entity search gradually decreases with the incremental candidates.However, our method based on sentence search takes advantage of context information to keep a small set of candidates, which keeps a consistent result and outperforms the baseline.

Conclusion
In this paper we address the problem of entity linking for open-domain queries.We introduce a novel approach to generating candidate entities by searching sentences in the Wikipedia to the query, then we extract the human-annotated entities as the candidates.We implement a regression model to rank these candidates for the final output.Two experiments on the ERD dataset and the GER-DAQ dataset show that our approach outperforms the baseline systems.In this work we directly use the default ranker in Lucene for similar sentences, which can be improved in future work.

Figure 2 :
Figure 2: F1 scores with different search numbers and thresholds on the ERD development set

Figure 4 :
Figure 4: F1 scores with number of candidates using different methods on the ERD dataset.The number of queries is shown in the parentheses.

Table 3 :
Feature Set for Candidate Ranking whose score is higher than the threshold selected on the development set.

Table 4 :
(Carmel et al., 2014))aset.Results of the baseline systems are taken from Table8inCornolti et al. (2016)and reported by the ERD organizer(Carmel et al., 2014).We only report the F1 score as precision and recall are not reported in previous work.*Significant improvement over state-of-the-art baselines (t-test, p < 0.05).

Table 5 :
Results on the GERDAQ dataset.Results of the baseline systems are taken from Table

Table 7 :
Results for the 398 queries which have at least one labeled entity on the ERD dataset using different candidate generation methods.C avg is the average recall of candidates per query.P avg and R avg are calculated on the final results.