Combining Mention Context and Hyperlinks from Wikipedia for Named Entity Disambiguation

Named entity disambiguation is the task of linking entity mentions to their intended referent, as represented in a Knowledge Base, usually derived from Wikipedia. In this paper, we combine local mention context and global hyperlink structure from Wikipedia in a probabilistic framework. We test our method in eight datasets, improving the state-of-the-art results in ﬁve. Our results show that the two models of context, namely, words in the context and hyperlink pathways to other entities in the context, are complementary. Our results are not tuned to any of the datasets, showing that it is robust to out-of-domain scenarios, and that further improvements are possible.


Introduction
Linking mentions occurring in documents to a knowledge base is the main goal of Entity Linking or Named Entity Disambiguation (NED). This problem has attracted a great number of papers in the NLP and IR communities, and a large number of techniques, including local context and global inference (Ratinov et al., 2011). We propose to use a probabilistic framework that combines entity popularity, name popularity, local mention context and global hyperlink structure, relying on information in Wikipedia alone. Entity and name popularity are useful disambiguation clues in the absence of any context. The local mention context provides direct clues (in the form of words in context) to disambiguate each mention separately. The hyperlink structure of Wikipedia provides a global coherence measure for all entities mentioned in the same context.
The advantages of our method with respect to other alternatives are as follows: (1) It does not involve a large number of methods and classifier combination.
(2) The method learns the parameters directly from Wikipedia so no additional hand-labeled data and training is needed. (3) We combine the global hyperlink structure of Wikipedia with a local bag-of-words probabilistic model in an intuitive and complementary way. (4) The absence of training allows for robust results in out-of-domain scenarios.
The evaluation of NED is fragmented, with several popular shared tasks, such as TAC-KBP 1 , ERD 2 or NEEL 3 . Other evaluation datasets include AIDA and KORE50 4 , which are very common in NED evaluation. Note that each dataset poses different problems. For instance AIDA is composed of news, and systems need to disambiguate all occurring mentions. TAC includes news and discussion forums, and focuses on a large number of mentions for a handful of challenging strings. KORE includes short sentences with very ambiguous mentions. Unfortunately, there is no standard dataset, and many contributions in this area report results in just one or two datasets. We report our results on eight datasets, improving the state-of-the-art results on five.

Resources
The knowledge used by our Bayesian network comes from Wikipedia. We extract three informa-tion resources to perform the disambiguation: a dictionary, textual contexts and a graph.
The dictionary is an association between strings and Wikipedia articles. We construct the dictionary using article titles, redirections, disambiguation pages, and anchor text. If the mention links to a disambiguation page, it is associated with all possible articles the disambiguation page points to. Each association between a string and article is scored with the prior probability, estimated as the number of times that the mention occurs in the anchor text of an article divided by the total number of occurrences of the mention. We choose candidate entities for disambiguation by just assigning all entities linked to the mention in the dictionary.
In addition we build a graph using the Wikipedia link structure, where entities are nodes and edges are anchor links among entities from Wikipedia. We used the third-party dictionary and graph described in (Agirre et al., 2015), which is publicly available 5 .
Finally, we extract textual contexts for all the possible candidate entities from a Wikipedia dump. We collect all the anchors including a link to each entity in Wikipedia, and extract a context of 50 words around the anchor link.

A Generative Bayesian Network
Given a mention s occurring in context c, our system ranks each of the candidate entities e. Figure  1 shows the dependencies among the different variables. Note that context probability is given by two different resources.
Candidate entities are ranked combining evidences from 4 different probability distributions, which we call entity knowledge P (e), name knowledge P (s|e), context knowledge P (c bow |e) and graph knowledge P (c grf |e) respectively. Entity knowledge P (e) represents the probability of generating entity e, and is estimated as follows: where Count(e) describes the entity popularity, e.g., the number of times the entity e is referenced within Wikipedia, |M | is the number of entity mentions and N is the total number of entities in Wikipedia. As can be seen, the estimation is smoothed using the add-one method. Name knowledge P (s|e) represents the probability of generating a particular string s given the entity e, and is estimated as follows: where Count(e, s) is the number of times mention s is used to refer entity e and S is the number of different possible names used to refer to e. The context knowledge is modeled in two different ways. In the bag-of-words model, P (c bow |e) represents the probability of generating context c = {w 1 , w 2 , . . . , w n } given the entity e, and is estimated as follows: where P e (w) is estimated as: is the maximum likelihood estimation of each word w in the context of e entity. Context words are smoothed by P w (w) that is the likelihood of words in the whole Wikipedia. λ parameter is set to 0.9 according to development experiments done in Aida development set (also known as Aida test-a).
The graph knowledge is estimated using personalized Pagerank. We used the probabilities returned by UKB 6 (Agirre et al., 2015). This software returns P (e|c grf ) 7 the probability of visiting a candidate entity when performing a random walk on the Wikipedia graph starting in the entity mentions in the context. In order to introduce it in the generative model, we must first convert it to P (c grf |e). We use Bayes' formula to estimate the probability: P (c grf |e) = P (e|c grf )P (c grf )/P (e) Finally, the Full Model combines all evidences to find the entity that maximizes the following formula:

Experiments
We tested our algorithms on a wide range of datasets: AIDA CoNLL-YAGO test-b (Hoffart et al., 2011), KORE50 (Hoffart et al., 2012) and six TAC-KBP 8 datasets corresponding to six years of the competition (Aida, Kore and Tac hereafter). No corpus was used for training the parameters of the system, apart from Wikipedia, as explained in the previous sections.
We used gold-standard mentions and we evaluated only those mentions linked to a Wikipedia entity (ignoring so-called NIL cases). Depending on the dataset, we used the customary evaluation measure: micro-accuracy (Aida, Kore, Tac09 and Tac10) or Bcubed+ (Tac11, Tac12, Tac13 and Tac14) 9 .
Each gold standard uses a different Wikipedia version: 2010 for Aida and Kore, 2008 for Tac. We use the Wikipedia dump from 25-5-2011 to build our resources, as this is close to the versions used at the time. We mapped gold-standard entities to 2011 Wikipedia automatically, using redirects in the 2011 Wikipedia. This mapping could cause a small degradation of our results.

Results
The top 4 rows in table 1 show the performance of the different combinations among probabilities. The remaining row shows the best results reported to date on those datasets (see caption for details).
The results suggest that each probability contributes to the final score of the Full Model, shown on row 4, showing that both context models are complementary between each other 10 . The only exception is Tac13, where the bow model is best.
Our system obtains very good results in all datasets, excelling in Tac09-10-11-12-13, where it beats the state-of-the-art. The figures obtained by the Full Model on Aida, Kore and Tac14 are close to the best results. Note that the table shows the results of the system reporting the best values for each dataset, that is, our system is compared not to one single system but to all those systems. For example, (Hoffart et al., 2012) reported lower figures for Kore, 64.58. Regarding the results for TAC-KBP, the full task includes linking to the Knowledge Base and detecting and clustering NIL mentions. In order to make results comparable to those for in Aida and Kore, the table reports the results for mentions which are linked to the Knowledge Base, that is, results where NIL mentions are discarded.

Adjusting the model to the data
We experimented with weighting the probabilities to adapt the Full Model mentioned above to a specific scenario. For the Weighted Full Model, we introduce the α, β, γ and δ parameters 11 as follows: e = arg max e P (s, c bow , c grf , e) = arg max e P (e) α P (s|e) β P (c bow |e) γ P (c grf |e) δ Weighting may change the optimal configuration for λ, we thus optimized all parameters on the development set of Aida, yielding λ = 0.8, α = 0.2, β = 0.1, γ = 0.6 and δ = 0.1 performing a exhaustive grid search. The step size used in this experiment is 0.1. The parameters yielded high results for development, up to 82.88. Table 2 summarizes the results of the Weighted Full Model for Aida, showing that model reaches 83.61 points, close to the best micro accuracy reported by (Houlsby and Ciaramita, 2014) and above those reported by (Hoffart et al., 2011;Moro et al., 2014). The values of (Hoffart et al., 2011) and (Moro et al., 2014) for Aida are, respectively, 10 The results of our combination involving the UKB software are not comparable to those reported by (Agirre et al., 2015), due to the different formulation of the probability distribution which involves the prior. 11 α + β + γ + δ = 1  Table 1: Bold marks the best value among probability combinations, and * those results that overcome the best value reported in the state-of-the-art: (Houlsby and Ciaramita, 2014) for Aida, (Moro et al., 2014) for Kore, (Han and Sun, 2011) for Tac09 and see TAC-KBP proceedings for the rest 8 .

Related Work
The use of Wikipedia for named entity disambiguation is a common approach in this area. In the related field of Wikification, (Ratinov et al., 2011) introduced the supervised combination of a large number of global and local similarity measures. They learn weights for each of those measures training a supervised classifier on Wikipedia. Our approach is different in that we just combine four intuitive methods, without having to learn weights for them. Unfortunately they don't report results for NED. (Moro et al., 2014) present a complex graphbased approach for NED and Word Sense Disambiguation which works on BabelNet, a complex combination of several resources including, among 12 Note that values by (Hoffart et al., 2011) were reported on a subset of Aida. The micro accuracy results reported in our table correspond to the latest best model from the Aida web site: http://www.mpi-inf.mpg.de/departments/ databases-and-information-systems/ research/yago-naga/aida/. others, Wikipedia, WordNet and Wiktionary. Our results are stronger over Aida, but not on the smaller Kore. (Hoffart et al., 2011) presents a robust method based on entity popularity and similarity measures, which are used to build a mention/entity graph. They include external knowledge from Yago, and train a classifier on the train part of Aida, obtaining results comparable to ours. Given that we do not train on in-domain training corpora, we think our system is more robust.
The use of probabilistic models using Wikipedia for NED was introduced in (Han and Sun, 2011). In this paper, we extend the model with a global model which takes the hyperlink structure of Wikipedia into account. (Houlsby and Ciaramita, 2014) presents a probabilistic method using topic models, where topics are associated to Wikipedia articles. They present strong results, but they need to initialize the sampler on another NED system, Tagme (Ferragina and Scaiella, 2012). In some sense they also combine the knowledge in the graph with that of a local algorithm (Tagme), so their work is complementary to ours. They only provide results on AIDA, and it is thus not possible to see whether they are as robust as our algorithm.

Conclusions and future work
Bayesian networks provide a principled method to combine knowledge sources. In this paper we combine popularity, name knowledge and two methods to model context: bag-of-words context, and hyperlink graph. The combination outperforms the state-of-the-art in five out of eight datasets, showing the robustness of the system in different domain and dataset types. Our results also show that in all but one dataset the combination outperforms individual models, indicating that bag-or-word context and graph context are complementary. We show that results can be further improved when tuning the weights on in-domain development corpora.
Given that Bayesian networks can be further extended, we are exploring to introduce additional models of context into a Markov Random Field algorithm. Our current model assumes that the two models of context (bag or words and graph) are independent given e, and we would like to explore alternatives to relax this assumption. We would also like to explore whether more sophisticated smoothing techniques could improve our probability estimates.