ELDEN: Improved Entity Linking Using Densified Knowledge Graphs

Entity Linking (EL) systems aim to automatically map mentions of an entity in text to the corresponding entity in a Knowledge Graph (KG). Degree of connectivity of an entity in the KG directly affects an EL system’s ability to correctly link mentions in text to the entity in KG. This causes many EL systems to perform well for entities well connected to other entities in KG, bringing into focus the role of KG density in EL. In this paper, we propose Entity Linking using Densified Knowledge Graphs (ELDEN). ELDEN is an EL system which first densifies the KG with co-occurrence statistics from a large text corpus, and then uses the densified KG to train entity embeddings. Entity similarity measured using these trained entity embeddings result in improved EL. ELDEN outperforms state-of-the-art EL system on benchmark datasets. Due to such densification, ELDEN performs well for sparsely connected entities in the KG too. ELDEN’s approach is simple, yet effective. We have made ELDEN’s code and data publicly available.


Introduction
Entity Linking (EL) is the task of mapping mentions of an entity in text to the corresponding entity in Knowledge Graph (KG) (Hoffart et al., 2011;Dong et al., 2014;Chisholm and Hachey, 2015). EL systems primarily exploit two types of information: (1) similarity of the mention to the candidate entity string, and (2) coherence between the candidate entity and other entities mentioned in the vicinity of the mention in text. Coherence essentially measures how well the candidate entity is connected, either directly or indirectly, with other KG entities mentioned in the vicinity (Milne and Witten, 2008;Globerson et al., 2016). In the state-? ?
Andrei_Broder Figure 1: Improving entity disambiguation performance using KG densification: edges to WWW conference, a sparsely-connected entity in the KG, is increased by adding edges from pseudo entity Program Committee whose mention cooccurs with it in web corpus. ELDEN, the system proposed in this paper, uses such densified KG to successfully link ambiguous mention WWW to the correct entity WWW conference, instead of the more popular entity World Wide Web.
of-the-art EL system by (Yamada et al., 2016), coherence is measured as distance between embeddings of entities. This system performs well on entities which are densely-connected in KG, but not so well on sparsely-connected entities in the KG.
We demonstrate this problem using the example sentence in Figure 1. This sentence has two mentions: Andrei Broder and WWW. The figure also shows mention-entity linkages, i.e., mentions and their candidate entities in KG. Using a conventional EL system, the first mention Andrei Broder 1 can be easily linked to Andrei Broder using string similarity between the mention and candidate entity strings. String similarity works well in this case as this mention is unambiguous in the given setting. However, the second mention WWW has two candidates, World Wide Web and WWW conference, and hence is ambiguous. In such cases, coherence measure between the candidate entity and other unambiguously linked entity(ies) is used for disambiguation.
State-of-the-art EL systems measure coherence as similarity between embeddings of entities. The entity embeddings are trained based on the number of common edges in KG 2 . In our example, common edges are edges World Wide Web shares with Andrei Broder and edges WWW conference shares with Andrei Broder. But WWW conference has less number of edges (it is a sparsely-connected entity) compared to World Wide Web. This leads to poor performance 3 whereby WWW is erroneously linked to World Wide Web instead of linking to WWW conference.
In this paper, we propose ELDEN, an EL system which increases nodes and edges of the KG by using information available on the web about entities and pseudo entities. Pseudo Entities are words and phrases that frequently occur in Wikipedia, and co-occur with mentions of KG entities in the web corpus. Thus ELDEN uses a web corpus to find pseudo entities and refines the cooccurrences with Pointwise Mutual Information (PMI) (Church and Hanks, 1989) measure. EL-DEN then adds edges to the entity from pseudo entities. In Figure 1, pseudo entity Program Committee co-occurs with mentions of Andrei Broder and WWW conference in web corpus and has a positive PMI value with both.
So EL-DEN adds edges from Program Committee to Andrei Broder and WWW conference, densifying neighborhood of the entities. Coherence, now measured as similarity between entity embeddings where embeddings are trained on densified KG, leads to improved EL performance.
Density (number of KG edges) of candidate entity affects EL performance. In our analysis of density and number of entities having that density in the Wikipedia KG, we find that entities with 500 edges or less make up more than 90%. Thus, creating an EL system that performs well on densely as well as sparsely-connected entities is a challenging, yet unavoidable problem.
We make the following contributions: • ELDEN presents a simple yet effective graph densification method which may be applied to improve EL involving any KG.
• By using pseudo entities and unambiguous mentions of entity in a corpus, we demonstrate how non-entity-linked corpus can be used to improve EL performance.
• We have made ELDEN's code and data publicly available 4 .

Related Work
Entity linking: Most EL systems use coherence among entities (Cheng and Roth, 2013) to link mentions. We studied coherence measures and datasets used in six recent 5 EL systems (He et al., 2013;Huang et al., 2015;Sun et al., 2015;Yamada et al., 2016;Globerson et al., 2016;Barrena et al., 2016). We see that the two popular datasets used for evaluating EL (Chisholm and Hachey, 2015) on documents are CoNLL (Hoffart et al., 2011) and TAC2010 (Ji et al., 2010), here after TAC. The popular coherence measures used are (1) WLM, (2) Entity Embedding Similarity and (3) Jaccard Similarity (Chisholm and Hachey, 2015;Guo et al., 2013). WLM is widely acknowledged as most popular (Hoffart et al., 2012) with almost all the six approaches analyzed above using WLM or its variants. Entity embedding similarity (Yamada et al., 2016) is reported to give highest EL 6 performance and is the baseline of ELDEN. Enhancing entity disambiguation: Among methods proposed in literature to enhance entity disambiguation utilizing KG (Bhattacharya and Getoor) uses additional relational information between database references; (Han and Zhao, 2010) uses semantic relatedness between entities in other KGs; and (Shen et al., 2018) uses paths consisting of defined relations between entities in the KG (IMDB and DBLP). All these methods utilize structured information, while our method shows how unstructured data (web corpus about the entity to be linked) can be effectively used for entity disambiguation. Entity Embeddings: ELDEN presents a method to enhance embedding of entities and words in a common vector space. Word embedding methods like word2vec (Mikolov et al., 2013) and Glove (Pennington et al., 2014) have been extended to entities in EL (Yamada et al., 2016;Fang et al., 2016;Zwicklbauer et al., 2016;Huang et al., 2015). These methods use data about entity-entity co-occurrences to improve the entity embeddings.
In ELDEN, we improve it with web corpus cooccurrence statistics. Ganea and Hofmann (2017) present a very interesting neural model for jointly learning entity embedding along with mentions and contexts. KG densification with pseudo entities: KG densification using external corpus has been studied by Kotnis et al. (2015) and Hegde and Talukdar (2015). Densifying edge graph is also studied as 'link prediction' in literature (Martínez et al., 2016). Kotnis et al. augment (2013) proves that considering corpus level significant co-occurrences, PMI is better than others. Budiu et al. (2007) compare Latent Semantic Analysis (LSA), PMI and Generalized Latent Semantic Analysis (GLSA) and conclude that for large corpora like web corpus, PMI works best on word similarity tests. Hence, we chose PMI to refine co-occurring mentions of entities in web corpus.

Definitions and Problem Formulation
In this section, we present a few definitions and formulate the EL problem. Knowledge Graph (KG): A Knowledge Graph is defined as G = (E, F ) with entities E as nodes and F as edges. In allegiance to EL literature and baselines (Milne and Witten, 2008;Globerson et al., 2016), we use the Wikipedia hyperlink graph as the KG in this paper, where nodes correspond to Wikipedia articles and edges are incoming links from one Wikipedia article to another. ELDEN ultimately uses a densified version of this Wikipedia KG, as described in Section 4. Sparsely connected entities: Following Hoffart et al. (2012), we define an entity to be a sparsely connected entity, if the number of edges incident on the entity node in the KG is less than threshold η 7 . Otherwise, the entity is called a denselyconnected entity. Entity Linking (EL): Given a set of mentions M D = {m 1 , ..., m n } in a document D, and a knowledge graph G = (E, F ), the problem of entity linking is to find the assignment Λ : For mention m i ∈ M D , let the set of possible entities it can link to (candidate entities) be C i . Then, the solution to the EL problem is an assignment Λ where, (1) Here, φ(m i , e) ∈ [0, 1] measures the contextual compatibility of mention m i and entity e. φ(m i , e) is obtained by combining prior probability and context similarity. ψ(e, E D ) measures the coherence of e with entities in E D . β is a variable controlling inclusion of ψ in the assignment Λ. Problem Formulation : (Yamada et al., 2016) is a recently proposed state-of-the-art EL system. We consider it as a representative EL system and use it as the main baseline for the experiments in this paper. In this section, we briefly describe Yamada et al. (2016)'s two-step approach that solves the EL problem presented above.
Step 1: Assigning entities first to unambiguous mentions has also been found to be helpful in prior research (Milne and Witten, 2008;Guo and Barbosa, 2014). Let A D be the set of entities linked to in this step. In Figure 1, mention Andrei Broder is unambiguous 8 . Step 2: In this step, Yamada et al. links all ambiguous mentions by solving Equation 2.
where v e , v e j ∈ R d are d-dimensional embeddings of entities e and e j respectively. Please note that the equation above is a reformulation of Equation 1 with β = 1 and ψ(e, As coherence ψ(e, E D ) is applied only in disambiguation of ambiguous mentions, we apply densification to only selective nodes of KG.
Embeddings of entities are generated using word2vec model and trained using WLM (Details in Section 5). Given a graph G = (E, F ), the WLM coherence measure ψ wlm (e i , e j ) between two entities e i and e j is defined in Equation 3 where C e is the set of entities with edge to entity e.

Our Approach: ELDEN
In this section, we present ELDEN, our proposed approach. ELDEN extends (Yamada et al., 2016), with one important difference: rather than working with the input KG directly, ELDEN works with the densified KG, created by selective densification of the KG with co-occurrence statistics extracted from a large corpus. Even though this is a simple change, this results in improved EL performance. The method performs well even for sparsely connected entities.
Overview : Overview of the ELDEN system is shown in Figure 2. ELDEN starts off with densification of the input KG, using statistics from web corpus. Embeddings of entities are then learned utilizing the densified KG in the next step. Embedding similarity estimated using the learned entity embeddings is used in calculating coherence measure in subsequent EL. Notation used is summarized in Table 1.
(i)KG Densification Figure 2 depicts densification of KG in 'Input KG' and 'Densified KG'. It shows two Wikipedia titles Andrei Broder and WWW conference from our running example ( Figure 1). There are no edges common between Andrei Broder and WWW conference. In a web corpus, mentions of Andrei Broder and WWW conference co-occur with Program committee and it has a positive PMI value with both the entities. So ELDEN adds an edge from Program committee to both the entities. Here Program committee is a pseudo entity. Thus, ELDEN densifies the KG by adding edges from pseudo entities when the mentions of Wikipedia entity and pseudo entity co-occur in a web corpus and the pseudo entity has a positive PMI value with given entity. Taking a closer look, KG densification process starts from 'input KG' which is Wikipedia hyperlink graph G = (E, F ), where the nodes are Wikipedia titles (E) and edges are hyperlinks (F ). ELDEN processes Wikipedia text corpus and identifies phrases (unigrams and bi-grams) that occur frequently, i.e. more than 10 times in it. We denote these phrases as pseudo entities (S) and add them as nodes to the KG. Let E + = E ∪ S be the resulting set of nodes.
ELDEN then adds edges connecting entities in E + to entities in E. This is done by processing a web text corpus looking for mentions of entities in E + , and linking the mentions to entities in KG G = (E + , F ). ELDEN uses Equation 1 with β = 0 for this entity linking, i.e. only mentionentity similarity φ(m, e) is used during this linking 9 . Based on this entity linked corpus, a cooccurrence matrix M of size |E + | × |E + | is constructed. Each cell M i,j is set to the PMI between   Figure 2: ELDEN consists of KG densification, training entity embeddings on densified KG and building EL system that uses similarity between the trained embeddings as coherence measure. ELDEN's difference from baseline method is that while baseline method uses input KG for training v e , ELDEN uses densified KG for training v e . Hence, the improved performance of ELDEN is solely attributed to densification.  the entities e and e . In other words, we augment the set of initial edges F with additional edges connecting entities in E + with entities in E such that PMI between the entities is positive.
ELDEN now constructs the KG G dense = (E + , F + ), which is a densified version of the input KG G = (E, F ). ELDEN uses this densified KG G dense for subsequent processing and entity linking.
(ii)Learning Embeddings of Densified KG Entities ELDEN derives entity embeddings using the same setup, corpus and Word2vec skip-gram with negative sampling model as in Yamada et al., However, instead of training embeddings over the input KG, ELDEN trains embeddings of entities in the densified KG G dense . Let V be the word2vec matrix containing embeddings of entities in E + where V ∈ R k * d . v e is the embedding of entity e in E + with dimension 1 * d.
In word2vec model, entities in context are used to predict the target entity. ELDEN maximizes the objective function (Goldberg and Levy, 2014) of word2vec skip-gram model with negative sampling, L = (t,c)∈P L t,c where Here v t and v c are the entity embeddings of target entity t and context entity c. P is the set of target-context entity pairs considered by the model. N (t,c) is a set of randomly sampled entities used as negative samples with pair (t, c). This objective is maximized with respect to variables v t 's and v c 's, where θ(x) = 1 1+e −x . P and N are derived using G dense . t and c are entities in E + such that c shares a common edge with t. v n is randomly sampled from V, for entities that do not share a common edge with t. Entity embedding similarity measured using V trained this way on G dense is ψ ELDEN . Embedding similarity is measured as cosine distance between v e s. Embeddings of S are trained using positive and negative word contexts derived using context length.

(iii) Bringing it All Together: ELDEN
ELDEN is a supervised EL system which uses two sets of features: (1) contextual compatibility φ(m, e); and (2) coherence ψ(e i , e j ). These features are summarized in Table 2. Similarity between entity embeddings is measured as cosine similarity between v e s.

Experiments
In this section, we evaluate the following: • Is ELDEN's corpus co-occurrence statisticsbased densification helpful in disambiguating entities better? (Sec. 6.1) • Where does ELDEN's selective densification of KG nodes link entities better? (Sec. 6.3) Setup : ELDEN is implemented using Random Forest ensemble 10 (Breiman, 1998). Parameter values were set using CoNLL development set. Feature limit of 3 with number of estimators as 100 yielded best performance. Knowledge Graph: Wikipedia Following prior EL literature, we use Wikipedia hypergraph as our KG (Milne and Witten, 2008;Globerson et al., 2016). This KG is enhanced with pseudo entities as explained in Section 4. We process 10 http://scikit-learn.org  . Even for sparsely connected entities, an average corpus size of 670 lines or more 13 was collected. We note that though some of the entities mentioned in this dataset are ten or more years old, we are able to collect, on an average more than 670 lines of web content. Thus corpus proves to be a good source of additional links for densification, for both common and rare entities. As Taneva and Weikum (2013) also note, it is not hard to find content about sparsely connected entities on the web. The web corpus is analyzed for mentions and pseudo entities. Co-occurrence matrix M is created 14 for mention and pseudo entities occurring within window of size 10 for PMI calculation 15 . Edges are added from pseudo entities with positive PMI to mention of given entity. In experiments we add edges from top 10 pseudo entities ordered by 11 by removing disambiguation, navigation, maintenance and discussion pages. 12 https://www.google.com/ 13 A detailed analysis of knowledge gained from crawling for common versus less common entities is present in Figure  1 of supplementary material.
14 This co-occurrence matrix is downloadable with source code. 15 We experimented with window sizes 10, 25 and 50. We chose 10 that gave best results PMI values 16 . Evaluation Dataset: In line with prior work on EL, we test the performance of ELDEN on CoNLL and TAC datasets. As this paper focuses on entity disambiguation, we tested ELDEN against datasets and baseline methods for disambiguation. We note that the entity disambiguation evaluation part of other recent datasets like ERD 2014 and TAC 2015 is exactly same as the TAC 2010 evaluation (Ellis et al., 2014) 17 . Training: ELDEN's parameters were tuned using training (development) sets of CoNLL and TAC datasets. CoNLL and TAC datasets consist of documents where mentions are marked and entity to which the mention links to, is specified. We use only mentions that link to a valid Wikipedia title (non NIL entities) and report performance on test set. Some aspects of these datasets relevant to our experiments are provided below. CoNLL: In CoNLL test set (5267 mentions), we report Precision of topmost candidate entity, aggregated over all mentions (P-micro) and aggregated over all documents (P-macro), i.e., if tp, fp and p are the individual true positives, false positives and precision for each document in a dataset of δ documents, then For CoNLL candidate entities, we use (Pershina et al., 2015) dataset 18 . TAC: In TAC dataset, we report P-micro of topranked candidate entity on 1,020 mentions. Pmacro is not applicable to TAC as most documents have only one mention as query mention ( or 'mention to be linked'). For TAC candidate entities, we index the Wikipedia word tokens and titles using solr 19 . We index terms in (1) title of the entity, (2) title of another entity redirecting to the entity, and (3) names of anchors that point to the entity, in line with baselines. We are making this TAC candidate set publicly available. Baseline: Yamada16 Our baseline is the Yamada et al. system explained in Section 3. Entity embedding distance measured using v e trained on the  input KG G is ψ Yamada .

Does ELDEN's selective densification help in disambiguation in EL?
In Table 4, we compare ELDEN's EL performance with results of other recently proposed state-ofthe-art EL methods that use coherence models. We see that ELDEN results matches best results on CoNLL and outperforms state-of-the-art in TAC dataset. In the table, the last four rows uses the Pershina et al. (2015) candidate set and hence, we provide a comparison of their disambiguation performance. Improved results of ELDEN over baseline is attributed to the improved disambiguation due to KG densification.

Why does ELDEN's selective densification work?
We conduct ablation analysis using various feature and feature combinations and present performance of ELDEN and baseline in Table 5. Starting with base features, we add various features to ELDEN incrementally and report their impact on performance. The results when using base feature group alone, and base and string similarity groups together (φ) are presented in first and second rows for each dataset. We compare ψ ELDEN to three coherence measures: ψ wlm , ψ Yamada and ψ dense , details of which are provided in   Table 2 for definitions of these measures). Statistically significant improvements over φ are marked with an asterisk. ELDEN's coherence measure, ψ ELDEN++ , achieves the best overall performance. P-macro is not applicable to TAC as most documents have only one mention marked as query. (Please see Section 6.2).
gave an improvement of 2.0 and 1.9 (P-micro and P-macro) over Yamada16 results. We note that Ya-mada16 results are from our re-implementation of (Yamada et al., 2016) system 20 and we are able to almost reproduce the baseline results. We also present the results combining baseline's ψ Yamada and ψ wlm versus ELDEN's ψ ELDEN and ψ dense in next two rows. We find the ELDEN's KG densification features perform better than baselines. On TAC dataset also, combined with φ, ψ dense is found to do better than ψ wlm and ψ ELDEN gives a significant P-micro improvement of 4.2 over ψ Yamada . The ψ ELDEN++ P-micro in TAC dataset is statistically significant 21 . In short, we find the KG densification features, ψ dense and ψ ELDEN , as the features causing better performance of EL-DEN on both datasets.

Where does ELDEN's selective densification work better?
While most EL systems give higher precision on CoNLL dataset than TAC dataset, ELDEN performs with high precision on TAC dataset too. 20 We have re-implemented the Yamada et al system using hyper-parameters specified in the paper and these are our best-effort results. 21 We performed two tailed t-test, with 2-tail 95% value of 1.96.   Table 4). This is explained by analyzing distribution of densely-connected and sparsely connected entities in TAC and CoNLL datasets as presented in Table 6. We see that CoNLL test set has almost half as densely-connected and half as sparsely connected entities, whereas in TAC test set, 63.6% are sparsely connected entities. This higher constitution of sparsely connected entities in TAC, explains ELDEN's better results in TAC relative to CoNLL dataset. As the number of sparsely connected entities is more than the number of denselyconnected entities in most KGs (Reinanda et al., 2016), our method is expected to be of significance for most KGs.
6.4 What type of EL errors are best fixed with ELDEN's selective densification ?
We analyzed errors fixed by ELDEN on TAC dataset. We categorize the errors into four classes in line with error classes of Ling et al. (2015). We manually analyzed 240 wrong predictions of Ya-mada16 and compared it with that of ELDEN, and the results are presented in Figure 3. We found errors to reduce with use of KG densification features and most of the errors eliminated were in "Specific label" class. Errors in this class called for better modeling of mention's context and linkbased similarity (Ling et al., 2015).(More details of this analysis in the supplementary document.)

Conclusion
We started this study by analyzing the performance of a state-of-the-art Entity Linking (EL) system and found that its performance was low when linking entities sparsely-connected in the KG. We saw that this can be addressed by densifying the KG with respect to the given entity. We proposed ELDEN, which densifies edge graph of entities using pseudo entities and mentions of entities in a large web corpus. Through our experiments, we find that ELDEN outperforms state-ofthe-art baseline on benchmark datasets. We believe that ELDENs combination of KG densification and entity embeddings is novel. Poor performance of EL systems on sparsely connected entities has been recognized as one of the open challenges by prior research. ELDEN performs well on sparsely connected entities too, as a validation of our method of combining KG densification followed by embedding. Our approach may be applied to any KG as the densification is performed with the help of unstructured data, and not any specific KG. We hope the simple graph densification method utilized in ELDEN will be of much interest to the research community.
Pseudo entities can be looked at as entity candidates for KG expansion, as also noted by Farid et al. (2016). In future, we plan to enhance EL-DEN using EL of pseudo entities to estimate entity prior of entities not present in KG. We also plan to explore entity embeddings obtained using other graph densifying methods.