Boosting Entity Linking Performance by Leveraging Unlabeled Documents

Modern entity linking systems rely on large collections of documents specifically annotated for the task (e.g., AIDA CoNLL). In contrast, we propose an approach which exploits only naturally occurring information: unlabeled documents and Wikipedia. Our approach consists of two stages. First, we construct a high recall list of candidate entities for each mention in an unlabeled document. Second, we use the candidate lists as weak supervision to constrain our document-level entity linking model. The model treats entities as latent variables and, when estimated on a collection of unlabelled texts, learns to choose entities relying both on local context of each mention and on coherence with other entities in the document. The resulting approach rivals fully-supervised state-of-the-art systems on standard test sets. It also approaches their performance in the very challenging setting: when tested on a test set sampled from the data used to estimate the supervised systems. By comparing to Wikipedia-only training of our model, we demonstrate that modeling unlabeled documents is beneficial.


Introduction
Named entity linking is the task of linking a mention to the corresponding entity in a knowledge base (e.g., Wikipedia). For instance, in Figure 1 we link mention "Trump" to Wikipedia entity Donald Trump. Entity linking enables aggregation of information across multiple mentions of the same entity which is crucial in many natural language processing applications such as question answering (Hoffmann et al., 2011;Welbl et al., 2018), information extraction (Hoffmann et al., 2011) or multi-document summarization (Nenkova, 2008).
While traditionally entity linkers relied mostly on Wikipedia and heuristics (Milne and   2008; Ratinov et al., 2011a;Cheng and Roth, 2013), the recent generation of methods (Globerson et al., 2016;Guo and Barbosa, 2016;Yamada et al., 2016;Ganea and Hofmann, 2017;Le and Titov, 2018) approached the task as supervised learning on a collection of documents specifically annotated for the entity linking problem (e.g., relying on AIDA CoNLL (Hoffart et al., 2011)). While they substantially outperform the traditional methods, such human-annotated resources are scarce (e.g., available mostly for English) and expensive to create. Moreover, the resulting models end up being domain-specific: their performance drops substantially when they are used in a new domain. 1 We will refer to these systems as fully-supervised.
Our goal is to show that an accurate entity linker can be created relying solely on naturally occurring data. Specifically, our approach relies only on Wikipedia and a collection of unlabeled texts. Though links in Wikipedia have been created by humans, no extra annotation is necessary to build our linker. Wikipedia is also available in many languages and covers many domains. Though Wikipedia information is often used within entity linking pipelines, previous systems relying on Wikipedia are substantially less accurate than modern fully-supervised systems (e.g., Cheng and Roth (2013), Ratinov at al. (2011a)). This is also true of the only other method which, like ours, uses a combination of Wikipedia data and unlabeled texts (Lazic et al., 2015). We will refer to approaches using this form of supervision, including our approach, as Wikipedia-based linkers.
Wikipedia articles have a specific rigid structure (Chen et al., 2009), often dictated by the corresponding templates, and mentions in them are only linked once (when first mentioned). For these reasons, Wikipedia pages were not regarded as suitable for training document-level models (Globerson et al., 2016;Ganea and Hofmann, 2017), whereas state-of-the-art fully supervised methods rely on document-level modeling. We will show that, by exploiting unlabeled documents and estimating document-level neural coherence models on these documents, we can bring Wikipedia-based linkers on par or, in certain cases, make them more accurate than fully-supervised linkers.
Our Wikipedia-based approach uses two stages: candidate generation and document-level disambiguation. First, we take an unlabeled document collection and use link statistics in Wikipedia to construct a high recall list of candidates for each mention in each document. To create these lists, we use the Wikipedia link graph, restrict vertices to the ones potentially appearing in the document (i.e. use the 'vertex-induced subgraph' corresponding to the document) and perform message passing with a simple probabilistic model which does not have any trainable parameters. After this step, for the example in Figure 1, we would be left with Theresa May and a Queen of England Mary of Teck as two potential candidates for mention "May," whereas we would rule out many other possibilities (e.g., a former settlement in California). Second, we train a document-level statistical disambiguation model which treats entities as latent variables and uses the candidate lists as weak supervision. Intuitively, the disambiguation model is trained to score at least one assignment compatible with the candidate lists higher than all the assignments incompatible with the lists (e.g., one which links "Trump" to Ivanka Trump).
Though the constraints do not prevent linking "May" to the Queen in Figure 1, given enough data, the model should rule out this assignment as not in fitting with other entities in the document (i.e. Donald Trump and Brexit) and/or not compatible with its local context (i.e. "Mrs."). We evaluate our model against previous methods on six standard test sets, covering multiple domains. Our model achieves the best results on four of these sets and in average. Interestingly, our system performs well on test data from AIDA CoNLL, the dataset used to train fully-supervised systems, even though we have not used the annotations.
Our approach also substantially outperforms both previous Wikipedia-based approaches and a version of our system which is simply trained to predict Wikipedia links. This result demonstrates that unlabeled data was genuinely beneficial. We perform ablations confirming that the disambiguation model benefits from capturing both coherence with other entities (e.g., Theresa May is more likely than Mary of Teck to appear in a document mentioning Donald Trump) and from exploiting local context of mentions (e.g., "Mrs." can be used to address a prime minister but not a queen). This experiment confirms an intuition that global modeling of unlabeled documents is preferable to training local models to predict individual Wikipedia links. Our contributions can be summarized as follows: • we show how Wikipedia and unlabeled data can be used to construct an accurate linker which rivals linkers constructed using expensive human supervision; • we introduce a novel constraint-driven approach to learning a document-level ('global') co-reference model without using any document-level annotation; • we provide evidence that fully-annotated documents may not be as beneficial as previously believed.
2 Constraint-Driven Learning for Linking

Setting
We assume that for each mention m i , we are provided with a set of candidates E + i . In subsequent section we will clarify how these candidates are produced.
For example, for m 1 ="Trump" in Figure 1, the set would be E + 1 = {Donald T rump, M elania T rump}. When learning our model we will assume that one entity candidate in this set is correct (e * i ). Besides the 'positive examples' E + i , we assume that we are given a set of wrong entities E − i (including, in our example, Ivanka Trump and Donald Trump Jr).
In practice our candidate selection procedure is not perfect and the correct entity e * i will occasionally be missed from E + i and even misplaced into E − i . This is different from the standard supervised setting where E + i contains a single entity, and the annotation is not noisy. Moreover, unlike the supervised scenario, we do not aim to learn to mimic the teacher but rather want to improve on it relying on other learning signals (i.e. document context).
Some mentions do not refer to any entity in a knowledge base and should, in principle, be left unlinked. In this work, we link mentions whenever there are any candidates for linking them. More sophisticated ways of dealing with NIL-linking are left for future work.

Model
Our goal is to not only model fit between an entity and its local context but also model interactions between entities in a document (i.e. coherence between them). As in previous global entity-linking models (Ratinov et al., 2011a), we can define the scoring function for n entities e 1 , . . . , e n in a document D as a conditional random field: where the first term scores how well an entity fits the context and the second one judges coherence. Exact MAP (or max marginal) inference, needed both at training and testing time, is NP-hard (Wainwright et al., 2008), and even approximate methods (e.g., loopy belief propagation, LBP) are relatively expensive and do not provide convergence guarantees. Instead, we score entities independently relying on the candidate lists: Informally, we score e i based on its coherence with the 'most compatible' candidate for each mention in the document. This scoring strategy  is computationally efficient and has been shown effective in the supervised setting by Globerson et al. (2016). They refereed to this approach as a 'star model', as it can be regarded as exact inference in a modified graphical model. 2 We instantiate the general model for the above expression (1) in the following form: where we use m i to denote an entity mention, c i is its context (a text window around the mention), ξ(e i , e j ) is a pair-wise compatibility score and α ij are attention weights, measuring relevance of an entity at position j to predicting entity e i (i.e. n j=1 α ij = 1). The local score φ is identical to the one used in Ganea and Hofmann (2017). As the pair-wise compatibility score we use ξ(e i , e j ) = x T e i Rx e j , where x e i and x e j ∈ R de are external entity embeddings, which are not fine-tuned in training. R ∈ R de×de is a diagonal matrix. The attention is computed as where the function h(m i , c i ) mapping a mention and its context to R dc is given in Figure 2, A ∈ R dc×dc is a diagonal matrix. A similar attention model was used in the supervised linkers of Le and Titov (2018) and Globerson et al. (2016).
Previous supervised methods such as Ganea and Hofmann (2017) additionally exploited a simple extra feature p wiki (e i |m i ): the normalized frequency of mention m i being used as an anchor text for entity e i in Wikipedia articles and YAGO. We combine this score with the model score s(e i |D) using a one-layer neural network to yieldŝ(e i |D). At test time, we use our model to select entities from the candidate list. As standard in reranking (Collins and Koo, 2005), we linearly combine ŝ(e i |D) with the score s c (e i |D) from the candidate generator, defined below (Section 3.3). 3 The hyper-parameters are chosen using a development set. Additional details are provided in the appendix.

Training
As we do not know which candidate in E + i is correct, we train the model to score at least one candidate in E + i higher than any negative example from E − i . This approach is reminiscent of constraintdriven learning (Chang et al., 2007), as well as of multi-instance learning methods common in relation extraction (Riedel et al., 2010;Surdeanu et al., 2012). Specifically, we minimize where Θ is the set of model parameters, δ is a margin, and [x] + = max{0, x}.

Producing Weak Supervision
We rely primarily on Wikipedia to produce weak supervision. We start with a set of candidates for a mention m containing all entities refereed to with anchor text m in Wikipedia. We then filter this set in two steps. The first step is the preprocessing technique of Ganea and Hofmann (2017). After this step, the list has to remain fairly large in order to maintain high recall. Large lists are not effective as weak supervision as they do not sufficiently constraint the space of potential assignments to drive learning of the entity disambiguation model. In order to further reduce the list, we apply the second filtering step. In this stage, which we introduce in this work, we use Wikipedia to create a link graph: entities as vertices in this graph. The graph defines the structure of a probabilistic graphical model which we use to rerank the candidate list. We select only top candidates for each mention (2 in our experiments) and still maintain high recall. The two steps are described below.

Initial filtering
For completeness, we re-describe the filtering technique of Ganea and Hofmann (2017) Brexit is a portmanteau of "British" and "exit". It was derived by analogy from Grexit.
... x w }, x e and x w ∈ R de are external embeddings for entity e and word w, respectively. Note that the word and entity embeddings are not fine-tuned, so the model does not have any free parameters. They then extract N p = 4 top candidates according to p wiki (e|m) and N q = 3 top candidates according to q wiki (e|m, c) to get the candidate list. For details, we refer to the original paper. On the development set, this step yields recall of 97.2%.

Message passing on link graph
We describe now how we use Wikipedia link statistics to further reduce the candidate list.

Link graph
We construct an undirected graph from Wikipedia; vertices of this graph are Wikipedia entities. We link vertex e u with vertex e v if there is a document D wiki in Wikipedia such that either • D wiki is a Wikipedia article describing e u , and e v appears in it, or • D wiki contains e u , e v and there are less than l entities between them.
For instance, in Figure 3, for document "Brexit", we link entity Brexit to all other entities. However, we do not link United Kingdom to Greek withdrawal from the eurozone as they are more than l entities apart. Figure 4: Recall as a function of the candidate number.

Model and inference
Now we consider unlabeled (non-Wikipedia) documents. We use this step both to preprocess training documents and also apply it to new unlabeled documents at test time. First, we produce at most N q + N p candidates for each mention in a document D as described above. 4 Then we define a probabilistic model over entities in D: where ϕ wiki (e i , e j ) is 0 if e i is linked with e j in the link graph and −∆, otherwise (∆ ∈ R + ). Intuitively, the model scores an assignment e 1 , . . . , e n according to the number of unlinked pairs in the assignment. We use max-product version of LBP to produce approximate marginals: r wiki (e i |D) ≈ max e 1 ,...,e i−1 e i+1 ,...,en r wiki (e 1 , . . . , e n |D) For example, in Figure 1, we linked Donald Trump to Brexit and with Theresa May, that are linked in the Wikipedia link graph. The assignment Donald Trump, Brexit, Theresa May does not contain unlinked pairs and will receive the highest score.
In Figure 4, we plot recall on AIDA CoNLL development set as a function of the candidate number (ranking is according to r wiki (e i |D)). We can see that we can reduce N p + N q = 7 candidates down to N w = 2 and still maintain recall of 93.9%. 5 The remaining (N p + N q − N w ) entities are kept as 'negative examples' E − i for training the disambiguation model (see Figure 1).

Aggregate scoring function
As we can see from Figure 4, keeping the top candidate from the list would yield recall of 83.5%, which is about 10% below state of the art. In order to test how far we can go without using the disambiguation model, we combine together the signals we relied on in the previous section. Specifically, rather than using r wiki alone, we linearly combine the Levenstein edit distance (Levenshtein, 1966), with the scores p wiki and r wiki . Parameters are described in the appendix. The coefficients are chosen on the development set. We refer to this score as s c (e i |D).

Parameters and Resources
We used DeepEd 6 from Ganea and Hofmann (2017) to obtain entity embeddings. We also used Word2vec word embeddings 7 to compute the local score function and GloVe embeddings 8 within the attention model in Figure 2. Hyper-parameter selection was performed on the AIDA CoNLL development set. The margin parameters δ and the learning rate were set to 0.1 and 10 −4 . We use early stopping by halting training when F1 score on the development set does not increase after 50,000 updates. We report the mean and 95% confidence of the F1 scores using five runs of our system. See additional details in the appendix.
The source code and data are publicly available at https://github.com/lephong/wnel.
In our experiments, we randomly selected 30,000 unlabeled documents from RCV1. Since we focus on the inductive setting, we do not include any documents used to create AIDA CoNLL development and test sets in our training set. In addition, we did not use any articles appearing in WIKI to compute r wiki . We rely on SpaCy 9 to extract named entity mentions. We compare our model to those systems which were trained on Wikipedia or on Wikipedia plus unlabeled documents. They are: Milne and Witten (2008), Ratinov et al. (2011a), Hoffart et al. (2011), Cheng andRoth (2013), Chisholm and Hachey (2015), Lazic et al. (2015). Note that we are aware of only Lazic et al. (2015) which relied on learning from a combination of Wikipedia and unlabeled documents. They use semi-supervised learning and exploit only local context (i.e. coherence with other entities is not modeled).
We also compare to recent state-of-the-art systems trained supervisedly on Wikipedia and extra supervision or on AIDA CoNLL: Chisholm and Hachey (2015), Guo and Barbosa (2016), Globerson et al. (2016), Yamada et al. (2016), Ganea and Hofmann (2017), Le and Titov (2018). Chisholm and Hachey (2015) used supervision in the form of links to Wikipedia from non-Wikipedia pages, Wikilinks (Singh et al., 2012)). This annotation can also be regarded as weak or incidental supervision, as it was not created with the entity linking problem in mind. The others exploited AIDA CoNLL training set. F1 scores of these systems are taken from Guo and Barbosa (2016), Ganea and Hofmann (2017) and Le and Titov (2018).
We use the standard metric: 'in-knowledgebase' micro F-score, in other words, F1 of those mentions which can be linked to the knowledge base. We report the mean and 95% confidence of the F1 scores using five runs of our system.

Results
The results are shown in Table 1.
First, we compare to systems which relied on Wikipedia and those which used Wikipedia along with unlabeled data ('Wikipedia + unlab'), i.e. the top half of Table 1. These methods are comparable to ours, as they use the same type of information as supervision. Our model outperformed all of them on all test sets. One may hypothesize that this is only due to using more powerful feature representations rather than our estimation method or document-level disambiguation. We will address this hypothesis in the ablation studies below. The approach of Chrisholm and Hachey (2015) does 9 https://spacy.io/ not quite fall in this category as, besides information from Wikipedia, they use a large collection of web pages (34 million web links). When evaluated on AIDA-B, their scores are still lower than ours, though significantly higher that those of the previous systems suggesting that web links are indeed valuable. Though we do not exploit web links in our model, in principle, they can be used in the exactly same way as Wikipedia links. We leave it for future work.
Second, we compare to fully-supervised systems, which were estimated on AIDA-CoNLL documents. Recall that every mention in these documents has been manually annotated or validated by a human expert. We distinguish results on a test set taken from AIDA-CoNLL (AIDA-B) and the other standard test sets not directly corresponding to the AIDA-CoNLL domain. When tested on the latter, our approach is very effective, on average outperforming fully-supervised techniques. We would argue that this is the most important set-up and fair to our approach: it is not feasible to obtain labels for every domain of interest and hence, in practice, supervised systems are rarely (if ever) used in-domain. As expected, on the in-domain test set (AIDA-B), the majority of recent fully-supervised methods are more accurate than our model. However, even on this test set our model is not as far behind, for example, outperforming the system of Guo and Barbosa (2016).

Analysis and ablations
We perform ablations to see contributions of individual modeling decisions, as well as to assess importance of using unlabeled data.
Is constraint-driven learning effective? In this work we advocated for learning our model on unlabeled non-Wikipedia documents and using Wikipedia to constraint the space of potential entity assignments. A simpler alternative would be to learn to directly predict links within Wikipedia documents and ignore unlabeled documents. Still, in order to show that our learning approach and using unlabeled documents is indeed preferable, we estimate our model on Wikipedia articles. Instead of using the candidate selection step to generate list E + i , we used the gold entity as singleton E + i in training. The results are shown in Table 2 ('Wikipedia'). The resulting model is significantly less accurate than the one which used unlabeled documents. The score difference is larger  (Chisholm and Hachey, 2015) 84.9 ------Wiki + unlab (Lazic et al., 2015) 86   for AIDA-CoNLL test set than for the other 5 test sets. This is not surprising as our unlabeled documents originate from the same domain as AIDA-CoNLL. This suggests that the scores on the 5 tests could in principle be further improved by incorporating unlabeled documents from the corresponding domains. Additionally we train our model on AIDA-CoNLL, producing its fully-supervised version ('AIDA CoNLL' row in Table 2). Though, as expected, this version is more accurate on AIDA test set, similarly to other fully-supervised methods, it overfits and does not perform that well on the 5 out-of-domain test sets.
As we do not want to test multiple systems on the final test set, we report the remaining ablations on the development set (AIDA-A), Table 3. 10 Is the document-level disambiguation model beneficial? As described in Section 3.3 ('Aggregate scoring function'), we constructed a baseline which only relies on link statistics in Wikipedia as well as string similarity (we refereed to its scoring function as s c ). It appears surprisingly strong, however, we still outperform it by 1.6% (see Table 3).
Is both local and global disambiguation beneficial? When we use only global coherence (i.e. only second term in expression (1)) and drop any modeling of local context on the disambiguation stage, the performance drops very substantially (to 82.4% F1, see Table 3). This suggests that the local scores are crucial in our model: an entity should fit its context (e.g., in our running example, 'Mrs' is not used to address a Queen). Without using local scores the disambiguation model appears to be even less accurate than our 'no-statisticaldisambiguation' baseline. It is also important to have an accurate global model: not using global attention results in a 1.2% drop in performance.  training. As expected, the score increases with the number of raw documents, but changes very slowly after 10,000 documents.
Which entities are easier to link? Figure 4 shows the accuracy of two systems for different NER (named entity recognition) types. We consider four types: location (LOC), organization (ORG), person (PER), and miscellany (MICS). These types are given in CoNLL 2003 dataset, which was used as a basis for AIDA CoNLL. 11 Our model is accurate for PER, achieving accuracy of about 97%, only 0.53% lower than the supervised model. However, annotated data appears beneficial for other named-entity types. One of the harder cases for our model is distinguishing nationalities from languages (e.g., "English peacemaker" vs "English is spoken in the UK"). Both linking options typically appear in the positive sets simultaneously, so the learning objective does not encourage the model to distinguish the two. This is one of most frequent mistakes for tag 'MISC'.

Related work
Using Wikipedia pages to learn linkers ('wikifiers') has been a popular line of research both for named entity linking (Cheng and Roth, 2013;Milne and Witten, 2008) and generally entity disambiguation tasks (Ratinov et al., 2011b). How-ever, since introduction of the AIDA CoNLL dataset, fully-supervised learning on this dataset became standard for named entity linking, with supervised systems (Globerson et al., 2016;Guo and Barbosa, 2016;Yamada et al., 2016) outperforming alternatives even on out-of-domain datasets such as MSNBC and ACE2004. Note though that supervised systems also rely on Wikipedia-derived features. As an alternative to using Wikipedia pages, links to Wikipedia pages from the general Web were used as supervision (Singh et al., 2012). As far as we are aware, the system of Chisholm and Hachey (2015) is the only such system evaluated on standard named-entity linking benchmarks, and we compare to them in our experiments. This line of work is potentially complementary to what we propose, as we could use the Web links to construct weak supervision.
The weakly-or semi-supervised set-up, which we use, is not common for entity linking. The only other approach which uses a combination of Wikipedia and unlabeled data, as far as we are aware of, is by Lazic et al. (2015). We discussed it and compared to in previous sections. Our setup is inspired by distantly-supervised learning in relation extraction (Mintz et al., 2009). In distant learning, the annotation is automatically (and noisily) induced relying on a knowledge base instead of annotating the data by hand. Fan, Zhou, and Zheng (2015) learned a Freebase linker using distance supervision. Their evaluation is nonstandard. They also do not attempt to learn a disambiguation model but directly train their system to replicate noisy projected annotations. Wang et al. (2015) refer to their approach as unsupervised, as they do not use unlabeled data. However, their method does not involve any learning and relies on matching heuristics. Some aspects of their approach (e.g., using Wikipedia link statitics) resemble our candidate generation stage. So, in principle, their approach could be compared to the 'no-disambiguation' baselines (s c ) in Table 3. Their evaluation set-up is not standard.
Our model (but not the estimation method) bears similarities to the approaches of Le and Titov (2018) and Globerson at al. (2016). Both these supervised approaches are global and use attention.
In this paper we proposed a weakly-supervised model for entity linking. The model was trained on unlabeled documents which were automatically annotated using Wikipedia. Our model substantially outperforms previous methods, which used the same form of supervision, and rivals fullysupervised models trained on data specifically annotated for the entity-linking problem. This result may be interpreted as suggesting that humanannotated data is not beneficial for entity linking, given that we have Wikipedia and web links. However, we believe that the two sources of information are likely to be complementary.
In the future work we would like to consider setups where human-annotated data is combined with naturally occurring one (i.e. distantly-supervised one). It would also be interesting to see if mistakes made by fully-supervised systems differ from the ones made by our system and other Wikipediabased linkers.