Distant Learning for Entity Linking with Automatic Noise Detection

Accurate entity linkers have been produced for domains and languages where annotated data (i.e., texts linked to a knowledge base) is available. However, little progress has been made for the settings where no or very limited amounts of labeled data are present (e.g., legal or most scientific domains). In this work, we show how we can learn to link mentions without having any labeled examples, only a knowledge base and a collection of unannotated texts from the corresponding domain. In order to achieve this, we frame the task as a multi-instance learning problem and rely on surface matching to create initial noisy labels. As the learning signal is weak and our surrogate labels are noisy, we introduce a noise detection component in our model: it lets the model detect and disregard examples which are likely to be noisy. Our method, jointly learning to detect noise and link entities, greatly outperforms the surface matching baseline. For a subset of entity categories, it even approaches the performance of supervised learning.


Introduction
Entity linking (EL) is the task of linking potentially ambiguous textual mentions to the corresponding entities in a knowledge base. Accurate entity linking is crucial in many natural language processing tasks, including information extraction  and question answering (Yih et al., 2015). Though there has been significant progress in entity linking recently (Ratinov et al., 2011;Chisholm and Hachey, 2015;Yamada et al., 2017;Ganea and Hofmann, 2017;Le and Titov, 2018), previous work has focused on supervised learning. Annotated data necessary for supervised learning is available for certain knowledge bases and domains. For example, one can directly use web-pages linking to Wikipedia to learn a Wikipedia linker. Similarly, there exist domain-specific sets of manually annotated documents (e.g., AIDA-CoNLL news dataset for YAGO ). However, for many ontologies and domains annotation is not available or limited (e.g., law). Our goal is to develop a method which does not rely on any training data besides unlabeled texts and a knowledge base.
In order to construct such a method, we use an insight from simple surface matching heuristics (e.g., Riedel et al. (2010)). Such heuristics choose entities from a knowledge base by measuring the overlap between the sets of content words in the mention and in the entity name. For example, in Figure 1, the entities BILL CLINTON (PRESIDENT) and PRESIDENCY OF BILL CLIN-TON both have two matching words with the mention Bill Clinton. Whereas we will see in our experiments that this method alone is not particularly accurate at selecting the best entity, the candidate lists it provides often include the correct entity. This implies that we can both focus on learning to select candidates from these lists and, less obviously, that we can leverage the lists as weak or distant supervision.
We frame this distance learning (DL) task as the multi-instance learning (MIL) problem (Dietterich et al., 1997). In MIL, each bag of examples is marked with a class label: the label indicates that the bag contains at least one example corresponding to that class. Relying on such labeled bags, MIL methods aim at learning classifiers for individual examples.
Our DL problem can be regarded as a binary version of MIL. For a list of entities (and importantly given the corresponding mention and its document context), we assume that we know if the list contains a correct entity or not. The 'positive lists' are essentially top candidates from the  Figure 1: We annotate raw sentences using entity names and knowledge base triples. In training, we keep only red entities as positive candidates. In testing, we consider |E + | = 100 name-matched candidates. matching heuristic. For example, the four candidate entities for the mention 'Bill Clinton' in Figure 1 could be marked as a positive set. The 'negative lists' are randomly sampled sets of entities from the knowledge base. As with other MIL approaches, while relying on labeled lists, we learn to classify individual entities, i.e. to predict if an entity should be linked to the mention.
On important detail is that the classifier must not have access to information which and how many words match between the mention and the entity name. If it would know this, it would easily figure out which entity set is a candidate list and which one consists of randomly generated entities based solely on this information. Instead, by hiding it from the classifier, we force the classifier to extract features of the mention and its context predictive of the entity properties (e.g., an entity type), and hence ensure generalization.
Unfortunately, our supervision is noisy. The positive lists will often miss the correct entity for the given mention. This confuses the MIL model. In order to address this issue, we, jointly with the MIL model, learn a classifier which detects potentially problematic candidate lists. In other words, the classifier predicts how likely a given list is noisy (i.e., how much we should trust it). The probability is then used to weight the corresponding term in the objective function of the MIL model. By jointly training the MIL model and the noise detection classifier, we effectively let the MIL model choose which examples to use for training. As we will see in our experimental analysis, this joint learning method leads to a substantial improvement in performance. We also confirm that the noise detection model is generally able to identify and exclude wrong candidate lists by comparing its predictions to the gold standard.
DL is the mainstream approach to learning relation extractors (RE) (Mintz et al., 2009;Riedel et al., 2010), a problem related to entity linking. However, the two instantiations of the DL framework are very different. For RE, a bag of sentences is assigned to a categorical label (a relation). For EL, we assign a bag of entities, conditioned on the mention, to a positive class (correct) or a negative class (incorrect).
We evaluate our approach on the news domain for English as, having gold standard annotation (AIDA CoNLL), we can both assess performance and compute the upper bound, given by supervised learning. Nevertheless, we expect that our methodology is applicable to a wider range of knowledge bases, as long as unlabeled texts can be obtained for the corresponding domain. We plan to verify this claim in future work. In addition, we restrict ourselves to sentence-level modeling and, unlike state-of-the-art supervised methods, (Yamada et al., 2017;Ganea and Hofmann, 2017;Le and Titov, 2018) ignore interaction between linking decisions in the document. Again, it would be interesting to see if such global modeling would be beneficial in the distance learning setting.
Our contributions can be summarized as follows • we show how the entity linking problem can be framed as a distance learning problem, namely as a binary MIL task; • we construct a model for this task; • we introduce a method for detecting noise in the automatic annotation; • we demonstrate the effectiveness of our approach on a standard benchmark. and negative candidates: E + should have a high chance of containing the correct entity e, while E − should include only incorrect entities. As standard in MIL, this will be the only supervision the model receives at training time. When using this supervision, the model will need to learn to decide which entity e in E + is most likely to correspond to the mention-context pair (m, c). At test time, the model with be provided with the list E + and will need to select an entity from this list. Performing entity linking in two stages, candidate selection (generating candidate lists) and entity disambiguation (choosing an entity from the list), is standard in EL, with the first stage usually handled with heuristics and the second one approached with statistical modeling (Ratinov et al., 2011;. However, in our DL setting both stages change substantially. The candidate selection stage relies primarily on a surface matching heuristic, as described in Section 4. Whereas supervised learning for the disambiguation stage (e.g., ) is replaced with MIL learning as described below in Section 3. 1 To make the following sections clear, we introduce the following terms.
Definition 1. A data point is a tuple m, c, E + , E − of mention m, context c, positive set E + , and negative set E − . In testing, E − = ∅.
Definition 2. A data point m, c, E + , E − is noisy if E + does not contain the correct entity for mention m. If a data point is not noisy, we will refer to it as valid.

Models
We introduce two approaches. The first one directly applies MIL, disregarding the fact that many data points are noisy. The second one addresses this shortcoming by integrating a noise detection component.

Model 1: MIL
Encoding context Context c is the entire l-word sentence w 1 , ..., w l which also includes the mention m = (w h , ..., w k ), 1 ≤ h ≤ k ≤ l. We use a BiLSTM to encode sentences. The input to the BiLSTM is a concatenation w * i = [w i , p i ] where p i ∈ R dp is position embedding and w ∈ R dw is 1 Supervised learning is equivalent to assuming that E + are singletons containing only the gold-standard entity. from GloVe 2 (Pennington et al., 2014). Forward f i and backward b i states of BiLSTM are fed into the classifier described below.
Entity embeddings In this work, we use a simple and scalable approach which involves computing entity embeddings on the fly using associated types. For instance, the TV episode BILL CLINTON is associated with several types including BASE.TYPE ONTOLOGY.NON AGENT and TV.TV SERIES EPISODE. Specifically, in order to produce an entity embedding, each type t is assigned a vector t ∈ R dt . We then compute a vector for entity e as where T e is the set of e's types, and W e ∈ R de×dt , b ∈ R de are a weight matrix and a bias vector.
More sophisticated approaches to producing entity embeddings (e.g., using relational graph convolutional networks (Schlichtkrull et al., 2018)) are likely to yield further improvements.
Scoring a candidate We use a one-hidden layer feed forward NN to compute score compatibility between a context-mention pair (m, c) and an entity e: If e * is the correct entity, we want g(e * , m, c) > g(e, m, c) for any entity e = e * .
Training Recall that for each mention-context pair (m, c), we have a positive set E + and a negative set E − . We want to train the model to score at least one candidate in E + higher than any candidate in E − . We use the max-margin loss to achieve this. Let where δ is a margin and [x] + = x if x > 0 else 0; D is the training set. We want to minimize L 1 with respect to the model parameters. We rely on Adam optimizer and employ early stopping.

Model 2: MIL with Noise Detection (MIL-ND)
The model 1 ignores the fact that many data points are noisy, i.e. E + may not contain the correct entity. We address this by integrating a binary noise detection (ND) classifier which predicts if a data point is noisy. Intuitively, data points classified as noisy need to be discarded from training of the EL model. In practice, we weight them with the confidence of the ND classifier. As discussed below, we train the ND classifier jointly with the EL model.
Representation for E + The ND classifier needs to decide if there is at least one entity in the list E + corresponding to the mention-context pair (m, c).
The question is now how to represent E + to make classification as easy as possible. One option is to use mean pooling, but this would result in uninformative representations, especially for longer candidate lists. Another option is max pooling, but it would not take into account which mentioncontext pair (m, c) is currently considered, so also unlikely to yield informative features of E + . Instead we use attention, with the attention weight computed as a function of (m, c): where α e are attention weights where g is a score function. Instead of learning a separate attention function for the ND classifier, we reuse the one from the EL model, i.e. g = g . This will reduce the number of parameters and make the method less prone to overfitting. Maybe more importantly, we expect that the better the entity disambiguation score function is, the better the ND classifier is, so tying the two together may provide an appropriate inductive bias.
T is temperature, controlling how sharp α e should be. We found that a small T = 1/3 stabilizes the learning.
Noise detection We use a binary classifier to detect noisy data points. The probability that a data point is noisy is defined as T , σ is the logistic sigmoid function. For simplicity, we use the same T as above.
Training Our goal is to down-weight potentially noisy data points. Our new loss is where p * N is a prior distribution indicating our beliefs about the proportion of noisy data points; η is a hyper-parameter. We optimize the objective with respect to the parameters of both ND and EL models. The second term is necessary, as without it the loss can be trivially minimized by the ND classifier predicting that all data points are noisy with the probability of 1. This would set the first term to exactly zero.
Intuitively, when using the second term, the model can disregard certain data points but disregarding too many of them incurs a penalty. Which data points are likely to be disregarded? Presumably the ones less consistent with the predictions of the EL model. In other words, joint training of EL and ND models encourages learning an entitylinking scoring function consistent with a large proportion of the data set but not necessarily with the entire data set. As we will see in the experimental section, the ND classifier indeed detects noisy data points rather than chooses some random subset of the data. 3 We use the same optimization procedure as for the model 1. The second term is estimated at the mini-batch level.
Testing Differently from model 1, with model 2 we have two options on how to use it at test time: • ignoring the ND classifier, thus doing entity disambiguation the same way as for model 1, or • using the ND classifier as a mechanism to decide if the test data point should be classified as 'undecidable' or not. Specifically, if p N (1|m, c, E + ) > τ , model 2 will not output an entity for this data point. This should increase precision as at test time E + also may not contain the correct entity.  We call the two versions MIL-ND and τ MIL-ND, respectively.

Dataset
We describe how we create our dataset. We use Freebase 4 , though our approach should be applicable to many other knowledge bases. Brief statistics of the dataset are shown in Table 1.

Training set
We took raw texts from the New York Times corpus, tagged them with the CoreNLP named entity recognizer 5 . We then selected only sentences that contain at least two entity mentions. We did this because on the one hand in most applications of EL we care about relations between entities (e.g., relation extraction), on the other hand, it provides us with an opportunity to prune the candidate list effectively, as discussed below. Note that we do it only for training. For each mention m we carried out candidate selection as follows. First, we listed all entities which names contain all words of m. For instance, "America" (Figure 1) can be both the nation UNITED STATES OF AMERICA and Simon & Garfunkel's song AMERICA. We ranked these chosen entities by the entity ordering in the knowledge base (i.e., one appears first in the knowledge base would be ranked first); this order for Freebase is correlated with prominence.
Second, for each mention (e.g., "Bill Clinton"), we kept only entities which participate in a relation with one of the candidate entities for another mention in the sentence.  ing, we sampled |E − | = 10 candidates from the rest of the knowledge base as negative candidates.

Development and test sets
We took manually annotated AIDA-A and AIDA-B as development and test sets . We turned the ground truth Wikipedia links in these sets to Freebase entities, thanks to the mapping available in Freebase. 6 Candidate selection was done in the same way as for training, except for not filtering out sentences with only 1 entity (i.e. no step 2 from Section 4.1). The oracle recall for surface name matching (i.e. step 1 from Section 4.1) is 77%. It goes down to 50% if we restrict |E + | = 100 (see Figure 2). We believe that there are straightforward ways to improve the selection heuristic (e.g., modifying the string matching heuristic or using word embeddings to match words in entity names and words in the mention) but we leave this for future work.
It is worth noting that because AIDA CoNLL dataset is based on Reuters news-wire articles, these development and test sets do not overlap with the training set.

Experiments
We evaluated the models above using the data from Section 4.
The source code and the data are available at https://github.com/ lephong/dl4el We ran each model five times and report mean and 95% confidence interval of three metrics: (micro) precision, (micro) recall, and (micro) F1 (Cornolti et al., 2013) under two settings: • 'All': all mentions are taken into account, • 'In E + ': only mentions with E + containing the correct entity are considered.
The latter, though not realistic, is interesting as it lets us concentrate on the contribution of the disambiguation model, and ignore cases which are hopeless with the considered candidate selection method. Note that, for system outputting exactly one entity for each mention (e.g., MIL model 1), precision and recall are equal.

Systems
We compared our models against 'Name matching'. It was proposed by Riedel et al. (2010) for RE: a mention is linked to an entity if it matches the entity's name. For tie cases, we chose the first matched entity appearing in Freebase. For instance, "America" is linked to the song instead of the nation. To our knowledge, name matching is the only method tried in previous work for our setting (i.e. with no annotated texts).
We also compared with a supervised version of model 1. We used the same method in Section 4.2 to convert AIDA CoNLL training set, with E + being singletons consisting of the correct entity provided by human annotators. This system can be considered as an upper-bound of our two models because: (i) it is trained in supervised rather than MIL setting with gold standard labels rather than weak supervision, and (ii) the training set is in the same domain (i.e. Reuter) with the test set. Although it uses only entity types but no other entityrelated information for entity disambiguation, in Appendix B we show that this system performs on par with  when evaluated in their setting.
Note that comparison with supervised linkers proposed in previous work is not possible as they require Wikipedia (see Section 6) for candidate selection, as a source of supervision, and often for learning entity embeddings.
We tuned hyper-parameters on the development set. Details are in Appendix A. Note that, in model 2 (both MIL-ND and τ MIL-ND), we set the prior p * N (1) to 0.9, i.e. requiring 90% of training data points should be ignored. 7 We experimented with |E + | = 100 for both training and testing. For training, we set |E − | = 10. Table 2 shows results on the test set. 'Name matching' is far behind the two models. Many entities in the knowledge base have similar or even identical names, so relying only on the surface form does not result in an effective method. 8 MIL-ND achieves higher precision, recall, and F1 than MIL, this suggests that the ND classifier helped to eliminate bad data points during training. Using its confidence at test time (τ MIL-ND, 'All' setting) was also beneficial in terms of precision and F1 (it cannot possibly increase recall). Because all the test data points are valid for the 'In E + ' setting, using the ND classifier had a slight negative effect on F1.

Results
MIL-ND significantly outperforms MIL: the 95% confidence intervals for them do not overlap. However, this is not the case for MIL-ND and τ MIL-ND. We therefore conclude that the ND classifier is clearly helpful for training and potentially for testing.

Analysis
Error types In Table 3 we classified errors according to named entity types thanks to the annotation from Tjong Kim Sang and De Meulder (2003). PER is the easiest type for all systems. Even name matching, without any learning, can correctly predict in half of the cases.
For LOC, it turns out that candidate selection is a bottleneck: when candidate selection was flawless, the models made only about 12% errors, down from about 57%. For MISC a similar conclusion can be drawn.
Can the ND classifier detect noise? From the training set, we collected 100 data points and manually checked if a data point is valid (i.e., E + contains the correct entity). We then checked how the accuracy changes depending on the threshold τ (Figure 3), the accuracy is defined as # valid data points with p N < τ # all data points with p N < τ 7 Section 5.3 shows that 90% is too high, but it helps the model to rely only on those entity disambiguation decisions that are very certain. 8 Figure 3: Accuracy vs τ . There are large plateaus between τ ∈ (0.7, 0.95) and τ < 0.6 because the ND classifier hardly used these ranges. We hence set τ = 0.75 in τ MIL-ND.
As expected, the smaller τ is, the higher the chance is that the chosen data point is valid (i.e., not noise). Hence, we can use the ND classifier to select high quality data points by adjusting τ . For a further examination, from the training set, we collected all 47,213 data points (i.e. 27.8%) with p N (1|m, c, E + ) > τ = 0.75, and randomly chose 100 data points. We found that 89% are indeed noisy. This further confirms that the ND classifier is sufficiently accurate. Some examples are given in Table 4.

Number of positive candidates
We also experimented with different values of |E + | (10, 50, 100) on the development set ( Figure 4).
First, MIL-ND and τ MIL-ND are always better  than MIL. This is more apparent in the 'In E + ' settings: with this evaluation regime, we zoom in on cases where our models can predict correct entities (of course, all models equally fail for examples outside E + ). Using the ND classifier at test time to decide to predict any entity or skip (τ MIL-ND) is helpful in Correctly detected as noise: * Small-market teams , like Milwaukee and San Diego , even big-market clubs , like Boston , Atlanta and the two [Chicago] franchises , trumpet them . Candidates: CHICAGO (music single), BOSTON TO CHICAGO (music track) * The politically powerful [Green] movement in Germany has led public opposition to genetic technology research and production . Candidates: THE GREEN PRINCE (movie), THE GREEN ARCHER (movie), GREEN GOLD (movie) Incorrectly detected as noise: * Everything Forrest remains unaffected by , [Jenny] self-indulgently , self-destructively drowns in : radical politics , drug abuse , promiscuity . Candidates: JENNY CURRAN (book/film character) Table 4: Examples of 100 randomly chosen sentences from the training set whose p N (1|m, c, E + ) > τ = 0.75. The first two examples are correctly detected as noise by our ND classifier. The last one is incorrectly detected. the more realistic 'All' setting. The difference between τ MIL-ND and MIL-ND is less pronounced for larger E + . This is expected as the proportion of valid data points is higher, and hence the ND classifier is less necessary at test time. For 'in E + ' setting, τ MIL-ND performs worse than MIL-ND, as we expected, because there are no noisy data points at test time.
What is wrong with the candidate selector? The above results show that candidate selection is a bottleneck and that the used selector is far from perfect. We found two cases where the selector is problematic: (i) the mention or the entity name is in an abbreviated form, such as 'U.N.' rather than 'United Nations', (ii) the mention and the entity's name only fuzzily match, such as ' [English] county' and ENGLAND (country). We can overcome these problems via extending our surface matching as in Charton et al. (2014) or using word embeddings.
Even in some cases when the selector does not have any problems with surface matching, the number of candidates may be too large. For instance, consider '[Simpson] killed his wife...', there are more than 1,500 entities in the knowledge base containing the word 'Simpson'. It is unlikely that our entity disambiguation model can deal with such large lists. We may need a stronger mechanism for reducing the number of candidates. For example, we could use document-level information to discard highly unlikely entities.  (2017), are two-stage methods: candidate generation is followed by selecting an entity for the candidate lists. We follow the same paradigm but with some important differences discussed below.

Related work
Most approaches use alias-entity maps, i.e. weighted sets of (mention, entity) pairs created from anchors in Wikpedia. For example, one can count how many times phrase "the president" refers to BILL CLINTON to assign the weight to the corresponding pair. However, the method requires large annotated datasets, and it cannot deal with less prominent entities. As we do not have access to links, we use surface matching instead.
To choose an entity from a candidate list, two main disambiguation frameworks (Ratinov et al., 2011) are introduced: local which resolves mentions independently, and global which makes use of coherence modeling at the document level. Though we experimented with local models, the local-global distinction is largely orthogonal as we can directly integrate coherence modeling components in our DL approach.
Different types of supervision have been considered in previous work: full supervision (Yamada et al., 2017;Ganea and Hofmann, 2017;Le and Titov, 2018), using combinations of labeled and unlabeled data (Lazic et al., 2015), and even distant supervision (Fan et al., 2015). The approach of Fan et al. (2015) is heavily Wikipediabased: they rely on a heuristic mapping from Freebase entities to Wikipedia entities, and learn features from Wikipedia articles. Unlike ours, their approach cannot be generalized to set-ups where no documents are available for entities.
We introduced the first approach to entity linking which neither uses annotated texts, nor assumes that entities are associated with textual documents (e.g., Wikipedia articles). We learn the model using the MIL paradigm, and introduce a novel component, a noise detecting classifier, estimated jointly with the EL model. The classifier lets us disregard noisy labels, resulting in a more accurate entity linking model. Experimental results showed that our models substantially outperform the heuristic baseline, and, for certain categories, they approach the model estimated with supervised learning. In future work we will aim to improve candidate selection. We will also use extra document information and jointly predict entities for different mentions in the document.