End-to-End Neural Entity Linking

Entity Linking (EL) is an essential task for semantic text understanding and information extraction. Popular methods separately address the Mention Detection (MD) and Entity Disambiguation (ED) stages of EL, without leveraging their mutual dependency. We here propose the first neural end-to-end EL system that jointly discovers and links entities in a text document. The main idea is to consider all possible spans as potential mentions and learn contextual similarity scores over their entity candidates that are useful for both MD and ED decisions. Key components are context-aware mention embeddings, entity embeddings and a probabilistic mention - entity map, without demanding other engineered features. Empirically, we show that our end-to-end method significantly outperforms popular systems on the Gerbil platform when enough training data is available. Conversely, if testing datasets follow different annotation conventions compared to the training set (e.g. queries/ tweets vs news documents), our ED model coupled with a traditional NER system offers the best or second best EL accuracy.


Introduction and Motivation
Towards the goal of automatic text understanding, machine learning models are expected to accurately extract potentially ambiguous mentions of entities from a textual document and link them to a knowledge base (KB), e.g. Wikipedia or Freebase. Known as entity linking, this problem is an essential building block for various Natural Language Processing tasks, e.g. automatic KB construction, question-answering, text summarization, or relation extraction.
An EL system typically performs two tasks: i) Mention Detection (MD) or Named Entity Recog- * Equal contribution. 1) MD may split a larger span into two mentions of less informative entities: B. Obama's wife gave a speech [...] Federer's coach [...] 2) MD may split a larger span into two mentions of incorrect entities: Obama Castle was built in 1601 in Japan. The Kennel Club is UK's official kennel club. A bird dog is a type of gun dog or hunting dog. Romeo and Juliet by Shakespeare [...] Natural killer cells are a type of lymphocyte Mary and Max, the 2009 movie [...] 3) MD may choose a shorter span, referring to an incorrect entity: The Apple is played again in cinemas. The New York Times is a popular newspaper. 4) MD may choose a longer span, referring to an incorrect entity: Babies Romeo and Juliet were born hours apart.  (underlined) can be avoided by proper context understanding. The correct spans are shown in blue. nition (NER) when restricted to named entitiesextracts entity references in a raw textual input, and ii) Entity Disambiguation (ED) -links these spans to their corresponding entities in a KB. Until recently, the common approach of popular systems Ceccarelli et al. (2013); van Erp et al. (2013); Piccinno and Ferragina (2014); Daiber et al. (2013); Hoffart et al. (2011); Steinmetz and Sack (2013) was to solve these two sub-problems independently. However, the important dependency between the two steps is ignored and errors caused by MD/NER will propagate to ED without possibility of recovery Sil and Yates (2013); Luo et al. (2015). We here advocate for models that address the end-to-end EL task, informally arguing that humans understand and generate text in a similar joint manner, discussing about entities which are gradually introduced, referenced under multiple names and evolving during time Ji et al. (2017). Further, we emphasize the importance of the mutual dependency between MD and ED. First, numerous and more informative linkable spans found by MD obviously offer more contextual cues for ED. Second, finding the true entities appearing in a specific context encourages better mention boundaries, especially for multi-word mentions. For example, in the first sentence of Table 1, understanding the presence of the entity Michelle Obama helps detecting its true mention "B. Obama's wife", as opposed to separately linking B. Obama and wife to less informative concepts.
We propose a simple, yet competitive, model for end-to-end EL. Getting inspiration from the recent works of Lee et al. (2017) and Ganea and Hofmann (2017), our model first generates all possible spans (mentions) that have at least one possible entity candidate. Then, each mention -candidate pair receives a context-aware compatibility score based on word and entity embeddings coupled with a neural attention and a global voting mechanisms. During training, we enforce the scores of gold entity -mention pairs to be higher than all possible scores of incorrect candidates or invalid mentions, thus jointly taking the ED and MD decisions.
Our contributions are: • We address the end-to-end EL task using a simple model that conditions the "linkable" quality of a mention to the strongest context support of its best entity candidate. We do not require expensive manually annotated negative examples of non-linkable mentions. Moreover, we are able to train competitive models using little and only partially annotated documents (with named entities only such as the CoNLL-AIDA dataset).
• We are among the first to show that, with one single exception, engineered features can be fully replaced by neural embeddings automatically learned for the joint MD & ED task.
• On the Gerbil 1 benchmarking platform, we empirically show significant gains for the endto-end EL task when test and training data 1 http://gerbil.aksw.org/gerbil/ come from the same domain. Morever, when testing datasets follow different annotation schemes or exhibit different statistics, our method is still effective in achieving state-ofthe-art or close performance, but needs to be coupled with a popular NER system.

Related Work
With few exceptions, MD/NER and ED are treated separately in the vast EL literature. Traditional NER models usually view the problem as a word sequence labeling that is modeled using conditional random fields on top of engineered features Finkel et al. (2005) or, more recently, using bi-LSTMs architectures Lample et al. (2016); Chiu and Nichols (2016); Liu et al. (2017) capable of learning complex lexical and syntactic features.
End-to-end EL is the realistic task and ultimate goal, but challenges in joint NER/MD and ED modeling arise from their different nature. Few previous methods tackle the joint task, where errors in one stage can be recovered by the next stage. One of the first attempts, Sil and Yates (2013) use a popular NER model to over-generate mentions and let the linking step to take the final decisions. However, their method is limited by the dependence on a good mention spotter and by the usage of handengineered features. It is also unclear how linking can improve their MD phase. Later, Luo et al. (2015) presented one of the most competitive joint MD and ED models leveraging semi-Conditional Random Fields (semi-CRF). However, there are several weaknesses in this work. First, the mutual task dependency is weak, being captured only by type-category correlation features. The other engineered features used in their model are either NER or ED specific. Second, while their probabilistic graphical model allows for tractable learning and inference, it suffers from high computational complexity caused by the usage of the cartesian product of all possible document span segmentations, NER categories and entity assignments. Another approach is J-NERD Nguyen et al. (2016) that addresses the end-to-end task using only engineered features and a probabilistic graphical model on top of sentence parse trees.

Neural Joint Mention Detection and Entity Disambiguation
We formally introduce the tasks of interest. For EL, the input is a text document (or a query or tweet) given as a sequence D = {w 1 , . . . , w n } of words from a dictionary, w k ∈ W. The output of an EL model is a list of mention -entity pairs {(m i , e i )} i∈1,T , where each mention is a word subsequence of the input document, m = w q , . . . , w r , and each entity is an entry in a knowledge base KB (e.g. Wikipedia), e ∈ E. For the ED task, the list of entity mentions {m i } i=1,T that need to be disambiguated is additionally provided as input. The expected output is a list of corresponding Note that, in this work, we only link mentions that have a valid gold KB entity, setting referred in Röder et al. (2017) as InKB evaluation. Thus, we treat mentions referring to entities outside of the KB as "non-linkable". This is in line with few previous models, e.g. Luo et al. (2015); Ganea and Hofmann (2017); Yamada et al. (2016). We leave the interesting setting of discovering out-of-KB entities as future work.
We now describe the components of our neural end-to-end EL model, depicted in Figure 1. We aim for simplicity, but competitive accuracy.
Word and Char Embeddings. We use pretrained Word2Vec vectors Mikolov et al. (2013). In addition, we train character embeddings that capture important word lexical information. Following Lample et al. (2016), for each word independently, we use bidirectional LSTMs Hochreiter and Schmidhuber (1997) on top of learnable char embeddings. These character LSTMs do not extend beyond single word boundaries, but they share the same parameters. Formally, let {z 1 , . . . , z L } be the character vectors of word w. We use the forward and backward LSTMs formulations defined recursively as in Lample et al. (2016): Then, we form the character embedding of w is from the hidden state of the forward LSTM corresponding to the last character concatenated with the hidden state of the backward LSTM corresponding to the first character. This is then concatenated with the pre-trained word embedding, forming the context-independent word-character embedding of w. We denote the sequence of these vectors as {v k } k∈1,n and depict it as the first neural layer in Figure 1.
Mention Representation. We find it crucial to make word embeddings aware of their local context, thus being informative for both mention boundary detection and entity disambiguation (leveraging contextual cues, e.g. "newspaper"). We thus encode context information into words using a bi-LSTM layer on top of the word-character embeddings {v k } k∈1,n . The hidden states of forward and backward LSTMs corresponding to each word are then concatenated into context-aware word embeddings, whose sequence is denoted as {x k } k∈1,n . Next, for each possible mention, we produce a fixed size representation inspired by Lee et al. (2017). Given a mention m = w q , . . . , w r , we first concatenate the embeddings of the first, last and the "soft head" words of the mention: The soft head embeddingx m is built using an attention mechanism on top of the mention's word embeddings, similar with Lee et al. (2017): However, we found the soft head embedding to only marginally improve results, probably due to the fact that most mentions are at most 2 words long. To learn non-linear interactions between the component word vectors, we project g m to a final mention representation with the same size as entity embeddings (see below) using a shallow feedforward neural network FFNN (a simple projection layer): Entity Embeddings. We use fixed continuous entity representations, namely the pre-trained entity embeddings of Ganea and Hofmann (2017), due to their simplicity and compatibility with the pre-trained word vectors of Mikolov et al. (2013). Briefly, these vectors are computed for each entity in isolation using the following exponential model that approximates the empirical conditional word-entity distributionp(w|e) obtained from cooccurrence counts.
Here, x w are fixed pre-trained word vectors and y e is the entity embedding to be trained. In practice, Ganea and Hofmann (2017)  Final Local Score. For each span m that can possibly refer to an entity (i.e. |C(m)| ≥ 1) and for each of its entity candidates e j ∈ C(m), we compute a similarity score using embedding dotproduct that supposedly should capture useful information for both MD and ED decisions. We then combine it with the log-prior probability using a shallow FFNN, giving the context-aware entitymention score: Long Range Context Attention. In some cases, our model might be improved by explicitly capturing long context dependencies. To test this, we experimented with the attention model of Ganea and Hofmann (2017). This gives one context embedding per mention based on informative context words that are related to at least one of the candidate entities. We use this additional context embedding for computing dot-product similarity with any of the candidate entity embeddings. This value is fed as additional input of FFNN 2 in Eq. 6. We refer to this model as long range context attention.
Training. We assume a corpus with documents and gold entity -mention pairs G = {(m i , e * i )} i=1,K is available. At training time, for each input document we collect the set M of all (potentially overlapping) token spans m for which |C(m)| ≥ 1. We then train the parameters of our model using the following minimization procedure: where the violation term V enforces the scores of gold pairs to be linearly separable from scores of negative pairs, i.e.
Note that, in the absence of annotated negative examples of "non-linkable" mentions, we assume that all spans in M and their candidates that do not appear in G should not be linked. The model will be enforced to only output negative scores for all entity candidates of such mentions. We call all spans training the above setting. Our method can also be used to perform ED only, in which case we train only on gold mentions, i.e. M = {m|m ∈ G}. This is referred as gold spans training.
Inference. At test time, our method can be applied for both EL and ED only as follows. First, for each document in our validation or test sets, we select all possibly linkable token spans, i.e. M = {m| |C(m)| ≥ 1} for EL, or the input set of mentions M = {m|m ∈ G} for ED, respectively. Second, the best linking threshold δ is found on the validation set such that the micro F1 metric is maximized when only linking mention -entity pairs with Ψ score greater than δ. At test time, only entity -mention pairs with a score higher than δ are kept and sorted according to their Ψ scores; the final annotations are greedily produced based on this set such that only spans not overlapping with previously selected spans (of higher scores) are chosen.
Global Disambiguation Our current model is "local", i.e. performs disambiguation of each candidate span independently. To enhance it, we add an extra layer to our neural network that will promote coherence among linked and disambiguated entities inside the same document, i.e. the global disambiguation layer. Specifically, we compute a "global" mention-entity score based on which we produce the final annotations. We first define the set of mention-entity pairs that are allowed to participate in the global disambiguation voting, namely those that already have a high local score: Since we initially consider all possible spans and for each span up to s candidate entities, this filtering step is important to avoid both undesired noise and exponential complexity for the EL task for which M is typically much bigger than for ED. The final "global" score G(e j , m) for entity candidate e j of mention m is given by the cosine similarity between the entity embedding and the normalized average of all other voting entities' embeddings (of the other mentions m � ). The final loss function is now slightly modified. Specifically, we enforce the linear separability in two places: in Ψ(e, m) (exactly as before), but also in Φ(e, m), as follows The inference procedure remains unchanged in this case, with the exception that it will only use the Φ(e, m) global score.
Coreference Resolution Heuristic. In a few cases we found important to be able to solve simple coreference resolution cases (e.g. "Alan" referring to "Alan Shearer"). These cases are difficult to handle by our candidate selection strategy. We thus adopt the simple heuristic descried in Ganea and Hofmann (2017) and observed between 0.5% and 1% improvement on all datasets.

Experiments
Datasets and Metrics. We used Wikipedia 2014 as our KB. We conducted experiments on the most important public EL datasets using the Gerbil platform Röder et al. (2017). Datasets' statistics are provided in Tables 5 and 6 (Appendix). This benchmarking framework offers reliable and trustable evaluation and comparison with state of the art EL/ED methods on most of the public datasets for this task. It also shows how well different systems generalize to datasets from very different domains and annotation schemes compared to their training sets. Moreover, it offers evaluation metrics for the end-to-end EL task, as opposed to some works that only evaluate NER and ED separately, e.g. Luo et al. (2015). As previously explained, we do not use the NIL mentions (without a valid KB entity) and only compare against other systems using the InKB metrics.
For training we used the biggest publicly available EL dataset, AIDA/CoNLL Hoffart et al. We report micro and macro InKB F1 scores for both EL and ED. For EL, these metrics are computed both in the strong matching and weak matching settings. The former requires exactly predicting the gold mention boundaries and their entity annotations, whereas the latter gives a perfect score to spans that just overlap with the gold mentions and are linked to the correct gold entities.
Baselines. We compare with popular and stateof-the-art EL and ED public systems, e.g. WAT Piccinno and Ferragina (2014) Moro et al. (2014). These are integrated within the Gerbil platform. However, we did not compare with Luo et al. (2015) and other models outside Gerbil that do not use end-to-end EL metrics.  Table 3: AIDA A dataset: Gold mentions are split by the position they appear in the p(e|m) dictionary. In each cell, the upper value is the percentage of the gold mentions that were annotated with the correct entity (recall), whereas the lower value is the percentage of gold mentions for which our system's highest scored entity is the ground truth entity, but that might not be annotated in the end because its score is below the threshold δ.
300 dimensional, while 50 dimensional trainable character vectors were used. The char LSTMs have also hidden dimension of 50. Thus, word-character embeddings are 400 dimensional. The contextual LSTMs have hidden size of 150, resulting in 300 dimensional context-aware word vectors. We apply dropout on the concatenated word-character embeddings, on the output of the bidirectional context LSTM and on the entity embeddings used in Eq. 6. The three FFNNs in our model are simple projections without hidden layers (no improvements were obtained with deeper layers). For the long range context attention we used a word window size of K = 200 and keep top R = 10 words after the hard attention layer (notations from Ganea and Hofmann (2017)). We use at most s = 30 entity candidate per mention both at train and test time. γ is set to 0.2 without further investigations. γ � is set to 0, but a value of 0.1 was giving similar results.
For the loss optimization we use Adam Kingma and Ba (2014) with a learning rate of 0.001. We perform early stopping by evaluating the model on the AIDA validation set each 10 minutes and stopping after 6 consecutive evaluations with no significant improvement in the macro F1 score.
Results and Discussion. The following models are used. i) Base model: only uses the mention local score and the log-prior. It does not use long range attention, nor global disambiguation. It does not use the head attention mechanism. ii) Base model + att: the Base Model plus Long Range Context Attention.
ii) Base model + att + global: our Global Model (depicted in figure 1) iv) ED base model + att + global Stanford NER: our ED Global model that runs on top of the detected mentions of the Stanford NER system Finkel et al. (2005).
EL strong and weak matching results are presented in Tables 2 and 7 (Appendix).
We first note that our system outperforms all baselines on the end-to-end EL task on both AIDA-A (dev) and AIDA-B (test) datasets, which are the biggest EL datasets publicly available. Moreover, we surpass all competitors on both EL and ED by a large margin, at least 9%, showcasing the effectiveness of our method. We also outperform systems that optimize MD and ED separately, including our ED base model + att + global Stanford NER. This demonstrates the merit of joint MD + ED optimization.
In addition, one can observe that weak matching EL results are comparable with the strong matching results, showcasing that our method is very good at detecting mention boundaries.
At this point, our main goal was achieved: if enough training data is available with the same characteristics or annotation schemes as the test data, then our joint EL offers the best model. This is true not only when training on AIDA, but also for other types of datasets such as queries (Table 11) or tweets (Table 12). However, when testing data  [3, and [5 cases illustrate the main source of errors. These are false negatives in which our model has the correct ground truth entity pair as the highest scored one for that mention, but since it is not confident enough (score < γ) it decides not to annotate that mention. In this specific document these errors could probably be avoided easily with a better coreference resolution mechanism.
[3 and [4 cases illustrate that the gold standard can be problematic. Specifically, instead of annotating the whole span Korean War and linking it to the war of 1950, the gold annotation only include Korean and link it to the general entity of Korea_(country).
[2 is correctly annotated by our system but it is not included in the gold standard. Table 4: Error analysis on a sample document. Green corresponds to true positive (correctly discovered and annotated mention), red to false negative (ground truth mention or entity that was not annotated) and orange to false positive (incorrect mention or entity annotation ). has different statistics or follows different conventions than the training data, our method is shown to work best in conjunction with a state-of-the-art NER system as it can be seen from the results of our ED base model + att + global Stanford NER for different datasets in Table 2. It is expected that such a NER system designed with a broader generaliza-tion scheme in mind would help in this case, which is confirmed by our results on different datasets.
While the main focus of this paper is the endto-end EL and not the ED-only task, we do show ED results in Tables 8 and 9. We observe that our models are slightly behind recent top performing systems, but our unified EL -ED architecture has to deal with other challenges, e.g. being able to exchange global information between many more mentions at the EL stage, and is thus not suitable for expensive global ED strategies. We leave bridging this gap for ED as future work.
Additional results and insights are shown in the Appendix. Table 3 shows an ablation study of our method. One can see that the log p(e|m) prior is very helpful for correctly linking unambiguous mentions, but is introducing noise when gold entities are not frequent. For this category of rare entities, removing this prior completely will result in a significant improvement, but this is not a practical choice since the gold entity is unknown at test time.

Ablation study
Error Analysis. We conducted a qualitative experiment shown in Table 4. We showcase correct annotations, as well as errors done by our system on the AIDA datasets. Inspecting the output annotations of our EL model, we discovered the remarkable property of not over-generating incorrect mentions, nor under-generating (missing) gold spans. We also observed that additional mentions generated by our model do correspond in the majority of time to actual KB entities, but are incorrectly forgotten from the gold annotations.

Conclusion
We presented the first neural end-to-end entity linking model and show the benefit of jointly optimizing entity recognition and linking. Leveraging key components, namely word, entity and mention embeddings, we prove that engineered features can be almost completely replaced by modern neural networks. Empirically, on the established Gerbil benchmarking platform, we exhibit state-of-the-art performance for EL on the biggest public dataset, AIDA/CoNLL, also showing good generalization ability on other datasets with very different characteristics when combining our model with the popular Stanford NER system.
Our code is publicly available 2 .