Hierarchical Losses and New Resources for Fine-grained Entity Typing and Linking

Extraction from raw text to a knowledge base of entities and fine-grained types is often cast as prediction into a flat set of entity and type labels, neglecting the rich hierarchies over types and entities contained in curated ontologies. Previous attempts to incorporate hierarchical structure have yielded little benefit and are restricted to shallow ontologies. This paper presents new methods using real and complex bilinear mappings for integrating hierarchical information, yielding substantial improvement over flat predictions in entity linking and fine-grained entity typing, and achieving new state-of-the-art results for end-to-end models on the benchmark FIGER dataset. We also present two new human-annotated datasets containing wide and deep hierarchies which we will release to the community to encourage further research in this direction: MedMentions, a collection of PubMed abstracts in which 246k mentions have been mapped to the massive UMLS ontology; and TypeNet, which aligns Freebase types with the WordNet hierarchy to obtain nearly 2k entity types. In experiments on all three datasets we show substantial gains from hierarchy-aware training.


Introduction
Identifying and understanding entities is a central component in knowledge base construction (Roth et al., 2015) and essential for enhancing downstream tasks such as relation extraction *equal contribution Data and code for experiments: https://github. com/MurtyShikhar/Hierarchical-Typing (Yaghoobzadeh et al., 2017b), question answering (Das et al., 2017;Welbl et al., 2017) and search (Dalton et al., 2014). This has led to considerable research in automatically identifying entities in text, predicting their types, and linking them to existing structured knowledge sources.
Current state-of-the-art models encode a textual mention with a neural network and classify the mention as being an instance of a fine grained type or entity in a knowledge base. Although in many cases the types and their entities are arranged in a hierarchical ontology, most approaches ignore this structure, and previous attempts to incorporate hierarchical information yielded little improvement in performance (Shimaoka et al., 2017). Additionally, existing benchmark entity typing datasets only consider small label sets arranged in very shallow hierarchies. For example, FIGER (Ling and Weld, 2012), the de facto standard fine grained entity type dataset, contains only 113 types in a hierarchy only two levels deep.
In this paper we investigate models that explicitly integrate hierarchical information into the embedding space of entities and types, using a hierarchy-aware loss on top of a deep neural network classifier over textual mentions. By using this additional information, we learn a richer, more robust representation, gaining statistical efficiency when predicting similar concepts and aiding the classification of rarer types. We first validate our methods on the narrow, shallow type system of FIGER, out-performing state-of-the-art methods not incorporating hand-crafted features and matching those that do.
To evaluate on richer datasets and stimulate further research into hierarchical entity/typing prediction with larger and deeper ontologies, we introduce two new human annotated datasets. The first is MedMentions, a collection of PubMed ab-stracts in which 246k concept mentions have been annotated with links to the Unified Medical Language System (UMLS) ontology (Bodenreider, 2004), an order of magnitude more annotations than comparable datasets. UMLS contains over 3.5 million concepts in a hierarchy having average depth 14.4. Interestingly, UMLS does not distinguish between types and entities (an approach we heartily endorse), and the technical details of linking to such a massive ontology lead us to refer to our MedMentions experiments as entity linking. Second, we present TypeNet, a curated mapping from the Freebase type system into the WordNet hierarchy. TypeNet contains over 1900 types with an average depth of 7.8.
In experimental results, we show improvements with a hierarchically-aware training loss on each of the three datasets. In entity-linking MedMentions to UMLS, we observe a 6% relative increase in accuracy over the base model. In experiments on entity-typing from Wikipedia into TypeNet, we show that incorporating the hierarchy of types and including a hierarchical loss provides a dramatic 29% relative increase in MAP. Our models even provide benefits for shallow hierarchies allowing us to match the state-of-art results of Shimaoka et al. (2017) on the FIGER (GOLD) dataset without requiring hand-crafted features.
We will publicly release the TypeNet and Med-Mentions datasets to the community to encourage further research in truly fine-grained, hierarchical entity-typing and linking.

MedMentions
Over the years researchers have constructed many large knowledge bases in the biomedical domain (Apweiler et al., 2004;Davis et al., 2008;Chatraryamontri et al., 2017). Many of these knowledge bases are specific to a particular sub-domain encompassing a few particular types such as genes and diseases (Piñero et al., 2017).
UMLS (Bodenreider, 2004) is particularly comprehensive, containing over 3.5 million concepts (UMLS does not distinguish between entities and types) defining their relationships and a curated hierarchical ontology. For example LETM1 Protein IS-A Calcium Binding Protein IS-A Binding Protein IS-A Protein IS-A Genome Encoded Entity. This fact makes UMLS particularly well suited for methods explicitly exploiting hierarchical struc-ture.
Accurately linking textual biological entity mentions to an existing knowledge base is extremely important but few richly annotated resources are available. Even when resources do exist, they often contain no more than a few thousand annotated entity mentions which is insufficient for training state-of-the-art neural network entity linkers. State-of-the-art methods must instead rely on string matching between entity mentions and canonical entity names (Leaman et al., 2013;Wei et al., 2015;Leaman and Lu, 2016). To address this, we constructed MedMentions, a new, large dataset identifying and linking entity mentions in PubMed abstracts to specific UMLS concepts. Professional annotators exhaustively annotated UMLS entity mentions from 3704 PubMed abstracts, resulting in 246,000 linked mention spans. The average depth in the hierarchy of a concept from our annotated set is 14.4 and the maximum depth is 43.
MedMentions 1 contains an order of magnitude more annotations than similar biological entity linking PubMed datasets (Dogan et al., 2014;Wei et al., 2015;Li et al., 2016). Additionally, these datasets contain annotations for only one or two entity types (genes or chemicals and disease etc.). MedMentions instead contains annotations for a wide diversity of entities linking to UMLS. Statistics for several other datasets are in Table 1 (Wei et al., 2015) both contain genes.
To construct TypeNet, we first consider all Freebase types that were linked to more than 20 entities. This is done to eliminate types that are either very specific or very rare. We also remove all Freebase API types, e.g. the [/freebase, /dataworld, /schema, /atom, /scheme, and /topics] domains.
For each remaining Freebase type, we generate a list of candidate WordNet synsets through a substring match. An expert annotator then attempted to map the Freebase type to one or more synsets in the candidate list with a parent-of, child-of or equivalence link by comparing the definitions of each synset with example entities of the Freebase type. If no match was found, the annotator manually formulated queries for the online WordNet API until an appropriate synset was found. See Table 9 for an example annotation.
Two expert annotators independently aligned each Freebase type before meeting to resolve any conflicts. The annotators were conservative with assigning equivalence links resulting in a greater number of child-of links. The final dataset contained 13 parent-of, 727 child-of, and 380 equivalence links. Note that some Freebase types have multiple child-of links to WordNet, making Type-Net, like WordNet, a directed acyclic graph. We then took the union of each of our annotated Freebase types, the synset that they linked to, and any ancestors of that synset.
We also added an additional set of 614 FB → FB links 4. This was done by computing conditional probabilities of Freebase types given other Freebase types from a collection of 5 million randomly chosen Freebase entities. The conditional probability P(t 2 | t 1 ) of a Freebase type t 2 given another Freebase type t 1 was calculated as #(t 1 ,t 2 ) #t 1 . Links with a conditional probability less than or equal to 0.7 were discarded. The remaining links were manually verified by an expert annotator and valid links were added to the final dataset, preserving acyclicity.

Background: Entity Typing and Linking
We define a textual mention m as a sentence with an identified entity. The goal is then to classify m with one or more labels. For example, we could take the sentence m = "Barack Obama is the President of the United States." with the identified entity string Barack Obama. In the task of entity linking, we want to map m to a specific entity in a knowledge base such as "m/02mjmr" in Freebase. In mention-level typing, we label m with one or more types from our type system T such as t m = {president, leader, politician} (Ling and Weld, 2012;Gillick et al., 2014;Shimaoka et al., 2017). In entity-level typing, we instead consider a bag of mentions B e which are all linked to the same entity. We label B e with t e , the set of all types expressed in all m ∈ B e (Yao et al., 2013;Neelakantan and Chang, 2015;Verga et al., 2017;Yaghoobzadeh et al., 2017a).

Mention Encoder
Our model converts each mention m to a d dimensional vector. This vector is used to classify the type or entity of the mention. The basic model depicted in Figure 1 concatenates the averaged word embeddings of the mention string with the output of a convolutional neural network (CNN). The word embeddings of the mention string capture global, context independent semantics while the CNN encodes a context dependent representation.

Token Representation
Each sentence is made up of s tokens which are mapped to d w dimensional word embeddings. Because sentences may contain mentions of more than one entity, we explicitly encode a distinguished mention in the text using position embeddings which have been shown to be useful in state of the art relation extraction models (dos Santos et al., 2015;Lin et al., 2016) and machine translation (Vaswani et al., 2017). Each word embedding is concatenated with a d p dimensional learned position embedding encoding the token's relative distance to the target entity. Each token within the distinguished mention span has position 0, tokens to the left have a negative distance from [−s, 0), and tokens to the right of the mention span have a positive distance from (0, s]. We denote the final sequence of token representations as M .

Sentence Representation
The embedded sequence M is then fed into our context encoder. Our context encoder is a single layer CNN followed by a tanh non-linearity to produce C. The outputs are max pooled across time to get a final context embedding, m CNN .
∈ R d is a token representation, and the max is taken pointwise. In all of our experiments we set w = 5.
In addition to the contextually encoded mention, we create a global mention encoding, m G , by averaging the word embeddings of the tokens within the mention span.
The final mention representation m F is constructed by concatenating m CNN and m G and applying a two layer feed-forward network with tanh non-linearity (see Figure 1):

Mention-Level Typing
Mention level entity typing is treated as multilabel prediction. Given the sentence vector m F , we compute a score for each type in typeset T as: where t j is the embedding for the j th type in T and y j is its corresponding score. The mention is labeled with t m , a binary vector of all types where t m j = 1 if the j th type is in the set of gold types for m and 0 otherwise. We optimize a multi-label binary cross entropy objective:

Entity-Level Typing
In the absence of mention-level annotations, we instead must rely on distant supervision (Mintz et al., 2009) to noisily label all mentions of entity e with all types belonging to e. This procedure inevitably leads to noise as not all mentions of an entity express each of its known types. To alleviate this noise, we use multi-instance multi-label learning (MIML) (Surdeanu et al., 2012) which operates over bags rather than mentions. A bag of mentions B e = {m 1 , m 2 , . . . , m n } is the set of all mentions belonging to entity e. The bag is labeled with t e , a binary vector of all types where t e j = 1 if the j th type is in the set of gold types for e and 0 otherwise. For every entity, we subsample k mentions from its bag of mentions. Each mention is then encoded independently using the model described in Section 3.2 resulting in a bag of vectors. Each of the k sentence vectors m i F is used to compute a score for each type in t e : where t j is the embedding for the j th type in t e and y i is a vector of logits corresponding to the i th mention. The final bag predictions are obtained using element-wise LogSumExp pooling across the k logit vectors in the bag to produce entity level logits y: We use these final bag level predictions to optimize a multi-label binary cross entropy objective:

Entity Linking
Entity linking is similar to mention-level entity typing with a single correct class per mention. Because the set of possible entities is in the millions, linking models typically integrate an alias table mapping entity mentions to a set of possible candidate entities. Given a large corpus of entity linked data, one can compute conditional probabilities from mention strings to entities (Spitkovsky and Chang, 2012). In many scenarios this data is unavailable. However, knowledge bases such as UMLS contain a canonical string name for each of its curated entities. State-of-the-art biological entity linking systems tend to operate on various string edit metrics between the entity mention string and the set of canonical entity strings in the existing structured knowledge base (Leaman et al., 2013;Wei et al., 2015). For each mention in our dataset, we generate 100 candidate entities e c = (e 1 , e 2 , . . . , e 100 ) each with an associated string similarity score csim. See Appendix A.5.1 for more details on candidate generation. We generate the sentence representation m F using our encoder and compute a similarity score between m F and the learned embedding e of each of the candidate entities. This score and string cosine similarity csim are combined via a learned linear combination to generate our final score. The final prediction at test timeê is the maximally similar entity to the mention.
We optimize this model by multinomial cross entropy over the set of candidate entities and correct entity e.

Encoding Hierarchies
Both entity typing and entity linking treat the label space as prediction into a flat set. To explicitly incorporate the structure between types/entities into our training, we add an additional loss. We consider two methods for modeling the hierarchy of the embedding space: real and complex bilinear maps, which are two of the state-of-the-art knowledge graph embedding models.

Hierarchical Structure Models
Bilinear: Our standard bilinear model scores a hypernym link between (c 1 , c 2 ) as: where A ∈ R d×d is a learned real-valued nondiagonal matrix and c 1 is the child of c 2 in the hierarchy. This model is equivalent to RESCAL (Nickel et al., 2011) with a single IS-A relation type. The type embeddings are the same whether used on the left or right side of the relation. We merge this with the base model by using the parameter A as an additional map before type/entity scoring. Complex Bilinear: We also experiment with a complex bilinear map based on the ComplEx model (Trouillon et al., 2016), which was shown to have strong performance predicting the hypernym relation in WordNet, suggesting suitability for asymmetric, transitive relations such as those in our type hierarchy. ComplEx uses complex valued vectors for types, and diagonal complex matrices for relations, using Hermitian inner products (taking the complex conjugate of the second argument, equivalent to treating the right-hand-side type embedding to be the complex conjugate of the left hand side), and finally taking the real part of the score 3 . The score of a hypernym link between (c 1 , c 2 ) in the ComplEx model is defined as: where c 1 , c 2 and r IS-A are complex valued vectors representing c 1 , c 2 and the IS-A relation respectively. Re(z) represents the real component of z and Im(z) is the imaginary component. As noted in Trouillon et al. (2016), the above function is antisymmetric when r IS-A is purely imaginary.
Since entity/type embeddings are complex vectors, in order to combine it with our base model, we also need to represent mentions with complex vectors for scoring. To do this, we pass the output of the mention encoder through two different affine transformations to generate a real and imaginary component: where m F is the output of the mention encoder, and W real , W img ∈ R d×d and b real , b img ∈ R d .

Training with Hierarchies
Learning a hierarchy is analogous to learning embeddings for nodes of a knowledge graph with a single hypernym/IS-A relation. To train these embeddings, we sample (c 1 , c 2 ) pairs, where each pair is a positive link in our hierarchy. For each positive link, we sample a set N of n negative links. We encourage the model to output high scores for positive links, and low scores for negative links via a binary cross entropy (BCE) loss: σ(s(c 1i , c 2i ))) L = L type/link + γL struct 3 This step makes the scoring function technically not bilinear, as it commutes with addition but not complex multiplication, but we term it bilinear for ease of exposition.

Experiments
We perform three sets of experiments: mentionlevel entity typing on the benchmark dataset FIGER, entity-level typing using Wikipedia and TypeNet, and entity linking using MedMentions.

Models
CNN: Each mention is encoded using the model described in Section 3.2. The resulting embedding is used for classification into a flat set labels. Specific implementation details can be found in Appendix A.2. CNN+Complex: The CNN+Complex model is equivalent to the CNN model but uses complex embeddings and Hermitian dot products. Transitive: This model does not add an additional hierarchical loss to the training objective (unless otherwise stated). We add additional labels to each entity corresponding to the transitive closure, or the union of all ancestors of its known types. This provides a rich additional learning signal that greatly improves classification of specific types. Hierarchy: These models add an explicit hierarchical loss to the training objective, as described in Section 5, using either complex or real-valued bilinear mappings, and the associated parameter sharing.

Mention-Level Typing in FIGER
To evaluate the efficacy of our methods we first compare against the current state-of-art models of Shimaoka et al. (2017). The most widely used type system for fine-grained entity typing is FIGER which consists of 113 types organized in a 2 level hierarchy. For training, we use the publicly available W2M data (Ren et al., 2016) and optimize the mention typing loss function defined in Section-4.1 with the additional hierarchical loss where specified. For evaluation, we use the manually annotated FIGER (GOLD) data by Ling and Weld (2012). See Appendix A.2 and A.3 for specific implementation details.

Results
In  previous state-of-the-art for models without handcrafted features. When incorporating structure into our models, we gain 2.5 points of accuracy in our CNN+Complex model, matching the overall state of the art attentive LSTM that relied on handcrafted features from syntactic parses, topic models, and character n-grams. The structure can help our model predict lower frequency types which is a similar role played by hand-crafted features.

Entity-Level Typing in TypeNet
Next we evaluate our models on entity-level typing in TypeNet using Wikipedia. For each entity, we follow the procedure outlined in Section 4.2. We predict labels for each instance in the entity's bag and aggregate them into entity-level predictions using LogSumExp pooling. Each type is assigned a predicted score by the model. We then rank these scores and calculate average precision for each of the types in the test set, and use these scores to calculate mean average precision (MAP). We evaluate using MAP instead of accuracy which is standard in large knowledge base link prediction tasks (Verga et al., 2017;Trouillon et al., 2016). These scores are calculated only over Freebase types, which tend to be lower in the hierarchy. This is to avoid artificial score inflation caused by trivial predictions such as 'entity.' See Appendix A.4 for more implementation details. Table 6 shows the results for entity level typing on our Wikipedia TypeNet dataset. We see that both the basic CNN and the CNN+Complex models perform similarly with the CNN+Complex model doing slightly better on the full data regime.

Results
We also see that both models get an improvement when adding an explicit hierarchy loss, even before adding in the transitive closure. The transitive closure itself gives an additional increase   in performance to both models. In both of these cases, the basic CNN model improves by a greater amount than CNN+Complex. This could be a result of the complex embeddings being more difficult to optimize and therefore more susceptible to variations in hyperparameters. When adding in both the transitive closure and the explicit hierarchy loss, the performance improves further. We observe similar trends when training our models in a lower data regime with~150,000 examples, or about 5% of the total data. In all cases, we note that the baseline models that do not incorporate any hierarchical information (neither the transitive closure nor the hierarchy loss) perform~9 MAP worse, demonstrating the benefits of incorporating structure information.

MedMentions Entity Linking with UMLS
In addition to entity typing, we evaluate our model's performance on an entity linking task using MedMentions, our new PubMed / UMLS dataset described in Section 2.1. Table 7 shows results for baselines and our proposed variant with additional hierarchical loss. None of these models incorporate transitive clo-  Table 8: Example predictions from MedMentions. Each example shows the sentence with entity mention span in bold. Baseline, shows the predicted entity and its ancestors of a model not incorporating structure. Finally, +hierarchy shows the prediction and ancestors for a model which explicitly incorporates the hierarchical structure information. sure information, due to difficulty incorporating it in our candidate generation, which we leave to future work. The Normalized metric considers performance only on mentions with an alias table hit; all models have 0 accuracy for mentions otherwise. We also report the overall score for comparison in future work with improved candidate generation. We see that incorporating structure information results in a 1.1% reduction in absolute error, corresponding to a~6% reduction in relative error on this large-scale dataset. Table 8 shows qualitative predictions for models with and without hierarchy information incorporated. Each example contains the sentence (with target entity in bold), predictions for the baseline and hierarchy aware models, and the ancestors of the predicted entity. In the first and second example, the baseline model becomes extremely dependent on TFIDF string similarities when the gold candidate is rare (≤ 10 occurrences). This shows that modeling the structure of the entity hierarchy helps the model disambiguate rare entities. In the third example, structure helps the model understand the hierarchical nature of the labels and prevents it from predicting an entity that is overly specific (e.g predicting Interleukin-27 rather than the correct and more general entity IL2 Gene).

Results
Note that, in contrast with the previous tasks, the complex hierarchical loss provides a significant boost, while the real-valued bilinear model does not. A possible explanation is that UMLS is a far larger/deeper ontology than even TypeNet, and the additional ability of complex embeddings to model intricate graph structure is key to realizing gains from hierarchical modeling.

Related Work
By directly linking a large set of mentions and typing a large set of entities with respect to a new ontology and corpus, and our incorporation of structural learning between the many entities and types in our ontologies of interest, our work draws on many different but complementary threads of research in information extraction, knowledge base population, and completion.
Our structural, hierarchy-aware loss between types and entities draws on research in Knowledge Base Inference such as Jain et al. Linking mentions to a flat set of entities, often in Freebase or Wikipedia, is a long-standing task in NLP (Bunescu and Pasca, 2006;Cucerzan, 2007;Durrett and Klein, 2014;Francis-Landau et al., 2016). Typing of mentions at varying levels of granularity, from CoNLL-style named entity recognition (Tjong Kim Sang and De Meulder, 2003), to the more fine-grained recent approaches (Ling and Weld, 2012;Gillick et al., 2014;Shimaoka et al., 2017), is also related to our task. A few prior attempts to incorporate a very shallow hierarchy into fine-grained entity typing have not lead to significant or consistent improvements (Gillick et al., 2014;Shimaoka et al., 2017).
The knowledge base Yago (Suchanek et al., 2007) includes integration with WordNet and type hierarchies have been derived from its type system (Yosef et al., 2012). Del Corro et al. (2015) use manually crafted rules and patterns (Hearst patterns (Hearst, 1992), appositives, etc) to automati-cally match entity types to Wordnet synsets.
Recent work has moved towards unifying these two highly related tasks by improving entity linking by simultaneously learning a fine grained entity type predictor (Gupta et al., 2017). Learning hierarchical structures or transitive relations between concepts has been the subject of much recent work (Vilnis and McCallum, 2015;Vendrov et al., 2016;Nickel and Kiela, 2017) We draw inspiration from all of this prior work, and contribute datasets and models to address previous challenges in jointly modeling the structure of large-scale hierarchical ontologies and mapping textual mentions into an extremely fine-grained space of entities and types.

Conclusion
We demonstrate that explicitly incorporating and modeling hierarchical information leads to increased performance in experiments on entity typing and linking across three challenging datasets. Additionally, we introduce two new humanannotated datasets: MedMentions, a corpus of 246k mentions from PubMed abstracts linked to the UMLS knowledge base, and TypeNet, a new hierarchical fine-grained entity typeset an order of magnitude larger and deeper than previous datasets.
While this work already demonstrates considerable improvement over non-hierarchical modeling, future work will explore techniques such as Box embeddings (Vilnis et al., 2018) andPoincaré embeddings (Nickel andKiela, 2017) to represent the hierarchical embedding space, as well as methods to improve recall in the candidate generation process for entity linking. Most of all, we are excited to see new techniques from the NLP community using the resources we have presented.
Silviu Cucerzan. 2007. Large-scale named entity disambiguation based on wikipedia data. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL).

A.2 Model Implementation Details
For all of our experiments, we use pretrained 300 dimensional word vectors from Pennington et al. (2014). These embeddings are fixed during training. The type vectors and entity vectors are all 300 dimensional vectors initialized using Glorot initialization (Glorot and Bengio, 2010). The number of negative links for hierarchical training n ∈ {16, 32, 64, 128, 256}. For regularization, we use dropout (Srivastava et al., 2014) with p ∈ {0.5, 0.75, 0.8} on the sentence encoder output and L2 regularize all learned parameters with λ ∈ {1e-5, 5e-5, 1e-4}. All our parameters are optimized using Adam (Kingma and Ba, 2014) with a learning rate of 0.001. We tune our hyper-parameters via grid search and early stopping on the development set.

A.3 FIGER Implementation Details
To train our models, we use the mention typing loss function defined in Section-5. For models with structure training, we additionally add in the hierarchical loss, along with a weight that is obtained by tuning on the dev set. We follow the same inference time procedure as Shimaoka et al. (2017) For each mention, we first assign the type with the largest probability according to the logits, and then assign additional types based on the condition that their corresponding probability be greater than 0.5.

A.4 Wikipedia Data and Implementation Details
At train time, each training example randomly samples an entity bag of 10 mentions. At test time we classify bags of 20 mentions of an entity. The dataset contains a total of 344,246 entities mapped to the 1081 Freebase types from TypeNet. We consider all sentences in Wikipedia between 10 and 50 tokens long. Tokenization and sentence splitting was performed using NLTK (Loper and Bird, 2002). From these sentences, we considered all entities annotated with a cross-link in Wikipedia that we could link to Freebase and assign types in TypeNet. We then split the data by entities into a 90-5-5 train, dev, test split.

A.5 UMLS Implementation details
We pre-process each string by lowercasing and removing stop words. We consider ngrams from size 1 to 5 and keep the top 100,000 features and the final vectors are L2 normalized. For each mention, In our experiments we consider the top 100 most similar entities as the candidate set.

A.5.1 Candidate Generation Details
Each mention and each canonical entity string in UMLS are mapped to TFIDF character ngram vectors. We pre-process each string by lowercasing and removing stop words. We consider ngrams from size 1 to 5 and keep the top 100,000 features and the final vectors are L2 normalized. For each mention, we calculate the cosine similarity, csim, between the mention string and each canonical entity string. In our experiments we consider the top 100 most similar entities as the candidate set.