Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs

This paper addresses the problem of mapping natural language text to knowledge base entities. The mapping process is approached as a composition of a phrase or a sentence into a point in a multi-dimensional entity space obtained from a knowledge graph. The compositional model is an LSTM equipped with a dynamic disambiguation mechanism on the input word embeddings (a Multi-Sense LSTM), addressing polysemy issues. Further, the knowledge base space is prepared by collecting random walks from a graph enhanced with textual features, which act as a set of semantic bridges between text and knowledge base entities. The ideas of this work are demonstrated on large-scale text-to-entity mapping and entity classification tasks, with state of the art results.


Introduction
The task of associating a well-defined action, concept or piece of knowledge to a natural language utterance or text is a common problem in natural language processing and generic artificial intelligence (Tellex et al., 2011), and can emerge in many different forms. In NLP, the ability to code text into an entity of a knowledge graph finds applications in tasks such as question answering and information retrieval, or any task that involves some form of mapping a definition to a term (Hill et al., 2016;Rimell et al., 2016). Further, it can be invaluable in providing solutions to domain-specific challenges, for example medical concept normalisation (Limsopatham and Collier, 2016) and identification of adverse drug reactions (O'Connor et al., 2014).
This paper details a model for efficiently mapping unrestricted text at the level of phrases and sentences to the entities of a knowledge base (KB)-a task also referred to as text grounding or normalisation. The model aims at characterising short focused texts, such as definitions or tweets. Given a medical KB, for example, a tweet of the form "Can't sleep, too tired to think straight" would be mapped to the entity Insomnia, while in the context of a lexical ontology the definition "Device that detects planets" would be associated to the entity Telescope.
Note that such a task cannot be approached as standard classification, since the "classes" (entities) are usually in one-to-one correspondence with the available inputs. To address this we propose the use of a continuous vector space for embedding the entities of the KB graph, where text is projected by a neural network. We rely on the notion of distributional semantics, where a word is represented as a multi-dimensional vector obtained either by collecting co-occurrence statistics with a selected set of contexts or by directly optimising an objective function in a neural networkbased architecture (Collobert and Weston, 2008). Interestingly, similar techniques can be used for the multi-dimensional representation of nodes in a KB graph; for example, by collecting random walks following the edges of a graph it is possible for one to construct an artificial "corpus", to which a distributional model applies in the usual way (Perozzi et al., 2014).
By exploiting this representational compatibility, we treat the process of text-to-entity mapping as a transformation from a textual vector space where words live, to a KB vector space created from a graph and populated by vectors representing entities. A sentence is coded as a sequence of word vectors, composed by a modified Long Short-Term Memory network (LSTM-Hochreiter and Schmidhuber, 1997) into a multidimensional point in the entity space. One of our aims is to specifically deal with lexical ambiguity and polysemy which can be an important factor for the task at hand. To this end, each word is associated with a number of sense embeddings, and the LSTM is extended with an attentional disambiguation mechanism that dynamically selects and updates the right sense vector for each word given its context during training. We dub this formulation Multi-Sense LSTM (MS-LSTM).
An important issue is the provision of a set of reliable anchors; that is, points in one-to-one correspondence between the two representations that would enforce some degree of structural similarity between pieces of text and KB entities and thus make the mapping more efficient. We deal with this problem by extending the original KB graph with nodes corresponding to textual features, i.e. to words strongly associated to a specific entity and collected from various resources. A novel sampling strategy is detailed for incorporating these nodes to random walks, which are then fed to the skipgram model for producing an entity space. The results indicate that the textual nodes, being words and KB entities at the same time, do an extremely effective job in transforming the geometry of the entity space to the benefit of mapping the textual modality.
The proposed model is evaluated in three tasks: text-to-entity mapping on a dataset extracted from SNOMED CT 1 , a medical knowledge base of 327,000 concepts; a reverse dictionary task based on WordNet (Miller, 1998), where the goal is to associate a multi-word definition to the correct lemma (Hill et al., 2016); and document classification on the Cora dataset (McCallum et al., 2000). The results demonstrate the effectiveness of our methods by improving the current state of the art.

Background
Aligning meaning between text and entities in a knowledge graph is a task traditionally based on heuristic methods exploiting text features such as string matching, word weighting, syntactic relations, or dictionary lookups (McCallum et al., 2005;Lu et al., 2011;O'Connor et al., 2014). Machine learning techniques have been also exploited in various forms, for example Leaman et al. (2013) use a pairwise learning-to-rank technique to learn the similarity between different terms, while Limsopatham and Collier (2015) apply statistical machine translation to "translate" social media text to domain-specific terminology. There is little work based on neural networks; the most relevant to us 1 https://www.snomed.org/snomed-ct is a study by Hill et al. (2016), who tested a number of compositional neural architectures trained to approximate word embeddings on a reverse dictionary task. Compared to their work, this paper proposes the use of a distinct target space for representing ontological knowledge, where every entity in the graph lives.
The goal of a graph embedding method is to embed components of a knowledge graph into a lowdimensional space. One research direction focuses on the relations, i.e. the edges of the graph (Bordes et al., 2013;Socher et al., 2013;Xiao et al., 2016) and aims at tasks such as link prediction and KB completion. Such work is outside the scope of the current paper, the subject of which is the efficient low-dimensional representation of entities (nodes). In this line of research, a prevalent method involves the collection of a set of random walks, starting from each node in the graph (Perozzi et al., 2014;Grover and Leskovec, 2016). There is a direct analogy between such a set of random walks and a text corpus: each node corresponds to a word and the sequence of nodes visited during a random walk is analogous to a sentence. Thus, any distributional model that takes as input this artificial "corpus" can generate multidimensional representations of the nodes in the graph. Random walks have also been used for KB inference (Lao et al., 2011) with success.
While random walk-based methods are not the only way to construct graph spaces-alternatives include factorisation (Ahmed et al., 2013) and deep autoencoders -they have been found very effective in capturing multiple aspects of the graph structure (Wang et al., 2017;Goyal and Ferrara, 2017). The current paper proposes a random walk generation strategy that improves and complements existing approaches.
The idea of using textual features to improve the entity vectors is not well explored, and most of the existing work focuses again on the representation of relations (Xie et al., 2016;Wang et al., 2014;Wang and Li, 2016) as opposed to entities. Closer to us is the work of Yamada et al. (2017) and Yang et al. (2015), with the latter to incorporate text features in the concept embeddings by exploiting matrix factorisation properties.
Representing the meaning of words using a number of sense vectors is an old and well-established idea in NLP-see for example (Schütze, 1998;Reisinger and Mooney, 2010;Neelakantan et al., 2014). However, most of the relevant research is evaluated on intrinsic tasks  Figure 1: The text-to-entity mapping system in a nutshell. The red nodes indicate textual features, while "MSE" stands for mean squared error.
such as word similarity, while in the few works based in real end tasks, disambiguation is usually treated as a prior stand-alone step Li and Jurafsky, 2015;Pilehvar et al., 2017). The crucial difference of this work is that the ambiguity resolution mechanism is part of the compositional model itself, and the sense embeddings are trained simultaneously with the rest of the parameters. A close work is by Cheng and Kartsaklis (2015), who used a siamese network with an integrated disambiguation mechanism for paraphrase detection. For more information on multi-sense embeddings see (Camacho-Collados and Pilehvar, 2018).

Methodology
Fig. 1 provides a high-level illustration of our methodology, consisting of two stages: (1) the KB graph is extended with weighted textual features, and an artificial "corpus" of random walks is created and used as input to the skipgram model (Mikolov et al., 2013) for generating an enhanced KB space-this part is covered in §3.1; (2) the transformation from text to entities is performed by a supervised multi-sense compositional model, which generates a point in the KB space for every input text. This is achieved with an LSTM recurrent network, equipped with an attentional mechanism that provides a finer level of granularity to the different ways a word is used in the data-we detail this part in §3.2.

Textual features for entity vectors
For our KB space, we follow the generic recipe proposed by Perozzi et al. (2014) and we assemble an artificial corpus of random walks from the KB graph, which is then used as input to the skipgram model (Mikolov et al., 2013). For a random walk of nodes n 1 n 2 . . . n T and a context window size c, skipgram maximises the following quantity: i.e. for a target node n t , the objective is to predict all other nodes in the same context. As a consequence, two vectors of the resulting space will be close if their corresponding nodes occur in topological proximity within the graph. However, while such a topology allows perhaps for meaningful comparisons between points in this space, it is not directly compatible with the task of mapping text to entities. The reason is that the communities formed in a KB graph (and thus the topology of the resulting vector space) mostly reflect domainspecific hierarchies and ontological relationships that are not necessarily evident by the textual representations referring to the entities. An important question therefore with regard to the proposed methodology is how to provide meaningful links between the two representations that would allow for the efficient translation of one form (text) to another (entities).
In this work, we address this problem by associating every node in the graph with a set of textual features, each one of which is weighted according to their importance with respect to the node. Our methodology is as follows: For each entity, we collect all available textual descriptions found in the knowledge base itself and the English portion of BabelNet (Navigli and Ponzetto, 2012), which is a very large dictionary integrating numerous resources, such as WordNet, Wikipedia, FrameNet and many others. The textual descriptions are treated as short documents, and each word in them is assigned a specific TF-IDF value, forming the set of textual features for the specific entity.
The KB graph is extended in the following way: Let T c be the set of textual features for an entity c; then, for each t in T c , we add an edge (c, t) with weight tf-idf c (t), where tf-idf c (t) is the TF-IDF value of t with respect to c. In contrast to Perozzi et al. (2014) who utilise a uniform node sampling strategy, we define the random walk generation process as follows: Given a randomly selected node n, let C n = {c 1 , c 2 , · · · c N } be the set of all entity nodes in its immediate vicinity, and T n = {t 1 , t 2 , · · · t M } the set of all textual features λ = 0 λ = 0.5 λ = 1 Figure 2: Effect of λ parameter. Blue nodes indicate entities, red nodes textual features, and red paths refer to random walks. As λ increases, the probability of "hops" between originally unlinked nodes increases accordingly. of n; the next node x in the path is drawn from a categorical distribution defined as below: for X a discrete random variable with range C n ∪ T n . In the above, λ defines the proportion of the probability mass allocated to textual features, when both C n and T n are non-empty; if one of the sets is empty, all of the probability mass is allocated to the other set, and λ becomes irrelevant. Further, in contrast to what is the case for the textual nodes, the probabilities of the entity nodes in Equation 2 are defined uniformly, since we lack any mechanism for fine-tuning them in a way that objectively reflects the importance of the nodes. It is instructive to examine how the above sampling strategy works. As expected, setting λ = 0 will result in a sampling process that ignores the textual features and produces a path comprised solely of entity nodes; this is equivalent to the original model by Perozzi et al. (2014), known as DeepWalk. On the other hand, the effect of setting λ = 1 is less intuitive: Recall that, by construction, each textual node is connected only to entity nodes; that is, when the current node is textual, the next node will be always sampled from C n . Therefore, setting λ = 1 creates paths following an alternating pattern, where each entity node is followed by a textual node, which in turn is followed by an entity node. Values of λ between 0 and 1 scale this behaviour accordingly (Fig. 2).
Advantages. The introduction of textual features in the graph achieves two goals. Firstly, the textual nodes serve as links between entities which, although perhaps related to each other in some way, lie in different parts of the KB graph (e.g. being parts of different hierarchies). As  Figure 3: Linking of distant concepts with textual features (red boxes) for λ = 1. The textual feature understanding correctly links a related medical finding (lying on a different branch) to the condition known as alexia. Further, due to the presence of inability in their contexts, the vectors of alexia and insomnia (concepts originally quite far apart in the graph) will now have a common part reflecting that they are both conditions related to forms of incompetence. a result, points that would normally be unjustifiably apart of each other in the vector space are now brought closer, providing additional coherence. This behaviour is controlled by the λ parameter, as Fig. 2 shows. Fig. 3 presents an illustrative example, taken from a real random walk on SNOMED CT.
The second advantage of introducing textual features in the graph is a consequence of the dual nature of these features in the context of learning: they essentially represent words, but since they are also nodes of the graph, they get vector representations exactly as every other normal entity in the knowledge base. The textual features, therefore, paired with their assigned vectors, form a set of anchors that links pieces of text with the KB space, and can be used to support the training process of the mapping system. In §4.1 we will see that this approach leads to substantial improvements in the accuracy of the model.

A multi-sense LSTM
We now proceed to present our neural architecture for text-to-entity mapping. The goal of the model is, given a certain piece of text, to produce a point in the KB space corresponding to an appropriate entity or concept. The model is trained on pairs of texts and entity vectors created from a graph extended with textual features, as discussed in §3.1.
Our architecture needs to explicitly take into account the fact that the task at hand is very sen- sitive to lexical ambiguity. Specifically, while it is true that the level of homonymy (words having more than one disjoint meanings) is substantially decreased when we move from the generic domain to more specialised domains, on the other hand the increase in polysemy (words with many slightly different meanings) is exponential. As an example, while the lemma for the word "fever" in a dictionary usually contains two or three definitions, the term occurs in many dozens of different forms and contexts in SNOMED. Note that most of the different uses of the term correspond to distinct KB nodes, a fact that makes the job of a text-toentity mapping system especially hard. 2 This motivates the employment of a dedicated mechanism that would handle the extra complexity imposed by the polysemous words. The compositional setting of this paper, equipped with such a mechanism, is shown in Fig. 4. It consists of a generic word embedding layer, a word sense disambiguation layer, and two consecutive LSTM networks responsible for encoding the embeddings into a vector in the KB space. The objective is to minimise the mean squared error between the predicted vectors and the target vectors (prepared as in §3.1): where N is the number of training examples, x the input text, y the target entity vector, and f the neural network.
To address the polysemy issues discussed above, every word is associated with a single generic embedding and k sense embeddings, where k is a fixed number. These sense embeddings can be seen as centroids of clusters denoting 2 See also §4.4 for some concrete examples. different uses of the word in the training set, and are dynamically updated during training. Specifically, for each word w i in a training example, a context vector c i is computed as the average of the generic vectors of all other words in the sentence. The probability of each sense vector s ij given this context is then calculated via an attentional mechanism equipped with a softmax layer, as follows: where s ij = tanh(W s ij +U c i ), and W , U and W the parameters of the attentional network. Each sense vector is subsequently updated by addition of the context vector weighted by its similarity with the specific sense: The output of the attention is a weighted sum of the sense vectors given their probabilities (i.e. we apply soft attention), which is used as input to the compositional network-a 2-layer LSTM. The overall model is optimised on the MSE of the LSTM's output vector and the target entity vector. At inference time, a predicted vectorŷ can be classified to the entity with the closest vectorial representation according to some metric.

Experiments
The ideas presented in the previous sections are evaluated on three tasks, two of which are related to text-to-entity mapping, and one to classification of KB entities. The purpose of the classification task ( §4.3) is to provide a direct comparison of the textually enhanced vectors against vectors produced by the original graph, but independently of the compositional part. On the other hand, the text mapping experiments ( § §4.1, 4.2) evaluate the overall architecture of Fig. 1 (including the compositional model and the dynamic disambiguation mechanism) on appropriate end tasks. Comparisons are provided with the most relevant previous work. Specifically, in all tasks, no inclusion of textual features corresponds to the standard DeepWalk model of Perozzi et al. (2014); in §4.2 our compositional architecture is compared to the work of Hill et al. (2016) in their reverse dictionary task; and §4.3 compares our method for textually enhancing the entity space with that of Yang et al. (2015), and other state-of-the-art deep models. The last subsection, §4.4, examines a few selected cases from a qualitative perspective.

Text-to-entity mapping
We begin with a large scale text-to-entity mapping experiment. We construct a dataset of 21,000 medical concepts extracted from SNOMED CT, each of which is associated with a multi-word textual description, taken from the knowledge base or Ba-belNet. The criterion for including a concept in the dataset was the availability of at least one textual description with 4 or more words. The objective of the task is to associate each one of these descriptions to the correct concept. Given a predicted vectorv, we assemble a list of all candidate concept vectors ranked by their cosine similarity withv. We compute strict accuracy (based on how many times the vector of the correct concept is at the top of the list) and accuracy on the first 20 elements of the list. Further, we also present results based on the mean reciprocal rank (MRR).
In all experiments, we create KB vectors of 150 dimensions by applying the skipgram objective on a set of random walks of length 20, and with window size of 5. The graph is extended with 102,500 textual nodes weighted by their TF-IDF values with regard to the corresponding entities and selected as described in §3.1 (textual features that occur in the testing set are not taken into account). Each node in the graph serves as the starting point of 10 random walks. For the compositional model, we use embeddings of 150 dimensions, and 200-dimensional hidden states. The attentional mechanism is implemented as a 2-layer MLP, with 50 units allocated to the hidden layer for each sense. The overall model contains two dropout layers for regularisation purposes, and is optimised with Adam (Kingma and Ba, 2015) (α = 0.001, β 1 = 0.9, β 2 = 0.999). 3 Following usual practice, we split our dataset in three parts: a training set (14,754 instances), a testing set (4,187 instances), and a development set (2,000 instances). We use the dev set to optimise the two main hyper-parameters of our model, namely the probability mass given to textual features (λ) and the number of senses for each word (k). The experiments on the dev set showed that increasing the probability mass for the inclusion of textual features in the random walks leads to consistently better performance for all tested models, so for the main experiment we set λ to its highest possible value, 1.00. 4 Further, a number of senses equal to 3 achieved the highest performance.
We compare our MS-LSTM with a number of baselines: In Baselines 1 and 2 a vector for each textual description is computed as the average of pre-computed word vectors, and compared to concept vectors prepared in a similar way, i.e. by averaging pre-computed vectors for all words in the qualified name of the entities. We used two different word spaces, a standard Word2Vec space created from Google News 5 and a custom Word2Vec model trained on a corpus of 4B tokens from medical articles indexed in PubMed 6 . In Least squares and CCA, an averaged vector for each textual description is again computed as before, and a linear mapping is learned between the textual space and the KB space, using least squares and canonical correlation analysis (Hardoon et al., 2004).
In Standard LSTM, we use a configuration similar to that of Fig. 4, but without the multi-sense aspect; here, the word embeddings are just parameters of the model randomly initialised before training. Further, we also test a standard LSTM where the length of the single embeddings is k times bigger (k is the number of senses in the MS-LSTM), so that the overall dimensionality of embeddings in LSTM and MS-LSTM is the same.
The results are presented in Table 1. Each model is tested against two target KB spaces, one consisting of simple DeepWalk vectors 7 and one of textually enhanced vectors (TF vectors, λ = 1) according to the procedure of §3.1. There are three observations: (1) Using the enhanced vectors as a target space improves the performance of all tested models by a large margin; (2) the MS-LSTM configuration of Fig. 4 achieves the highest overall performance, showing that explicitly handling polysemy during the composition is beneficial for the task at hand; and (3) despite the equal dimensionality between the two models, the standard LSTM with the long embeddings presents performance inferior to that of the MS-LSTM.
The last row of the table presents results after extending the training dataset with the textual anchors, that is, all the textual features paired with their learned KB vectors, as described in the Advantages section in §3.1. Specifically, recall that each textual feature (a word or a two-word compound), being also a node in the graph, is associated with a vector according to the process of §3.1. It is possible for one then to use these (textual feature, vector) pairs as additional examples during the training of the MS-LSTM. The last row of Table 1 shows the results after extending the training set with the 102,500 textual features. This setting achieves the highest performance, increasing further the strict accuracy by 6%, to 0.90.

Reverse dictionary
We proceed to the reverse dictionary task of Hill et al. (2016), the goal of which is to return a candidate term given a definition. Many forms of this task have been proposed in the past, see for example (Kartsaklis et al., 2012;Turney, 2014;Rimell et al., 2016). In (Hill et al., 2016), the authors test a number of supervised models under two evaluation modes: (1) "seen", in which the testing instances are also included in the training set; and (2) "unseen", where the evaluation is done on a held-out set. In both cases the datasets consisted of 500 term-definition pairs from WordNet.
We treat WordNet as a graph, the edges of which are defined by the various relationships between the synsets. This graph is further extended with 96,734 textual nodes extracted from the synset descriptions. We compute synset vectors of 150 dimensions, on random walks of length 20 and with window size of 5. For the seen evaluation, we train the compositional model on the totality of WordNet 3.0 synsets (117,659) and their descriptions. For the unseen evaluation, we remove from the graph any textual features occurring in the testing part, and create a new set of synset vectors; further, any testing instance is removed from the training set of the compositional model. The evaluation is done by comparing the Model Acc-10 Acc-100 Seen (500 WordNet definitions) OneLook (Hill et al., 2016) 0.89 0.91 RNN cosine (Hill et al., 2016) 0  (Hill et al., 2016) 0.44 0.69 BOW w2v cosine (Hill et al., 2016) 0  Table 2: Results for the reverse dictionary task, compared with the highest numbers reported by Hill et al. (2016). TF vectors refers to textually enhanced vectors with λ = 1. For the MS-LSTM, k is set to 3.
predicted vector with the vectors of all WordNet synsets (a search space of 117,659 points) and creating a ranked list as before, by cosine similarity. Following (Hill et al., 2016), we compute accuracy on top-10 and top-100. λ and k are tuned on a dev set of 2,000 synsets, showing a behaviour very similar to that of the SNOMED task. Table 2 shows the results, based on a MS-LSTM setup similar to that of §4.1. Note that the MS-LSTM achieves 0.95-0.96 top-10 accuracy for the seen evaluation, significantly higher not only than the best model of Hill et al. (2016), but also higher than OneLook, a commercial system with access to more than 1000 dictionaries. It also presents considerably higher performance in the unseen evaluation. We are not aware of any other models with higher performance on the specific task.

Document classification
Our last experiment is a document classification task, performed on Cora (McCallum et al., 2000), a dataset containing 2708 machine learning papers linked by citation relationships into a graph. Each document is a short text extracted from the title or the abstract of the paper. The task is to predict the category of a document (a total of 7 classes), given its vector-so here we only evaluate the textually enhanced vectors as inputs to a classifier, independently of the compositional part.
In Table 3 we report results for two evaluation settings. In Evaluation 1, we provide a comparison with the method of Yang et al. (2015) who include textual features in graph embeddings based on matrix factorisation, and two topic models used as baselines in their paper. Using the same clas-

Model
Accuracy Evaluation 1 (training ratio=0.50) PLSA (Hofmann, 1999) 0.68 NetPLSA (Mei et al., 2008) 0.85 TADW (Yang et al., 2015) 0.87 Linear SVM + DeepWalk vectors 0.85 Linear SVM + TF vectors 0.88 Evaluation 2 (training ratio=0.05) Planetoid (Yang et al., 2016) 0.76 GCN (Kipf and Welling, 2017) 0.81 GAT (Veličković et al., 2018) 0.83 Linear SVM + DeepWalk vectors 0.72 Linear SVM + TF vectors 0.82 sification algorithm (a linear SVM) and training ratio (0.50) with them, we present state-of-the-art results for vectors of 150 dimensions, prepared by a graph extended with 1422 textual features. We set λ = 0.5 by tuning on a dev set of 677 randomly selected entries from the training data. 8 In Evaluation 2, using the same linear SVM classifier and λ as before, we reduce the training ratio to 0.05 in order to make our task comparable to the experiments reported by Veličković et al. (2018) for a number of deep learning models: specifically, the graph attention network (GAT) of Veličković et al. (2018), the graph convolutional network (GCN) of Kipf and Welling (2017), and the Planetoid model of Yang et al. (2016). Again, our simple setting presents results within the state of the art range, comparable to (or better than) those of much more sophisticated models that have been specifically designed for the task of node classification. We consider this as a strong indication for the effectiveness of the textually enhanced vectors as representations of KB entities. Fig. 5 provides a visualisation of the Cora classes based on node vectors created with λ = 0 and λ = 0.5, correspondingly, demonstrating the impact of textual features in terms of cluster coherence and separation. Table 4 compares the performance of the multisense approach with that of the single-sense model for a number of selected cases of text mapping. The predictions in the top part (for definitions 8 We also attempted a second classification experiment on a dataset of 200k concepts extracted from SNOMED, observing a similar behaviour of λ (details are not reported due to space). This difference in the behaviour of λ between text-toentity mapping and classification tasks is discussed in §5. taken from the unseen evaluation of the reverse dictionary task) show that, in contrast to the single-sense model, the multi-sense approach was able to capture subtle variations of meaning between different synsets due to polysemy, as motivated in §3.2. The lower part of the table contains short phrases with ambiguous words, specifically selected to demonstrate the effect of the multisense approach. In all these cases, the multi-sense model was able to effectively disambiguate the ambiguous parts of the phrase by using the available context, and predict a very relevant synset; in contrast, the predictions of the single-sense model were based on choosing a wrong sense.

Qualitative evaluation
Finally, Table 5 presents the derived senses for word table, expressed as lists of nearest neighbouring words in the space. The model was able to effectively distinguish between a table as a kitchen furniture (sense 2), and a table as a structured way of presenting data (senses 1 and 3).

Discussion
The experimental work shows that using a graph embedding space as a target for mapping text to entities is an effective approach. This was mostly evident in the reverse dictionary task of §4.2, where the model was found to perform substantially better than previous approaches by Hill et al. (2016), who used a compositional architecture similar to ours but optimised on the word embeddings of the target terms. Note that this is suboptimal in the sense that, unless specific measures are taken, a word embedding reflects ambiguous meaning; therefore, trying to associate a definition like "keyboard musical instrument with pipes" to the vector for word "organ" introduces a certain amount of noise in the model, since the definition will be partly associated with features related to Definition from the unseen dataset of the reverse dictionary task k = 3 (correct pred.) k = 1 (wrong pred.) the branch of engineering that deals with things smaller than 100 nm nanotechnology microelectronics floor consisting of open space at the top of a house just below roof loft balcony a board game for two players; pieces move according to dice throws backgammon checkers an address of a religious nature sermon rogation Example short phrase with ambiguous words k = 3 prediction k = 1 prediction a rechargeable cell nickel-cadmium battery karyolysis (biological process) a state capital Curitiba (Brazilian state capital) assert (verb) the lap of a person upper side of thighs lapper (garment) a band named Queen band leader neckband (garment)  the "body part" sense of the word. In our model, homonymy issues are resolved by design: each point in the target space corresponds to a welldefined unambiguous concept or synset. Further, the attentional mechanism of Fig. 4 handles subtle variations of each distinct sense due to polysemy. The effectiveness of the textual feature mechanism was demonstrated in every task we attempted, but to different extents. As our tuning on the dev sets showed, for tasks closer to textto-entity mapping ( § §4.1-4.2) the more the textual features in the random walks, the better the results were. However, the best performance on the classification task came by λ values between 0.50 and 0.75, i.e. by walks visiting more entity nodes than textual nodes. The reason is that entity classification is a task very sensitive to the topology of the KB graph, since entities belonging to a specific class are very likely to be located at the same sub-hierarchy, hence in topological proximity. On the other hand, one of the motivations for introducing textual features was exactly to broaden the context of a node by connecting distant parts of the graph (see Figures 2-3). So, while small amounts of textual features can be still useful for classification purposes, excessive use introduces unwanted noise in the model.
The dynamic disambiguation mechanism integrated in the compositional architecture improved further the performance of the model. This finding is consistent with previous work on simpler tensorbased models, which showed that applying some form of word sense disambiguation when composing word vectors can provide consistent improvements on end tasks such as sentence similarity and paraphrase detection .

Conclusion and future work
We presented and evaluated a text-to-entity mapping system based on a continuous KB space enhanced with textual features and capable of handling polysemy. The reasonable next step will be to extend our methods for modelling the relations (edges) of a KB graph, which will allow applications in tasks such as link prediction and KB completion. Furthermore, having a mechanism that translates arbitrary text to points in a continuous space creates many opportunities for interesting research. For example, while the size of a knowledge base is finite, the space itself consists of infinite number of points, each one of which corresponds to a valid-yet not explicitly stated in the KB-entity of the same domain. The exciting question of how can we exploit this extra information-for instance in order to enrich the knowledge base with new data-constitutes one of our future directions.