LSTMEmbed: Learning Word and Sense Representations from a Large Semantically Annotated Corpus with Long Short-Term Memories

While word embeddings are now a de facto standard representation of words in most NLP tasks, recently the attention has been shifting towards vector representations which capture the different meanings, i.e., senses, of words. In this paper we explore the capabilities of a bidirectional LSTM model to learn representations of word senses from semantically annotated corpora. We show that the utilization of an architecture that is aware of word order, like an LSTM, enables us to create better representations. We assess our proposed model on various standard benchmarks for evaluating semantic representations, reaching state-of-the-art performance on the SemEval-2014 word-to-sense similarity task. We release the code and the resulting word and sense embeddings at http://lcl.uniroma1.it/LSTMEmbed.


Introduction
Natural Language is inherently ambiguous, for reasons of communicative efficiency (Piantadosi et al., 2012). For us humans, ambiguity is not a problem, since we use common knowledge to fill in the gaps and understand each other. Therefore, a computational model suited to understanding natural language and working side by side with humans should be capable of dealing with ambiguity to a certain extent (Navigli, 2018). A necessary step towards creating such computer systems is to build formal representations of words and their meanings, either in the form of large repositories of knowledge, e.g., semantic networks, or as vectors in a geometric space (Navigli and Martelli, 2019).
In fact, Representation Learning  has been a major research area in NLP over the years, and latent vector-based representations, called embeddings, seem to be a good candidate for coping with ambiguity. Embeddings encode lexical and semantic items in a low-dimensional continuous space. These vector representations capture useful syntactic and semantic information of words and senses, such as regularities in the natural language, and relationships between them, in the form of relation-specific vector offsets. Recent approaches, such as word2vec (Mikolov et al., 2013), and GloVe (Pennington et al., 2014), are capable of learning efficient word embeddings from large unannotated corpora. But while word embeddings have paved the way for improvements in numerous NLP tasks (Goldberg, 2017), they still conflate the various meanings of each word and let its predominant sense prevail over all others in the resulting representation. Instead, when these embedding learning approaches are applied to senseannotated data, they are able to produce embeddings for word senses (Iacobacci et al., 2015).
A strand of work aimed at tackling the lexical polysemy issue has proposed the creation of sense embeddings, i.e. embeddings which separate the various senses of each word in the vocabulary (Huang et al., 2012;Iacobacci et al., 2015;Flekova and Gurevych, 2016;Pilehvar and Collier, 2016;Mancini et al., 2017, among others). One of the weaknesses of these approaches, however, is that they do not take word ordering into account during the learning process. On the other hand, word-based approaches based on RNNs that consider sequence information have been presented, but they are not competitive in terms of speed or quality of the embeddings (Mikolov et al., 2010;Mikolov and Zweig, 2012;Mesnil et al., 2013).
For example, in Figure 1 we show an excerpt of a t-SNE (Maaten and Hinton, 2008) projection of word and sense embeddings in the literature:  as can be seen, first, the ambiguous word bank is located close to words which co-occur with it (squares in the Figure), and, second, the closest senses of bank (dots for the financial institution meaning and crosses for its geographical meaning) appear clustered in two separated regions without a clear correlation with (potentially ambiguous) words which are relevant to them. A more accurate representation would be to have word vectors distributed across all the space with defined clusters for each set of vectors related to each sense of a target word ( Figure 2).
Recently, the much celebrated Long-Short Term Memory (LSTM) neural network model has emerged as a successful model to learn representations of sequences, thus providing an ideal solution for many Natural Language Processing tasks whose input is sequence-based, e.g., sentences and phrases (Hill et al., 2016;Melamud et al., 2016;Peters et al., 2018). However, to date LSTMs have not been applied to the effective creation of sense embeddings linked to an explicit inventory.
In this paper, we explore the capabilities of the architecture of LSTMs using sense-labeled corpora for learning semantic representations of words and senses. We present four main contributions: • We introduce LSTMEmbed, an RNN model based on a bidirectional LSTM for learning word and sense embeddings in the same semantic space, which -in contrast to the most popular approaches to the task -takes word ordering into account.
• We present an innovative idea for taking advantage of pretrained embeddings by using them as an objective during training.
• We show that LSTM-based models are suitable for learning not only contextual information, as is usually done, but also representations of individual words and senses.
• By linking our representations to a knowledge resource, we take advantage of the preexisting semantic information.

Embeddings for words and senses
Machine-interpretable representations of the meanings of words are key for a number of NLP tasks, and therefore obtaining good representations is an important research goal in the field, as shown by the surge of recent work on this topic.

Word Embeddings
In recent years, we have seen an exponential growth in the popularity of word embeddings. Models for learning embeddings, typically based on neural networks, represent individual words as low-dimensional vectors. Mikolov et al. (2013, word2vec) showed that word representations learned with a neural network trained on raw text geometrically encode highly latent relationships. The canonical example is the vector resulting from king − man + woman found to be very close to the induced vector of queen. GloVe (Pennington et al., 2014), an alternative approach trained on aggregated global word-word co-occurrences, obtained similar results. While these embeddings are surprisingly good for monosemous words, they fail to represent the non-dominant senses of words properly. For instance, the representations of bar and pub should be similar, as well as those of bar and stick, but having similar representations for pub and stick is undesirable. Several approaches were proposed to mitigate this issue: Yu and Dredze (2014) presented an alternative way to train word embeddings by using, in addition to common features, words having some relation in a semantic resource, like PPDB (Ganitkevitch et al., 2013) or WordNet (Miller, 1995). Faruqui et al. (2015) presented a technique applicable to pre-processed embeddings, in which vectors are updated ("retrofitted") in order to make them more similar to those which share a word type and less similar to those which do not. The word types were extracted from diverse semantic resources such as PPDB, WordNet and FrameNet (Baker et al., 1998). Melamud et al. (2016) introduced context2vec, a model based on a bidirectional LSTM for learning sentence and word embeddings. This model uses large raw text corpora to train a neural model that embeds entire sentential contexts and target words in the same lowdimensional space. Finally, Press and Wolf (2017) introduced a model, based on word2vec, where the embeddings are extracted from the output topmost weight matrix, instead of the input one, showing that those representations are also valid word embeddings.

Sense Embeddings
In contrast to the above approaches, each of which aims to learn representations of lexical items, sense embeddings represent individual word senses as separate vectors. One of the main approaches for learning sense embeddings is the so-called knowledge-based approach, which relies on a predefined sense inventory such as Word-Net, BabelNet 1 (Navigli and Ponzetto, 2012) or Freebase 2 . SensEmbed 3 (Iacobacci et al., 2015) uses Babelfy 4 , a state-of-the-art tool for Word Sense Disambiguation and Entity Linking, to build a sense-annotated corpus which, in turn, is used to train a vector space model for word senses with word2vec. SensEmbed exploits the structured knowledge of BabelNet's sense inventory along with the distributional information gathered from text corpora. Since this approach is based on word2vec, the model suffers from the lack of word ordering while learning embeddings. An alternative way of learning sense embeddings is to start from a set of pretrained word embeddings and split the vectors into their respective senses. This idea was implemented by Rothe and Schütze (2015) in AutoExtend, a system which learns embeddings for lexemes, senses and synsets from WordNet in a shared space. The synset/lexeme embeddings live in the same vector space as the word embeddings, given the constraint that words are sums of their lexemes and synsets are sums of their lexemes. AutoExtend is based on an auto-encoder, a neural network that mimics the input and output vectors. However, Mancini et al. (2017) pointed out that, by constraining the representations of senses, we cannot learn much about the relation between words and senses. They introduced SW2V, a model which extends word2vec to learn embeddings for both words and senses in the same vector space as an emerging feature, rather than via constraints on both representations. The model was built by exploiting large corpora and knowledge obtained from WordNet and BabelNet. Their basic idea was to extend the CBOW architecture of word2vec to represent both words and senses as different inputs and train the model in order to predict the word and its sense in the middle. Nevertheless, being based on word2vec, SW2V also lacks a notion of word ordering. Other approaches in the literature avoid the use of a predefined sense inventory. The vectors learned by such approaches are identified as multi-prototype embeddings rather than senses, due to the fact that these vectors are only identified as different from one another, while there is no clear identification of their inherent sense. Several approaches have used this idea: Huang et al. (2012) introduced a model which learned multi vectors per word by clustering word context representations. Neelakantan et al. (2014) extended word2vec and included a module which induced new sense vectors if the context in which a word occurred was too different from the previously seen contexts for the same word. A similar approach was introduced by Li and Jurafsky (2015), which used a Chinese Restaurant Process as a way to induce new senses. Finally, Peters et al. (2018) presented ELMo, a word-in-context representation model based on a deep bidirectional language model. In contrast to the other related approaches, ELMo does not have a token dictionary, but rather each token is represented by three vectors, two of which are contextual. These models are, in general, difficult to evaluate, due to their lack of linkage to a lexical-semantic resource.
In marked contrast, LSTMEmbed, the neural architecture we present in this paper, aims to learn individual representations for word senses, linked to a multilingual lexical-semantic resource like BabelNet, while at the same time handling word ordering, and using pretrained embeddings as objective.

LSTMEmbed
Many approaches for learning embeddings are based on feed-forward neural networks (Section 2). However, recently LSTMs have gained popularity in the NLP community as a new de facto standard model to process natural language, by virtue of their context and word-order awareness. In this section we introduce LSTMEmbed, a novel method to learn word and sense embeddings jointly and which is based on the LSTM architecture.

Model Overview
At the core of LSTMEmbed is a bidirectional Long Short Term Memory (BiLSTM), a kind of recurrent neural network (RNN) which uses a set of gates especially designed for handling long-range dependencies. The bidirectional LSTM (BiLSTM) is a variant of the original LSTM (Hochreiter and Schmidhuber, 1997) that is particularly suited for temporal problems when access to the complete context is needed. In our case, we use an architecture similar to Kawakami and Dyer (2015), Kågebäck and Salomonsson (2016) and Melamud et al. (2016), where the state at each time step in the BiLSTM consists of the states of two LSTMs, centered in a particular timestep, accepting the input from previous timesteps in one LSTM, and the future timesteps in another LSTM. This is particularly suitable when the output corresponds to the analyzed timestep and not to the whole context. tag from an existing inventory (see Section 4.1 for details). Each token is represented by its corresponding embedding vector v(s j ) ∈ R n , given by a shared look-up table, which enables representations to be learned taking into account the contextual information on both sides. Next, the BiLSTM reads both sequences, i.e., the preceding context, from left to right, and the posterior context, from right to left: The model has one extra layer. The concatenation of the output of both LSTMs is projected linearly via a dense layer: where W o ∈ R 2m×m is the weights matrix of the dense layer with m being the dimension of the LSTM. Then, the model compares out LST M Embed with emb(s i ), where emb(s i ) is a pretrained embedding vector of the target token (see Section 4.1 for an illustration of the pretrained embeddings that we use in our experiments), and, depending on the annotation and the pretrained set of embeddings used, this could be either a word, or a sense. At training time, the weights of the network are modified in order to maximize the similarity between out LST M Embed and emb(s i ). The loss function is calculated in terms of cosine similarity: Once the training is over, we obtain latent semantic representations of words and senses jointly in the same vector space from the look-up table, i.e., the embedding matrix between the input and the LSTM, with the embedding vector of an item s given by v(s).
In comparison to a standard BiLSTM, the novelties of LSTMEmbed can be summarized as follows: • Using a sense-annotated corpus which includes both words and senses for learning the embeddings.
• Learning representations of both words and senses, extracted from a single look-up table, shared between both left and right LSTMs.
• A new learning method, which uses a set of pretrained embeddings as the objective, which enables us to learn embeddings for a large vocabulary.

Evaluation
We now present an experimental evaluation of the representations learned with LSTMEmbed. We first provide implementation details (Section 4.1), and then, to show the effectiveness of our model on a broad range of tasks, report on two sets of experiments: those involving sense-level tasks (Section 4.2) and those concerned with the word level (Section 4.3).

Implementation Details
Training data. We chose BabelNet (Navigli and Ponzetto, 2012) as our sense inventory. 5 Babel-Net is a large multilingual encyclopedic dictionary and semantic network, comprising approximately 16 million entries for concepts and named entities linked by semantic relations. As training corpus we used the English portion of BabelWiki, 6 a multilingual corpus comprising the English Wikipedia (Scozzafava et al., 2015). The corpus was automatically annotated with named entities and concepts using Babelfy (Moro et al., 2014), a state-ofthe-art disambiguation and entity linking system, based on the BabelNet semantic network. The English section of BabelWiki contains 3 billion tokens and around 3 million unique tokens.
Learning embeddings. LSTMEmbed was built with the Keras 7 library using Theano 8 as backend. We trained our models with an Nvidia Titan X Pascal GPU. We set the dimensionality of the look-up table to 200 due to memory constraints. We discarded the 1,000 most frequent tokens and set the batch size to 2048. The training was performed for one epoch. As optimizer function we used Adaptive Moment Estimation or Adam (Kingma and Ba, 2014). As regards the objective embeddings emb(s i ) used for training, we chose 400-dimension sense embeddings trained using word2vec's SkipGram architecture with negative sampling on the Babel-Wiki corpus and recommended parameters for the SkipGram architecture: window size of 10, negative sampling set on 10, sub-sampling of frequent words set to 10 3 .

Sense-based Evaluation
Our first set of experiments was aimed at showing the impact of our joint word and sense model in tasks where semantic, and not just lexical, relatedness is needed. We analyzed two tasks, namely Cross-Level Semantic Similarity and Most Frequent Sense Induction.
Comparison systems. We compared the performance of LSTMEmbed against alternative approaches to sense embeddings: SensEmbed (Iacobacci et al., 2015), which obtained semantic representations by applying word2vec to the English Wikipedia disambiguated with Babelfy; Nasari (Camacho-Collados et al., 2015), a technique for rich semantic representation of arbitrary concepts present in WordNet and Wikipedia pages; AutoExtend (Rothe and Schütze, 2015) which, starting from the word2vec word embeddings learned from GoogleNews 9 , infers the representation of senses and synsets from WordNet; DeConf, an approach introduced by Pilehvar and Collier (2016) that decomposes a given word representation into its constituent sense representations by exploiting WordNet. Experiment 1: Cross-Level Semantic Similarity. To best evaluate the ability of embeddings to discriminate between the various senses of a word, we opted for the SemEval-2014 task on Cross-Level Semantic Similarity (Jurgens et al., 2014, CLSS), which includes word-to-sense similarity as one of its sub-tasks. The CLSS word-tosense similarity dataset comprises 500 instances of words, each paired with a short list of candidate senses from WordNet with human ratings for their word-sense relatedness. To compute the word-tosense similarity we used our shared vector space of words and senses, and calculated the similarity using the cosine distance.
We included not only alternative sense-based representations but also the best performing approaches on this task: MeerkatMafia (Kashyap et al., 2014), which uses Latent Semantic Analysis (Deerwester et al., 1990) and WordNet glosses to get word-sense similarity measurements; Se-mantiKLU (Proisl et al., 2014), an approach based on a distributional semantic model trained on a large Web corpus from different sources; Sim-Compass (Banea et al., 2014), which combines word2vec with information from WordNet.
The results are given as Pearson and Spearman correlation scores in Table 1. LSTMEmbed achieves the state of the art by surpassing, in terms of Spearman correlation, alternative sense embedding approaches, as well as the best systems built specifically for the CLSS word-to-sense similarity task. In terms of Pearson, LSTMEmbed is on a par with the current state of the art, i.e., MeerkatMafia.  Experiment 2: Most Frequent Sense Induction. In a second experiment, we employed our representations to induce the most frequent sense (MFS) of the input words, which is known to be a hard-to-beat baseline for Word Sense Disambiguation systems (Navigli, 2009). The MFS is typically computed by counting the word sense pairs in an annotated corpus such as SemCor (Miller et al., 1993).
To induce a MFS using sense embeddings, we identified -among all the sense embeddings of an ambiguous word -the sense which was closest to the word in terms of cosine similarity in the vector space. We evaluated all the sense embedding approaches on this task by comparing the induced most frequent senses against the MFS computed for all those words in SemCor which have a minimum number of 5 sense annotations (3731 words in total, that we release with the paper), so as to exclude words with insufficient gold-standard data for the estimates. We carried out our evaluation by calculating precision@K (K ∈ {1, 3, 5}). Table 2 shows that, across all the models, SW2V performs the best, leaving LSTMEmbed as the best runnerup approach.

Word-based Evaluation
While our primary goal was to show the effectiveness of LSTMEmbed on tasks in need of sense information, we also carried out a second set of experiments focused on word-based evaluations with the objective of demonstrating the ability of our joint word and sense embedding model to tackle tasks traditionally approached with wordbased models.
Experiment 3: Synonym Recognition. We first experimented with synonym recognition: given a target word and a set of alternative words, the objective of this task was to select the member from the set which was most similar in meaning to the target word. The most likely synonym for a word w given the set of candidates A w is calculated as: where Sim is the pairwise word similarity: where S w i is the set of words and senses associated with the word w i . We consider all the inflected forms of every word, with and without all its possible senses. In order to evaluate the performance of LSTMEmbed on this task, we carried out experiments on two datasets. The first one, introduced by Landauer and Dumais (1997), is extracted directly from the synonym questions of the TOEFL (Test of English as a Foreign Language) questionnaire. The test comprises 80 multiple-choice synonym questions with four choices per question. The second one, introduced by Turney (2001), provides a set of questions extracted from the synonym questions of the ESL test (English as a Second Language). Similarly to TOEFL, it comprises 50 multiple-choice synonym questions with four choices per question.
Several related efforts used this kind of metric to evaluate their representations. We compare our approach with the following: • Multi-Sense Skip-gram (Neelakantan et al., 2014, MGGS), an extension of the Skip-gram model of word2vec capable of learning multiple embeddings for a single word. The model makes no assumption about the number of prototypes.
• Li and Jurafsky (2015), a multi-sense embeddings model based on the Chinese Restaurant Process.
• Jauhar et al. (2015), a multi-sense approach based on expectation-maximization style algorithms for inferring word sense choices in the training corpus and learning sense embeddings while incorporating ontological sources of information.
• Modularizing Unsupervised Sense Embeddings (Lee and Chen, 2017, MUSE), an unsupervised approach that introduces a modularized framework to create sense-level representation learned with linear-time sense selection.
In addition, we included in the comparison two off-the-shelf popular word embedding models: GoogleNews, a set of word embeddings trained with word2vec, from a corpus of newspaper articles, and Glove.6B 10 , a set of word embeddings trained on a merge of 2014 English Wikipedia dump and the corpus from Gigaword 5, for a total of 6 billion tokens.
In Table 3 we report the performance of LSTMEmbed together with the alternative approaches (the latter obtained from the respective publications). We can see that, on the TOEFL task, LSTMEmbed outperforms all other approaches, including the word-based models. On the ESL task, LSTMEmbed is the runner-up approach across systems and only by a small margin. The performance of the remaining models is considerably below ours.
Experiment 4: Outlier detection. Our second word-based evaluation was focused on outlier detection, a task intended to test the capability of the learned embeddings to create semantic clusters, that is, to test the assumption that the representation of related words should be closer than the representations of unrelated ones. We tested our model on the 8-8-8 dataset introduced by Camacho-Collados and Navigli (2016), containing eight clusters, each with eight words and eight possible outliers. In our case, we extended the   (2016)).
similarity function used in the evaluation to consider both the words in the dataset and their senses, similarly to what we had done in the synonym recognition task (cf. Equation 5). We can see from Table 4 that LSTMEmbed ranks second below SensEmbed in terms of both measures defined in the task (accuracy, and outlier position percentage, which considers the position of the outlier according to the proximity of the semantic cluster), with both approaches outperforming all other word-based and sense-based approaches.

Analysis
The objective embedding emb we used in our work uses pretrained sense embeddings obtained from word2vec trained on BabelWiki, as explained in Section 4.1. Our assumption was that training with richer and meaningful objective embeddings would enhance the representation delivered by our model in comparison to using wordbased models. We put this hypothesis to the test by comparing the performance of LSTMEmbed equipped with five sets of pretrained embeddings on a word similarity task. We used the WordSim-353 (Finkelstein et al., 2002) dataset, which comprises 353 word pairs annotated by human subjects with a pairwise relatedness score. We computed the performance of LSTMEmbed with the different pretrained embeddings in terms of Spearman correlation between the cosine similarities of the  LSTMEmbed word vectors and the WordSim-353 scores.
The first set of pretrained embeddings is a 50-dimension word space model, trained with word2vec Skip-gram with the default configuration. The second set consists of the same vectors, retrofitted with PPDB using the default configuration. The third is the GoogleNews set of pretrained embeddings. The fourth is the GloVe.6B word space model. Finally, we tested our model with the pretrained embeddings of SensEmbed. As a baseline we included a set of normalized random vectors. As is shown in Table 5, using richer pretrained embeddings improves the resulting representations given by our model. All the representations obtain better results compared to word2vec and GloVe trained on the same corpus, with the sense embeddings from SensEmbed, a priori the richest set of pretrained embeddings, attaining the best performance.

Conclusions
We presented LSTMEmbed, a new model based on a bidirectional LSTM for learning embeddings of words and senses jointly, and which is able to learn semantic representations on a par with, or better than, state-of-the-art approaches. We draw three main findings. Firstly, we have shown that our semantic representations are capable to properly reflect the similarity between word and sense representations, showing state-of-the-art performance in the sense-aware tasks of word-to-sense similarity and most frequent sense induction. Secondly, our approach is also able to attain high performance in standard word-based semantic evaluations, namely, synonym recognition and outlier detection. Finally, the introduction of an output layer which predicts pretrained embeddings enables us to use larger vocabularies instead of using the slower softmax. We release the word and sense embeddings at the following URL: http: //lcl.uniroma1.it/LSTMEmbed.
Our model shows potential for further applications. We did, in fact, explore alternative configurations, for instance, using several layers or replacing the LSTMs with Gated Recurrent Units (Cho et al., 2014) or the Transformer architecture (Vaswani et al., 2017). Trying more complex networks is also within our scope and is left as future work.