Auto-Encoding Dictionary Definitions into Consistent Word Embeddings

Monolingual dictionaries are widespread and semantically rich resources. This paper presents a simple model that learns to compute word embeddings by processing dictionary definitions and trying to reconstruct them. It exploits the inherent recursivity of dictionaries by encouraging consistency between the representations it uses as inputs and the representations it produces as outputs. The resulting embeddings are shown to capture semantic similarity better than regular distributional methods and other dictionary-based methods. In addition, our method shows strong performance when trained exclusively on dictionary data and generalizes in one shot.


Introduction
Dense, low-dimensional, real-valued vector representations of words known as word embeddings have proven very useful for NLP tasks (Turian et al., 2010). They can be learned as a by-product of solving a particular task (Collobert et al., 2011). Alternatively, one can pretrain generic embeddings based on co-occurrence counts or using an unsupervised criterion such as predicting nearby words (Bengio et al., 2003;Mikolov et al., 2013). These methods implicitly rely on the distributional hypothesis (Harris, 1954;Sahlgren, 2008), which states that words that occur in similar contexts tend to have similar meanings.
It is common to study the relationships captured by word representations in terms of either similarity or relatedness . "Coffee" is related to "cup" as coffee is a beverage often drunk in a cup, but "coffee" is not similar to "cup" in that coffee is a beverage and cup is a container. Methods relying on the distributional hypothesis often capture relatedness very well, reaching human performance, but fare worse in capturing sim-ilarity and especially in distinguishing it from relatedness .
It is useful to specialize word embeddings to focus on either relation in order to improve performance on specific downstream tasks. For instance, Kiela et al. (2015) report that improvements on relatedness benchmarks also yield improvements on document classification. In the other direction, embeddings learned by neural machine translation models capture similarity better than distributional unsupervised objectives (Hill et al., 2014).
There is a wealth of methods that postprocess embeddings to improve or specialize them, such as retrofitting (Faruqui et al., 2014). On similarity benchmarks, they are able to reach correlation coefficients close to inter-annotator agreements. But these methods rely on additional resources such as paraphrase databases (Wieting et al., 2016) or graphs of lexical relations such as synonymy, hypernymy, and their converse .
Rather than relying on such curated lexical resources that are not readily available for the majority of languages, we propose a method capable of improving embeddings by leveraging the more common resource of monolingual dictionaries. 1 Lexical databases such as WordNet (Fellbaum, 1998) are often built from dictionary definitions, as was proposed earlier by Amsler (1980). We propose to bypass the process of explicitly building a lexical database -during which information is structured but information is also lostand instead directly use its detailed source: dictionary definitions. The goal is to obtain better representations for more languages with less effort.
The ability to process new definitions is also desirable for future natural language understanding systems. In a dialogue, a human might want to explain a new term by explaining it in his own words, and the chatbot should understand it. Similarly, question-answering systems should also be able to grasp definitions of technical terms that often occur in scientific writing.
We expect the embedding of a word to represent its meaning compactly. For interpretability purposes, it would be desirable to be able to generate a definition from that embedding, as a way to verify what information it captured. Case in point: to analyse word embeddings, Noraset et al. (2017) used RNNs to produce definitions from pretrained embeddings, manually annotated the errors in the generated definitions, and found out that more than half of the wrong definitions fit either the antonyms of the defined words, their hypernyms, or related but different words. This points in the same direction as the results of intrinsic evaluations of word embeddings: lexical relationships such as lexical entailment, similarity and relatedness are conflated in these embeddings. It also suggests a new criterion for evaluating word representations, or even learning them: they should contain the necessary information to reproduce their definition (to some degree). In this work, we propose a simple model that exploits this criterion. The model consists of a definition autoencoder: an LSTM processes the definition of a word to yield its corresponding word embedding. Given this embedding, the decoder attempts to reconstruct the bag-of-words representation of the definition. Importantly, to address and leverage the recursive nature of dictionaries -the fact that words that are used inside a definition have their own associated definition -we train this model with a consistency penalty that ensures proximity of the embeddings produced by the LSTM and those that are used by the LSTM.
Our approach is self-contained: it yields good representations when trained on nothing but dictionary data. Alternatively, it can also leverage existing word embeddings and is then especially apt at specializing them for the similarity relation. Finally, it is also extremely data-efficient, as it permits to create representations of new words in one shot from a short definition.

Setting and motivation
We suppose that we have access to a dictionary that maps words to one or several definitions. Definitions themselves are sequences of words. Our training criterion is built on the following principle: we want the model to be able to recover the definition from which the representation was built. This objective should produce similar embeddings for words which have similar definitions. Our hypothesis is that this will help capture semantic similarity, as opposed to relatedness. Reusing the previous example, "coffee" and "cup" should get different representations in virtue of having very different definitions, while "coffee" and "tea" should get similar representations as they are both defined as beverages and plants.
We chose to compute a single embedding per word in order to avoid having to disambiguate word senses. Indeed, word sense disambiguation remains a challenging open problem with mixed success on downstream task applications (Navigli, 2009). Also, recent papers have shown that a single word vector can capture polysemy and that having several vectors per word is not strictly necessary (Li and Jurafsky, 2015) (Yaghoobzadeh and Schütze, 2016). Thus, when a word has several definitions, we concatenate them to produce a single embedding.

Autoencoder model
Let V D be the set of all words that are used in definitions and V K the set of all words that are defined. We let w ∈ V K be a word and D w = (D w,1 , . . . , D w,T ) be its definition, where D w,t is the index of a word in vocabulary V D . We encode such a definition D w by processing it with an LSTM (Hochreiter and Schmidhuber, 1997).
The LSTM is parameterized by Ω and a matrix E of size |V D | × m, whose i th row E i contains an m-dimensional input embedding for the i th word of V D . These input embeddings can either be learned by the model or be fixed to a pretrained embedding. The last hidden state computed by this LSTM is then transformed linearly to yield an m-dimensional definition embedding h. Thus the encoder whose parameters are θ = {E, Ω, W, b} computes this embedding h as The subsequent decoder can be seen as a conditional language model trained by maximum likelihood to regenerate definition D w given definition embedding h = f θ (D w ). We use a simple conditional unigram model with a linear parametrization θ = {E , b } where E is a |V D | × m matrix and b is a bias vector. 2 We maximize the log-probability of definition D w under that model: where , denotes an ordinary dot product. We call E the output embedding matrix. The basic autoencoder training objective to minimize over the dictionary can then be formulated as

Consistency penalty
We introduced 3 different embeddings: a) definition embeddings h, produced by the definition encoder, are the embeddings we are ultimately interested in computing; b) input embeddings E are used by the encoder as inputs; c) output embeddings E are compared to definition embeddings to yield a probability distribution over the words in the definition. We propose a soft weight-tying 2 We have tried using a LSTM decoder but it didn't yield good representations. It might overfit because the set of dictionary definitions is small. Also, using teacher forcing, we condition on ground-truth words, making it easier to predict the next words. More work is needed to address these issues. scheme that brings the input embeddings closer to the definition embeddings. We call this term a consistency penalty because its goal is to to ensure that the embeddings used by the encoder (input embeddings) and the embeddings produced by the encoder (definition embeddings) are consistent with each other. It is implemented as where d is a distance. In our experiments, we choose d to be the Euclidian distance. The penalty is only applied to some words because V D = V K . Indeed, some words are defined but are not used in definitions and some words are used in definitions but not defined. In particular, inflected words are not defined. To balance the two terms, we introduce two hyperparameters λ, α ≥ 0 and the complete objective is We call the model CPAE, for Consistency Penalized AutoEncoder when α > 0 and λ > 0 (see Figure 1). 3 The consistency penalty is a cheap proxy for dealing with the circularity found in dictionary definitions. We want the embeddings of the words in definitions to be compositionally built from their definition as well. The recursive process of fetching definitions of words in definitions does not terminate, because all words are defined using other words. To counter that, our model uses input embeddings that are brought closer to definition embeddings and vice versa in an asynchronous manner.
Moreover, if λ is chosen large enough, then E w ≈ f θ (D w ) after optimisation. This means that the definition embedding for w is close enough to the corresponding input embedding to be used by the encoder for producing other definition embeddings for other words. In that case, the model could enrich its vocabulary by computing embeddings for new words and consistently reusing them as inputs for defining other words.
Finally, the consistency penalty can be used to leverage pretrained embeddings and bootstrap the learning process. For that purpose, the encoder's input embeddings E can be fixed to pretrained embeddings. These provide targets to the encoder but also helps the encoder to produce better definition embeddings in virtue of using input embeddings that already contain meaningful information.
To summarize, the consistency penalty has several motivations. Firstly, it deals with the fact that the recursive process of building representation of words out of definitions does not terminate. Secondly, it is a way to enrich the vocabulary with new words dynamically. Finally, it is a way to integrate prior knowledge in the form of pretrained embeddings.
In order to study the two terms of the objective in isolation, we introduce two special cases. When λ = 0 and α > 0, the model reduces to AE for Autoencoder. When α = 0 and λ > 0, we retrieve Hill's model, as presented by . 4 Hill's model is simply a recurrent encoder that uses pretrained embeddings as targets so it only makes sense in the case we use fixed pretrained embeddings.

Related work 3.1 Extracting lexical knowledge from dictionaries
There is a long history of attempts to extract and structure the knowledge contained in dictionaries. Amsler (1980) studies the possibility of automatically building taxonomies out of dictionaries, relying on the syntactic and lexical regularities that definitions display. One relation is particularly straightforward to identify: it is the is a relation that translates to hypernymy. Dictionary definitions often contain a genus which is the hypernym of the defined word, as well as a differentia which differentiates the hypernym from the defined word. For example, the word "hostage" is defined as "a prisoner who is held by one party to insure that another party will meet specified terms", where "prisoner" is the genus and the rest is the differentia.
To extract such relations, early works by Chodorow et al. (1985) and Calzolari (1984) use string matching heuristics. Binot and Jensen (1987) operate at the syntactic parse level to detect these relations. Whether based on the string representation or the parse tree of a definition, these rule-based systems have helped to create large lexical databases. We aim to reduce the manual labor involved in designing the rules and directly obtaining representations from raw definitions.

Improving word embeddings using lexical resources
Postprocessing methods for word embeddings use lexical resources to improve already trained word embeddings irrespective of how they were obtained. When it is used with fixed pretrained embeddings, our method can be seen as a postprocessing method.
Postprocessing methods typically have two terms for trading off conservation of distributional information that is brought by the original vectors with the new information from lexical resources. There are two main ways to preserve distributional information: Attract-Repel , retrofitting  and our method control the distance between the original vector and the postprocessed vector so that the new vector does not drift too far away from the original vector. Counter-Fitting (Mrkšić et al., 2016) and dict2vec (Tissier et al., 2017) ensure that the neighbourhood of a vector in the original space is roughly the same as the neighbourhood in the new space.
Finally, methods differ by the nature of the lexical resources they use. To our knowledge, dict2vec is the only technique that uses dictionaries. Other postprocessing methods use various data from WordNet: sets of synonyms and sometimes antonyms, hypernyms, and hyponyms. For instance, Lexical Entailment Attract-Repel (LEAR) uses all of these . Other methods rely on paraphrase databases (Wieting et al., 2016).

Dictionaries and word embeddings
We now turn to the most relevant works that involve dictionaries and word embeddings.
Dict2vec (Tissier et al., 2017) combines the word2vec skip-gram objective (predicting all the words that appear in the context of a target word) with a cost for predicting related words. These related words either form strong pairs or weak pairs with the target word. Strong pairs have a greater influence in the cost. They are pairs of words that are in the neighbourhood of the target word in the original embedding, as well as pairs of words for which the definitions make reference to each other. Weak pairs are pairs of words where only one word appears in the definition of the other. Unlike dict2vec, our method can be used as either a standalone or a postprocessing method (when used with pretrained embeddings). It also focuses on handling and leveraging the recursivity found in dictionary definitions with the consistency penalty whereas dict2vec ignores this aspect of the structure of dictionaries.
Besides dict2vec,  train neural language models to predict a pretrained word embedding given a definition. Their goal was to learn a general-purpose sentence encoder useful for downstream tasks. Noraset et al. (2017) propose the task of generating definitions based on word embeddings for interpretability purposes. Our model unifies these two approaches into an autoencoder. However, we have a different goal: that of creating or improving word representations. Their methods assume that pretrained embeddings are available to provide either targets or inputs, whereas our model is unsupervised, and the use of pretrained embeddings is optional. Bahdanau et al. (2017) present a related model that produces embeddings from definitions such that it improves performance on a downstream task. By contrast our approach is used either stand-alone or as as a postprocessing step, to produce general-purpose embeddings at a lesser computational cost. The core novelty is the way we leverage the recursive structure of dictionaries.
Finally, Herbelot and Baroni (2017) also aim at learning representations for word embeddings in a few shots. The method consists of fine-tuning word2vec hyperparameters and can learn in one or several passes, but it is not specifically designed to handle dictionary definitions.

Setup
We experiment on English to benefit from the many evaluation benchmarks available. The dictionary we use is that of WordNet (Fellbaum, 1998). WordNet contains graphs of linguistic relations such as synonymy, antonymy, hyponymy, etc. but also definitions. We emphasize that our method trains exclusively on the definitions and is thus applicable to any electronic dictionary.
However, in order to evaluate the quality of em-beddings on unseen definitions, WordNet relations comes in handy: we use the sets of synonyms to split the dictionary into a train set and a test set, as explained in Section 7. Moreover, WordNet has a wide coverage and high quality, so we do not need to aggregate several dictionaries as done by Tissier et al. (2017). Finally, WordNet is explicitly made available for research purposes, therefore we avoid technical and legal difficulties associated with crawling proprietary online dictionaries. We do not include part of speech tags that go with definitions. WordNet does not contain function words but contains homonyms of function words. We filter these out.

Similarity and relatedness benchmarks
Evaluating the learned representations is a complex issue (Faruqui et al., 2016). Indeed, different evaluation methods yield different rankings of embeddings: there is no single embedding that outperforms others on all tasks (Schnabel et al., 2015) and thus no single best evaluation method.
We focus on intrinsic evaluation methods. In particular, we study how different models trade off similarity and relatedness. We use benchmarks which consist of pairs of words scored according to some criteria. They vary in terms of annotation guidelines, number of annotators, selection of the words, etc. To evaluate our embeddings, we score each pair by computing the cosine similarity between the corresponding word vectors. Then the predicted scores and the ground truth are ranked and the correlation between the ranks is measured by Spearman's ρ. We leave aside analogy prediction benchmarks as they suffer from many problems (Linzen, 2016; Rogers et al., 2017).
We adopt one of the methods proposed by Faruqui et al. (2016) and use separate datasets for model selection. We choose the development set to be the development set of SimVerb3500 (Gerz et al., 2016) and MEN (Bruni et al., 2014), the only benchmarks with a standard train/test split.
We justified our emphasis on the similarity relation in Section 1: capturing this relation remains a challenge, and we hypothesize that dictionary data should improve representations in that respect. The model selection procedure reflects that we want embeddings specialized in similarity. To do that, we set the validation loss as a weighted mean which weights SimVerb twice as MEN.

Baselines
The objective function presented in section 2 gives us 3 different models: CPAE, AE, and Hill's model. The objective of CPAE comprises the sum of the objective of Hill's model and of AE. We compare the CPAE model to both of these to evaluate the individual contribution of the two terms to the performance. In addition, when we use external corpora to pretrain embeddings, we compare these models to dict2vec and retrofitting. The hyperparameter search is described in Appendix C.
The test benchmarks for the similarity relation includes SimLex999  and more particularly SimLex333, a challenging subset of SimLex999 which contains only highly related pairs but in which similarity scores vary a lot. For relatedness, we use MEN (Bruni et al., 2014), RG (Rubenstein and Goodenough, 1965), WS353 (Finkelstein et al., 2001), SCWS (Huang et al., 2012), andMTurk (Radinsky et al., 2011;Halawi et al., 2012). The evaluation is carried out by a modified version of the Word Embeddings Benchmarks project. 5 Conveniently, all these benchmarks contain mostly lemmas, so we do not suffer too much from the problem of missing words. 6

Results in the dictionary-only setting
In the first evaluation round, we train models only using a single monolingual dictionary. This allows us to check our hypothesis that dictionaries contain information for capturing the similarity relation between words.
Our baselines are regular distributional models: GloVe (Pennington et al., 2014) and word2vec (Mikolov et al., 2013). They are trained on the concatenation of defined words with their definitions.
Such a formatting introduces spurious cooccurrences that do not otherwise appear in free text. But these baselines are not designed for dictionaries and cannot deal with their particular structure.
We compare these models to the autoencoder model without (AE) and with (CPAE) the consistency penalty. In this setting, we cannot use Hill's model as it requires pretrained embeddings as targets. We also trained an additional CPAE model with pretrained word2vec embeddings trained on the concatenated definitions. The results are presented in Table 1. GloVe is outperformed by word2vec by a large margin so we ignore this model in later experiments. Word2vec captures more relatedness than CPAE (+10.7 on MEN-t, +13.5 on MT, +13.2 on WS353) but less similarity than CPAE. The difference in the nature of the relations captured is exemplified by the scores on SimLex333. This subset of SimLex999 focuses on pairs of words that are very related but that can be either similar or dissimilar. On this subset, CPAE fares better than word2vec (+13.1).
The consistency penalty improves performance on every dataset. This penalty provides targets to the encoder, but these targets are themselves learned and change during the learning process. The exact dynamics of the system are unknown. It can be seen as a regularizer because it puts strong weight-sharing constraints on both types of embeddings. It also resembles bootstrapping in reinforcement learning, which consists of building estimates of values functions on top of over estimates (Sutton and Barto, 1998).
The last model is the CPAE model that uses the word2vec embeddings pretrained on the dictionary data. This combination not only equals other models on some benchmarks but outperforms them, sometimes by a large margin (+6.3 on SimLex999 and +7.5 on SimVerb3500 compared to CPAE, +6.1 on SCWS, +5.4 on MT compared to word2vec). Thus, the two kinds of algorithms are complementary through the different relationships that they capture best. The pretraining helps in two different ways, by providing quality input embeddings and targets to the encoder. The pretrained word2vec targets are already remarkably good. That is why the chosen consistency penalty coefficient selected is very high (λ = 64). The model can pay a small cost and deviate from the targets in order to encode information about the definitions.
To sum up, dictionary data contains a lot of data relevant to modeling the similarity relationship. Autoencoder based models learn different relationships than regular distributional methods. The consistency penalty is a very helpful prior and regularizer for dictionary data, as it always helps,  Table 1: Positive effect of the consistency penalty and word2vec pretraining. Spearman's correlation coefficient ρ × 100 on benchmarks. Without pretraining, autoencoders (AE and CPAE) improve on similarity benchmarks while capturing less relatedness than distributional methods. The consistency penalty (CPAE) helps even without pretrained targets. Our method, combined with pretrained embeddings on the same dictionary data (CPAE-P), significantly improves on every benchmark.
regardless of what relationship we focus on. Finally, our model can drastically improve embeddings that were trained on the same data but with a different algorithm.

Improving pretrained embeddings
We have seen that CPAE with pretraining is very efficient. But does this result generalizes to other kind of pretraining data? To answer this question, we experiment using embeddings pretrained on the first 50 million tokens of a Wikipedia dump, as well as the entire Wikipedia dump. We compare our method to existing postprocessing methods such as dict2vec and retrofitting, which also aims at improving embeddings with external lexical resources.
Retrofitting, which operates on graphs, is not tailored for dictionary data, which consists in pairs of words along with their definitions. We build a graph where nodes are words and edges between nodes correspond to the presence of one of the words into the definition of another. Obviously, we lose word order in the process.
The results for the small corpus are presented in Table 2. By comparing Table 2 with Table  1, we see that word2vec does worse on similarity than when trained on dictionary data, but better on relatedness. Both dict2vec and retrofitting improve with regards to word2vec on similarity benchmarks and seem roughly on par. However, dict2vec fails to improve on relatedness benchmarks, whereas retrofitting sometimes improves (as in RG, MEN, and MT), sometimes equals (SCWS) and does worse (353).
We do an ablation study by comparing Hill's model and AE with CPAE. Recall that Hill's model lacks the reconstruction cost while AE lacks the consistency penalty. Firstly, CPAE al-ways improves over AE. Thus, we confirm the results of the previous section on the importance of the consistency penalty. In that setting, it is more obvious why this penalty helps, as it now provides pretrained targets to the encoder. Secondly, CPAE improves over Hill on all similarity benchmarks by a large margin (+12.2 on SL999, +13.7 on SL333, +16.1 on SV3500). It is sometimes slightly worse on relatedness benchmarks (−3.3 on MEN-t, −5.6 on MT), other times better or equal. We conclude that both terms of the CPAE objective matter.
We see identical trends when using the full Wikipedia dump. As expected, CPAE can still improve over even higher quality embeddings by roughly the same margins. The results are presented in Appendix D.
Remarkably, the best model among all our experiments is CPAE in Table 1 and uses only the dictionary data. This supports our hypothesis that dictionaries contain similarity-specific information.

Generalisation on unseen definitions
A model that uses definitions to produce word representations is appealing because it could be extremely data-efficient. Unlike regular distributional methods which iteratively refine their representation as occurrences accumulate, such a model could output a representation in one shot. We now evaluate CPAE in a setting where some definitions are not seen during training.
The dictionary is split into train, validation (for early stopping) and test splits. The algorithm for splitting the dictionary puts words in batches. It ensures two things: firstly, that words which share at least one definition are in the same batch, and secondly, that each word in a batch is associated   with all its definitions. We can then group batches to build the training and the test sets such that the test set does not contain synonyms of words from the other sets. We sort the batches by the number of distinct definitions they contain. We use the largest batch returned by the algorithm as the training set: it contains mostly frequent and polysemous words. The validation and the test sets, on the contrary, contain many multiword expressions, proper nouns, and rarer words. More details are given in Appendix B.1. We train CPAE only on the train split of the dictionary, with randomly initialized input embeddings. Table 3 presents the same correlation coefficients as in the previous tables but also distinguishes between two subsets of the pairs: the pairs for which all the definitions were seen during training (train) and the pairs for which at least one word was defined in the test set (test). Unfortunately, there are not enough pairs of words which both appear in the test set to be able to compute significant correlations. On small-sized benchmarks, correlation coefficients are sometimes not significant so we do not report them (when p-value > 0.01).
The scores of CPAE on the test pairs are quite correlated with the ground truth: except on Sim-Lex999 and SCWS, there is no drop in correlation coefficients between the two sets. The scores of Hill's model follow similar trends, but are lower on every benchmark so we do not report them. This shows that recurrent encoders are able to generalize and produce coherent embeddings as a function of other embeddings in one pass.

Conclusion and future work
We have focused on capturing the similarity relation. It is a challenging task which we have proposed to solve using dictionaries, as definitions seem to encode the relevant kind of information.
We have presented an alternative for learning word embeddings that uses dictionary definitions. As a definition autoencoder, our approach is selfcontained, but it can alternatively be used to improve pretrained embeddings, and includes Hill's model  as a special case.
In addition, our model leverages the inherent recursivity of dictionaries via a consistency penalty, which yields significant improvements over the vanilla autoencoder.
Our method outperforms dict2vec and retrofitting on similarity benchmarks by a quite large margin. Unlike dict2vec, our method can be used as a postprocessing method which does not require going through the original pretraining corpus, it has fewer hyperparameters, and it generalises to new words.
We see several directions for future work. Firstly, more work is needed to evaluate the representations on other languages and tasks.
Secondly, solving downstream tasks requires representations for the inflected words as well. We have set aside this issue by focusing on benchmarks involving lemmas. To address it in future work, we might want to split word representations into a lexical and a morphological part. With such a split representation, we could postprocess only the lexical component, and all the words, whether inflected or not, would benefit from this. This seems desirable for postprocessing methods in general and would make them more suitable for synthetic languages.
Thirdly, dictionary defines every sense of words, so we could produce one embedding per sense (Chen et al., 2014) (Iacobacci et al., 2015). This requires potentially complicated modifications to our model as we would need to disambiguate senses inside each definition. However, some class of words might benefit a lot from such representations, for example words that can be used as different parts of speech.
Lastly, a more speculative direction could be to study iterative constructions of the set of embeddings. As our algorithm can generalize in one shot, we could start the training with a small set of words and their definitions and iteratively broaden the vocabulary and refine the representations without retraining the model. This could be useful in discovering a set of semantic primes from which one can define all the other words (Wierzbicka, 1996).