Embedding Structured Dictionary Entries

Previous work has shown how to effectively use external resources such as dictionaries to improve English-language word embeddings, either by manipulating the training process or by applying post-hoc adjustments to the embedding space. We experiment with a multi-task learning approach for explicitly incorporating the structured elements of dictionary entries, such as user-assigned tags and usage examples, when learning embeddings for dictionary headwords. Our work generalizes several existing models for learning word embeddings from dictionaries. However, we find that the most effective representations overall are learned by simply training with a skip-gram objective over the concatenated text of all entries in the dictionary, giving no particular focus to the structure of the entries.


Introduction
While word embedding models are typically trained using large text corpora with objectives based on distributional semantics, recent work has shown how to take advantage of external resources like WordNet (Miller, 1995) and other manually created dictionaries in order to better capture wordlevel semantic relationships of interest. For example, previous work has used the graph structure of external resources to post-process pre-trained word embeddings, enforcing that the similarity between embeddings reflects the similarity inferred from the graph structure of lexicons like WordNet (Faruqui et al., 2015). Following in a similar principle, others use known synonymy and antonymy relationships between words to adjust the distance between word embeddings (Mrkšić et al., 2016). Other work uses traditional dictionaries to improve the overall coverage of word embedding models by creating embeddings for rare words be leveraging information from their definitions (Bahdanau et al., 2017).
While dictionaries have been shown to be useful, most previous work has focused only on using the text of the definitions in order to learn word representations. However, many dictionaries include additional structural elements such as usage examples, quotations containing the headword, tags, labels, and more. For some online crowd-built dictionaries, information such as the contributing users and even upvotes and downvotes are available.
We conjecture that such meta information may prove useful and, therefore, we seek to leverage all of this additional information to build improved representations of the words defined in a given dictionary. To do this, we generalize the Consistency-Penalized Autoencoder (CPAE) (Bosc and Vincent, 2018) to allow for not only the reconstruction of dictionary definitions, but also for making predictions about the other structural elements available, such as usage examples and user-assigned tags.
We make the following contributions in this paper: (1) we propose a flexible, multi-task learning extension to the CPAE model that can be used to produce embeddings from structured dictionary entries, (2) we evaluate the applicability of this extended model to three English-language dictionary datasets, each with their own unique characteristics and sets of structural elements, and (3) we demonstrate the a simple baseline approach for learning word embeddings, based on the popular skip-gram with negative sampling framework, can often lead to representations that better capture word-level semantic similarity according to a range of commonly used evaluation tasks.

Structured Dictionary Data
We consider three manually constructed, machinereadable, English-language dictionaries: English Wordnet 1 (Miller, 1995), English Wiktionary 2 , and Urban Dictionary (UD) 3 , each containing definitions for each word in addition to one or more structural elements such as usage examples, tags, or votes (Table 1). We find that many of the terms that are defined in Urban Dictionary are not commonly used in everyday language, and so we choose to further filter the set of headwords from Urban Dictionary to those that have been used at least 10,000 times in a sample of tweets sampled over a five-year period as identified in (Wilson et al., 2020b).

Baseline Approach
To provide a simple baseline for later evaluation, we train word embeddings using the entire text of each dictionary, including all structured elements, by treating each structural element as a short document and prepending the entry headword to each. We use a standard skip-gram model with negative sampling (SGNS), trained using the FastText library (Mikolov et al., 2018).

Auto-encoding Structured Entries with
Multi-task Learning Next, we present an approach for learning word embeddings that implicitly encode a wide range of the elements that are present in a dictionary entry. Given a word defined in a dictionary, the objective of the model is to accurately recover as much structural information as possible, including the word's definition, usage examples, tags, and authors. We also leverage user provided votes as a means of sorting and filtering the dictionary entries. The model takes a word's definition as input, and learns a transformation from the words in the definition to an embedding that contains features that describe the structural elements of the dictionary entry for the word. We treat the prediction of each type of structural element as a separate task within a multi-task learning framework.

Model Architecture
Our model (Figure 1; a more formal, detailed description of the model is given in Appendix A) can be seen as a generalization of several others: a simple auto-encoder, Hill's model (Hill et al., 2016), and the consistency penalized auto-encoder 1 To make our results directly comparable with (Bosc and Vincent, 2018), we use the filtered version of WordNet included at: https://github.com/tombosc/cpae 2 https://en.wiktionary.org/ 3 https://www.urbandictionary.com/ Figure 1: Model architecture for multi-task learning autoencoder for embedding words from their structured dictionary entries. Input tokens are embedding using the Input Embeddings layer, and the n tokens in the definition of headword w h are passed to the Definition Encoder to produce the definition embedding h. This embedding should be consistent (low distance) with the embedding of the definition headword e h . M possible output tasks can be used, each with its own decoder which needs to reconstruct the Target.
(CPAE). In each case, the input for the model is a definition 4 for the target headword, w h . The input tokens are converted into a sequence of embeddings using a learnable word embedding layer, and these embeddings are passed to the definition encoder, which produces a single embedding, h, which is used as the representation for w h . This embedding is then fed to any number of decoders, each with their own specific objective and loss function (details in the subsections of Appendix A). The goal of each decoder's loss is to influence the weights of the encoder to produce an embedding h that is most useful for capturing a specific structural element of the dictionary entry for w h , or to retain some other important property of the embedding h. The decoders that we use and their associated losses become components in the overall loss function for our model: L = λ 0 L 0 + λ 1 L 1 . . . + λ n L m for up to m objectives, each with its own associated weight term. These weights can be used to control the overall influence of the objective in the final loss computation.  Table 1: Structural elements present in three machine-readable dictionaries, and number of headwords, definitions, and total tokens present in each. UD (Filtered) is the filtered version of Urban Dictionary which doesn't contain words that are not commonly used or definitions for which the difference between the number of upvotes and downvotes is negative. This is the version of Urban Dictionary that is used when training our proposed model.
The target of each decoder is dependent on the structural element that it is meant to encoder. For the definitions, the goal of the decoder is to reproduce the definition itself (making the use of this task alone equivalent to a simple autoencoder). For the usage examples and tags, the target task is the predict the context in which the headword appears using a skip-gram learning objective. We also experiment with using the user-provided votes to filter and sort the data, as well as to provide weights for the input definitions.
An additional loss term can be used in order to enforce the consistency between the learned embedding h and the input embedding for the headword e h . This is similar to the main objective of Hill's model (Hill et al., 2016) and is the consistency penalty that is used in the CPAE model (Bosc and Vincent, 2018). This forces the model to produce embeddings for headwords that are consistent to the embeddings produced for the same words when they appear in the definitions of other headwords.

Evaluation and Results
We evaluate all produced embeddings 5 across a range of intrinsic evaluation tasks as used in (Jastrzebski et al., 2017). 6 For these word-level semantic similarity tasks, the machine generated scores (cosine similarity between the produced word embeddings) are compared against human-labeled similarity scores by computing the correlation between the two sets of scores.
The tasks involved include the Marco, Elia and Nam (MEN) annotated word pairs based on image captioning data (Bruni et al., 2014), the SimVerb (SV) verb similarity dataset (Gerz et al., 2016), both of which have standardized development and testing splits. We use the development splits of these datasets in order to tune our models. The WordSim-353 (WS) dataset contains both similarity (WS-S) and relatedness (WS-R) annotations for the same sets of words, allowing us to examine the ability of our models to capture each of these semantic relations. We also evaluate using the SimLex-999 dataset and a subset of that data, SimLex-333 (SL999 and SL333) ( For models that use our proposed architecture, we initialize the input embeddings using the baseline pre-trained skip-gram embeddings. We train these embeddings ourselves in the case of WordNet and Wikitionary, and use the ud-basic embeddings released by (Wilson et al., 2020a) for Urban Dictionary. 7 Table 2 shows the similarity and relatedness scores achieved when using various combinations of objectives in our model. 8 We observe that for WordNet, the simple SGNS embeddings are always outperformed by the other approaches, which is in line with the results reported in (Bosc and Vincent, 2018) where the CPAE-P model was found to achieve the best results when using WordNet. We can see that adding structure, which, for the case of WordNet, only includes usage examples, leads to an improvement over the base CPAE-P model in many cases. The overall trend is similar for the Wiktionary data, yet we see a stronger performance from the SGNS baseline. In fact, SGNS achieves the best results for  two of the test datasets and achieves competitive results across the board, making it a viable alternative to the more complex dictionary auto-encoding approaches. Finally, for the Urban Dictionary data, we see the baseline SGNS approach overtaking the other methods in almost every evaluation set, also leading to many of the best overall scores found in this study. This shift in performance may be related to the overall size of each dataset: Urban Dictionary dataset contains approximately 200 million total tokens, compared to the 1.7 million in Word-Net and 4.6 million in English Wiktionary. Further, as Urban Dictionary's definitions contain a mixture of noisy submissions, jokes, and opinions, they are likely to be less closely tied to the true meanings of the headwords (Nguyen et al., 2018). This could make the auto-encoding objective less useful overall in comparison to learning representations of the words simply based on their usage contexts.

Conclusions
We show that the extension of the CPAE model to include additional structural elements can provide some gains in word-level semantic similarity tasks, however, the the extra complexity of this approach is unnecessary for learning useful word embeddings, and in many cases, leads to degradation in the scores across a range of standard word embedding evaluation metrics in comparison to simpler approaches. To build general purpose word embeddings from a sufficiently large dictionary (i.e., containing at least several hundred million tokens of text), our recommendation is to simply concatenate all of the structural elements together as a single text, inserting the entry headword between each element, and applying the widely popular skip-gram architecture to this text to learn traditional distribution embeddings. This approach requires only a single learning objective, trains in much less time, and achieves competitive results in many cases, making it an easier alternative to explicitly leveraging structural information from dictionary entries while still creating useful embeddings.
Future work should explore how these approaches would work when applied to more English dictionaries such as the Oxford English Dictionary 9 in order to better understand the effects of using a more standardized dictionary to learn embeddings. Further, dictionaries in other languages, particularly lower-resource languages, should be considered, since our results suggest that the approaches described in this paper outperform the baseline approach mostly in settings where the total amount of text in the dictionary is small.

Appendix A Detailed Model Description
Formally, let D = {w 0 in , w 1 in , . . . , w n in } be a sequence of tokens in the definition for w h . The elements of D belong to the vocabulary of all words that appear in definitions, V in , and w h belongs to the vocabulary of all headwords, V h . In the case of polysemous words which have more than one meaning, we concatenate the tokens from all definitions together into a single sequence, and separate them by a special SEP token.
Given the full sequence of input words, E in (D) = {e 0 in , e 1 in , . . . , e n in } is the set of d indimensional embeddings representing words in the definition. These embeddings can be learned during training, or pre-initialized and frozen, as discussed later in this section. The embeddings are passed into an encoder layer in order to produce a single d h -dimensional embedding h = enc(E in (D)). The encoder can be any type of model that takes a variable-length sequence of embeddings as input and produces a single, fixedlength embedding as output.
This embedding is then fed to any number of decoders, each with their own specific objective and loss function. The goal of each decoder's loss is to influence the weights of the encoder to produce an embedding h that is most useful for capturing a specific structural element of the dictionary entry for w h , or to retain some other important property of the embedding h. In the following subsections, we describe the decoders that we use and their associated loss, which become components in the overall loss function for our model: L = λ 0 L 0 + λ 1 L 1 . . . + λ n L n for up to n objectives, each with its own associated weight term. These weights can be used to control the overall influence of the objective in the final loss computation.

A.1 Definitions as reconstruction targets
The words in a well-formed definition should provide a precise encapsulation of one of the meanings of the headword being defined. So, we expect that a combination of the meanings of the words in the definition should provide a reasonable approximation for the meaning of the word itself. Since the input to our encoder is the set of embeddings of the definition words, a decoder objective based on the intermediate representation, h, will lead to a simple auto-encoder for the definition itself.
The definition decoder with learned parameters θ produces a set of predictions of the words belonging to the original definitionD = dec θ (h), and this decoder is used to compute the definition reconstruction loss L R . We use a simple conditional unigram language modeling loss as our reconstruction loss where p(w|θ) is determined by the decoder dec θ . For the decoder, the auto-encoder model uses a single linear layer with input size d h and output size |V def |, followed by a softmax operation, providing a probability p(w) for all words in the output vocabulary V def . The output vocabulary V def is equal to V in for the traditional auto-encoder setting, since the objective is to reproduce the set of input words. However, in practice, we can speed up computation with minimal impact on performance by reducing V def to only contain the m def most common words, and treating all others as out-of-vocabulary. The out-of-vocabulary words are represented by a single token UNK which is ignored for the purposes of the loss computation. Including only this objective (which can be achieved by setting λ t = 0 for every other task t) is equivalent to a simple definition auto-encoder: given the word in the headword's definition, produce an intermediate embedding h which can then be used to reconstruct the original set of words from the definition.

A.2 Usage examples and tags as context
While widely used distribution word embeddings rely on examples of words in context in order to learn representations of those words, hundreds of examples of usage of each word are usually required in order to build stable representations (Burdick et al., 2018). We experiment with using only the few prototypical examples that are provided in the dictionary definitions themselves as training samples for the term. This has several advantages: first, no data outside of the dictionary itself is needed to train the embeddings, and second, usage examples should, by nature, be written in a way that a specific meaning of the term is emphasized, providing a potentially stronger semantic signal than randomly sampled occurrences of a term in a text corpus. Usage contexts may help to capture aspects of meaning that correspond to general semantic relatedness between words. Similarly, tags provide high-level category information related to words, and we expect that words with similar sets of tags will be related in meaning.
To incorporate this information into our model, we use a skip-gram language modeling objective similar to the one used by the word2vec model for learning word embeddings from word-in-context samples. That is, given the embedding for a word, h, we train a new feedforward output layer to predict the set of words that appear in the usage example context around the target word, or in the case of tags, the output layer should predict all tags. In the case of the usage examples, we replace the word and its morphological variations with a special MASK token so that the model does not learn to simply predict the word itself. Then, we define new vocabularies V use and V tag for all words that appear in usage examples in the dictionary and all tags, respectively, and we train linear layers to predict the set of usage words and tags given h. The loss L use is then the cross-entropy between the predicted distribution over V use and the equally sized vector of counts representing the number of times each word actually appeared in a usage example, and the same is done with the tag distribution to compute the tag prediction loss, L tag . As with the definition decoder, we allow for the size of the output vocabulary to be restricted to the most common m use /m tag words.

A.3 Consistency between embeddings
The consistency penalized auto-encoder model (CPAE) adapted an additional constraint, based on Hill's model (Hill et al., 2016), to minimize the distance between the input embedding e h = E in (w h ) and the learned encoder embedding h. To achieve this, the Euclidean distance between the two embeddings is minimized as an additional component of the loss, the consistency penalty: L C = (h − e h ) 2 which can only be computed for for the set of words which are both defined (headwords) and used within definitions of other words, i.e., V h ∩ V in . When setting λ t = 0 for all other tasks t, we can approximately recover Hill's model (Hill et al., 2016). It was previously shown (Bosc and Vincent, 2018) that initializing the weights of the input embeddings E in with pre-trained word embeddings, paired with this type of consistency constraint, can lead to improved performance on a number of word relatedness tasks (we label this setting as CPAE-P).

A.4 Votes as signals of importance
User-provided information can be used in several ways in our method. In our current setup, there may often be too many entries for a given headword to be able to adequately focus on all of them at once using our models which rely on a recurrent encoder for the concatenation of all tokens in all definitions. In Urban Dictionary, we can rely on the signal of user-provided votes, which are applied at the entry-level. This information can help sort the set of entries by importance: when training our concatenated lists of definitions, entries, and tags, we try sorting 10 them by their net number of votes (up-votes − down-votes) so that the top scoring entries will be processed by the model first, giving them priority over the other entries. We also remove any entries that received negative net votes from the concatenated list of entries. Empirically, we found that using the voting information in this way resulted in either a minor improvement or no change in the results, and so all results presented reflect the use of votes as signals of importance where votes are available. Table 3 shows the full set of results across all three dictionaries using the same evaluation tasks as before. AE/Autoencoder is the simple autoencoder model in which the loss term only consists of the definition reconstruction penalty. CPAE is the Consistency Penalized Autoencoder (Bosc and Vincent, 2018) which is the same as the AE model with the addition of the consistency penalty. Model names ending with "-P" use pre-trained embeddings (the same used for the SGNS baseline) to initialize the input embedding layer of the model. Hill's model (Hill et al., 2016) only uses the consistency penalty and always uses pre-trained embeddings to initialize the input embedding layer. SGNS is the skipgram with negative sampling baseline, and "+Structure" is the same as the previous row, but using out multi-task learning framework to train the model to use the structural elements available in the dictionary. For the Urban Dictionary data, for models that use pre-trained embeddings to initialize the input layer, "Full" indicates that those embeddings