Conditional Generators of Words Definitions

We explore recently introduced definition modeling technique that provided the tool for evaluation of different distributed vector representations of words through modeling dictionary definitions of words. In this work, we study the problem of word ambiguities in definition modeling and propose a possible solution by employing latent variable modeling and soft attention mechanisms. Our quantitative and qualitative evaluation and analysis of the model shows that taking into account words’ ambiguity and polysemy leads to performance improvement.


Introduction
Continuous representations of words are used in many natural language processing (NLP) applications. Using pre-trained high-quality word embeddings are most effective if not millions of training examples are available, which is true for most tasks in NLP (Kumar et al., 2016;Karpathy and Fei-Fei, 2015). Recently, several unsupervised methods were introduced to learn word vectors from large corpora of texts (Mikolov et al., 2013;Pennington et al., 2014;Joulin et al., 2016). Learned vector representations have been shown to have useful and interesting properties. For example, Mikolov et al. (2013) showed that vector operations such as subtraction or addition reflect semantic relations between words. Despite all these properties it is hard to precisely evaluate embeddings because analogy relation or word similarity tasks measure learned information indirectly.
Quite recently Noraset et al. (2017) introduced a more direct way for word embeddings evaluation. Authors suggested considering definition modeling as the evaluation task. In definition modeling vector representations of words are used for conditional generation of corresponding word definitions. The primary motivation is that highquality word embedding should contain all useful information to reconstruct the definition. The important drawback of Noraset et al. (2017) definition models is that they cannot take into account words with several different meanings. These problems are related to word disambiguation task, which is a common problem in natural language processing. Such examples of polysemantic words as "bank" or "spring" whose meanings can only be disambiguated using their contexts. In such cases, proposed models tend to generate definitions based on the most frequent meaning of the corresponding word. Therefore, building models that incorporate word sense disambiguation is an important research direction in natural language processing.
In this work, we study the problem of word ambiguity in definition modeling task. We propose several models which can be possible solutions to it. One of them is based on recently proposed Adaptive Skip Gram model (Bartunov et al., 2016), the generalized version of the original SkipGram Word2Vec, which can differ word meanings using word context. The second one is the attention-based model that uses the context of a word being defined to determine components of embedding referring to relevant word meaning. Our contributions are as follows: (1) we introduce two models based on recurrent neural network (RNN) language models, (2) we collect new dataset of definitions, which is larger in number of unique words than proposed in Noraset et al. (2017)  Several works utilize word definitions to learn embeddings. For example, Hill et al. (2016) use definitions to construct sentence embeddings. Authors propose to train recurrent neural network producing an embedding of the dictionary definition that is close to an embedding of the corresponding word. The model is evaluated with the reverse dictionary task. Bahdanau et al. (2017) suggest using definitions to compute embeddings for outof-vocabulary words. In comparison to Hill et al. (2016) work, dictionary reader network is trained end-to-end for a specific task.

Definition Modeling
Definition modeling was introduced in Noraset et al. (2017) work. The goal of the definition model p(D|w * ) is to predict the probability of words in the definition D = {w 1 , . . . , w T } given the word being defined w * . The joint probability is decomposed into separate conditional probabilities, each of which is modeled using the recurrent neural network with soft-max activation, applied to its logits.
Authors of definition modeling consider following conditional models and their combinations: Seed (S) -providing word being defined at the first step of the RNN, Input (I) -concatenation of embedding for word being defined with embedding of word on corresponding time step of the RNN, Gated (G), which is the modification of GRU cell. Authors use a character-level convolutional neural network (CNN) to provide character-level information about the word being defined, this feature vector is denoted as (CH). One more type of conditioning referred to as (HE), is hypernym relations between words, extracted using Hearst-like patterns.

Word Embeddings
Many natural language processing applications treat words as atomic units and represent them as continuous vectors for further use in machine learning models. Therefore, learning high-quality vector representations is the important task.

Skip-gram
One of the most popular and frequently used vector representations is Skip-gram model. The original Skip-gram model consists of grouped word prediction tasks. Each task is formulated as a prediction of the word v given word w using their input and output representations: where θ and V stand for the set of input and output word representations, and dictionary size respectively. These individual prediction tasks are grouped in a way to independently predict all adjacent (with some sliding window) words y = {y 1 , . . . y C } given the central word x: The joint probability of the model is written as follows: are training pairs of words and corresponding contexts and θ stands for trainable parameters.
Also, optimization of the original Skip-gram objective can be changed to a negative sampling procedure as described in the original paper or hierarchical soft-max prediction model (Mnih and Hinton, 2009) can be used instead of (2) to deal with computational costs of the denominator. After training, the input representations are treated as word vectors.

Adaptive Skip-gram
Skip-gram model maintains only one vector representation per word that leads to mixing of meanings for polysemantic words. Bartunov et al. (2016) propose a solution to the described problem using latent variable modeling. They extend Skip-gram to Adaptive Skip-gram (AdaGram) in a way to automatically learn the required number of vector representations for each word using Bayesian nonparametric approach. In comparison with Skip-gram AdaGram assumes several meanings for each word and therefore keeps several vectors representations for each word. They introduce latent variable z that encodes the index of meaning and extend (2) to p(v|z, w, θ). They use hierarchical soft-max approach rather than negative sampling to overcome computing denominator.
Here in wk stands for input representation of word w with meaning index k and output representations are associated with nodes in a binary tree, where leaves are all possible words in model vocabulary with unique paths from the root to the corresponding leaf. ch(n) is a function which returns 1 or -1 to each node in the path(·) depending on whether n is a left or a right child of the previous node in the path. Huffman tree is often used for computational efficiency.
To automatically determine the number of meanings for each word authors use the constructive definition of Dirichlet process via stickbreaking representation (p(z = k|w, β)), which is commonly used prior distribution on discrete latent variables when the number of possible values is unknown (e.g. infinite mixtures).
This model assumes that an infinite number of meanings for each word may exist. Providing that we have a finite amount of data, it can be shown that only several meanings for each word will have non-zero prior probabilities.
Finally, the joint probability of all variables in AdaGram model has the following form: Model is trained by optimizing Evidence Lower Bound using stochastic variational inference (Hoffman et al., 2013) with fully factorized variational approximation of the posterior distribution p(Z, β|X, Y, α, θ) ≈ q(Z)q(β).
One important property of the model is an ability to disambiguate words using context. More formally, after training on data D we may compute the posterior probability of word meaning given context and take the word vector with the highest probability.: This knowledge about word meaning will be further utilized in one of our models as disambiguation(x|y).

Models
In this section, we describe our extension to original definition model. The goal of the extended definition model is to predict the probability of a definition D = {w 1 , . . . , w T } given a word being defined w * and its context C = {c 1 , . . . , c m } (e.g. example of use of this word). As it was motivated earlier, the context will provide proper information about word meaning. The joint probability is also decomposed in the conditional probabilities, each of which is provided with the information about context: p(w t |w i<t , w * , C) (9)

AdaGram based
Our first model is based on original Input (I) conditioned on Adaptive Skip-gram vector representations. To determine which word embedding to provide as Input (I) we disambiguate word being defined using its context words C. More formally our Input (I) conditioning is turning in: where g is the recurrent cell, [a; b] denotes vector concatenation, v * and v t are embedding of word being defined w and embedding of word w t respectively. We refer to this model as Input Adaptive (I-Adaptive).

Attention based
Adaptive Skip-gram model is very sensitive to the choice of concentration parameter in Dirichlet process. The improper setting will cause many similar vectors representations with smoothed meanings due to theoretical guarantees on a number of learned components. To overcome this problem and to get rid of careful tuning of this hyper-parameter we introduce following model: where is an element-wise product, σ is a logistic sigmoid function and AN N is attention neural network, which is a feed-forward neural network. We motivate these updates by the fact, that after learning Skip-gram model on a large corpus, vector representation for each word will absorb information about every meaning of the word. Using soft binary mask dependent on word context we extract components of word embedding relevant to corresponding meaning. We refer to this model as Input Attention (I-Attention).

Attention SkipGram
For attention-based model, we use different embeddings for context words. Because of that, we pre-train attention block containing embeddings, attention neural network and linear layer weights by optimizing a negative sampling loss function in the same manner as the original Skip-gram model: where v w O , v w I and v w i are vector representation of "positive" example, anchor word and negative example respectively. Vector v w I is computed using embedding of w I and attention mechanism proposed in previous section.

Data
We collected new dataset of definitions using Ox-fordDictionaries.com (2018) API. Each entry is a triplet, containing the word, its definition and example of the use of this word in the given meaning. It is important to note that in our data set words can have one or more meanings, depending on the corresponding entry in the Oxford Dictionary. Table  1 shows basic statistics of the new dataset.

Pre-training
It is well-known that good language model can often improve metrics such as BLEU for a particular NLP task (Jozefowicz et al., 2016). According to this, we decided to pre-train our models. For this purpose, WikiText-103 dataset (Merity et al., 2016) was chosen. During pre-training we set v * (eq. 10) to zero vector to make our models purely unconditional. Embeddings for these language models were initialized by Google Word2Vec vectors 1 and were fine-tuned. Figure 1 shows that this procedure helps to decrease perplexity and prevents over-fitting. Attention Skip-gram vectors were also trained on the WikiText-103.

Results
Both our models are LSTM networks (Hochreiter and Schmidhuber, 1997) with an embedding layer. The attention-based model has own embedding layer, mapping context words to vector representations. Firstly, we pre-train our models using the procedure, described above. Then, we train them on the collected dataset maximizing log-likelihood objective using Adam (Kingma and Ba, 2014). Also, we anneal learning rate by   a factor of 10 if validation loss doesn't decrease per epochs. We use original Adaptive Skip-gram vectors as inputs to S+I-Adaptive, which were obtained from the official repository 2 . We compare different models using perplexity and BLEU score on the test set. BLEU score is computed only for models with the lowest perplexity and only on the test words that have multiple meanings. The results are presented in Table 3. We see that both models that utilize knowledge about meaning of the word have better performance than the competing one. We generated definitions using S + I-Attention model with simple temperature sampling 2 https://github.com/sbos/AdaGram.jl algorithm (τ = 0.1). Table 2 shows the examples. The source code and dataset will be freely available 3 .

Conclusion
In the paper, we proposed two definition models which can work with polysemantic words. We evaluate them using perplexity and measure the definition generation accuracy with BLEU score. Obtained results show that incorporating information about word senses leads to improved metrics. Moreover, generated definitions show that even implicit word context can help to differ word meanings. In future work, we plan to explore individual components of word embedding and the mask produced by our attention-based model to get a deeper understanding of vectors representations of words.