Bridging the Defined and the Defining: Exploiting Implicit Lexical Semantic Relations in Definition Modeling

Definition modeling includes acquiring word embeddings from dictionary definitions and generating definitions of words. While the meanings of defining words are important in dictionary definitions, it is crucial to capture the lexical semantic relations between defined words and defining words. However, thus far, the utilization of such relations has not been explored for definition modeling. In this paper, we propose definition modeling methods that use lexical semantic relations. To utilize implicit semantic relations in definitions, we use unsupervisedly obtained pattern-based word-pair embeddings that represent semantic relations of word pairs. Experimental results indicate that our methods improve the performance in learning embeddings from definitions, as well as definition generation.


Introduction
Dictionary definitions are rich resources of semantic information for both humans and machines. Recent studies on definition modeling are primarily divided into two categories. One is definition generation, in which a definition is generated for a target word from its word representation. Definition generation involves analyzing word embeddings using generated definitions (Noraset et al., 2017) and machine explanations of word meanings for human readers (Ni and Wang, 2017;Ishiwatari et al., 2019). The other is learning word embeddings from definitions to obtain semanticoriented word representations (Tissier et al., 2017;Bosc and Vincent, 2018).
Although previous methods for encoding or decoding definitions using recurrent neural networks (RNNs) yielded promising results in definition modeling (Noraset et al., 2017;Bosc and Vincent, 2018), they did not explicitly utilize lexical semantic relations between defined words and defin- Figure 1: Definition of knife from WordNet (Fellbaum, 1998) and lexical semantic relations between the defined word and the defining words.
ing words. Various lexical relations exist in definitions (Amsler, 1981), as displayed in Figure 1, where the defined word knife exhibits an Is-a relation with the defining words, tool and instrument, Has-a relation with edge, and Used-for relation with cutting. Utilizing structures of definitions about lexical semantic relations facilitates the understanding and generation of definitions.
Based on this observation, we propose definition modeling methods that exploit lexical semantic relations between defined and defining words. However, lexical semantic relations in definitions are not explicit. To solve this problem, we use unsupervisedly learned word-pair embeddings that represent semantic relations of word pairs based on co-occurring relational patterns in a corpus (Turney, 2005;Washio and Kato, 2018b). Experimental results show that our definition modeling methods improve previous models, with respect to both definition generation and the acquisition of word embeddings from definitions.

Definition Embedding
Relationships between words captured in embeddings are studied in terms of similarity or relatedness (Hill et al., 2015). For example, coffee and cup have high relatedness because coffee is often contained in a cup; meanwhile, these words ex-hibit low similarity as coffee is a beverage and cup is a container.
Definition embeddings that are learned from word definitions are useful to capture similarity, while word embeddings based on distributional hypothesis (Mikolov et al., 2013) tend to capture relatedness (Bosc and Vincent, 2018). Bosc and Vincent (2018) proposed a method that learns the long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) that encodes a word definition into an embedding. Given a defining word sequence D = {w 1 , . . . , w T }, an LSTM definition encoder processes D, as follows: where LST M computes the hidden state given the previous hidden state h t−1 and the input word embedding w t along with the LSTM architecture. h T is the final hidden state. W * and b * are weight matrices and bias terms, respectively. The definition encoder is trained to reconstruct a bagof-words representation of the definition from h with a consistency penalty that renders h closer to the corresponding word embedding used in the LSTM. This method is referred to as consistency penalized autoencoder (CPAE).

Definition Generation
Definition generation was introduced by Noraset et al. (2017). The goal of definition generation is to predict the probability of the defining word sequence D given the defined word w trg . In the aforementioned study, the LSTM conditional language model was used as a definition decoder to model this probability as follows: where h t is a hidden state from the LSTM definition decoder. Sof tmax is the softmax function. They conditioned the decoder by providing the embedding of the defined word at the first step of the LSTM. They referred to this model as the Seed (S). Moreover, they extended this model to update the output of the recurrent unit with a gate function depending on the embedding of the defined word. This gate function, Gate (G), controls the amount of information from the defined word that contributes to the definition generation at each step. As additional features, they used morphological information from a character-level convolutional neural network (CNN) to process a character sequence of the defined word and the embeddings of the hypernyms from the WebIsA database (Seitner et al., 2016). These features are called CH and HE, respectively. Gadetsky et al. (2018) introduced context-aware definition generation to disambiguate polysemous words with their context and generate the corresponding definitions. They extended Equation 3 to consider the context word sequence of the defined word C = {c 1 , . . . , c m } as follows: To model this probability, they used an attention mechanism to extract meaningful information from C and chose relevant dimensions of the embedding of the defined word. They referred to this model as Input Attention (I-Attention).

Word-Pair Embedding
Word-pair embeddings represent relations of word pairs. Although representation of word pairs as the vector offsets of their pretrained word embeddings is a simple and powerful method (Mikolov et al., 2013), recent studies have shown that neural pattern-based word-pair embeddings are more effective than vector offsets in various tasks such as calculating relational similarity (Washio and Kato, 2018b), natural language inference, and question answering (Joshi et al., 2019).
Neural pattern-based word-pair embedding models (Washio and Kato, 2018a,b;Joshi et al., 2019) unsupervisedly learn two neural networks: a word-pair encoder and pattern encoder, both of which encode the word-pair and lexico-syntactic pattern respectively into the same embedding space. These networks are trained by predicting co-occurrences between word-pairs and patterns in a corpus with the negative sampling objective. After the unsupervised learning, the wordpair encoder provides word-pair embeddings for any word pair given their word embeddings.
defining words in the definition encoder for definition embeddings (Section 3.1) and the definition decoder for definition generation (Section 3.2). To utilize the implicit semantic relations in definitions, we use word-pair embeddings that represent semantic relations of word pairs. We describe how word-pair embeddings are obtained in Section 4.1

For Definition Encoder
To consider lexical semantic relations in acquiring embedding from definitions, we feed the wordpair embeddings to the definition encoder. Assuming that the pair embedding v (wtrg,wt) represents the relation between the defined word w trg and the t-th defining word w t , we calculate h t as follows: where ; denotes vector concatenation. To exclude meaningless relations between the defined word and functional words, we replace v (wtrg,wt) with the zero vector if w t is a stopword. With wordpair embeddings as inputs, the definition encoder can recognize the role of information that is provided by the current word w t , for example, a type of w trg (Is-a), a goal of w trg (Used-for), or a component of w trg (Has-a).

For Definition Decoder
To provide the definition decoder with information regarding lexical semantic relations, we use an additional loss function with word-pair embeddings as follows: where S is a set of stopwords. As in Section 3.1, we ignore the loss when w t is a stopword. This additional loss allows the definition decoder to learn the pattern of what semantic relations occur in definitions and how they occur. For example, if w trg indicates a type of tools, a defining word that has the Is-a relation to w trg tends to be followed by the Used-for word.

Experiments and Results
The experiments conducted to evaluate our methods for definition encoding and decoding are presented in this section. In the Appendix, we describe the details of the hyperparameter settings and optimization methods used in the experiments.

Obtaining Word-Pair Embedding
To obtain pattern-based word-pair embeddings, we extracted triples (w 1 , w 2 , p) ∈ T from the Wikipedia corpus, where (w 1 , w 2 ) is a word pair composed of nouns, verbs, or adjectives in the the 100K most frequent words of the GloVe 1 (Pennington et al., 2014), and p is the co-occurring shortest dependency path 2 . We discarded the triples if p occurred less than five times and subsampled the triples based on word-pair probability with a threshold of 5 · 10 −7 , following Joshi et al.
. For the word-pair encoder, we used the neural networks as follows: where M LP is a four-layer perceptron with the ReLU activation, v w is a word embedding of w, and is the element-wise product. Each size of the hidden states of M LP was 300.
The dependency path p is a sequence composed of one to three lemmas and dependency relations e 1 , . . . , e n . The sequence of the corresponding embeddings e 1 , . . . , e n was encoded using the bidirectional LSTM with the 300-dimensional hidden state as the pattern encoder. Then, the 300dimensional pattern representation v p was calculated with the final output vectors h f and h b from the forward and backward LSTM as follows: The word embeddings in the models were initialized by the 300-dimensional GloVe. We used the multivariate negative sampling objective (Joshi et al., 2019) to train the parameters with the data T for 10 epochs. Adagrad, which has a learning rate of 0.01, was used as the optimizer (Duchi et al., 2011). After the training was completed, wordpair embeddings are calculated as follows:

Definition Embedding
For the evaluation of definition embeddings, we used the modified Word Embedding Benchmarks projects 3 , following Bosc and Vincent (2018). These benchmarks include SimLex999 (SL999),   (Hill et al., 2015), SimVerb (SV) (Gerz et al., 2016), MEN (Bruni et al., 2014), RG (Rubenstein and Goodenough, 1965), WS353 (Finkelstein et al., 2002), SCWS (Huang et al., 2012), and MTurk (Radinsky et al., 2011;Halawi et al., 2012). To evaluate the definition embeddings, we scored word pairs in the benchmarks using the cosine similarity between the corresponding definition embeddings and calculated Spearman's correlation to the ground truth. The definitions in WordNet were used to train the definition encoder. The development sets of the SimVerb and MEN were used for the hyperparameter tuning. We implemented the CPAE (Section 2.1) as a baseline and compared it to CPAE with the wordpair embeddings (Section 3.1), which is our proposed method. The word embeddings in the definition encoder were initialized by the Google Word2Vec vectors 4 . Google Word2Vec vectors and GloVe were used as the other baselines. Table 1 shows the results of the similarity and relatedness benchmarks. While the baseline CPAE outperformed our model on the two out of five datasets pertaining to word relatedness, our model consistently outperformed the baseline on the similarity benchmarks. These results indicate that the word-pair embeddings provide the definition encoder with useful semantic information about the target word.

Hidden States of Definition Encoder
To analyze the functionality of injecting semantic relations between the defined word and the defin-4 https://code.google.com/archive/p/word2vec/ Figure 2: Cosine similarities between the last hidden states of the definition encoders encoding kettle's definition and each hidden states representing the knife's definition ing words into the definition encoder, we investigate the similarities between hidden states of the definition encoders.
We encoded the kettle's definition (a metal pot for stewing or boiling) and the knife's definition (edge tool used as a cutting instrument) with the definition encoders. These two definitions are similar in their style, composed of the defining words of Is-a relations (pot and tool) and Used-for relations (stewing, boiling, and cutting.) Figure 2 displays the cosine similarities between the last hidden state of the encoded kettle's definition and each hidden state of the encoded knife's definition. This figure shows that when tool with Is-a relation and cutting with Used-for relation to knife were input to the encoder, our method increased the similarities more than the baseline. This indicates that our method successfully helps the model capture the similarities of definitions in terms of lexical semantic relations.

Definition Generation
We evaluated our method for the definition decoder (Section 3.2) on a context-agnostic dataset (Noraset et al., 2017) and context-aware dataset (Gadetsky et al., 2018) for the definition generation. For evaluation purposes, we calculated the perplexity (PPL) and BLEU score (Papineni et al., S+G+CH+HE w/ L rel academician a person who specializes in a particular profession one who is versed in a scholarly or scientific field artist one who is a person who is a person or thing is made one who creates a picture or representation of a creative work adolescence the state of being pregnant the state of being mature  2002), following Noraset et al. (2017). We implemented S+G+CH+HE for the context-agnostic dataset and S+I-Attention for the context-aware dataset as baselines, and compared these models to the ones with L rel in Section 3.2. The word embeddings in the models were initialized by Google Word2Vec vectors, as in Section 4.2.
The results in Table 2 show that training with our L rel improves the performance in both the context-agnostic and context-aware settings.

Effect to Definition Generation
To analyze the effect of L rel , we plotted the BLEU scores at each average number of content words in reference definitions from the development set of Noraset et al. (2017), as shown in Figure 3. This figure shows that when many content words exist in reference definitions, the model with L rel is stronger than that without L rel . This indicates that our method allows the model to select correct words when the defined word requires detailed descriptions. Table 3 displays examples of generated definitions by the baseline S + G + CH + HE and the one with L rel . The model with the proposed method successfully generated the definitions for academician and artist, while the baseline did not. Although the baseline generated the generalized class of the target words, for examples, person and one, it fails to generate details when the target word is distinct from others in the same class.

Generated Definitions
In contrast, the model with the proposed method chose the correct words at both the generalized class and the details.
For adolescence, both models could not generate the correct definitions. Even with our method, the model produced a definition that has an opposite meaning to the target word. The generation of opposite definitions is a significant problem in the definition generation from word embeddings (Noraset et al., 2017). Although our method helps the model generate details about the target words, this problem requires other approaches that consider antonym relations.

Conclusion
In this paper, we proposed definition modeling methods that utilize lexical semantic relations between defined words and defining words. To utilize implicit semantic relations in dictionary definitions, we used pattern-based word-pair embeddings. The experimental results demonstrated that applying our methods to the definition encoder and the definition decoder improved their performance. In our future work, we will extend our methods for phrase-level definition generation (Ishiwatari et al., 2019).