SJTU-NLP at SemEval-2018 Task 9: Neural Hypernym Discovery with Term Embeddings

This paper describes a hypernym discovery system for our participation in the SemEval-2018 Task 9, which aims to discover the best (set of) candidate hypernyms for input concepts or entities, given the search space of a pre-defined vocabulary. We introduce a neural network architecture for the concerned task and empirically study various neural network models to build the representations in latent space for words and phrases. The evaluated models include convolutional neural network, long-short term memory network, gated recurrent unit and recurrent convolutional neural network. We also explore different embedding methods, including word embedding and sense embedding for better performance.


Introduction
Hypernym-hyponym relationship is an is-a semantic relation between terms as shown in Table 1.
Various natural language processing (NLP) tasks, especially those semantically intensive ones aiming for inference and reasoning with generalization capability, such as question answering (Harabagiu and Hickl, 2006;Yahya et al., 2013) and textual entailment (Dagan et al., 2013;Roller and Erk, 2016), can benefit from identifying semantic relations between words beyond synonymy.
The hypernym discovery task (Camacho-Collados et al., 2018) aims to discover the most appropriate hypernym(s) for input concepts or entities from a pre-defined corpus.
A relevant well-known scenario is hypernym detection, which is a binary task to decide whether a hypernymic relationship holds between a pair of words or not. A hypernym detection system should be capable of learning taxonomy and lexical semantics, including pattern-based methods (Boella and Caro, 2013;Espinosa-Anke et al., 2016b) and graphbased approaches (Fountain and Lapata, 2012;Velardi et al., 2013;Kang et al., 2016). However, our concerned task, hypernym discovery, is rather more challenging since it requires the systems to explore the semantic connection with all the exhausted candidates in the latent space and rank a candidate set instead of a binary classification in previous work. Recently, neural network (NN) models have shown competitive or even better results than traditional linear models with handcrafted sparse fea-tures (Qin et al., 2016b;?,a;Wang et al., 2016c;Zhao et al., 2017a;Wang et al., 2017;Qin et al., 2017;Zhao et al., 2017b;Li et al., 2018). In this work, we introduce a neural network architecture for the concerned task and empirically study various neural networks to model the distributed representations for words and phrases.
In our system, we leverage an unambiguous vector representation via term embedding, and we take advantage of deep neural networks to discover the hypernym relationships between terms.
The rest of the paper is organized as follows: Section 2 briefly describes our system, Section 3 shows our experiments on the hyperym discovery task including the general-purpose and domainspecific one. Section 4 concludes this paper.

System Overview
Our hypernym discovery system can be roughly split into two parts, Term Embedding and Hypernym Relationship Learning. We first train term embeddings, either using word embedding or sense embedding to represent each word. Then, neural networks are used to discover and rank the hypernym candidates for given terms.

Embedding
To use deep neural networks, symbolic data needs to be transformed into distributed representations (Wang et al., 2016a;Qin et al., 2016b;Cai and Zhao, 2016;Zhang et al., 2016;Wang et al., 2016bWang et al., , 2015. We use Glove toolkit to train the word embeddings using UMBC corpus (Han et al., 2013). Moreover, in order to perform word sense induction and disambiguation, the word embedding could be transformed to sense embedding, which is induced from exhisting word embeddings via clustering of ego-networks (Pelevina et al., 2016) of related words. Thus, each input word or phrase is embedded into vector sequence, w = {x 1 , x 2 , . . . , x l } where l denotes the sequence length. If the input term is a word, then l = 1 while for phrases, l means the number of words.

Hypernym Learning
Previous work like TAXOEMBED (Espinosa-Anke et al., 2016a) uses transformation matrix for hypernm relationship learning, which might be not optimal due to the lack of deeper nonlinear feature extraction. Thus, we empirically survey various neural networks to represent terms in latent space. After obtaining the representation for input term and all the candidate hypernyms, to give the ranked hypernym list, the cosine similarity between the term and the candidate hypernym is computed by, where x i and y i denote the two concerned vectors. Our candidate neural networks include Convolutional Neural Network (CNN), Long-short Term Memory network (LSTM), Gated Recurrent Unit (GRU) and Recurrent Convolutional Neural Network (RCNN).
GRU The structure of GRU (Cho et al., 2014) used in this paper are described as follows.
where ⊙ denotes the element-wise multiplication. r t and z t are the reset and update gates respectively, andh t the hidden states.
where σ stands for the sigmoid function, ⊙ represents element-wise multiplication and h t are the input gates, forget gates, memory cells, output gates and the current state, respectively.
CNN Convolutional neural networks have also been successfully applied to various NLP tasks, in which the temporal convolution operation and associated filters map local chunks (windows) of the input into a feature representation. Concretely, let n denote the filter width, filter matrices [W 1 , W 2 , . . . , W k ] with several variable sizes [l 1 , l 2 , . . . , l k ] are utilized to perform the convolution operations for input embeddings. For the sake of simplicity, we will explain the procedure for only one embedding sequence. The embedding will be transformed to sequences c j (j ∈ [1, k]) : where [i : i + l j − 1] indexes the convolution window. Additionally, we apply wide convolution operation between embedding layer and filter matrices, because it ensures that all weights in the filters reach the entire sentence, including the words at the margins.
A one-max-pooling operation is adopted after convolution and the output vector s is obtained through concatenating all the mappings for those k filters.
In this way, the model can capture the critical features in the sentence with different filters.
RCNN Since some input terms are phrases, whose member words share different weights. In RCNN, an adaptive gated decay mechanism is used to weight the words in the convolution layer. Following (Lei et al., 2016), we introduce neural gates similar λ to LSTMs to specify when and how to average the observed signals. The resulting architecture integrates recurrent networks with nonconsecutive convolutions: where c 1 t , c 2 t , · · · , c n t are accumulator vectors that store weighted averages of 1-gram to n-gram features.
For discriminative training, we use a maxmargin framework for learning (or fine-tuning) parameters θ. Specifically, a scoring function ϕ(·, ·; θ) is defined to measure the semantic similarity between the corresponding representations of input term and hypernym. Let p = {p 1 , ...p n } denote the hypernym corpus and h ∈ p is the ground-truth hypernym to the term t i , the optimal parameters θ are learned by minimizing the maxmargin loss: where δ(., .) denotes a non-negative margin and δ(p i , a) is a small constant when a = p i and 0 otherwise.

Experiment
In the following experiments, besides the neural networks, we also simply average the embeddings of an input phrase as our baseline to discover the relationship of terms and their corresponding hypernyms for comparison (denoted as term embedding averaging, TEA).

Setting
Our hypernym discovery experiments include general-purpose substask for English and domainspecific ones for medical and music. Our evaluation is based on the following information retrieval metrics: Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), Precision at 1 (P@1), Precision at 3 (P@3), Precision at 5 (P@5), Precision at 15 (P@15).
For the sake of computational efficiency, we simply average the sense embedding if a word has more than one sense embedding (among various domains). Our model was implemented using the Theano 1 . The diagonal variant of Ada-Grad (Duchi et al., 2011) is used for neural network training. We tune the hyper-parameters with the following range of values: learning rate ∈ {1e − 3, 1e − 2}, dropout probability ∈ {0.1, 0.2}, CNN filter width ∈ {2, 3, 4}. The hidden dimension of all neural models are 200. The batch size is set to 20 and the word embedding and sense embedding sizes are set to 300. All of our models are trained on a single GPU (NVIDIA GTX 980Ti), with roughly 1.5h for general-purpose subtask for English and 0.5h domain-specific domain-specific ones for medical and music. We run all our models up to 50 epoch and select the best result in validation. Table 2 shows the result on general-domain subtask for English. All the neural models outperform term embedding averaging in terms of   Table 3: Gold standard evaluation on domain-specific subtask. "Embed" is short for "Embedding".

Result and analysis
all the metrics. This result indicates simply averaging the embedding of words in a phrase is not an appropriate solution to represent a phrase. Convolution or recurrent gated mechanisms in either CNN-based (CNN, RCNN) or RNN (GRU, LSTM) based neural networks could essentially be helpful of modeling the semantic connections between words in a phrase, and guide the networks to discover the hypernym relationships. We also observe CNN-based network performance is better than RNN-based, which indicates local features between words could be more important than long-term dependency in this task where the term length is up to trigrams.
To investigate the performance of neural models on specific domains, we conduct experiments on medical and medicine subtask. Table 3 shows the result. All the neural models outperform term embedding averaging in terms of all the metrics and CNN-based network also performs better than RNN-based ones in most of the metrics using word embedding, which verifies our hypothesis in the general-purpose task. Compared with word embedding, the sense embedding shows a much poorer result though they work closely in generalpurpose subtask. The reason might be the simple averaging of sense embedding of various domains for a word, which may introduce too much noise and bias the overall sense representation. This also discloses that modeling the sense embedding of specific domains could be quite important for further improvement.

Conclusion
In this paper, we introduce a neural network architecture for the hypernym discovery task and empirically study various neural network models to model the representations in latent space for words and phrases. Experiments on three subtasks show the neural models can yield satisfying results. We also evaluate the performance of word embedding and sense embedding, showing that in domainspecific tasks, sense embedding could be much more volatile.