NLP_HZ at SemEval-2018 Task 9: a Nearest Neighbor Approach

Hypernym discovery aims to discover the hypernym word sets given a hyponym word and proper corpus. This paper proposes a simple but effective method for the discovery of hypernym sets based on word embedding, which can be used to measure the contextual similarities between words. Given a test hyponym word, we get its hypernym lists by computing the similarities between the hyponym word and words in the training data, and fill the test word’s hypernym lists with the hypernym list in the training set of the nearest similarity distance to the test word. In SemEval 2018 task9, our results, achieve 1st on Spanish, 2nd on Italian, 6th on English in the metric of MAP.


Introduction
Hypernymy relationship plays a critical role in language understanding because it enables generalization, which lies at the core of human cognition (Yu et al. (2015)). It has been widely used in various NLP applications (Espinosa Anke et al. (2016)), from word sense disambiguation (Agirre et al. (2014)) to information retrieval (Varelas et al. (2005)) , question answering (Prager (2006)) and textual entailment (Glickman et al. (2005)). To date, the hypernymy relation also plays an important role in Knowledge Base Construction task.
In the past SemEval contest (SemEval-2015 task 17 1 , SemEval-2016 task 13 2 ), the "Hypernym Detection" task was treated as a classfication task, i.e., given a (hyponym, hypernym) pair, deciding whether the pair is a true hypernymic relation or not. This has led to criticisms regarding its oversimplification (Levy et al., 2015). In the SemEval 2018 Task 9 (Camacho-Collados et al., 2018), the task has shifted to "Hypernym Discovery" , i.e., given the search space of a domain's vocabulary and an input hyponym, discover its best (set of) candidate hypernyms.
In this paper, the content is organized as follows: Section 2 gives an introduction to the related work; Section 3 describes our methods for this task, including word embedding projection learning as the baseline and the nearest-neighbourbased method as the submission result; The experimental results are presented in Section 4. We conclude the paper with Section 5.

Related Work
The work of identifying hypernymy relationship can be categorized from different aspects according to the learning methods and the task formulization. The earlier work (Hearst (1992)) formalized the task as an unsupervised hypernym discovery task, i.e., none hyponym-hypernyms pairs (x, y) are given as the training data. Hearst (1992) handcrafted a set of lexico-syntactic paths that connect the joint occurrences of x and y which indicate hypernymy in a large corpus. Snow et al. (2004) trained a logistic regression classifier using all dependency paths which connect a small number of known hyponym-hypernym pairs. Paths that were assigned high weights by the classifier are used to extract unseen hypernym pairs from a new corpus. Variations of Snow et al. (2004) were later used in tasks such as taxonomy construction (Snow et al. (2006); Kozareva and Hovy (2010); Carlson et al. (2010)), analogy identification (Turney (2006)), and definition extraction (Borg et al. (2009); Navigli and Velardi (2010)).
A major limitation in relying on lexicosyntactic paths is the requirement of the cooccurence of the hypernym pairs. Distributional methods are developed to overcome this limitation. Lin (1998) developed symmetric similarity measures to detect hypernym in an unsupervised manner. Weeds and Weir (2003); Kotlerman et al. (2010) employed directional measures based on the distributional inclusion hypothesis. More recent work (Santus et al. (2014); Rimell (2014)) introduces new measures, based on the distributional informativeness hypothesis. Yu et al. (2015); Tuan and Ng (2016); Nguyen et al. (2017) learn directly the word embeddings which are optimized for capturing the hypernymy relationship.

Hyponym-hypernym Discovery method 3.1 Preprocessing
For the corpus and the train/gold/test data, we have two preprocessing steps: 1) Lowercase all the words; 2) Concatenate the phrases (hyponym or hypernym composed with more than one word) which occur in the training set or the test set with underline, i.e., "executive president" is replaced by "executive president". It is quite useful for training word embedding models because we want to treat phrases as single words.
If there are multiple phrases in one sentence, we generate multiple sentences, one per phrase. For example, "executive president" and "vice executive president" both exist in the corpus sentence "Hoang Van Dung , vice executive president of the Vietnam Chamber of Commerce and Industry.". After preprocessing, two more sentences are generated and included in the training corpus for word embeddings: • Hoang Van Dung , vice executive president of the Vietnam Chamber of Commerce and Industry.
• Hoang Van Dung , vice executive president of the Vietnam Chamber of Commerce and Industry.
The size of the original corpus has increased after the preprocessing step, e.g., The English corpus has increased from ∼18G to ∼32G.

Word Embedding
We train our word embedding models using the Google word2vec (Mikolov et al. (2013a,b)) tool 3 on the preprocessed corpus. We employ the skipgram model since the skip-gram model is shown to perform best in identifying semantic relations among words. The trained word embeddings are used in the projection learning and nearestneighbour based method.

Method based on Projection Learning
The intuition of this method is to assume that there is a linear transformation in the embedding space which maps hyponyms to their correspondent hypernyms. We first learn a projection matrix from the training data, then apply the matrix to the test data. Our method is similar to that described in Fu et al. (2014), the main idea can be summarized as follows: 1. Give a word x and its hypernym y, assuming there exists a linear projection matrix Φ to meet y = Φx. We need to learn a approximate Φ using the following equation to minimize the MSE loss: 2. Learn the piecewise linear projection by clustering the training data into different groups according to the vector offsets. The motivation for the clustering is two-fold: firstly, the hypernym-hyponym relation is diverse, e.g., offset from "carpenter" and "laborer" is distant from the one from "gold fish" to "fish"; Secondly, if a hyponym x has many hypernyms (or hierarchical hypernyms), we can't use a single transition matrix Φ to project x to different hypernym y. So a piecewise projection learning is needed in each individual group. Thus, the optimization goal can be formalized as follows: Where N k is the number of word pairs in the k th cluster C k .
3. Learn the threshold δ k for each cluster, by assumming that positive (hyponmy-hypernmy) pairs can locate in radius δ while negative pairs can not: Where d stands for the euclidean distance.
4. Once the piecewise projection and the threshold is learned, given a new hyponym x, all of the hypernym candidates ys from the vocabulary are paired with x. The pairs are assigned to the proper cluster by the vector offset (y-x). According to the threshold δ in that group, it can be decided whether (x, y) is a reasonable hyponym-hypernym pair.

Method Based on Nearest Neighbors
We noticed that the hypernyms are often very distant from the correspondent hyponyms in the embedding space. Meanwhile, hyponyms which are close to each other often share the same hypernyms. We propose a simple yet effective approach based on this observation.

Hyper(x) = [Hyper(w)|w in HypoN]
where the distance function measures the similarity between Hypo i and x, HypoN is the list of words from the training set sorted according to their distances to x. Consine similarity in the embedding space is used for the distance function in our setup. According to the requirements of Task 9, only the top 15 of Hyper(x) are submitted for evaluation.

Experimental Setup
Word2vec is used to produce the word embeddings. The skip-gram model (-cbow 0) is used with the embedding dimension set to 300 (-size 300). The other options are by default. We use 10fold cross validation to evaluate both methods on the provided training data. The results are shown in Table 1 4

Results Based on Projection Learning
For the projection learning method, we followed experimental settings described in Fu et al. (2014).The negative (hyponym, hypernym) pairs are randomly sampled from the vocabulary. The training set consists of the negative pairs and the positive pairs in 3:1 ratio.
By using the same evaluating metrics as PRF in the cited paper, our best F-value on the validation set is 0.68 (the paper result is 0.73) when the best cluster number is 2 and the threshold is (17.7, 17.3). We apply the learned projection matrices and thresholds on the validation data, extract out the candidate hypernyms from the given vocabulary and truncate the top 15 candidates by sorting them according to the d(Φ k x, y)/δ k scores. The generated results are not very promising, see Table 1 for details.
This projection learning method performs not very well on task9, we think the most probable reason is that in Fu et al. (2014), the problem is formalized as a classification problem, in which the (hyponym, hypernym) pairs are given. However, our task is formalized as a hypernym discovery problem given only hyponmys. This task might be inherently much harder than the classification task; a second reason might be related to the relative small amount of training data, i.e., ∼7500 training pairs in total.

Results Based on NN
The results are shown in Table 1 from row 2 to row 5. Table 2 shows the results evaluated on the test data. The performance evaluated using either cross validation or the test data is much worse than that of a typical hypernym prediction task reported by Weeds and Weir (2003). This illustrates that hypernym discovery is indeed a much harder task than the hypernym prediction task.
Although the method proposed by us is quite simple, our submissions are the 1st on Spanish, the 2nd on Italian, the 6th on English, ranked by the  Compared with the results got by cross validation, the performance evaluated on the test data (Table 2) dropped significantly on English (MAP dropped by 4%) and Italian (MAP dropped by 8%), but increased by a margin on Spanish (MAP increased by 3.6%). We consider that it is due to the properties of provided data , i.e., the hypernyms in the test set are similar to those in the training set for Spanish, but dissimilar for English or Italian.
The performance drop for English and Italian exposes one of the main drawbacks of our method: the method can not discover the hypernyms that have never occurred in the training set. To overcome this shortcoming, using syntactic patterns to extract hyponym-hypernym with high confidence can be employed to enlarge the training set. We leave this to the future work.

Conclusion
In this paper we describe two methods we have tried out for the hypernym discovery task in Se-mEval 2018. We extended the method originally proposed for hypernym prediction by Fu et al. (2014) as a baseline system. However the performance of this method is poor. The nearestneighbor-based method is relatively simple, yet quite effective. We analyzed the experimental results, reveal some shortcomings, and propose a potential extension to future improvement.