EXPR at SemEval-2018 Task 9: A Combined Approach for Hypernym Discovery

In this paper, we present our proposed system (EXPR) to participate in the hypernym discovery task of SemEval 2018. The task addresses the challenge of discovering hypernym relations from a text corpus. Our proposal is a combined approach of path-based technique and distributional technique. We use dependency parser on a corpus to extract candidate hypernyms and represent their dependency paths as a feature vector. The feature vector is concatenated with a feature vector obtained using Wikipedia pre-trained term embedding model. The concatenated feature vector fits a supervised machine learning method to learn a classifier model. This model is able to classify new candidate hypernyms as hypernym or not. Our system performs well to discover new hypernyms not defined in gold hypernyms.


Introduction
Hypernymy is an important lexical-semantic relation that is useful for many applications such as question answering, machine translation, information retrieval, and so on. In addition, hypernym relations are the backbone for building ontologies.
Various methods have been proposed to detect hypernym relation from text corpora. Most of these techniques are either path-based techniques or distributional techniques. In path-based methods, the detection of hypernym relations is based on the lexico-syntactic paths connecting a pair of terms in a corpus. Conversely, distributional methods are based on the distribution of term pair contexts. Most of these methods were unsupervised. Recently, focus shifted towards supervised methods.
This task inherits complexity and is far from being solved. The SemEval organizers address the same task but with a novel formulation (Camacho-Collados et al., 2018). They reformulate the task from hypernym detection into hypernym discovery. This novel formulation makes the task more realistic in terms of actual downstream application, while also enabling the benefits of information retrieval evaluation metrics. Hypernym detection focuses on deciding whether a hypernymic relation holds between a given pair of terms or not. Hypernym discovery focuses on discovering a set containing the best hypernyms for a given term from a given vocabulary search space. The task is divided into two subtasks: General-Purpose Hypernym Discovery and Domain-Specific Hypernym Discovery. The first consists of discovering hypernym in a general-purpose corpus, thus the SemEval organizers provide the participants with data for three languages: English, Italian, and Spanish. The second consists of discovering hypernym in a domain-specific corpus, thus they provide the participants with data for two specific domains: Medical and Music. The data contains a list of training terms along with gold hypernyms, a list of testing terms, and a vocabulary search space. The term is either a concept or an entity.
To tackle this task, we propose an approach that combines a path-based technique and distributional technique via concatenating two feature vectors: a feature vector constructed using dependency parser output and a feature vector obtained using term embeddings. Then, by using the concatenated vector we create a binary supervised classifier model based on support vector machine (SVM) algorithm. The model predicts if a term and its candidate hypernym are hypernym related or not.

Related Work
Most of the previous approaches for hypernymy detection are either path-based (patterns) or distributional based. Recently, some approaches are taking advantages of the combination of pathbased and distributional techniques.

Path-Based
Path-based approaches are heuristic methods that predict hypernymy between a pair of terms if they match a particular pattern in a sentence of the corpus. These patterns are either manually identified (Hearst, 1992) or automatically extracted (Snow et al., 2005;Navigli and Velardi, 2010;Sheena et al., 2016). Approaches based on handcrafted patterns yield a good precision, but their recall is very low (Buitelaar et al., 2005). Approaches based on automatic learning of patterns achieve better performance by a small improvement in terms of precision and a considerable improvement in terms of recall, but the main limitation of these approaches is the sparsity of the feature space (Shwartz et al., 2016).

Distributional
Distributional approaches predict hypernym relations between terms based on their distributional representation, by either unsupervised or supervised models. The early unsupervised distributional models are based on symmetric measures (Lin, 1998). Later, asymmetric measures are introduced based on the Distributional Inclusion Hypothesis (DIH) (Weeds and Weir, 2003;Kotlerman et al., 2010). More recent, Santus et al. (2014); Rimell (2014) introduce new measures based on assumption that DIH is not correct for all cases. While, most of the supervised models rely on term embedding (Mikolov et al., 2013;Pennington et al., 2014) to represent the feature vector between the terms x and y. Various vector representations have been used such as concatenation x ⊕ y ( Baroni et al., 2012) and difference y − x (Roller et al., 2014;Weeds et al., 2014). More recent, Yu et al. (2015); Luu et al. (2016) suggested that models rely on term embedding are useful to indicate similarity between words, not to indicate hypernymy relations. Consequently, they learn their own term embedding models that are more relevant to indicate hypernym relations.

Combined Approaches
Combined approaches of distributional and lexicosyntactic paths are proposed based on the assumption that distributional approaches and path-based approaches have certain complementary properties. To our best knowledge, there are little works on integrating them (Mirkin et al., 2006;Kaji and Kitsuregawa, 2008). The recent work on integrat-ing them is proposed by Shwartz et al. (2016). They use a long short-term memory (LSTM) network (Hochreiter and Schmidhuber, 1997) to encode dependency paths into a feature vector, then they concatenate the feature vector by the term embedding vectors of term x and term y.

System Description
As a preliminary step, we split each corpus into a training corpus and a testing corpus. Training corpus is a corpus of all sentences that contains training data terms (Concept/Entity), while testing corpus is a corpus of all sentences that contains testing data terms (Concept/Entity). Some sentences may contain training and testing data terms. These sentences will exist in both training and testing corpus.

Candidate Hypernyms
The first step in the system is to extract candidate hypernyms for the given training and testing data terms from a training corpus and a testing corpus respectively. We consider a term as a candidate hypernym if: 1. The term and its candidate occur in the same sentence.
2. The candidate exists in the vocabulary list.
3. The term and its candidate are noun phrases.
4. The term and its candidate are linked by short dependency path.
We consider a dependency path as short if it doesn't exceed two grammatical dependency relations. Using the short dependency path, we are capable representing paths similar to Hearst Patterns and other patterns. For example of short dependency paths, the dependency path between X and Y in the sentence S 1 "X such as Y " is {nmod:such as(X , Y )} and in the sentence S 2 "X includes Y " is {nsubj(includes , X), dobj(includes , Y )}. We use Stanford dependency parser 1 (Marneffe et al., 2006) to extract dependency paths.

Feature Vector
The feature vector used to learn a model capable of predicting hypernym relations between a term and a candidate hypernym consists of the concatenation of two vectors: the first one is a vector extracted using a path-based technique while the second is extracted using a distributional technique.
The path-based vector consists of a set of features representing the short dependency path between a term y and its candidate hypernym x.
The feature set is: [T ag(x), GRel(x), HR, F req, T ag(y), GRel(y)]. T ag(x) and T ag(y) are the POS tag of x and y, GRel(x) and GRel(y) are the grammatical dependency relation of x and y, HR is the hypernym ratio of a dependency path and it is equal to the number of occurrences of a dependency path when indicating hypernm relation divided by the total occurrences of the same dependency path, and F req is the relative frequency of a dependency path and it is equal to the occurrence of a dependency path divided by the total occurrences of all dependency paths.

HR =
hypernym DP occurrences DP occurrences F req = DP occurrences T otal DP s occurrences For a distributional based vector, We use pretrained 300 dimensional Word2Vec 2 term embeddings, trained on Wikipedia (Mikolov et al., 2013). We apply the difference between the embedding vector of term y and the embedding vector of term x ( y − x) (Roller et al., 2014;Weeds et al., 2014). The term is either a single word or a multi-word expression.

Model Learning and Hypernym Discovery
In each training corpus, we extract a set of candidate hypernyms for each training term and label them if they are hypernym related or not using the gold hypernym data. Next, we represent each term and its candidate hypernym by a concatenated feature vector. These concatenated vectors are used for training the model. The classification method we used is SVM 3 with RBF kernel (C = 1.0, gamma = 1/F eatureSize). The training dataset was unbalanced, the ratio of hypernym instances w.r.t. not hypernym is less than 0.05. To represent the two categories (hypernym and not hypernym) in the training set, we improved this ratio to 0.2 by random elimination of not hypernym instances (20% hypernym instances and 80% not hypernym instances). The classifier model is then used to discover hypernyms from a set of candidate hypernyms extracted from a testing corpus for each testing term by predicting if a term and its candidate hypernym are hypernym related or not. Each predicted hypernym is associated with a probability value. These values are used as ranking values to select the best fifteen hypernyms for each term (from higher to lower probability).

Results and Analysis
We submit our systems predictions for three corpora: English, Medical, and Music. The table 1 (a,b and c) below shows the result of our system and other supervised systems to discover hypernyms for Concept terms only. For the three corpora, our system performs better than STJU system, and it performs better than the MFH system on the English corpora. In addition, the result shows that our system performs well in discovering new hypernyms not defined in the gold hypernyms where it yields good False Positive values in the three corpora and we achieve the best False Positive value in Medical corpus (40)   The evaluation results of our system and other supervised systems.
Our system result was beneath the expectation. By a short look into the output result files, we notice a lot of empty lines, meaning that our system was unable to discover any hypernym for a lot of terms and unexpectedly these terms correspond to all entity terms. In other words, our system lacks the ability to discover hypernyms for entity terms.
The table 2 (a,b and c) below shows the coverage of Wikipedia pre-trained term embedding model (TEM) and the coverage of candidate hypernym extraction (CHE) for the training and testing terms of the three corpora (English, Medical, and Music). The table shows that our system is unable to discover hypernyms for a considerable number of terms due to two main reasons. The first reason is that Wikipedia pre-trained term embedding model is limited in coverage, where many terms (Concepts/Entities) are not covered by the pre-trained embeddings, which leads to failure to discover hypernyms for these terms. For example, the term embedding (TEM) coverage of Medical Testing terms is 249 (50%), which means the system is unable to discover hypernyms for 251 (50%) terms not covered by the pre-trained term embedding. The second reason is that some conditions used to extract candidate hypernyms restrict the number of candidate hypernyms. For instance, the condition of the existence of a short dependency link between the term and its candidate causes the system to miss many candidate hypernyms if they are not linked by a short dependency path with the terms. In addition, the term and its candidate hypernym must occur as noun phrases in the sentence. This condition leads to failure to extract candidate hypernyms for some entity terms that can't be identified as noun phrases in the corpus such as "Up All Night", "Someday Came Suddenly", "Now What", etc. As shown in the table 2, the candidate hypernym extraction (CHE) coverage for English testing terms is 950 (63%), that means our system is unable to extract any candidate hypernym for 550 (37%) terms (398 entities and 152 concepts). Furthermore, our system suffers from a major computational issue when applied to a large corpus. Parsing the corpus took to long and failed to complete before the submission deadline. Approximately, we processed 50% sentences of English corpus and 80% sentences of Music corpus, while we processed all sentences of Medical corpus. This explains why the performance of our  system on Medical corpus is better than its performance on the two others corpora.

Conclusion
In this paper, we presented our proposed system (EXPR) that is a combination of path-based technique and distributional technique to participate in Hypernym Discovery task of SemEval 2018. In this work, two feature vectors were extracted and concatenated: the first one is obtained using dependency parser on sentences and the second vector is obtained using pre-trained term embedding. A supervised classifier model based on SVM is built using training dataset composed of concatenated vectors. This model is used to discover hypernyms for new terms. The result was good but didnt fulfill our ambition due to several issues. Our future work is to improve our approach for hypernym discovery by solving several issues. We believe that relying on term embedding model learned from the corpus provided in this task may be a good choice. In addition, we will work on the definition of a new dependency links not only those defined in this paper. Also, we will work to propose an unsupervised approach by using sequential pattern mining technique to automatically extract frequent sequential pattern between hyponym terms and their given hypernyms from the corpus.