USAAR-WLV: Hypernym Generation with Deep Neural Nets

This paper describes the USAAR-WLV taxonomy induction system that participated in the Taxonomy Extraction Evaluation task of SemEval-2015. We extend prior work on using vector space word embedding models for hypernym-hyponym extraction by simplifying the means to extract a projection matrix that transforms any hyponym to its hypernym. This is done by making use of function words, which are usually overlooked in vector space approaches to NLP. Our system performs best in the chemical domain and has achieved competitive results in the overall evaluations.


Introduction
Traditionally, broad-coverage semantic taxonomies such as CYC (Lenat, 1995) and WordNet ontology (Miller, 1995) have been manually created with much effort and yet they suffer from coverage sparsity. This motivated the move towards unsupervised approaches to extract structured relational knowledge from texts (Lin and Pantel, 2001;Snow et al., 2006;Velardi et al., 2013). 1 Previous work in taxonomy extraction focused on rule-based, clustering and graph-based approaches. Although vector space approaches are popular in current NLP researches, ontology induction studies have yet to catch on the frenzy. Fu et al. (2014) proposed a vector space approach to hypernymhyponym identification using word embeddings that trains a projection matrix 2 that converts a hyponym vector to its hypernym. However, their approach requires an existing hypernym-hyponym pairs for training before discovering new pairs.
Our system submitted to the SemEval-2015 taxonomy building task is most similar to the approach by Fu et al. (2014) in using word embeddings projections to identify hypernym-hyponym pairs. As opposed to previous method our method does not requires prior taxonomical knowledge.
Instead of training a projection matrix, we capitalize on the fact that hypernym-hyponym pair often occurs in a sentence with an 'is a' phrase, e.g. "The goldfish (Carassius auratus auratus) is a freshwater fish". 3 Intuitively, if we single-tokenize the 'is a' phrase prior to training a vector space, we can make use of the vector that represents the phrase in capturing a hypernym-hyponym pair as such the multiplication of v(goldfish) and v(is-a) will be similar to the cross product v(fish) (v(goldfish)×v(is-a) ≈ v(fish)).
There is little or no previous work that manipulates non-content word vectors in vector space models studies for natural language processing. Often, non-content words 4 were implicitly incorporated into the vector space models by means of syntactic frames (Sarmento et al., 2009) or syntactic parses (Thater et al., 2010).
Our main contribution for ontological induction using vector space models are primarily (i) the use of non-content word vectors and (ii) simplifying a previously complex process of learning a hypernymhyponym transition matrix. The implementation of our ontological induction approach is open-sourced and available on our GitHub repository. 5

Task Definition
Similar to Fountain and Lapata (2012), the SemEval-2015 Taxonomy Extraction Evaluation (TaxEval) task addresses taxonomy learning without the term discovery step, i.e. the terms for which to create the taxonomy are given (Bordea et al., 2015). The focus is on creating the hypernym-hyponym relations.
In the TaxEval task, taxonomies are evaluated through comparison with gold standard taxonomies. There is no training corpus provided by the organisers of the task and the participating systems are to generate hyper-hyponyms pairs using a list of terms from four different domains, viz. chemicals, equipment, food and science.
The gold standards used in evaluation are the ChEBI ontology for the chemical domain (Degtyarenko et al., 2008), the Material Handling Equipment taxonomy 6 for the equipment domain, the Google product taxonomy 7 for the food domain and the Taxonomy of Fields and their Different Sub-fields 8 for the science domain. In addition, all four domains are also evaluated against the sub-hierarchies from the WordNet ontology that subsumes the Suggested Upper Merged Ontology (Pease et al., 2002).

Related Work
There are a variety of methods used in taxonomy induction. They can be broadly categorized as (i) pattern/rule based, (ii) clustering based, (iii) graph based and (iv) vector space approaches.

Pattern/Rule Based Approaches
Hearst (1992) first introduced ontology learning by exploiting lexico-syntactic patterns that explicitly links a hypernym to its hyponym, e.g. "X and other Ys" and "Ys such as X". These patterns could be manually constructed (Berland and Charniak, 1999;Kozareva et al., 2008) or automatically bootstrapped (Girju, 2003).
These methods rely on surface-level patterns and incorrect items are frequently extracted because of parsing errors, polysemy, idiomatic expressions, etc.

Clustering Approaches
Clustering based approaches are mostly used to discover hypernym (is-a) and synonym (is-like) relations. For instance, to induce synonyms, Lin (1998) clustered words based on the amount of information needed to state the commonality between two words. 9 Contrary to most bottom-up clustering approaches for taxonomy induction (Caraballo, 2001;Lin, 1998), Pantel and Ravichandran (2004) introduced a top-down approach, assigning the hypernyms to clusters using co-occurrence statistics and then pruning the cluster by recalculating the pairwise similarity between every hyponym pair within the cluster.

Graph-based Approaches
In graph theory (Biggs et al., 1976), similar ideas are conceived with a different jargon. In graph notation, nodes/vertices form the atom units of the graph and nodes are connected by directed edges. A graph, unlike an ontology, regards the hierarchical structure of a taxonomy as a by-product of the individual pairs of nodes connected by a directed edges. In this regard, a single root node is not guaranteed and to produce a tree-like structure.
Disregarding the overall hierarchical structure, the crux of graph induction focuses on the different techniques of edge weighting between individual node pairs and graph pruning or edge collapsing (Kozareva and Hovy, 2010;Navigli et al., 2011;Fountain and Lapata, 2012;Tuan et al., 2014).

Vector Space Approaches
Semantic knowledge can be thought of as a twodimensional vector space where each word is represented as a point and semantic association is indicated by word proximity. The vector space representation for each word is constructed from the distribution of words across context, such that words with similar meaning are found close to each other in the space (Mitchell and Lapata, 2010;Tan, 2013).
Although vector space models have been used widely in other NLP tasks, ontology/taxonomy inducing using vector space models has not been popular. It is only since the recent advancement in neural nets and word embeddings that vector space models are gaining ground for ontology induction and relation extraction (Saxe et al., 2013;Khashabi, 2013).

Methodology
This section provides a brief overview of our system's approach to taxonomy induction. The full system is released as open-source and contains documentation with additional implementation details. 10 Intuitively, the assumption is that all words can be projected to their hypernyms based on a transition matrix. That is, given a word x and its hypernym y, a transition matrix Φ exists such that y = Φx, e.g. v(goldfish) = Φ×v(fish).

Projecting a Hyponym to its Hypernym with Transition Matrix
Fu et al. proposed two projection approaches to identify hypernym-hyponym pairs, (i) uniform linear projection where Φ is the same for all words and Φ is learnt by minimizing the mean squared error of Φx-y across all word-pairs (i.e. a domain independent Φ) and (ii) piecewise linear projection that learns a separate projection for different word clusters (i.e. a domain dependent Φ, where a taxonomy's domain is bounded by its terms' cluster(s)). In both projections, hypernym-hyponym pairs are required to train the transition matrix Φ.

Inducing a Hypernym with is-a Vector
Instead of learning a supervised transition matrix Φ, we propose a simpler unsupervised approach where we learn a vector for the phrase "is-a". We singletokenize the adjacent "is" and "a" tokens and learn the word embeddings with is-a forming part of the vocabulary in the input matrix.
Effectively, we hypothesize that Φ can be replaced by the "is-a" vector. To achieve the piecewise projection effects of Φ, we trained a different deep neural net model for each TaxEval domain and assume that the "is-a" scales automatically across domains. For instance, the multiplication of the v(tiramisu) and the v(is-a f ood ) vectors yields a proxy vector and we consider the top ten word vectors that are most similar to this proxy vector as the possible hypernyms, i.e. v(tiramisu)×v(is-a f ood ) ≈ v(cake).

Training Data
There is no specified training corpus released for the SemEval-2015 TaxEval task. To produce a domain specific corpus for each of the given domains in the task, we used the Wikipedia dump and preprocessed it using WikiExtractor 11 and then extracted documents that contain the terms for each domain individually.
We trained a skip-gram model phrasal word2vec neural net (Mikolov et al., 2013a) using gensim (Řehůřek andSojka, 2010). The neural nets were trained for 100 epochs with a window size of 5 for all words in the corpus. 12

Evaluation Metrics
For the TaxEval task, the multi-faceted evaluation scheme presented in Navigli (2013) was adopted to compare the overall structure of the taxonomy against a gold standard, with an approach used for comparing hierarchical clusters. The multi-faceted  evaluation scheme evaluates (i) the structural measures of the induced taxonomy (left columns of Table 1), (ii) the comparison against gold standard taxonomy (right columns of Table 1 and leftmost column of Table 2) and (iii) manual evaluation of novel edges precision (last row of Table 2). Regarding the two types of automatic evaluation measures, the structural measures provides a gauge of the system's coverage and the ontology structural integrity, i.e. "tree-likeness" of the ontology produced by the hypernym-hyponym pairs, and the comparison against the gold standards gives an objective measure of the "human-likeness" of the system in producing a taxonomy that is similar to the manually-crafted taxonomy. Table 1 presents the evaluation scores for our system in the TaxEval task, the %VC and %EC scores summarize the performance of the system in replicating the gold standard taxonomies.

Results
In terms of vertex coverage, our system performs best in the chemical and WordNet chemical domain. Regarding edge coverage, our system achieves high-est coverage for the science domain and WordNet chemical domain. Having high edge and vertex coverage significantly lowers false positive rate when evaluating hypernym-hyponyms pairs with precision, recall and F-score.
We also note that the wikipedia corpus extracted that we used to induce the vectors lacks coverage for the food domain. In the other domains, we discovered all terms in the wikipedia corpus plus the domains' root hypernym (i.e. |V| = #VC + 1). Table 2 presents the comparative results between the participating teams in the TaxEval task averaged over all domains. We performed reasonable well as compared to the other systems in all measures. While our system's F&M measure is low, it is only representative of the clusters we have induced as compared to the gold standard. To improve our F&M measure, we could reduce the number of redundant novel edges by pruning our system outputs and achieve comparable results to the other teams given our relatively precision of novel edges.
A detailed evaluation on the results for the individual domains is presented on Bordea et al. (2015).

Conclusion
In this paper, we have described our submissions to the Taxonomy Evaluation task for SemEval-2015. We have simplified a previously complex process of inducing a hypernym-hyponym ontology from a neural net by using the word vector for the noncontent word text pattern, "is a".
Our system achieved modest results when compared against other participating teams. Given the simple approach to hypernym-hyponym relations, it is possible that future research can apply the method to other non-content words vectors to induce other relations between entities. The implementation of our system is released as open-source.