INRIASAC: Simple Hypernym Extraction Methods

Given a set of terms from a given domain, how can we structure them into a taxonomy without manual intervention? This is the task 17 of SemEval 2015. Here we present our simple taxonomy structuring techniques which, despite their simplicity, ranked first in this 2015 benchmark. We use large quantities of text (English Wikipedia) and simple heuristics such as term overlap and document and sentence co-occurrence to produce hypernym lists. We describe these techniques and pre-sent an initial evaluation of results.


Introduction
This paper describes the simple hypernym extraction methods implemented in this first participation of Inria in the Semeval campaigns.We participated in task 17 of the 2015 Semeval campaign (Bordea et al., 2015).This task consists in structuring a list of pre-identified domain terms into a list of hypernym pairs.List of terms automatically identified for four domains (equipment; food, chemical, science) were provided by the task organisers.For each domain, two lists were provided, one extracted from WordNet and one from on other source, making eight lists in all.Using any resources, the participants were invited to return eight lists of pairs of terms, in which the first term was a hyponym of the second term.For example, if the words airship and blimp were included in the list of terms for a domain, the system was expected to return lines such as: (where the first number is a meaningless identifier).Given the domain terms lists by the task organizers, we used Wikipedia (downloaded from http://dumps.wikimedia.org on August 13, 2014) as our only resource for discovering these relations.From the download source, we only extracted the text of articles, leaving out any categories, infoboxes, or other typed information.
The campaign organizers provided training data from the domains of Artificial Intelligence, vehicules and plants, different from the test domains.
The training data consisted in term lists (for plants), and term lists and lists of hypernyms (for AI and for vehicles).We examined these files to get an understanding of the task but did not process them in any way.

Domain Lists
We were provided with the following lists of domain terms, with no explanation of how they were created (though WN stands for WordNet): was absent from the domain list WN_chemical.terms.Participants were allowed to "add additional nodes, i.e. terms, in the hierarchy as they consider appropriate."We did not add any new terms, except for chemical in the WN_chemical list.

Preprocessing the resource
Our only resource for discovering hypernym relations was the English Wikipedia.Starting from the wiki-latest-pages-articles.xml,we extracted all the text between <text> markers, and marked off document boundaries using <title> markers.The text was then tokenized (Grefenstette, 1999) and output as one sentence per line, using our own programs.The first English Wikipedia sentence extracted looked like this: ' Anarchism ' is a political philosophy that advocates stateless societies often … based on nonhierarchical free associations .As mentioned, no other information (infoboxes, categories, etc.) was kept.We further applied Porter stemming (Willet, 2006) and stopword removal (Buckley et al., 1995) (replaced by underscores).The lowercased first sentence, then, looked like: We also applied the same Porter stemming and stopword removal to the task-supplied domain terms.So that science.terms,for example, becomes 0 electro-mechan system 1 biolog _ physic 2 histori _ religion _ eastern origin 3 linguist anthropolog 4 metaphys We retained both the Porter stemmed versions of the Wikipedia sentences and domain terms as well as the original unstemmed versions for the treatment described below.

Extracting Hypernyms
In order to extract hypernyms, we used the following features: (i) presence of terms in the same sentence, (ii) presence in the same document (iii) term frequency (iv) document frequency, and (v) subsequences.

Subterms
In addition to domain lists supplied for the Semeval task, we were supplied with training data.One file in this training data, ontolearn_AX.taxo,gives ground truth for the training file onto-learn_AX.terms, and contains: From these validated examples, we concluded that an "easy" way to find hypernyms is to check whether one term is a suffix of the other (e.g., communications satellite as a type of satellite), or whether one term B is the prefix of another term B A C where A is any two-letter word (e.g.helmet of coţofeneşti as a type of helmet; caterpillar d9 as a type of caterpillar).This heuristic was unexpectedly productive in the chemical domain where many hypernym pairs were similar to: ginsenoside mc as a type of ginsenoside.
We did not attempt to generalize the prefix matching to second words of length different from two, and so we missed hypernyms such as fortimicin b as a type of fortimicin or ginsenoside c-y as a type of ginsenoside.
Other examples of errors, false positives, caused by these heuristics are licorice as a type of rice or surface to air missile system as a type of surface, but they are often correct, so any terms in these relations were kept as hypernym pairs without any filtering.

Sentence and Document Co-occurrence Statistics
Any domain terms produced as possible hyponyms by the prefix or the suffix heuristic were no longer considered.For the remaining terms (which could, of course, include the hypernyms found by the suffix and prefix heuristics), we decided, after trying a number of alternatives described in the next section below, to use the statistics of document presence, and of co-occurrence of terms in sentences to predict hypernym relations.Let D porter (term) be the document frequency of a Poter-stemmed term in the stemmed version of Wikipedia.Since Wikipedia article boundaries were stored, we considered each Wikipedia article as a new document.
Let SentCooc porter (term i , term j ) be the number of times that the Porter-stemmed versions of term i and term j appear in the same sentence in the stemmed English Wikipedia.
Given two terms, term i and term j , we decided that if term i is appears in more documents than term j , then term i is a candidate hypernym for term j .
CandHypenym(term i ) = { term j : SentCooc porter (term i , term j ) > 0 && D porter (term j ) > D porter (term i ) } This heuristically derived set is meant to capture the intuition that general terms are more widely distributed than more specific terms (e.g., dog appears in more Wikipedia than poodle).
Next define the best candidate for term i as being the term term k that appears in the most documents (the most articles in Wikipedia, here): Next, we removed this term term k from CandHypernym(term i ) and repeated the choice twice, retaining, then, the three candidate hypernyms appearing in the most documents for each term not found by using the prefix or suffix heuristics.

Co-occurrence Example
Consider the following example.In the domain file science.termsthere is the term biblical studies.The Porter-stemmed version of this term biblic studi appears in 887 sentences.Considering all the other terms in science.terms,we find that biblic studi appears 215 times in the same sentence as the stemmed version of theology (theologi), 111 times in the same sentences as stemmed history (histori), 50 times with religion, 43 times with music, and 42 times with science (scienc).We decided to keep the top three for simplicity, so this term contributed three lines to our submitted science.taxofile: 121 biblical studies history 122 biblical studies religion 123 biblical studies theology

Other Attempts at Finding Relations
We tried a number of other methods to find hypernyms, none of which gave satisfaction by looking at the results.We implemented a method to recog-nize sentences containing Hearst patterns (list from (Cimiano et al., 2005)) involving the domain terms.For example, tape is in equipment.terms,and were able to find stemmed sentences of the form A, B and other C … such as todai , sticki note , 3m #tape# @, and other@ #tape# ar exampl of psa ( pressuresensit adhes ) from which we should have been able to extract relations such as 3m tape is a type of tape, and sticky note is a type of tape.But we would have had to the parse the sentence, and been willing to add new terms (which was permitted by the organizers, to the derived hypernym lists) but in our first participation in Semeval, we did not want to make that processing investment yet.
We tried to discover the basic vocabulary (Kit, 2002) of each domain by counting the number of times that each term appeared in Wikipedia in the set phrase A, such as.For example, using all the terms from equipment.terms,we found 225 instances of equipment, such as 24 instances of internet, such as 4 instances of telescop, such as 2 instances of manual, such as 2 instances of manipul, such as But this did not seem very useful or productive.

Evaluation
Each participant in Task 17 of SemEval 2015 was allowed to submit one run for each of the 8 domains (see Table 1 for the names of the domains, and the number of hypernym pairs we submitted).The task organizers evaluated the submissions of the six participating teams, using automated and manual methods, and published their evaluation three weeks after the submission deadline.Our team placed first in the official ranking of the six teams.
The evaluation criteria, which were not published before the submission, combined the presences of cycles in the hypernyms submitted, the Fowlkes & Mallows measure of the overlap between the submitted, the F-score ranking, the number of domains submitted (not all teams returned results for all domains), and a manual precision ranking (for hypernyms not present in the gold standard).The gold standards used by the task organizers came from published taxonomies, or from subtrees of WordNet (prefixed as WN_ above).Here is a quick evaluation of how well our simple hypernym extraction techniques fared on each gold standard in Table 2 As Table 3 shows, most of the correct answers found come from the sentence and document cooccurrence method described in section 4.2.

Conclusion
Even though training data was provided for this taxonomy creation task, we did not exploit it in this our first participation in Semeval.We implemented some simple frequency-based co-occurrence statistics, and substring inclusion heuristics to propose a set of hypernyms.We did not implement any graph algorithms (cycle detection, branch deletion) that would be useful to build a true hierarchy.We hope to learn from interaction from the other participants what paths to explore in the future to improve recall.

Table 1 .
Number of prefix and suffix hypernyms produced, compared to the total number of hypernyms returned for each domain.Suffix and prefix subterms account for 10% to 36% of the hypernyms we produced.The cooccurence technique produced the most hypernym candidates.

Table 2 .
. Number of gold standard relations to find in the last column.Columns 2, 3 and 4 are the number of gold standard relations found by each technique."union" is the union of columns 2, 3 and 4. Since the cooccurrence technique can find relations that have been found by the suffix and prefix techniques.

Table 3 .
Percentage of correct answers found by each method.