TAXI at SemEval-2016 Task 13: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling

We present a system for taxonomy construction that reached the ﬁrst place in all sub-tasks of the SemEval 2016 challenge on Tax-onomy Extraction Evaluation. Our simple yet effective approach harvests hypernyms with substring inclusion and Hearst-style lexico-syntactic patterns from domain-speciﬁc texts obtained via language model based focused crawling. Extracted taxonomies are evaluated on English, Dutch, French and Italian for three domains each (Food, Environment and Science). Evaluations against a gold standard and by human judgment show that our method out-performs more complex and knowledge-rich approaches on most domains and languages. Furthermore, to adapt the method to a new domain or language, only a small amount of manual labour is needed.


Introduction
In this paper, we describe TAXI -a taxonomy induction method first presented at the SemEval 2016 challenge on Taxonomy Extraction Evaluation (Bordea et al., 2016). We consider taxonomy induction as a process that should -as much as possiblebe driven solely on the basis of raw text processing. While some labeled examples might be utilized to tune the extraction and induction process, we avoid relying on structured lexical resources such as WordNet (Miller, 1995) or BabelNet (Navigli and Ponzetto, 2010). We rather envision a situation where a taxonomy shall be induced in a new domain or a new language for which such resources do not exist. In this paper, we demonstrate our methodology based on hyponym extraction from substrings and general-domain and domain-specific corpora for four languages and three domains.

Related Work
The extraction of taxonomic relationships from text is a long-standing challenge in ontology learning, see e.g. Biemann (2005) for a survey. The literature on hypernym extraction offers a high variability of methods, from simple lexical patterns (Hearst, 1992;Oakes, 2005), similar to those used in our method, to complex statistical techniques (Agirre et al., 2000;Ritter et al., 2009). Snow et al. (2004) use sentences that contain two terms which are known to be hypernyms. They parse sentences and extract patterns from the parse trees. Finally, they train a hypernym classifier based on these features and applied to text corpora. Yang and Callan (2009) presented a semisupervised taxonomy induction framework that integrates co-occurrence, syntactic dependencies, lexical-syntactic patterns and other features to learn an ontology metric, calculated in terms of the semantic distance for each pair of terms in a taxonomy. Terms are incrementally clustered on the basis of their ontology metric scores. Snow et al. (2006) perform incremental construction of taxonomies using a probabilistic model. They combine evidence from multiple supervised classifiers trained on large training datasets of hyponymy and co-hyponymy relations. The taxonomy learning task is defined as the problem of finding the taxonomy that maximizes the probability of individ-ual relations extracted by the classifiers. Kozareva and Hovy (2010) start from a set of root terms and use Hearst-like lexico-syntactic patterns to harvest hypernyms from the Web. The extracted hypernym relation graph is subsequently pruned. Veraldi et al. (2013) proposed a graph-based algorithm to learn a taxonomy from textual definitions, extracted from a corpus and the Web. An optimal branching algorithm is used to induce a taxonomy.
Finally, Bordea et al. (2015) introduced the first shared task on Taxonomy Extraction Evaluation to provide a common ground for evaluation. Six systems participated in the competition. The top system in this challange used features based on substrings and co-occurrence statistics (Grefenstette, 2015). Lefever et al. (2015) reached the second place gathered hypernyms from patterns, substrings and Word-Net. Tan et al. (2015) used word embeddings, reaching the third place.

Taxonomy Induction Method
Our approach is characterized by scalability and simplicity, assuming that being able to process larger input data is more important than the sophisticated extraction inference. Our approach to taxonomy induction takes as input a set of domain terms and general-domain text corpora and outputs a taxonomy. It consists of four steps. Firstly, we crawl domain-specific corpora based on terminology of the target domain (see Section 3.1). These complement general-purpose corpora, like texts of Wikipedia articles. Secondly, candidate hypernyms are extracted based on substrings and lexicosyntactic patterns (see Section 3.2). Thirdly, the candidates are pruned so that each term has only a few most salient hypernyms (see Section 3.3). The last step performs optimization of the overall taxonomy structure removing cycles and linking disconnected components to the root (see Section 3.4).

Corpora for Taxonomy Induction
To build domain-specific taxonomies we use both general and domain-specific corpora.
General Domain Corpora. We use three general purpose corpora in our approach presented in The second corpus is a concatenation of the English Wikipedia, Gigaword (Parker et al., 2009), ukWaC (Ferraresi et al., 2008 and a news corpora from the Leipzig Collection (Goldhahn et al., 2012).
Domain-Specific Corpora. Lefever (2015) showed the usefulness for taxonomy extraction of domain dependent corpora crawled from the Web using BootCat ( Baroni and Bernardini, 2004). This method takes terms as input, which are randomly combined into sequences of a pre-defined length, and sent to a Web search engine. The search results, i.e. the returned URLs, compose a domaindependent corpus. The number of input terms, the number of queries and the amount of desired URLs impact the size of the corpus. With 1,000 web queries and 10 URLs per query, the expected size of the resulting corpus is around 300 MB. While Lefever (2015) shows that such small in-domain corpora can be already useful for taxonomy extraction, we assumed that better results can be obtained if bigger domain-specific corpora are used.
We therefore follow a different approach based on focused crawling, where BootCat is used only for initialization of seed URLs. We use the provided taxonomy terms as input for the BootCat method, generate 1,000 random triples, and use the retrieved URLs as a starting point for further crawling. Focused crawling is an extension to standard web crawling where URLs, expected to point to relevant web documents, are prioritized for download (Chakrabarti et al., 1999). Remus and Biemann (2016) introduced a focused crawling approach based on language modeling. The idea is that relevant web documents refer to other relevant web documents, where the relevance of a web document is computed by considering a statistical n-gram language model of a small, initially provided, domain-defining corpus. We provide a domain-defining corpus for each category by using Wikipedia articles, that are directly contained in the matching Wikipedia category. For example, for the the Food domain we used the Wikipedia articles of Category:Foods to build a language model of the Food domain. The language models for each domain were created using the 5-gram with the Kneser-Ney (1995) smoothing.
Using this technique, we are able to iteratively follow promising URLs and download web pages until a specified stopping criterion (no more pages with desired perplexity or timeout). Each domain and language was crawled for about one week on a single server machine with 24 cores and 32GB RAM, and harvested between 130 and 800 GB raw content, which results in 2 to 23 GB of unique plaintext sentences (c.f. Table 1). Note, that these sentences might contain cross-domain content.

Candidate Hypernyms via Substrings
A simple yet precise method for hypernym extraction is based on substring matching, c.f. the baseline system in Table 3 and (Lefever, 2015). For instance, "biomedical science" is a "science", "microbiology" is a "biology" and so on. We calculate the following substring-based hypernymy score σ(t i , t j ) between a pair of candidate terms t i , t j : Here m(t i , t j ) is a function that returns true in case of a match of the term t i inside the term t j . Such match happens if length(t i ) is greater than 3. For English and Dutch, the hypernym t i should match in the end of hyponym t j , e.g. "natural science" is a "science". For French and Italian a match of hypernym should be in the beginning of hyponym e.g. "algèbre linéaire" is a "algèbre", not "linéaire". The same holds for English and Dutch if a hyponym contains a preposition e.g.: "toast with bacon" is a "toast", not "bacon" or "brood van gekiemd graan" is "brood", not "graan". Finally, if no match is found, we lemmatize terms t i and t j and retry matching. The precision-recall curve of the substring score calculated on the trial dataset is presented in Figure 1. As one can observe, precision of the substring score is constantly high, reaching 0.91 at the recall level of 0.29 with AUC of 0.61. There- fore, this score is able to retrieve a significant number of high-quality hypernyms. Yet, only hypernyms of compound words can be retrieved via substrings.

Candidate Hypernyms via Patterns
To extract candidate hypernym relations from texts we used three systems listed below. All of them rely on lexico-syntactic patterns in the fashion of (Hearst, 1992; Klaussner and Zhekova, 2011). We used several systems to filter noise via complementary signals. Besides, not all the systems support all the four languages of the SemEval task. Porting of Hearst patterns to a new European language is a straightforward and relatively quick procedure. Yet, due to a dense SemEval schedule, we decided to implement new rules only for two languages not supported by any available system, namely Italian and Dutch and reuse extraction rules for other languages. PattaMaika. This system was used to process English, Italian and Dutch corpora. It implements patterns using UIMA Ruta (Kluegl et al., 2014). First, part-of-speech information is used to assign noun phrase (NP) chunk annotations to nominal phrases. Next, we use patterns to identify hypernym relations between NP chunks. We adapted the 9 English rules to the target languages, resulting in 9 patterns for Italian and 8 patterns for Dutch.
PatternSim. This system was used to process English and French corpora. It encodes patterns in the form of finite state transducers implemented with the Unitex corpus processor. 2 PatternSim relies on 10 English patterns yielding average precision of top 5 extracted semantic relations per word of 0.69 (Panchenko et al., 2012). For French, 9 hypernym extraction patterns are used providing precision at top 5 of 0.63 (Panchenko et al., 2013).
WebISA. In addition to PattaMaika and Pat-ternSim, we used a publicly available database of English hypernym relations extracted from the CommonCrawl corpus (Seitner et al., 2016). We used 108 million hypernym relations with frequency above one. This collection of relations was harvested using a regexp-based implementation of 59 patterns collected from the literature.
Combination of hypernyms. Result of the extraction are 18 collections of hypernym relations listed in Table 2. Even the huge WebISA collection extracted from tens of terabytes of text does not provide hypernyms for all rare taxonomic terms, such as "ground and whole bean coffee" and "black sesame rice cake". On the other hand, most of the collections contain many noisy relations. For instance, frequent relations for hypernyms often go in both directions, e.g. "history" is a "science", but also "science" is a "history". Therefore, we introduced an asymmetric pattern-based hypernymy score π(t i , t j ) between terms t i and t j . It combines information from different hypernym collections to filter noisy extractions. To compute the score, first we normalize extraction counts on the per word basis: is the number of relations extracted between terms t i and t j by the k-th extractor. These normalized scores are averaged across all extractors per language-domain pair:π(t i , t j ) = 1 |LD| k∈LD π k (t i , t j ), where LD is a set of hypernym collections relevant for a given languagedomain pair. For instance, for the language-domain pair "English-Food", the LD contains four collections: general relations extracted by PatternSim, PattaMaika and WebISA plus domain-specific relations extracted by PatternSim (see Table 2). Finally, to get the pattern-based score, we subtract averaged scores of two terms in both directions: π(t i , t j ) = π(t i , t j ) −π(t j , t i ). This way, we downrank symmetrical relations like synonyms and co-hyponyms. The precision-recall curve of the pattern-based score on the trial data is presented in Figure 1. This plot is calculated on general corpora as we did not crawl domain specific corpora for the trial dataset domains. As one can see, precision of 0.80 is achieved at recall of 0.15 or less and drops to 0.36 at recall of 0.19. AUC of 0.28 of is less than half of the substring-based score of 0.61. Thus, patterns are a less reliable source of hypernyms than the substrings. Yet, they can capture relations between words with different spelling like "apple" and "fruit", while the substring-based score need a character overlap, like in "grapefruit" and "fruit".

Pruning of Hypernyms
Patterns and substrings together yield up to several hundreds of hypernym candidates per term. This step prunes hypernym candidates, ranking them with an unsupervised and supervised combinations of the σ(t i , t j ) and π(t i , t j ) scores.
Unsupervised Pruning. In this pruning strategy, used for French, Dutch and Italian languages, a term t i is a hypernym of term t j if their substring score σ(t i , t j ) is greater than zero or if rank of the term t i according to the pattern-based score π(t i , t j ) equals to one or two. Thus a term t i obtains all hypernyms extracted by substrings and up to two hypernyms extracted by patterns.
Supervised Pruning. This pruning strategy used for English relies on a supervised classifier trained on the trial dataset (Bordea et al., 2016). 3 This pruning approach uses 3,249 hypernymy relations from the trial taxonomies as positive training samples, e.g. hypernym relation ("biology", "science") and 128,183 automatically generated relations as negatives samples coming from two sources: 3,249 in-verted hypernyms, such as ("science", "biology") plus 124,934 co-hyponyms from the trial taxonomy, for instance ("biology", "mathematics").
The classifier used in the competition had two features that characterize a word pair (t i , t j ), namely substring-, and pattern-based scores σ(t i , t j ) and π(t i , t j ). Note, that the same features were used in the unsupervised approach. We applied an SVM classifier with RBF kernel (Vert et al., 2004), tuning kernel meta-parameters within an internal loop cross-validation procedure.
We tested multiple alternative configurations with extra features, including term frequency, out/in degree of terms in the hypernym graph, term length, expansions of hypernyms based on term clustering and shortest paths in the graph of candidate hypernyms as well as other classifiers including Logistic Regression, Gradient Boosted Trees, and Random Forest. However, none of the above mentioned configurations yielded consistently better results on the trial data than the two feature-based SVM.
To identify hypernyms among a set of terms T , we classify using a model trained on the trial data all possible word pairs except identical ones: The pairs classified using the positive class are added to the taxonomy.

Taxonomy Construction
At this point of the taxonomy construction, we obtained a noisy graph, which may contain cycles and disconnected components. To remove cycles and to obtain a directed acyclic graph taxonomy we used the unsupervised graph pruning approach of , which searches for a cycle C using topological sorting of Tarjan (1972) and then removes a random edge of C until no cycles are detected in the graph.
Besides, to improve connectivity of the taxonomy we connect all the nodes of each disconnected component with zero out degree to the taxonomy root.

Evaluation
To assess quality of the taxonomies several complementary measures were used. The first type of measurements are structural measures, such as the number of connected components (c.c.), the number of intermediate nodes (i.i.), i.e. the number of nodes with out degree equal to zero, and the presence of cycles. Second, system outputs were compared against the corresponding domain gold standards and performances are evaluated in terms of Fscore. Here precision and recall are based on the number of edges in common with the gold standard taxonomy over the number of system edges and over the number of gold standard edges respectively. To better compare against gold standard taxonomies the task included the evaluation of a cumulative measure (Velardi et al., 2013), namely Cumulative Fowlkes & Mallows Measure (F&M), where the similarity between the system and the reference taxonomies are measured as the combination of the hierarchical cluster similarities. Finally, the organizers performed manual quality assessment to estimate the precision of the hypernyms. To compute this measure, annotators labeled a sample of 100 hypernym relations as correct or wrong. The taxonomy extraction was evaluated on four languages, namely English, Dutch, French and Italian, and three different domains (Food, Science and Environment). A detailed description of the evaluation settings and metrics can be found in (Bordea et al., 2016). Table 3 presents a summary of evaluation of our method on the SemEval 2016 Task 13 dataset. Overall 5 systems participated in the challenge: JUNLP, TAXI, NUIG-UNLP, USAAR and QASITT. We represent the respective best scores across our four competitors in the BestComp column.

Results
Gold Standard Comparison. The organizerprovided Baseline system implemented a string inclusion approach that covers relations between compound terms. A similar mechanism was used by the USAAR system (Tan et al., 2016), which improved over the baseline in terms of precision at the cost of recall. USAAR achieved the highest precision scores for English, as they used substringbased methods that yield high precision (c.f. Figure 1). Yet substrings cannot retrieve hypernyms of non-compound terms.
The main mechanisms we added in TAXI with regard to the substring-based methods are statistics over pattern-based extractions over large domain specific corpora and our taxonomy construction step  that improves structure of the resource. These united mechanisms are not used in other submissions to the challenge. The NUIG-UNLP team (Pocostales, 2016) relies on vector directionality in dense word embedding spaces. Such approximation of patterns based on distributional similarity provided good recall, but attained low precision. The QASSIT team (Cleuziou and Moreno, 2016), who ranked second in the competition, uses patterns to extract hypernym candidates, but they rely solely on the Wikipedia. Subsequently, an optimization technique based on genetic algorithms is used to learn the parametrization of a so-called pretopological space, which leads to desired structural properties of the resulting taxonomy. While we use simpler optimization procedure based on supervised learning, TAXI outperforms QASSIT in terms of comparisons with the gold standard. Possible reasons why our method performs better are (1) QAS-SIT use no substring features, (2) this team relies on smaller general-purpose corpora, while we use larger domain-specific corpora.
Finally, JUNLP relies on substrings and relations extracted from BabelNet. We find the latter to be undesirable for taxonomy extraction. Indeed, a rich lexical resource, such as BabelNet can be considered as a taxonomy in itself. Interestingly, even with the BabelNet-based features the system did not always reach the top precision and recall.
Manual Evaluation. Our system was ranked first in terms of manual judgments for the Dutch, Italan and French reaching the average precision across languages and domains of 0.625. Precision for different language-domain pairs ranged from 0.90 for the Italian-Science pair to 0.23 for the French-Environment pair. For English, our system was ranked second with the average score of 0.20, while the substring-based USAAR system obtained the score of 0.49 and the third-ranked system obtained the score of 0.09. We attribute lower precision in the English run to absence in the supervised ranking scheme of a limit on the number of extracted hypernyms per word.
Further detailed comparisons with other systems with breakdowns with regard to different languages, domains and evaluation schemes are presented by Bordea et al. (2016) and on SemEval website. 4 Discussion. One shortcoming of our method is its coverage: for instance 774 terms of 1,555 of the English food domain are still not attached to any node, which is a typical issue with patternbased approaches since not all taxonomic relationships are spelled out explicitly in corpora. To tackle this shortcoming, we plan to use hypernym expansion based on distributional semantics (Biemann and Riedl, 2013).

Conclusion
We presented a technique for taxonomy induction from a domain vocabulary. It extracts hypernyms from substrings and large domain-specific corpora bootstrapped from the input vocabulary. Multiple evaluations based on the SemEval taxonomy extraction datasets of four languages and three domains show state-of-the-art performance of our approach. An implementation of our method featuring all language resources, is available for download. 5