LT3: A Multi-modular Approach to Automatic Taxonomy Construction

This paper describes our contribution to the SemEval-2015 task 17 on “Taxonomy Extraction Evaluation”. We propose a hypernym detection system combining three modules: a lexico-syntactic pattern matcher, a morpho-syntactic analyzer and a module retrieving hy-pernym relations from structured lexical resources. Our system ranked ﬁrst in the competition when considering the gold standard and manual evaluation, and second in the overall ranking. In addition, the experimental results show that all modules contribute to ﬁnding hy-pernym relations between terms.


Introduction
Because of globalization and rapid technological evolution, it is no longer feasible to manually create and manage taxonomies for the large variety of scientific and technological (sub)domains. In addition to domain-specific terminology, also companies desire to build their own mono-or bilingual taxonomies containing the relevant sector-and company-specific terminology. This clear need for automatisation has encouraged researchers to investigate how terminological and semantically structured resources such as taxonomies or ontologies can be automatically constructed from text (Biemann, 2005).
The SemEval-2015 "Taxonomy Extraction Evaluation" Task (Bordea et al., 2015) is concerned with automatically finding relations between pairs of terms and organizing them in a hierarchical structure. In this way, the task assumes that a list of domain specific terms is already available in order to focus on the relation detection between these terms.
To tackle this SemEval taxonomy learning task, we propose a multi-modular approach that combines lexico-syntactic, morphological and external structured lexical information. We will describe our hypernym detection system in Section 2. The results of the evaluation are presented in Section 3, while Section 4 concludes this paper.

System Description
Our hypernym detection system contains three main components: a lexico-syntactic pattern-based approach, a morpho-syntactic analyzer and a module retrieving hypernym relations from a structured lexical resource. Each module takes as input a domain specific term list.

Pattern-based Approach
The first module that automatically detects hypernym relations is a lexico-syntactic pattern-based approach, based on Hearst (1992). These patterns are implemented as a list of regular expressions containing lexicalized expressions (e.g. like), as well as iso-lated Part-of-Speech tags (e.g. noun) and chunk tags, which represent different Part-of-Speech sequences (e.g. noun phrase (NP) = determiner + adjective + noun, adjective + noun, etc.). An example of these manually defined patterns is "NP {, NP}* {,} or/and other NP", 1 as in "green beans, carrots, peas and other vegetables", which results in four hypernym pairs, being (vegetables, green beans), (vegetables, carrots), (vegetables, peas) and (vegetables, onions).
Domain specific corpus. As this module aims to find hypernym relations by detecting terms occurring in specific lexico-syntactic constructions, we first needed to compile a domain specific corpus containing these terms. To compile the corpus, we used the the BootCaT toolkit (Baroni and Bernardini, 2004), which can be used to build a specialized web-based corpus starting from a list of seed terms. We considered the different term lists for task 17 as the "seed terms" to build the domain specific corpora, by allowing 10 queries per seed term. Due to technical reasons (the Bing search engine that is used by BootCat only allows 5000 queries per user account), we only compiled corpora for three domains, being equipment, food and science. As a post-processing step, we removed all sentences containing (1) only URL links or (2) no domain specific term, resulting in three corpora containing about 12 million tokens for the food domain, 6 million tokens for the equipment domain and 27 million tokens for the science domain.
Linguistic preprocessing. We performed a number of linguistic preprocessing steps in order to enrich the original web-based corpus: (1) tokenisation, (2) Part-of-Speech Tagging, (3) Lemmatisation and (4) Chunking. All linguistic preprocessing was per performed by means of the LeTs Preprocess toolkit ( Van de Kauter et al., 2013).
Lexico-syntactic pattern matching. The resulting linguistically preprocessed corpus is the input for the pattern-based module. Example 1 shows a sentence matching the pattern: {other}* NP such as NP {, NP}* {(and-or) NP}* 1 Curly brackets indicate optional parts of the pattern. resulting in the two hypernym pairs (cranberry products, tablets) and (cranberry products, capsules). As can be seen in example 1, the lexicalised parts of the patterns (other and such as in this case) are not considered for the generation of the hyponym-hypernym tuples.
(1) other other JJ B-NP cranberry cranberry NN I-NP products product NNS I-NP such such JJ B-AP as as IN I-AP tablets tablet NNS B-NP and and CC O capsules capsule NNS B-NP We optimized the pattern-based model presented by (Lefever et al., 2014) in different ways. The efficiency of the module was improved by only considering noun phrases containing a maximum of 6 consecutive nouns and by ignoring named entities. This appeared to be necessary as the web-based corpus contains a lot of lists and enumerations, causing problems for the recursive way the regular expressions are built. Precision, on the other hand, was improved by ignoring tuples containing both terms as hypernym and hyponym (e.g. hand truck -truck and truck -hand truck) and by only considering patterns that revealed to obtain high precision in previous research (Lefever et al., 2014).
Finally, the output of the pattern-based module is filtered by only considering tuples where both terms (either lemma or full form) occur in the term list of the considered domain.

Morpho-syntactic Analyzer
Our second hypernym detection module applies a morpho-syntactic approach where the morphological structure of compound terms is used to extract a hypernym-hyponym relation from this term. This approach is inspired by the head-modifier principle (Sparck Jones, 1979) stating that in a compound noun, the linear arrangement of the compound parts expresses the kind of information being conveyed, the head referring to the more general semantic category, whereas the modifiers restrict the sense of the compound term. This way, the complete compound term can be considered as a hyponym of the head term. We implemented rules for three different syntactic hypernym-hyponym relations in compounds: 1. Single-word terms: If term T0 is a suffix string of term T1, T0 is considered to be a hypernym of T1. Examples of hypernym pairs detected within single-word terms are (sachertorte, torte), (candlepin, pin) and (psycholinguistics, linguistics).

Multiword terms:
If term T0 is the head term of term T1, T0 is considered to be a hypernym of T1. It is important to mention that we also allow multiple possible hypernyms in case different terms occur as suffixes of the compound term, e.g. phu quoc fish sauce is the hyponym of both sauce and fish sauce. As the head of a nominal phrase appears at the right edge of a multiword NP in English, the last constituent of the NP is regarded as the head of the compound, and thus as the hypernym of the complete term, as is the case in the generated hypernym pair (béarnaise sauce, sauce).

Complex prepositional phrases:
If term T0 is the first part of a term T1 containing a noun phrase + preposition + noun phrase, T0 is considered to be a hypernym of T1. In the case of a prepositional compound phrase, the head is situated at the left edge of the compound term. Examples of such hypernym pairs are (sociology of culture, sociology) and (soup all'imperatrice, soup) In addition, we added some restrictions to these general rules in order to improve the precision of the module. First, we set a threshold of minimum three characters for the detection of valid hypernyms. An example of invalid hypernyms filtered out this way is tu that could be detected as a hypernym of pesarattu, both terms occurring in the food term list. Second, we noticed that food terms (etc. dishes) are often loan words from other languages. Therefor we added a list of foreign adjectival affixes (e.g. french affix al/ale) that should not be considered as a hypernym of the compound term. This way we prevent for instance ale to be detected as the hypernym of chicken provencale or café royale.

Structured Lexical Resources: WordNet
The third hypernym detection module retrieves information from an external lexical resource, being WordNet in this case. This module looks up the synsets in WordNet for all domain-specific terms and retrieves all hypernyms appearing in the full hierarchical path of these synsets. Hypernym tuples containing identical terms were removed. Examples of hyponym-hypernym pairs retrieved from Word-Net are (semantics, science) and (semantics, linguistics).

Combined System
To generate the final list of hypernym relations, we combined the output of all three modules and removed all doubles from the hyponym-hypernym pair list.

Results
The resulting taxonomies are evaluated through comparison with gold standard relations collected from existing domain specific ontologies and Word-Net. In addition, expert evaluation has been performed on the hypernym relations submitted by the participants. The system organizers calculated precision, recall and F-score as well as a cumulative Fowlkes & Mallows measure, which is inspired by clustering evaluation and takes into account the hierarchical structure of the gold standard taxonomy and the taxonomy that is produced by the system.
In addition, a number of structural measures were calculated such as the number of distinct vertices and edges, the number of connected components and intermediate nodes to evaluate whether the taxonomy connects all nodes with the root. From this evaluation it was clear that our taxonomy contains cycles, which is conflicting with correct hierarchical relations. For more detailed information about the gold standards and evaluation metrics, we refer to (Bordea et al., 2015). Table 1 lists the averaged Precision, Recall, F-measure and Fowlkes & Mallows scores for all participating systems, while Table 2 lists the individual scores for the three domains in which we participated. The very high recall scores for the WN data sets can be explained by the fact that our system also contains a module that retrieves hypernym relations from WordNet. 2  As we wanted to gain more insights in the contribution of the different modules, we also calculated precision and recall per module for the different term lists. The results per module and for the combined system are shown in Table 3.  We notice that the WordNet module indeed achieves very high recall for the WN test sets, but that the recall for the more technical term lists is much lower (with only 0.021 for the Equipment data set). The recall achieved by the pattern-based module is also very modest, with scores ranging from 0.005 (Equipment) to 0.063 (WN Science). The Morpho-syntactic Module, on the other hand, contributes in a consistent way to the recall for all term lists. Finally, table 3 also shows that the obtained recall by the system that combines all different mod-ules consistently beats the recall of the individual modules for all test sets. With regard to precision, we observe that the morpho-syntactic approach obtains very good results for all the different test domains, resulting in system precision scores that outperform all participating systems.
A qualitative analysis of the output revealed shortcomings of the different hypernym detection modules. As discussed above, the morpho-syntactic module achieves good recall. The downside is that the module clearly over generates. Examples of invalid hypernym pairs are for instance (pineapple juice, apple juice), (hot and sour soup, sour soup) and (ice cream, cream). Although WordNet is a manually verified taxonomy, we also discovered invalid hypernym pairs in the output of the WordNet module. For the food domain, for instance, we discovered that all beverages have "food" as an inherited hypernym, resulting in hypernym pairs such as (pineapple juice, food) and (absinth, food).

Conclusion
To tackle the SemEval "Taxonomy Extraction Evaluation" task, we proposed a hypernym detection system combining a lexico-syntactic pattern matcher, a morpho-syntactic analyzer and a module retrieving hypernym relations from WordNet and showed promising results for the different test domains. Analyzing the recall per hypernym detection module revealed that all modules contribute to the final hyponym-hypernym list generated by the combined system. In future work, we would like to improve the recall of the system by adding additional hypernym detection modules (e.g. a distributional model built on the basis of the domain corpora). We will also add a dedicated module to construct the taxonomy based on the hierarchical relations in order to remove the cycles from the resulting taxonomy.