SemEval-2018 Task 9: Hypernym Discovery

This paper describes the SemEval 2018 Shared Task on Hypernym Discovery. We put forward this task as a complementary benchmark for modeling hypernymy, a problem which has traditionally been cast as a binary classification task, taking a pair of candidate words as input. Instead, our reformulated task is defined as follows: given an input term, retrieve (or discover) its suitable hypernyms from a target corpus. We proposed five different subtasks covering three languages (English, Spanish, and Italian), and two specific domains of knowledge in English (Medical and Music). Participants were allowed to compete in any or all of the subtasks. Overall, a total of 11 teams participated, with a total of 39 different systems submitted through all subtasks. Data, results and further information about the task can be found at https://competitions.codalab.org/competitions/17119.


Introduction
Hypernymy, i.e. the capability to relate generic terms or classes to their specific instances, lies at the core of human cognition. It is not surprising, therefore, that identifying hypernymic (is-a) relations has been pursued in NLP for more than two decades (Shwartz et al., 2016): indeed, successfully identifying this lexical relation substantially improves Question Answering applications (Prager et al., 2008;Yahya et al., 2013), Textual Entailment and Semantic Search systems (Hoffart et al., 2014;Roller et al., 2014;Roller and Erk, 2016). In addition, hypernymic relations are the backbone of almost every ontology, semantic network and taxonomy (Yu et al., 2015), which are in turn useful resources for downstream tasks such as web retrieval, website navigation or records management (Bordea et al., 2015).
Generally, evaluation benchmarks for modeling hypernymy have been designed such that in most cases they are reduced to binary classification (Baroni and Lenci, 2011;Snow et al., 2004;Boleda et al., 2017;Vyas and Carpuat, 2017), where a system has to decide whether a hypernymic relation holds between a given candidate pair of terms. Criticisms to this experimental setting point out that supervised systems tend to benefit from the inherent modeling of the datasets in the hypernym detection task, leading to lexical memorization phenomena (Levy et al., 2015;Santus et al., 2016a;Shwartz et al., 2017). In this respect, recent work has attempted to alleviate this issue by including a graded scale for evaluating the degree of hypernymy on a given pair (Vulić et al., 2017).
Crucially,  proposed to frame the problem as Hypernym Discovery, i.e. given the search space of a domain's vocabulary, and given an input term, discover its best (list of) candidate hypernyms. This formulation addresses one of the main drawbacks of the evaluation criterion described above, and better frames the evaluated systems within downstream realworld applications (Camacho-Collados, 2017). In fact, lessons learned from these studies have motivated the construction of a full-fledged benchmarking dataset for the shared task we present here, which covers multiple languages and knowledge domains. The main goal of this task is that of complementing current research in hypernymy modeling with this novel discovery setting.  Table 1: Some example terms and hypernyms extracted from different sources (see Section 4.1.4), for each of the subtasks and languages considered in the task.

Related Work
Traditionally, identifying hypernymic relations from text corpora has been addressed with two main approaches: pattern-based and distributional (Wang et al., 2017). Pattern-based (path-based) methods, which provide higher precision at the price of lower coverage, exploit the co-occurrence of a hyponym and its hypernym in a textual corpus (Hearst, 1992;Navigli and Velardi, 2010;Boella and Di Caro, 2013;Flati et al., 2016;Gupta et al., 2016;Pavlick and Pasca, 2017). Conversely, distributional models rely on a distributional representation for each observed word, and are capable of identifying hypernymic relations between concepts even when they do not co-occur explicitly in text. Earlier work on hypernym modeling was unsupervised, and leveraged various interpretations of the distributional hypothesis. 1 Most of the recent work on the subject is however supervised, and in the main based on using word embeddings as input for classification or prediction (e.g Baroni et al., 2012;Santus et al., 2014;Fu et al., 2014;Weeds et al., 2014;Sanchez Carmona and Riedel, 2017;Nguyen et al., 2017). As shown by Shwartz et al. (2016), pattern-based and distributional evidences can be effectively combined within a neural architecture. In this shared task we have actually received systems of both natures, including a combination of pattern-based and distributional cues, similar to the one mentioned above, which also proved to be highly effective (see Section 5).

Task Description
We define Hypernym Discovery operatively as the task of finding and extracting the appropriate hypernym(s) for a target input term. As input for the task, together with the target term, 2 a large textual corpus (source corpus henceforth) is provided, and participating systems are intended to exploit this large source of textual data to retrieve (i.e. "discover") as many suitable hypernyms as possible for the target term. A different source corpus, as well as the corresponding vocabulary, is specified for each subtask and language (cf. Section 4) in order to set a level playing field for competing systems, and constrain their search space. For each input term (or hyponym) the expected output is a ranked list of candidate hypernyms (up to 15) drawn from the provided vocabulary. Some example input-output pairs (i.e. terms and corresponding hypernym lists) are shown in Table 1 for each subtask and language. Table 1 also reports the sources of hypernymy information beside each pair, which vary depending on the subtask and language, as detailed in Section 4.1.4.
The structure of our Hypernym Discovery task consists of five independent but related subtasks, split into two larger groups: general-purpose hypernym discovery and domain-specific hypernym discovery. Participants were allowed to submit systems for any individual subtask. Along with a specific source corpus and vocabulary, each subtask features its specific training and testing data, consisting of input terms and corresponding gold hypernym lists, obtained as described throughout Section 4.
General-Purpose Hypernym Discovery consists in discovering hypernyms in a large corpus of general-purpose textual data, gathered from different and heterogeneous sources. A system operating in this setting requires the flexibility to provide hypernyms for terms in a wide range of domains. In this shared task we consider three different lan-guages for general-purpose hypernym discovery: • English (subtask 1A), with a gold standard of 3,000 labeled terms; • Italian (subtask 1B) and Spanish (subtask 1C), each with a gold standard of 2,000 labeled terms; All the gold standards provide a balanced set of input terms, with different degrees of frequency and for different domains. The corresponding gold hypernyms have been extracted from multiple resources and manually validated (cf. Sections 4.1.4-4.1.5). Training and testing data are split evenly (50% training -50% testing).
Domain-Specific Hypernym Discovery deals with the same problem, but constrains it to a specific domain of knowledge. As a consequence, in this case participants test their systems (which might be general or specifically tailored to the target domain) in a much more focused and reduced environment. In this shared task we focus on English and consider two different domains of knowledge: • Medical (subtask 2A), with a gold standard of 1,000 labeled terms; • Music (subtask 2B), also with a gold standard of 1,000 labeled terms; As in the previous subtask, we provide a balanced set of terms and gold hypernyms, with different degrees of frequency and for different subdomains. Again, training and testing data are split evenly (50% training -50% testing).
Subclass vs. Instance. Although many hypernym detection approaches tend to overlook this distinction, it is customary to consider two different varieties of the "is-a" relation: a subclass-of variety (e.g. a dog is a mammal), and an instance-of variety (e.g. Rome is a city). 3 From a practical standpoint, the former occurs between two concepts, while the latter connects a named entity with a concept. We make this distinction explicit in our shared task by handlabeling each input term as either a concept or a named entity. This strategy serves a double purpose: on one hand, it helps reducing lexical ambiguity, and narrowing the search space of potential hypernyms even further; 4 on the other hand, it enables participants to study and develop models specifically tailored to one of the two varieties, and possibly submit them separately. In this respect, Boleda et al. (2017) has indeed shown how systems tend to perform differently on these two kinds of hypernymy relation.

Task Data
In this section we present the data collection process carried out for each source corpus and gold standard featured in the task (Section 4.1). We then summarize and provide some global statistics on all these datasets (Section 4.2).

Data Collection Process
The process of collecting data for each subtask and language comprised five successive steps: compilation of the source corpus (Section 4.1.1), creation of the vocabulary (Section 4.1.2), collection and selection of the input terms (Section 4.1.3), extraction of the gold hypernyms (Section 4.1.4), and final filtering and validation of such hypernyms (Section 4.1.5).

Corpus Compilation
First, we selected and compiled a source corpus for each dataset, which was also considered in the vocabulary creation step (Section 4.1.2). Naturally, we considered three corpora as general and as large as possible for the general-purpose track, whereas for the domain-specific datasets we opted for more targeted and specific text collections.
General-purpose corpora. As source corpus for the English subtask (1A) we used the 3-billionword UMBC corpus 5 (Han et al., 2013), which is a resource composed of paragraphs extracted from the web as part of the Stanford WebBase Project 6 (Hirai et al. 2000). The UMBC corpus is considerably large and contains information from many and diverse domains. This corpus presents additional challenges and different sources of information with respect to the corpora used in previous tasks, such as Wikipedia in the SemEval 2016 task on taxonomy extraction (Bordea et al., 2016). In fact, the encyclopedic nature of Wikipedia has been exploited in a wide variety of works (Ponzetto and Strube, 2007;Flati et al., 2016;Gupta et al., 2016), and differs substantially from the web-based corpus we put forward here. As source corpus for the Italian subtask (1B) we instead used the 1.3-billion-word itWac corpus 7 (Baroni et al., 2009), extracted from different sources of the web within the .it domain. Finally, as source corpus for the Spanish subtask (1C) we considered the 1.8-billion-word Spanish corpus 8 (Cardellino, 2016), which also contains heterogeneous documents from different sources.
Domain-specific corpora. As source corpus for the medical domain (subtask 2A) we provided a combination of texts drawn from the MEDLINE 9 (Medical Literature Analysis and Retrieval System) repository, which contains academic documents such as scientific publications and paper abstracts. This corpus contains 130 million words. As regards the music domain (subtask 2B), instead, the source corpus we compiled is a concatenation of several music-specific corpora, i.e. music biographies from Last.fm contained in ELMD 2.0 (Oramas et al., 2016), articles from the music branch of Wikipedia, and a corpus of album customer reviews from Amazon (Oramas et al., 2017). The resulting corpus reaches 100 million words in total.

Vocabulary Creation
With the aim of simplifying the task for participants by providing a unified hypernym search space, we built a series of vocabulary files including all the possible hypernyms on each dataset. Each vocabulary was constructed by considering all the words occurring at least N times across the source corpus of the corresponding subtask. We set N to five and three in the general-purpose and domain-specific subtasks, respectively. We also included bigrams and trigrams, by considering all the instances present in any of the resources that we leveraged as part of the hypernym extraction process (see Section 4.1.4), provided that they also surpassed the corresponding frequency thresholds. In order to reduce the high granularity of some hypernymy relations (for example, dog is an entity) we created an additional blacklist of very general terms not considered in the vocabulary files. This list was obtained semi-automatically. We first extracted the most common hypernyms from the lexical sources we used for creating the datasets. Then, we filtered the resulting blacklist by removing manually a number of suitable hypernyms that, despite being general, provided useful information worthy to be taken into account (e.g. animal).

Term Collection
After compiling a source corpus and a corresponding vocabulary, we selected a suitable collection of input terms (i.e. hyponyms) to construct the gold standard for each subtask. Term selection was based on three key constraints. First, as in vocabulary creation step (Section 4.1.2), input terms were required to occur five and three times in the general-purpose and domain-specific datasets, respectively. Second, only terms up to trigrams were considered. Finally, we only allowed terms with at least one extracted hypernym (see Section 4.1.4) present in the corresponding vocabulary file.
We carried out the term collection process with a semi-automatic two-pass procedure, which we applied to the source corpus of each subtask. First, candidate terms were extracted automatically from the source corpus, taking into account frequency, type (i.e. concept and entity) and knowledge domain 10 in order to produce a list as balanced and representative as possible. After a preliminary list of input terms was obtained, we carried out an extensive validation and refinement step by manually normalizing each item (e.g. changing plurals to singulars, capitalizing named entities and lowercasing concepts), and by pruning all the terms that appeared too vague or general, as well as terms with mis-attributed domains.

Automatic Hypernym Extraction
Once the terms were collected we proceeded to extract a set of candidate hypernyms from a number of heterogeneous taxonomies. We drew taxonomic information from the following lexical resources: WordNet (Miller, 1995), Wikidata (Vrandečić and Krötzsch, 2014), MultiWiBi (Flati et al., 2016), andYago (Suchanek et al., 2007). In order to be able to use seamlessly all hypernymy information for languages other than English, we exploited the inter-resource mappings provided by BabelNet (Navigli and Ponzetto, 2012). 11 For the domain-specific datasets we additionally used SnomedCT (Spackman et al., 1997) and Mu-sicBrainz (Swartz, 2002) for the medical and music datasets, respectively.
The hypernym extraction process was carried out as follows: given a term (hyponym), we first retrieved all the BabelNet synsets which included the given term as lexicalization; then, starting from that synset, we iteratively visited the father nodes across all the reference taxonomies up to five levels 12 and selected all the lexicalizations of the traversed synsets (i.e. concepts) as given by Ba-belNet, provided that they appeared in the corresponding vocabulary files (see Section 4.1.2).

Hypernym Validation
Starting from the candidate gold hypernyms extracted in the previous step, we carried out a validation step using human annotators. We leveraged crowdsourcing for the English data in subtask 1A (which featured the largest dataset), and then expert verification in all subtasks (including English).
Crowdsourcing. We validated the English gold standard (both training and test set) by using crowdsourcing workers from Amazon Mechanical Turk. To ensure the quality of workers, we required workers to have answered at least 500 prior HITs with an approval rate of at least 95%, and applied a qualification test. For each target term, we showed the workers multiple candidate hypernyms, extracted in the previous step (Section 4.1.4), and asked them to select all the correct hypernyms. We also added 20% of random false candidates to prevent bias towards a positive answer. Finally, we assigned each HIT to 3 workers and determined the gold label with majority voting. The resulting annotations yielded an interannotator agreement of 73%.

1A
1B 1C 2A 2B  Trial  50  25  25  15 15 Training 1,500 1,000 1,000 500 500 Test 1,500 1,000 1,000 500 500 Expert verification. Expert verification comprised two steps. First, all the extracted data was verified by an expert human annotator. In this first step, the annotator was focused on removing the incorrect hypernyms, or normalizing them if required (e.g. plural to singular). This first verification was performed in all dataset except English, which underwent the crowsourcing validation explained earlier. Then, all datasets (including the English one) were again verified by other experts. However, in this case the annotators were given different guidelines: in particular, they were asked to fix clear hypernym errors (which may have been missed in the previous step) and to add obvious hypernyms which they found to be missing. Table 2 shows the number of input terms in each dataset. The dataset was split equally in training and testing, while the trial data provided a fewer examples and could also be used as development set. English (subtask 1A) was the largest dataset with 1,500 terms (hyponyms) and for training and other 1,500 for testing. Then, for the Italian (subtask 1B) and Spanish (subtask 1C) datasets, 2,000 terms were given overall between training and testing. Finally, both domain-specific datasets (i.e. medical, subtask 2A, and music, subtask 2B) contained half of this quantity, with 1,000 terms each. Note that each term may be associated with one or (in most cases) more than one hypernym. Therefore, counting all the term-hypernym pairs per dataset, as it is done in hypernymy detection datasets, would provide much larger figures. As an example, the number of term-hypernym pairs in the test gold standard is 7,048 for English, 4,770 for Italian, 6,070 for Spanish, 4,116 for the medical dataset, and 5,233 for the music dataset.

Evaluation
Parting ways from the classic precision-recall-F 1 metrics used so far in hypernym detection/extraction, we decided to evaluate this shared 716 task as a soft ranking problem. Systems were evaluated over the top 15 (at most) hypernyms retrieved for each input term, which let us assess their performance through Information Retrieval metrics. Let us briefly introduce each of them.

Mean Average Precision (MAP).
We use MAP as the main evaluation metric of this task. Intuitively, this metric should give a fine estimate on the capability of a system to retrieve a sizable number of hypernyms from textual data, as well as considering the precision of each of them. Formally: where Q is a sample of experiment runs, AP(·) refers to average precision, i.e. an average of the correctness of each individual obtained hypernym from the search space.
Mean Reciprocal Rank (MRR). MRR rewards the position of the first correct result in a ranked list of outcomes, and is defined as: where rank i refers to the rank position of the first relevant outcome for the ith run. While its main field of application is Information Retrieval, it has also been used in NLP tasks such as collocation recognition (Wu et al., 2010;Rodríguez-Fernández et al., 2016).
In addition to the above, we also provide results according to P@k, i.e. the number of correctly retrieved hypernyms at different cut-off thresholds, specifically k ∈ {1, 3, 5, 15}. 13

Baselines
We compared the participating systems with both supervised and unsupervised baselines for each subtask, inspired by recent work on hypernym detection and discovery. In this section we briefly describe each of them.

Supervised Baselines
We first used a naïve most frequent hypernym (MFH) baseline, which simply returns, for each input term, the 15 most frequent hypernyms found 13 Although only P@5 is displayed in the tables due to lack of space, the other thresholds were used in the official evaluation as well. in the training data. As a less naïve baseline, we also trained a transformation matrix (Mikolov et al., 2013;Fu et al., 2014), using the same optimization described by . For this baseline the hypernyms in the vocabulary which are among the fifteen closest vectors by applying the transformation matrix are retrieved. However, unlike in the original implementation, in this case we did not perform any a priori domain clustering of the embeddings space, and thus used the same matrix for all input terms. 14 This second supervised baseline is referred to as vTE (vanilla Taxoembed).

Unsupervised Baselines
We developed an unsupervised baseline by reducing hypernymy discovery to hypernymy detection. We generated a list of candidate hypernyms for each target word, and then employed unsupervised hypernymy detection measures to decide whether a hypernymy relation holds. We used the opensource code by Shwartz et al. (2017). 15 Our baseline starts by creating a distributional semantic model (DSM) for each domain/language (English, Spanish, Italian, Music and Medical). We used a non-directional window of size 5 as context type, and PPMI as feature weighting. Similarly to the hyponym selection step (Section 4.1.3), all the terms with frequency of at least 3 occurrences in the source corpus are considered as valid targets. For the context words, instead, we required a minimum of 100 occurrences, as in Shwartz et al. (2017). To generate candidates, we took the 50 most similar terms for each target word via cosine similarity in the DSM.
We chose the hypernym detection measures as representative algorithms from each "family" of unsupervised measures: APSyn (Santus et al., 2016b) as similarity measure, balAPInc (Kotlerman et al., 2010) as measure based on the distributional inclusion hypothesis, and SLQS (Santus et al., 2014) as measure based on informativeness. 16 Finally, we tuned the thresholds for the above measures by maximizing the average of the performance metrics on the training set, separately for each subtask and measure. 14 We used the open-source code available at https:// bitbucket.org/luisespinosa/taxoembed 15 https://github.com/vered1986/ UnsupervisedHypernymy 16 Following the conclusions from Shwartz et al. (2017), we set the hyper-parameters to: SLQS: median, PLMI, N = 100 and APSyn: N = 500. Table 3 shows a summary of all participant systems, displaying their main features with respect to supervison and external resources used, if any.

Results
A summary of the results is provided in tables 3 to 7, respectively describing results for English, Italian, Spanish and Music and Medical domains. Almost all systems performed better than the unsupervised baselines, while the supervised ones showed to be more challenging, with few systems outperforming them. For English, Music and Medical domains, CRIM (Bernier-Colborne and Barriere, 2018) obtained the best results, with a large margin on the other systems and baselines. This system is based on learning a projection between hyponym-hypernym pairs in terms of their corresponding embeddings, and combines this module with an unsupervised system which uses Hearst-style patterns. Moreover, in Italian, the best system was 300-sparsans r1 (Berend et al., 2018), a logistic regression model informed mostly with information coming from word embeddings; whereas for Spanish, the best performing team was NLP HZ (Qiu et al., 2018), who approached the task with a nearest neighbors algorithm trained with the provided training data.
From the summary tables we can also appreciate the difference in performance of the systems on concepts and entities. Such difference is due to several factors, including the quantity and type of hypernyms that needed to be identified for the two subclasses. Except for the Music domain, systems tended to perform better with entities than with concepts. This is probably due to the fact that entities contain many hypernyms which appear often (e.g. person, company), which in principle favor the inherent lexical memorization (Levy et al., 2015) of supervised systems. Hence, as expected, systems performed better in the specialized domains (i.e. medical and music) than in the generaldomain dataset (34.05% and 40.97% MAP performance by the best systems in the medical and music domains, respectively, compared to the 19.78% result of the best system in the English dataset).
Finally, the results also show the clear superiority of supervised systems over unsupervised approaches in all languages and domains. As far as fully unsupervised systems are concerned, they achieved a diverse degree of success. While in general they were outperformed by supervised systems, in some cases their performance came close, especially for concepts. For instance, the ADAPT (Maldonado and Klubika, 2018) system, which is based on a simple similarity measure applied to word embeddings, achieved a very decent 8.13 MAP percentage performance on the medical dataset, using neither supervision nor external resources. Supervised systems produced a larger gap for entities, probably due, as mentioned above, to the lower diversity of possible hypernyms.
Cross-evaluation. In addition to the normal setting on which supervised systems trained their system on the same dataset training data, we ask participants to train systems on the English generalpurpose data and trained on the domain-specific datasets. This experiment could enable us to test how a system could perform on a particular dataset when training data is not available. A few teams provided results on this setting and the results showed that even though trained on general data, they are still competitive with respect to other approaches. In fact, they tend to equally outperform unsupervised systems and in the medical dataset, for example, CRIM trained on the general English corpora outperformed all remaining participant systems trained on the medical training data.

Analysis
Inspired by previous tasks in taxonomy learning (Bordea et al., 2015), we sampled for each system 50 incorrect hypernyms (25 entities, 25 concepts) which were retrieved as first choice, and manually assessed their correctness. This evaluation of false positives is intended to account for the inevitable scenario in which not all possible correct hypernyms according to human judgement were included in the gold standard. The results in false positives were measured by accuracy (i.e. percentage of correct false positives on the given sample) and are displayed in Tables 4-8 under FPs.
In general, we observe that the systems' performances in this false positives experiment are correlated with the figures they obtained with the other automatic evaluation measures. Nonetheless, according to this false positives evaluation, most systems (both supervised and unsupervised) were able to retrieve some hypernyms which were not present in the gold standard. This result is encouraging, as not only hypernym discovery sys-   tems can be used to speed up the hypernym discovery process, but they can also provide new hypernyms not considered beforehand. Unsupervised distributional methods (e.g. the unsupervised baselines) seemed to perform poorly overall, as these systems tended to retrieve similar words which are not necessarily hypernyms. For example, false positives for APSyn and bal-APInc are characterized by a large number of cohyponyms (e.g. Exodus and Genesis) and syntag-matically related words (e.g. orange and juice).
As regards the top performing systems, it is worth noting that they often tended to retrieve correct or near-correct hypernyms. The hypernyms that were retrieved on the gold standard were of several kinds: first, some hypernyms were present in the gold standard but normalized differently (for example, for About.com the gold standard contained website but not web site retrieved by CRIM r1); second, they retrieved hy-  pernyms which were either more or less finegrained than the gold standard hypernyms (e.g. the list of gold hypernyms for downfall includes natural phenomenon but not storm, discovered by some supervised systems); third, some systems were able to retrieve hypernyms which correspond to another hyponym's sense not captured in the gold standard (e.g. facultad in Spanish can be either an educational institution or a virtue/ability, the latter not being captured by the gold standard  but retrieved by the 300-sparsans r2 system). Perhaps surprisingly, this latter case also extends to baselines such as MFH: in fact, many named entities have very skewed sense distributions, with less popular senses corresponding to people, cities, or companies often unbeknownst to most human annotators. 17 In addition to these three common patterns, there are also other correct false positives which do not clearly correspond to any of these three.

Conclusion
In this paper we have presented the SemEval 2018 task on Hypernym Discovery. We provided a large, reliable framework to evaluate hypernym discovery system in various languages (English, Italian, and Spanish) and domains (medical and music). This evaluation framework aims at going beyond the common practice of seeing hypernymy detection as a binary classification task, and provides a more challenging setting, inherently closer to how the task should be modeled within downstream applications. We hope this framework will contribute to the development of hypernym discovery systems in several languages and, more generally, to a wider understanding of hypernymy from a computational perspective. As far as the results are concerned, this newlyproposed task proved to be challenging for all participating systems, leaving considerable room for improvement. It is clear from the figures that supervised systems perform considerably better than unsupervised systems. This might suggests that, given a well-defined downstream task, it could be more valuable to annotate hypernyms manually or semi-automatically (whenever possible) and then train a supervised system, than proposing unsupervised solutions with suboptimal performances. On the other hand, it is also noteworthy that the best system across three of the subtasks (i.e. CRIM) combined a supervised neural network architecture with the output of an unsupervised system using Hearst-style patterns (Hearst, 1992).