CRIM at SemEval-2018 Task 9: A Hybrid Approach to Hypernym Discovery

This report describes the system developed by the CRIM team for the hypernym discovery task at SemEval 2018. This system exploits a combination of supervised projection learning and unsupervised pattern-based hypernym discovery. It was ranked first on the 3 sub-tasks for which we submitted results.


Introduction
The goal of the hypernym discovery task at Sem-Eval 2018 is to predict the hypernyms of a query given a large vocabulary of candidate hypernyms. A query can be either a concept (e.g. cocktail or epistemology) or a named entity (e.g. Craig Anderson or City of Whitehorse). Two types of data were provided to train the systems: a large unlabeled text corpus and a small training set of examples comprising a query and its hypernyms. More details on this task may be found in the task description paper (Camacho-Collados et al., 2018).
The system developed by the CRIM team for the task of hypernym discovery exploits a combination of two approaches: an unsupervised, pattern-based approach and a supervised, projection learning approach. These two approaches are described in Sections 2 and 3, then Section 4 describes our hybrid system and Section 5 presents our results.

Pattern-Based Hypernym Discovery
Pattern-based approaches to relation extraction have been discussed in the literature for quite some time (see surveys by Auger and Barrière (2008) and Nastase et al. (2013)). They can be used to discover various relations, including domain-specific ones (Halskov and Barrière, 2008) and more general ones, such as hypernymy. The patternbased approach to hypernym discovery was pioneered by Hearst (1992), who defined specific textual patterns (e.g. Y such as X) to mine hyponym/hypernym pairs from corpora.
This approach is known to suffer from low recall because it assumes that hyponym/hypernym pairs will occur together in one of these patterns, which is often not the case. For instance, using the training data of sub-task 1A, we found that the majority of training pairs never co-occur within the same paragraph in corpus 1A, let alone within a pattern that suggests hypernymy.
To increase recall, we extend the basic patternbased approach to hypernym discovery in two ways. First, we identify co-hyponyms for each query and add the hypernyms discovered for these terms to those found for the query. These cohyponyms are identified using patterns, and filtered based on distributional similarity using the embeddings described in Section 3.3. Furthermore, we discover additional hypernyms using a method based on the following assumptions: most multi-word expressions are compositional, and the prevailing head-modifier relation is hypernymy.
The co-hyponym patterns we use are limited to enumeration patterns (e.g. X 1 , X 2 and X 3 ). For hypernyms, we use an extended set of Hearst-like patterns which we selected empirically (e.g. Y such as X, Y other than X, not all Y are X, Y including X, Y especially X, Y like X, Y for example X, Y which includes X, X are also Y, X are all Y, not Y so much as X).
Our pattern-based hypernym discovery algorithm can be defined as follows: given a query q, 1. Create the empty set Q, which will contain an extended set of queries.
2. Search for the co-hyponym patterns in the corpus to discover co-hyponyms of q. Add these to Q and store their frequency (number of times a given co-hyponym was found using these patterns).
3. Score each co-hyponym q ∈ Q by multiplying the frequency of q by the cosine similarity of the embeddings of q and q . Rank the co-hyponyms in Q according to this score, keep the top n, 1 and discard the rest.
4. Add the original query q to Q.
5. Create the empty set of hypernyms H q .
6. For each query q ∈ Q, search for the hypernym patterns in the corpus to discover hypernyms of q . Add these to H q .
7. Add the head of each term in H q to this set, as well as the head of the original query q.
8. Score each candidate c ∈ H q by multiplying its normalized frequency 2 by the cosine similarity between the embeddings of c and q, and rank the candidates according to this score.
Although the pattern-based search for both cohyponyms and hypernyms can find terms not included in the provided vocabulary (which could also be useful), we discarded out-of-vocabulary terms because we had not learned embeddings for them.

Learning Projections for Hypernym Discovery
Several supervised learning approaches based on word embeddings have recently been developed for the task of hypernym detection and the related task of hypernym discovery (Camacho-Collados, 2017). The general idea is to learn a function that takes as input the word embeddings of a query q and a candidate hypernym h and outputs the likelihood that there is a hypernymy relationship between q and h. To discover hypernyms for a given query q (rather than classify a given pair of words), we apply this decision function to all candidate hypernyms, and select the most likely candidates (or all those classified as hypernyms). This decision function can be learned in a supervised fashion using examples of pairs of words that are related by hypernymy and pairs that are not. The supervised model can take as input a combination of the embeddings of q and h, and different ways of combining the embeddings for this purpose have been used (Baroni et al., 2012;Roller et al., 2014;Weeds et al., 2014). In related work, models have been proposed that learn to project the embedding of q such that its projection is close to that of its hypernym h (Fu et al., 2014;Yamane et al., 2016;Espinosa-Anke et al., 2016). This has been termed projection learning (Ustalov et al., 2017). The decision function is then based on how close the projection of q is to a given candidate h. Fu et al. (2014) introduced a model that learns multiple projection matrices representing different kinds of hypernymy relationships. In this model, each (q, h) pair is first assigned to a cluster based on their vector offsets, then projection matrices are learned for each cluster. Based on this work, Yamane et al. (2016) proposed a model that jointly learns the clusters and projection matrices.
We use a similar method to learn projections for hypernym discovery, but our approach differs from that of Yamane et al. (2016) in several ways: our model performs a soft clustering of query-hypernym pairs rather than a hard clustering, and we modified the training algorithm in several ways.

The Model
Given a query q and a candidate hypernym h, the model retrieves their embeddings e q , e h ∈ R d×1 using a lookup table. These embeddings were learned beforehand on a large unlabeled text corpus (i.e. the corpora provided for this task). The embedding e q is then multiplied by a 3-D tensor containing k square projection matrices φ i ∈ R d×d for i ∈ {1, . . . , k}, producing a matrix P ∈ R k×d containing the projections of e q : The model then checks how close each of the k projections of e q are to e h by taking the dot product: The column vector s ∈ R k×1 is then fed to an affine transformation and a sigmoid activation function (in other words, a logistic regression classifier) to obtain an estimate of the likelihood that q and h are related by hypernymy: To discover the hypernyms of a given query, we compute the likelihood y for all candidates and select the top-ranked ones.

The Training Algorithm
We train the model using negative sampling: for each positive example of a (query, hypernym) pair in the training data, we generate a fixed number m of negative examples by replacing the hypernym with a word randomly drawn from the vocabulary. 3 We then train the model to output a likelihood (y) close to 1 for positive examples and close to 0 for negative examples. This is accomplished by minimizing the binary cross-entropy of the positive and negative training examples. For a particular example, this is computed as follows: where q is a query, h is a candidate hypernym, t is the target (1 for positive examples, 0 for negative), and y is the likelihood predicted by the model. If we sum H for every example in the training set D (containing both the positive and negative examples), we obtain the cost function J = (q,h,t)∈D H(q, h, t). This function is minimized by gradient descent, using the Adam optimizer 4 (Kingma and Ba, 2014). A few details of the setup we use for training are worth mentioning: • We use a fixed number of projections (k) rather than the dynamic clustering algorithm of Yamane et al. (2016). For our official runs, we used k = 24.
• The word embeddings are normalized to unitlength before training.
• For the initialization of the projection matrices, we add random noise to an identity matrix, which means that at first, the projections of a query are simply k randomly corrupted copies of the query's embedding. • Dropout is applied to the embeddings e q and e h and the query projections P . For regularization, we also use gradient clipping, as well as early stopping.
3 Different ways of selecting the negative examples for this purpose have been proposed. See Ustalov et al. (2017). 4 We use β1 = β2 = 0.9.
• We sample positive examples using a function based on the frequency of the hypernyms in the training data, such that we subsample (q, h) pairs where h occurs often in the training data. The probability of sampling (q, h) is given by: where freq(h) returns the frequency of h in the training data. 5 • The word embeddings are optimized (or "fine-tuned") during training.
• We use a multi-task learning setup whereby we train two separate logistic regression classifiers, each with their own parameters W and b, and use one for queries that are named entities, and the other for queries that are concepts. 6 The rest of the parameters (i.e. the projection matrices φ) are shared.
The various hyperparameters mentioned above were tuned on the trial set (i.e. development set) provided for sub-task 1A.

The Word Embeddings
We learned term embeddings for all queries and candidates using the pre-tokenized corpora provided for sub-tasks 1A, 2A, and 2B. We preprocessed the corpora by converting all characters to lower case and replacing multi-word terms found in the vocabulary (candidates and lowercased queries) with a single token, starting with trigrams, then bigrams. 7 We then used the skipgram algorithm with negative sampling (Mikolov et al., 2013) to learn the embeddings.

Data Augmentation
For one of our 2 runs, we experimented with a method to add synthetic examples to the positive examples in the training set provided (D). This was meant to provide additional training data and avoid overfitting the embeddings of the words in the training set. We use 2 heuristics to generate these synthetic examples: 1. Given a positive example (q, h) ∈ D, add (q , h) to the positive examples, where q is the nearest neighbour of q, based on the cosine similarity of the embeddings of all the words in the vocabulary. This was motivated by the observation that nearest neighbours were often co-hyponyms.
2. Given a query q and the set H q containing the hypernyms of q according to the training data, compute the α nearest neighbours of each hypernym in H q , and for each neighbour that is shared by at least 2 of the hypernyms in H q , add that neighbour to H q . 8 Negative examples are generated for each of the synthetic examples, as with the actual positive examples in the training set.

Hybrid Hypernym Discovery
Our hybrid approach to hypernym discovery combines supervised projection learning and unsupervised pattern-based hypernym discovery (see Sections 2 and 3). To combine the outputs of the 2 systems, we take the top 100 candidates according to each, 9 normalize their scores and sum them, then rerank the candidates according to this new score. This reranking function favours candidates found by both systems, but also gives a chance to strong candidates found by a single system.

Experiments and Results
We submitted 2 runs on 3 of the 5 sub-tasks: 1A (general), 2A (medical), and 2B (music). The system outputs its top 15 predictions in all cases. The difference between the 2 runs is that for run 1, we used data augmentation (see Section 3.4) to train the supervised system -the same unsupervised output was used for both runs. We also submitted one run for cross-evaluation (training on 1A, but testing on 2A or 2B). First, we added the queries and candidates of 2A or 2B to those of 1A be-fore training embeddings on the corpus of 1A. 10 These embeddings were used to train the supervised model on the 1A training data. We then combined the predictions of the supervised and unsupervised models on test set 2A/2B. 11 A summary of our system's results is shown in Table 1. This table shows the mean average precision (MAP), mean reciprocal rank (MRR) and precision at rank 1 (P@1) of our system and those of the 2 strongest baselines which were computed by the task organizers. The first is a supervised baseline 12 and the second is based on the most frequent hypernyms in the training data. For more details, see (Camacho-Collados et al., 2018).
Our hybrid system was ranked 1st on all three sub-tasks for which we submitted runs. As shown in Table 1, the scores obtained using this system are much higher than the strongest baselines for this task. Furthermore, it is likely that we could improve our scores on 2A and 2B, since we only tuned the system on 1A.
If we compare runs 1 and 2 of our hybrid system, we see that data augmentation improved our scores slightly on 1A and 2B, and increased them by several points on 2A.
Our cross-evaluation results are better than the supervised baseline computed using the normal evaluation setup, so training our system on general-purpose data produced better results on a domain-specific test set than a strong, supervised baseline trained on the domain-specific data. Table 1 also shows the scores we would have obtained on the test set if we had used only the unsupervised (pattern-based) or supervised (projection learning) parts of our system. Note that the unsupervised system outperformed all other unsupervised systems evaluated on this task, and even outperformed the supervised baseline on 2A.
Combining the outputs of the 2 systems improves the best score of either system on all test sets, sometimes by as much as 10 points.
Notice also that the results obtained using only the supervised system indicate that data augmentation had a positive effect on our 2A scores only (compare runs 1 and 2), although our tests on the . The runs we submitted are the hybrid runs. The supervised and unsupervised runs were produced by using our 2 sub-systems separately. BaselineSUP is a strong, supervised baseline and BaselineMFH is the most-frequent-hypernym baseline. Cross-evaluation results were obtained by training the supervised system on 1A and evaluating on 2A or 2B.
trial set suggested it would also have a positive effect on our 1A scores. Given this observation, we find it somewhat surprising that run 1 is the best on all 3 test sets when we use the hybrid system. One possible explanation is that adding the synthetic examples makes the errors of the supervised system more different from those of the unsupervised system, and that this in turn makes the ensemble method more beneficial, but we haven't looked into this.

Ablation Tests
To assess the influence of different aspects of the supervised system and its training algorithm, we carried out a few simple ablation tests on subtask 1A. The baseline for these tests is our supervised projection learning system -we did not apply pattern-based hypernym discovery for any of these tests. We used the setup of run 1 (with data augmentation) and used the trial set for early stopping. We conducted the following tests (one by one, without combining any of the ablations): 1. No subsampling: we sample positive examples uniformly from the training set.
2. No MTL: instead of multi-task learning (MTL), we use a single classifier for both named entities and concepts.
3. Random init: the weights of φ are initialized randomly, instead of adding random noise to an identity matrix.
6. Frozen embeddings: the word embeddings are not fine-tuned during training.
The results obtained on test set 1A are shown in Table 2. 13 These results show that 2 of the techniques we used, namely subsampling and multitask learning, actually harmed our system's performance on test set 1A, although our experiments on the trial set suggested that they would be beneficial. This may be due to the small size of the trial set (i.e. 50 queries) or some difference in the underlying distributions of the trial and test sets.  On the other hand, fine-tuning the word embeddings during training seems to be one of the keys to the success of this approach, as are the use of multiple projection matrices, and the sampling of  multiple negative examples for each positive example. The way we initialize φ also seems to have helped quite a bit.
It is important to remember that a more thorough exploration of hyperparameter space would produce results very different from those of simple ablation tests such as these.
It is worth noting that our supervised model outperforms the supervised baseline provided for this task (see Table 1) even when it exploits a single projection matrix, however the difference in scores between these 2 systems is only 2 or 3 points, depending on the evaluation metric.
We should also note that the supervised model is prone to overfitting, and we found early stopping to be particularly important.

Qualitative Analysis
We manually inspected the results of run 1 on some of the 1A test queries to get a better idea of the ability of the model to discover hypernyms, and to identify potential sources of errors. Table 3 shows a few of these test cases. Below we will outline a few of our observations, and will refer repeatedly to examples in Table 3. The model is indeed able to discover valid hypernyms for both concepts and named entities. It seems that it can even handle very low-frequency queries in some cases (Suzy Favor Hamilton occurs only 5 times in the corpus), but we have not had the chance to investigate how sensitive the model is to term frequency.
Lexical memorization (Levy et al., 2015) can sometimes be observed. For example, person is the most frequent hypernym in the 1A training data, and the model often predicts this candidate incorrectly, even when its other top predictions are completely unrelated (e.g. tenpence).
The model can discover hypernyms of different senses of the same query (e.g. aquamarine, for which the top 15 predictions contain the valid hypernyms spectral color and primary color), and it sometimes discovers hypernyms for senses that are not represented in the gold standard (e.g. there is a college named Swarthmore, and hypostasis has senses related to linguistics and philosophy). It is likely that the senses that dominate the model's top predictions for a given query are its most frequent senses in the corpus. The case of vegetarian suggests that syntactic ambiguity is a source of errors: the predicted hypernyms include some that might be considered valid for the query vegetarian food, where vegetarian is an adjective, but not for the noun vegetarian.
Lastly, the model sometimes confuses concepts and named entities (e.g. Local Group, which refers to a group of galaxies). Preserving the true case of the characters in the corpus would mitigate this issue.

Concluding Remarks
Our approach to hypernym discovery combines a novel supervised projection learning algorithm and an unsupervised pattern-based algorithm which exploits co-hyponyms in its search for hypernyms. This hybrid approach produced very good results on the hypernym discovery task, and was ranked first on all 3 sub-tasks for which we submitted results.