Detecting Asymmetric Semantic Relations in Context: A Case-Study on Hypernymy Detection

We introduce WHiC, a challenging testbed for detecting hypernymy, an asymmetric relation between words. While previous work has focused on detecting hypernymy between word types, we ground the meaning of words in specific contexts drawn from WordNet examples, and require predictions to be sensitive to changes in contexts. WHiC lets us analyze complementary properties of two approaches of inducing vector representations of word meaning in context. We show that such contextualized word representations also improve detection of a wider range of semantic relations in context.


Introduction
Language understanding applications like question answering (Harabagiu and Hickl, 2006) and textual entailment (Dagan et al., 2013) benefit from identifying semantic relations between words beyond synonymy and paraphrasing. For instance, given "Anand plays chess.", and the question "Which game does Anand play?", successfully answering the question requires knowing that chess is a kind of game, i.e. chess entails game. Such lexical entailment relations are asymmetric (chess =⇒ game, but game =⇒ chess), and detecting their direction accurately is a challenge.
While prior work has defined lexical entailment as a relation between word types, we argue that it is better defined between word meanings illustrated by examples of usage in context. Ignoring context is problematic since entailment might hold between some senses of the words, but not others. Consider the word game in the following contexts: 1 https://github.com/yogarshi/whic 1. The championship game was played in NYC. 2. The hunters were interested in the big game.
Given the sentence, Anand is the world chess champion, chess =⇒ game in the first context, while chess =⇒ game in the second context.
Lexical entailment encompasses several semantic relations, with one important relation being hypernymy (Roller et al., 2014;. In this work, we focus on hypernymy detection in context, and show that existing resources can be leveraged to automatically create test beds for evaluation. We introduce "Wordnet Hypernyms in Context" (WHIC, pronounced which), a large dataset, automatically extracted from Word-Net (Fellbaum, 1998) using examples provided with synsets. Crucially, WHIC includes challenging negative examples that assess the ability of models to detect the direction of hypernymy.
We use WHIC to determine the effectiveness of existing supervised models for hypernymy detection (Roller and Erk, 2016) applied to representations, not only of word types, but of words in context. Such contextualized representations are induced in two ways: the first is based on Con-text2Vec, a BiLSTM model that embeds contexts and words in the same space (Melamud et al., 2016); the second aims to capture geometric properties of the context in a standard word embedding space built using GloVe (Pennington et al., 2014).
We show that the two contextualized representations improve performance over contextagnostic baselines. The structure of WHIC lets us show that they have complementary properties: Context2Vec-based models have higher recall and tend to identify directionality much better than Glove-based models. We also show that the context-aware representations improve performance on identifying a broader range of semantic relations .
Words (w l , w r ) Exemplars (c l ,c r ) Does w l =⇒ w r ?
staff , stick c l = He walked with the help of a wooden staff . Yes c r = The kid had a candied apple on a stick. staff , body c l = The hospital has an excellent nursing staff . Yes c r = The whole body filed out of the auditorium.
staff , stick c l = The hospital has an excellent nursing staff . No c r = The kid had a candied apple on a stick. We frame hypernymy detection in context as a binary classification task. Each example consists of a 4-tuple (w l , w r , c l , c r ), where w l and w r are word types, and c l and c r are sentences which illustrate each word usage. The example is treated as positive if w l =⇒ w r , given the meaning of each word exemplified by the contexts, and negative otherwise, as can be seen in Table 1.
As mentioned in Section 1, hypernymy is only one specific case of lexical entailment. The nature of entailment relations captured out-of-context can be broader depending on the test beds considered 2 . These relations can include synonymy, hypernymy, some meronymy relations, and also cause-effect relations.

Motivation
The need to study hypernymy detection in context is important due to several reasons. First, many downstream tasks which might benefit from detecting hypernyms will have words appearing in specific contexts. Second, existing definitions (and, by extension, annotations) of lexical entailment do not explicitly or consistently address polysemy. For instance, the substitutional definition for entailment by Zhitomirsky-Geffet and Dagan (2009) asks the reader to think of a natural sentence that provides the missing context to the two words being considered, thus constraining the possible senses of the two words. On the other hand, Turney and Mohammad (2013) propose a relational definition, inviting the reader to imagine a semantic relation that connects the two words and constrains their possible senses. In contrast, we propose to detect hypernymy between word meanings described by specific contexts.
Lexical entailment or hypernymy in context is also different from recognizing textual entailment (RTE). RTE (Dagan et al., 2006(Dagan et al., , 2013 involves detecting entailment relations between sentences, while hypernymy is a relation between words. Additionally, the two contexts c l and c r in our task can be very different, unlike in textual entailment, where the premise and hypothesis are usually related. For instance, the first example in Table 1 illustrates a scenario where the hypernymy relation holds between staff and stick, but there is no entailment relationship between the two sentences. On the other hand, the sentence "Children smile and wave at the camera." entails "There are children present.", but there is no meaningful hypernymy relationship between words in the two sentences. Finally, the proposed task is also related to, but different from word sense disambiguation (WSD). Unlike WSD, this task eschews an explicit sense inventory, instead relying on the provided contexts to decide the specific relation between the words. This might provide a more natural way to think about word senses for (untrained) human annotators (Erk et al., 2013). WSD can in principle be used as a preprocessing step to address hypernymy detection in context, but it is not required. Also, WSD remains a challenging task (Moro and Navigli, 2015) and it might introduce errors early in the preprocessing pipeline.

WHIC : A Dataset for Lexical Entailment in Context
We require a dataset to study hypernymy detection in context to satisfy the following desiderata: (1) the dataset should make it possible to assess the sensitivity of context-aware models to contexts that signal different word senses, and (2) the dataset should help quantify the extent to which models detect the asymmetric direction of hypernymy, rather than symmetric semantic similarity.  Existing datasets for lexical entailment (Baroni and Lenci, 2011;Baroni et al., 2012;Kotlerman et al., 2010) have driven progress on the out of context task only, and are therefore insensitive to context changes. In addition, they include a variety of negative examples without controlling for entailment direction. For instance, Baroni and Lenci (2011) use cohyponyms and random words as negative examples. Since cohyponyms are words that share a common hypernym (for example, salsa and tango are cohyponymys with respect to dance), hypernymy does not hold between them in any direction. On the other hand, random examples (also used by Baroni et al. (2012)) are likely to be detected using symmetric semantic similarity rather than asymmetric hypernymy detection.  recently introduced CONTEXT-PPDB, a dataset for fine-grained lexical inference in context. This dataset consists of word pairs along with a pair of sentential contexts, with a label indicating the semantic relation between the two words in the given contexts. However, since CONTEXT-PPDB only consists of 3700 sentence pairs, it provides only a smaller number of annotated examples per relation, making it difficult to train large supervised models on (we return to this dataset in Section 5).
We address these gaps by introducing, WHIC, a large dataset automatically derived from Word-Net (Fellbaum, 1998). WordNet groups synonyms into synsets and defines semantic relations such as hypernymy and meronymy between these synsets. Most synsets are further accompanied by one or more short sentences illustrating the use of the members of the synset. WHIC uses these example sentences as context for the words, and the hypernymy relations to draw candidate word pairs. The process starts from a seed list of words W and proceeds as follows (see Figure 1) : 1. For all word types w ∈ W obtain synsets S w .
2. For each synset i ∈ S w , pick a hypernym synset s i h , with a corresponding word form w i h . Also obtain c i and c i h which are example sentences corresponding to w i and w i h respectively -(w i , w i h , c i , c i h ) serves as a positive example. Repeat this process for all hypernyms (solid/green arrows in Figure 1). Figure 1).

Permute the positive examples to get neg
We run this process using the 9000 most frequent words from Wikipedia as W ( Step 4. WHIC satisfies the desiderata outlined above. The dataset has a well-defined focus, since we only pick hypernym-hyponym pairs. The negative examples generated in Steps 3 and 4 require discriminating between different word senses and entailment directions. Finally, with over 22000 examples distributed over 6000 word pairs, the dataset is large enough to train large supervised models. We define a 70/5/25 train/dev/test split, and ensure that each set contains different word pairs, to avoid memorization and overfitting .

Representing Words and their Contexts for Entailment
How can we construct representations of the meaning of target words w l and w r , and their respective exemplar contexts c l and c r ? : Constructing word-in-context representations for "bank", in the context "the river bank". indicates element-wise multiplication.
We will construct representations for c l , and c r , and create context-aware representations for w l and w r by "masking" their word embeddings with the embeddings for c l and c r (Section 3.3). We compare two approaches to representing c l and c r . The first (Section 3.1) builds on standard representations for word types, which have proven useful for detecting lexical entailment and other semantic relations out of context (Baroni et al., 2012;Kruszewski and Baroni, 2015;Vylomova et al., 2016;Turney and Mohammad, 2013). The second approach (Section 3.2) uses a recurrent neural model to embed words and contexts in the same space, allowing direct comparisons between them.

Creating Context Representations from Word Type Representations
Given an example (w l , w r , c l , c r ), let w l and w r refer to the context-agnostic representations of w l and w r , and let C l and C r represent the matrices obtained by row-wise stacking of the contextagnostic representations of words in c l and c r respectively. Following Thater et al. (2011); Erk and Padó (2008), we apply a filter to word type representations to highlight the salient dimensions of the exemplar context, emphasizing relevant dimensions and downplaying unimportant ones. However, while prior work represents context by averaging word vectors, we propose richer representations that better capture the salient geometrical properties of the exemplar context that might get lost by averaging.
We construct fixed length representations for the contexts c l and c r by running convolutional fil-ters over C l and C r . Specifically, we calculate the column-wise maximum, minimum and the mean over the matrices C l and C r , as done by Tang et al. (2014) for supervised sentiment classification. This yields three d-dimensional vectors for c l ( c l,max , c l,min , c l,mean ), and three d-dimensional vectors for c r ( c r,max , c r,min , c r,mean ). Computing the maximum and minimum across all vector dimensions captures the exterior surface of the "instance manifold" (the volume in embedding space within which all words in the instance reside), while the mean summarizes the density perdimension within the manifold (Hovy, 2015).

LSTM-based Context Representations: Context2Vec
An alternative approach to contextualizing word representations is to directly compare the representations of words with representations of contexts. This can be done using Context2Vec (Melamud et al., 2016), a neural model that, given a target word and its sentential context, embeds both the word and the context in the same lowdimensional space using a BiLSTM, with the objective of having the context predict the target word via a log-linear model. This model approaches the state-of-the-art on lexical substitution, sentence completion, and supervised word sense disambiguation. For each example (w l , w r , c l , c r ), we extract the word type representations w l,c2v and w l,c2v from Context2Vec, as well as the context representations c l,c2v , and c r,c2v .

Context-aware Masked Representations
Given these two methods to learn representations for words and their contexts, we also learn context aware word representations for the target words. We transform initial context-agnostic representations for target word types by taking an elementwise product of the word type vectors with vectors representing the context. Specifically, for the context representations learned in Section 3.1, we take an elementwise product of the word type vectors ( w * ) with ( c * ,max , c * ,min , c * ,mean ) where * ∈ {l, r}. This yields three d-dimensional vectors for w l ( w l,max , w l,min , w l,mean ), and three for w r ( w r,max , w r,min , w r,mean ). We refer to our final word-incontext representations for w l and w r as w l,mask and w r,mask respectively, where w l,mask is the concatenation of w l,max , w l,min , w l,mean , and w r,mask is also similarly constructed.
For the word and context representations obtained from Context2Vec (Section 3.2), we create the context-aware representations w l,c2v,mask by vector multiplication between w l,c2v and c l,c2v . We also obtain w r,c2v,mask similarly.

Comparing Words and Contexts for Entailment
Given the word, context, and word-in-context representations described above, we predict entailment via supervised classification. Our classifier is the Hypernymy-Feature detector (Roller and Erk, 2016), which is the current state-of-the-art supervised model for detecting hypernymy on several datasets. This model aims to overcome the shortcomings of previous supervised hypernymy detection models, which used linear classifiers on top of concatenation of the two vectors representing the target words. These models only captured notions of prototypicality without modeling the interactions between the two words; that is, they guessed that (animal, sofa) is a positive example because animal looks like a hypernym .
Instead, the H-Feature detector model trains a linear classifier using concatenation, as described above, and then removes this prototypical information from the word vectors by projecting them on a hyperplane orthogonal to the separating hyperplane learned by the linear classifier. By repeating this process, one can learn multiple classifiers, each of which increases the models representational power. In each iteration i, four features are extracted to represent the word pair, based on the current representations of the word pair ( x, y) and the hyperplane p i learned in the current iteration : 1. The similarity between x and the hyperplane, x. p i 2. The similarity between y and the hyperplane, y. p i 3. The similarity between the two words, x. y 4. The similarity between the difference of the two words, and the hyperplane, ( y − x). p i Features 1 and 2 capture similarities like the one included in the concatenation classifier. The third feature aims to overcome the shortcomings of the concatenation model by directly modeling the similarities between the two target words. Finally, the fourth feature captures the distributional inclusion hypothesis (Geffet and Dagan, 2005) -if word v is a hypernym of u, then the set of features of u are included in the set of features of v -by intuitively capturing whether y includes x (Roller et al., 2014).

Experimental Set-up
Tasks In addition to WHIC, we evaluate our context-aware representations on CONTEXT-PPDB. As mentioned in Section 2.3, CONTEXT-PPDB is a dataset for fine-grained lexical inference in context that captures other semantic relations beyond hypernymy. It has been created using 375 word pairs from a subset of the English Paraphrase Database (Ganitkevitch et al., 2013;Pavlick et al., 2015). These word pairs are semiautomatically labeled with semantic relations outof-context.  augmented them with examples of word usage in context, and re-annotated the word pairs given the extra contextual information. The final dataset consists of 3750 words/contexts tuples with a corresponding semantic label, one of which is entailment.
All our experiments are with the default train/dev/test splits on both datasets.

Contextualized Word Representations
To obtain the Context2Vec representations, we use an existing 600-dimensional model trained on ukWaC (Ferraresi et al., 2006). We use 600 dimensional GloVe embeddings trained on the same corpus to create w l , w r , C l , and C r , and allow for a controlled comparison with Context2Vec. Con-text2Vec representations are significantly more expensive to train: Melamud et al. (2016) indicate that training requires~30 hours on a Tesla K80 GPU, while the GloVe embeddings can be trained on the exact same amount of data in less than 7 hours on a CPU.

Supervised Lexical Entailment Classifier
We use an SVM with an RBF kernel for WHIC and Logistic Regression for CONTEXT-PPDB as implemented in Scikit-Learn 3 as our classifiers, to allow for exact comparisons with past work on CONTEXT-PPDB. We use default parameters, except for adding class weights in the WHIC experiments to account for the unbalanced data. For WHIC we use features derived from the H-Feature model described in Section 4. For CONTEXT-PPDB we simply concatenate the representations and use them directly as the features. We evaluate the predictions using F1 score.

Experiments on WHIC
In our first set of experiments, we evaluate the two models described in Section 3 on WHIC under a variety of combinations.

Overall Results
Results are summarized in Table 2. Supervised models 4 outperform the baseline that always predict that hypernymy holds ("All True Baseline") by up to 16 F-score points. Context-aware models outperform context-agnostic models by up to 3 points 5 . GloVe and Context2Vec models yield similar F1, both when used as word type representations alone, and when combined with masked representations. However, GloVe and Context2Vec representations capture complementary information: GloVe yields slightly better precision while Context2Vec models yield significantly better recall.
The best performance overall is obtained by a hybrid model that uses word-type representations from Context2Vec and masked context-aware representations derived from GloVe.
Additionally using Context2Vec vectors directly ( c l,c2v , c r,c2v ) performs much worse than using them as masks ( w l,c2v,mask , c r,c2v,mask ). This highlights the benefit of using context to influence the word type representation rather than to directly compare word and context representations.
Finally, there is no benefit in using the contextaware masked representations without the word type representations: using just the masked representations by themselves does worse than using them in combination with the word type representations.
Overall, the scores in Table 2 highlight the challenging nature of WHIC, and leave scope for improvement with potentially better models for context-aware representations. 4 We also tried two unsupervised context-agnostic baselines using cosine similarity and balAPinc (Kotlerman et al., 2010) but they trivially predicted all pairs as entailing 5 A statistically significant difference with p < 0.01 under the McNemar's test (McNemar, 1947)

Sensitivity to context
To determine the sensitivity of our models to context changes, we evaluate on the balanced subset of WHIC comprised of positive examples and negative examples created by permuting contexts in Step 3 of the dataset creation process. We analyze the predictions using a modified version of precision, recall and F-score, defined as the precision, recall, and F1-score calculated over each (w l ,w r ) word pair, and then averaged over all word pairs. We call these measures the Macro-P/R/F1. Table 3 shows that context-aware representations generally improve performance on all three metrics, but the gain is larger on recall. Again we observe that models using Context2Vec word types and masks have a better Macro-R than the corresponding GloVe models. Overall, the masked representations obtained from Context2Vec perform the best on these metrics, closely followed by the overall best model that uses the Context2Vec word type representations and the masked representations from GloVe.
Finally, note that the all-true baseline surprisingly does as well as the best context-aware model on this metric. However, it cannot detect the direction of hypernymy (Section 6.3), and the structure of WHIC allows us to distinguish these two factors.  Table 3: Macro-P/R/F1 and Pairwise accuracy, are intended to capture context-awareness (Section 6.2) and directionality-discrimination abilities (Section 6.3) of the models, respectively.

Sensitivity to Entailment Direction
Next, we evaluate to what extent the models capture the direction of hypernymy using the balanced subset of WHIC that consists of all positive examples and flipped negative examples generated in Step 4 in the dataset creation process. We measure directionality by looking at the fraction of pairs ((w l , w r , c l , c r ), (w r , w l , c r , c l )) where both examples are correctly labeled, i.e. the former is labeled as =⇒ and the latter as =⇒ . We call this metric the pairwise accuracy. As seen in Table 3, the best pairwise accuracy is again obtained by the hybrid model using word type representations from Context2Vec and the masked representations from GloVe. Overall Context2Vec models do a better job at capturing directionality than GloVe.

Nature of Contextualized Masks
We also hypothesized that masked contextualized representations based on the full volume of the context using min and max operations (Section 3.1) better capture salient context dimensions than the more usual vector averaging approach. We test this hypothesis empirically by replacing masked word-in-context representations w l,mask and w r,mask by two other ways to capture context. In the first method, we use the mean of the contexts ( c l,mean , c r,mean ). In the second method, we use ( w l,mean , w r,mean ), i.e. the masked representations calculated by using only the mean of the context, and not the max and min. Table 4 shows that our preferred method outperforms the two alternatives on WHIC, with our proposed representations outperforming the other methods by 3 F1 points. Additionally, this increase in performance also comes with significant improvement in detection of asymmetric relations.

Summary
Overall, both Context2Vec and Glove representations improve performance over context-agnostic baselines. Using masking to contextualize word type representations works better than just using the context representations as is. The best performing model is a hybrid model that uses word type representations from Context2Vec and masked representations from GloVe. Analysis enabled by the structure of the dataset shows that all masked representations are sensitive to changes in meaning indicated by glosses from distinct Word-Net synsets. However, the more expensive Con-text2Vec representations do a better job at recall and direction of hypernymy.

CONTEXT-PPDB
We now experiment on CONTEXT-PPDB to test the ability of contextualized representations to capture semantic relations beyond hypernymy, to aid future work on recognizing other contextualized relationships.       well as similarity scores between words and contexts. The PPDB features notably include scores for likelihood of context-agnostic entailment labels, distributional similarities, and probabilities of the word pair being paraphrases, among other scores. Additionally, word representation features are used: given two word/context pairs (w x , c x , w y , c y ), GloVe vectors are used to represent w x and w y , as well as words in c x and c y , and are used to extract the following feature, which capture the most salient word/context similarities between the two pairs : {max w∈cy w x · w, max w∈cx w y · w, max w∈cx,w ∈cy w · w } We augment this system with contextualized word representations. We use the GloVe based masked representations, as they can be obtained with a negligible computation cost in addition the features already included in the baseline, and as the labels denote a mix of directional and nondirectional relations. This remarkably yields an improvement~5 F1 points compared to the previous state-of-the-art (Table 5). Breaking down results per label (Table 6) shows an increase of 8 F1 points for the entailment class. This improvement again stems from a large increase in recall, mirroring the behavior observed on WHIC. The diverse "other-related" category also benefits from context-aware representations.

Related Work
WordNet and lexical entailment The "is-a" hierarchy of WordNet (Fellbaum, 1998) is a prominent source of information for unsupervised detection of hypernymy and entailment (Harabagiu and Moldovan, 1998;Shwartz et al., 2015), as well as a source of various datasets (Baroni and Lenci, 2011;Baroni et al., 2012). WHIC is inspired by the latter line of work, except that we extract exemplar contexts from WordNet in addition to relations between words.
Modeling word meaning in context Prior models for the meaning of a word in a given context aimed to capture semantic equivalence in tasks such as lexical substitution, word sense disambiguation or paraphrase ranking, rather than asymmetric relations such as entailment. One line of work (Dinu and Lapata, 2010;Reisinger and Mooney, 2010) views each word as a set of latent word senses. These models rely on token representations for individual occurrences of a word and then choose a set of token vectors based on the current context. An alternate set of models (Erk and Padó, 2008;Thater et al., 2011;Dinu et al., 2012) avoids defining a fixed set of word senses, and instead contextualizes word type vectors as we do here. These models share the idea of using an element-wise multiplication to apply a context mask to word type representations. The nature of the context representation varies: Erk and Padó (2008) use inverse selectional preferences; Thater et al. (2010) combine a first order co-occurrence based representation for the context with a second order representation for the target, Thater et al. (2011) rely on syntactic dependencies to define context. Apidianaki (2016) shows that bag-of-word context representation within a small context window works as well as syntactic definitions of context for ranking paraphrases in context. Our use of convolution is motivated by success of similar models on sentence classification tasks. Tang et al. (2014) uses convolution over embedding matrices for unigrams, bigrams, and trigrams, while Hovy (2015) uses just unigrams. However, all these works use the resulting representations to predict properties of the sentence (e.g., sentiment), rather than to contextualize target word representations.
In-context lexical semantic tasks Besides entailment, other lexical semantic tasks studied in context include lexical substitution (McCarthy and Navigli, 2007) and cross-lingual lexical substitution (Mihalcea et al., 2010). The focus of these tasks and their related datasets is on synonymy and translation equivalence, since they require one to predict substitutes for a target word instance, which preserve its meaning in a given sentential context. On the other hand, the focus of this work and WHIC is on detecting more fine-grained relations via lexical entailment. Another related task is that of paraphrase ranking (Apidianaki, 2016). The work by Apidianaki (2016) is also notable because of their successful use of models of wordmeaning in context from Thater et al. (2011), which is closely related to our work.

Conclusion
We introduced WHIC, a dataset to evaluate lexical entailment in context, providing exemplar sentences to ground the meaning of words being considered for entailment, and challenging examples designed to capture entailment direction accurately.
We showed that supervised models developed for context-agnostic lexical entailment can address the context-aware task to some extent, when replacing word representations with a contextualized version. We compared two contextualized representations including (1) a simple contextaware representation based on the geometry of word embeddings, and (2) Context2Vec, a more expensive BiLSTM-based model that yields representations of words and their context in the same space. Both improve performance over contextagnostic models, and have complementary properties: models using Context2Vec are more accurate at discriminating the direction of entailment. They also have a better recall when measured using metrics designed to test sensitivity to context. Finally, we also showed that contextualized representations can improve detection of other semantic relations in context.
While encouraging, the performance of models considered leave substantial room for improvement. For instance, it remains to be seen whether richer features for the supervised models and richer context representations can improve sensitivity to context, and whether the nuances of the task can be better captured with annotations on a graded scale, following previous work on word meaning in context (Erk et al., 2013).