Exploiting Image Generality for Lexical Entailment Detection

We exploit the visual properties of concepts for lexical entailment detection by examining a concept’s generality . We introduce three unsupervised methods for determining a concept’s generality, based on its related images, and obtain state-of-the-art performance on two standard semantic evaluation datasets. We also introduce a novel task that combines hypernym detection and directionality, signiﬁcantly outperforming a competitive frequency-based baseline.


Introduction
Automatic detection of lexical entailment is useful for a number of NLP tasks including search query expansion (Shekarpour et al., 2013), recognising textual entailment (Garrette et al., 2011), metaphor detection (Mohler et al., 2013), and text generation (Biran and McKeown, 2013). Given two semantically related words, a key aspect of detecting lexical entailment, or the hyponym-hypernym relation, is the generality of the hypernym compared to the hyponym. For example, bird is more general than eagle, having a broader intension and a larger extension. This property has led to the introduction of lexical entailment measures that compare the entropy of distributional word representations, under the assumption that a more general term has a higher-entropy distribution (Herbelot and Ganesalingam, 2013;Santus et al., 2014).
A strand of distributional semantics has recently emerged that exploits the fact that meaning is often grounded in the perceptual system, known as multi-modal distributional semantics (Bruni et al., 2014). Such models enhance purely linguistic models with extra-linguistic perceptual information, and outperform language-only models on a range of tasks, including modelling semantic similarity and conceptual relatedness (Silberer and Lapata, 2014). In fact, under some conditions uni-modal visual representations outperform traditional linguistic representations on semantic tasks (Kiela and Bottou, 2014).
We hypothesize that visual representations can be particularly useful for lexical entailment detection. Deselaers and Ferrari (2011) have shown that sets of images corresponding to terms at higher levels in the WordNet hierarchy have greater visual variability than those at lower levels. We exploit this tendency using sets of images returned by Google's image search. The intuition is that the set of images returned for animal will consist of pictures of different kinds of animals, the set of images for bird will consist of pictures of different birds, while the set for owl will mostly consist only of images of owls, as can be seen in Figure 1.
Here we evaluate three different vision-based methods for measuring term generality on the semantic tasks of hypernym detection and hypernym directionality. Using this simple yet effective unsupervised approach, we obtain state-of-the-art results compared with supervised algorithms which use linguistic data.

Related Work
In the linguistic modality, the most closely related work is by Herbelot and Ganesalingam (2013) and Santus et al. (2014), who use unsupervised distributional generality measures to identify the hypernym in a hyponym-hypernym pair. Herbelot and Ganesalingam (2013) use KL divergence to compare the probability distribution of context words, given a term, to the background probability distribution of context words. Santus et al. (2014) use the median entropy of the probability distributions associated with a term's top-weighted con- text words as a measure of information content.
In the visual modality, the intuition that visual representations may be useful for detecting lexical entailment is inspired by Deselaers and Ferrari (2011). Using manually annotated images from ImageNet (Deng et al., 2009), they find that concepts and categories with narrower intensions and smaller extensions tend to have less visual variability. We extend this intuition to the unsupervised setting of Google image search results and apply it to the lexical entailment task.

Approach
We use two standard evaluations for lexical entailment: hypernym directionality, where the task is to predict which of two words is the hypernym; and hypernym detection, where the task is to predict whether two words are in a hypernym-hyponym relation (Weeds et al., 2014;Santus et al., 2014). We also introduce a third, more challenging, evaluation that combines detection and directionality.
For the directionality experiment, we evaluate on the hypernym subset of the well-known BLESS dataset (Baroni and Lenci, 2011), which consists of 1337 hyponym-hypernym pairs. In this case, it is known that the words are in an entailment relation and the task is to predict the directionality of the relation. BLESS data is always presented with the hyponym first, so we report how often our measures predict that the second term in the pair is more general than the first.
For the detection experiment, we evaluate on the BLESS-based dataset of Weeds et al. (2014), which consists of 1168 word pairs and which we call WBLESS  include pairs in the reversed hypernym-hyponym order, as well as holonym-meronym pairs, cohyponyms, and randomly matched nouns. Accuracy on WBLESS reflects the ability to distinguish hypernymy from other relations, but does not require detection of directionality, since reversed pairs are grouped with the other negatives. For the combined experiment, we assign reversed hyponym-hypernym pairs a value of -1 instead of 0. We call this more challenging dataset BIBLESS. Examples of pairs in the respective datasets can be found in Table 1.

Image representations
Following previous work in multi-modal semantics (Bergsma and Goebel, 2011;, we obtain images from Google Images 1 for the words in the evaluation datasets. It has been shown that images from Google yield higherquality representations than comparable resources such as Flickr and are competitive with "hand prepared datasets" (Bergsma and Goebel, 2011;Fergus et al., 2005).
For each image, we extract the pre-softmax layer from a forward pass in a convolutional neural network (CNN) that has been trained on the Im-ageNet classification task using Caffe (Jia et al., 2014). As such, this work is an instance of deep transfer learning; that is, a deep learning representation trained on one task (image classification) is used to make predictions on a different task (image generality). We chose to use CNN-derived image representations because they have been found to be of higher quality than the traditional bag of visual words models (Sivic and Zisserman, 2003) that have previously been used in multi-modal distributional semantics (Bruni et al., 2014;Kiela and Bottou, 2014).

Generality measures
We propose three measures that can be used to calculate the generality of a set of images. The image dispersion d of a concept word w is defined as the average pairwise cosine distance between all image representations { w 1 ... w n } of the set of images returned for w: This measure was originally introduced to account for the fact that perceptual information is more relevant for e.g. elephant than it is for happiness. It acts as a substitute for the concreteness of a word and can be used to regulate how much perceptual information should be included in a multi-modal model . Our second measure follows Deselaers and Ferrari (2011), who take a similar approach but instead of calculating the pairwise distance calculate the distance to the centroid µ of { w 1 ... w n }: For our third measure we follow Lazaridou et al. (2015), who try different ways of modulating the inclusion of perceptual input in their multi-modal skip-gram model, and find that the entropy of the centroid vector µ works well (where p(µ j ) = µ j || µ|| and m is the vector length):

Hypernym Detection and Directionality
We calculate the directionality of a hyponymhypernym pair with a measure f using the following formula for a word pair (p, q). Since even cohyponyms will not have identical values for f , we introduce a threshold α which sets a minimum difference in generality for hypernym identification: In other words, s(p, q) > 0 iff f (q) > f (p) + α, i.e. if the second word (q) is (sufficiently) more general. To avoid false positives where one word is more general but the pair is not semantically related, we introduce a second threshold θ which sets f to zero if the two concepts have low cosine similarity. This leads to the following formula: We experimented with different methods for obtaining the mean vector representations for cosine (hereafter µ c ) in Equation (5), and found that multi-modal representations worked best. We concatenate an L2-normalized linguistic vector with the L2-normalized centroid of image vectors to obtain a multi-modal representation, following Kiela and Bottou (2014). For a word p with image representations {p img 1 ... p img n }, we thus set µ c = p ling || 1 n n i p img i , after normalizing both representations. For comparison, we also report results for a visual-only µ c .

Results
The results can be found in Table 2. We compare our methods with a frequency baseline, setting f (p) = freq(p) in Equation 4 and using the frequency scores from Turney et al. (2011). Frequency has been proven to be a surprisingly  challenging baseline for hypernym directionality (Herbelot and Ganesalingam, 2013;Weeds et al., 2014). In addition, we compare to the reported results of Santus et al. (2014) for WeedsPrec (Weeds et al., 2004), an early lexical entailment measure, and SLQS, the entropy-based method of Santus et al. (2014). Note, however, that these are on a subsampled corpus of 1277 word pairs from BLESS, so the results are indicative but not directly comparable. On WBLESS we compare to the reported results of Weeds et al. (2014): we include results for the highest-performing supervised method (WeedsSVM) and the highestperforming unsupervised method (WeedsUnSup). For BLESS, both dispersion and centroid distance reach or outperform the best other measure (SLQS). They beat the frequency baseline by a large margin (+30% and +29%). Taking the entropy of the mean image representations does not appear to do as well as the other two methods but still outperforms the baseline and WeedsPrec (+25% and +20% respectively).
In the case of WBLESS and BIBLESS, we see a similar pattern in that dispersion and centroid distance perform best. For WBLESS, these methods outperform the other unsupervised approach, WeedsUnsup, by +17% and match the best-performing support vector machine (SVM) approach in Weeds et al. (2014). In fact, Weeds et al. (2014) report results for a total of 6 supervised methods (based on SVM and k-nearest neighbor (k-NN) classifiers): our unsupervised image dispersion method outperforms all of these except for the highest-performing one, reported here.
We can see that the task becomes increasingly difficult as we go from directionality to detection to the combination: the dispersion-based method goes from 0.88 to 0.75 to 0.57, for example. BIB-LESS is the most difficult, as shown by the fre- quency baseline obtaining only 0.39. Our methods do much better than this baseline (+18%). Image dispersion appears to be the most robust measure.
To examine our results further, we divided the test data into buckets by the shortest WordNet path connecting word pairs (Miller, 1995). We expect our method to be less accurate on word pairs with short paths, since the difference in generality may be difficult to discern. It has also been suggested that very abstract hypernyms such as object and entity are difficult to detect because their linguistic distributions are not supersets of their hyponyms' distributions (Rimell, 2014), a factor that should not affect the visual modality. We find that concept comparisons with a very short path (bucket 1) are indeed the least accurate. We also find some drop in accuracy on the longest paths (bucket 5), especially for WBLESS and BIBLESS, perhaps because semantic similarity is difficult to detect in these cases. For a histogram of the accuracy scores according to WordNet similarity, see Figure 2.

Conclusions
We have evaluated three unsupervised methods for determining the generality of a concept based on its visual properties. Our best-performing method, image dispersion, reaches the state-of-the-art on two standard semantic evaluation datasets. We introduced a novel, more difficult task combining hypernym detection and directionality, and showed that our methods outperform a frequency baseline by a large margin.
We believe that image generality may be particularly suited to entailment detection because it does not suffer from the same issues as linguistic distributional generality. Herbelot and Ganesalingam (2013) found that general terms like liquid do not always have higher entropy distributions than their hyponyms, since speakers use them in very specific contexts, e.g. liquid is often coordinated with gas.
We also acknowledge that our method depends to some degree on Google's search result diversification, but do not feel this detracts from the utility of the method, since the fact that general concepts achieve greater maximum image dispersion than specific concepts is not dependent on any particular diversification algorithm. In future work, we plan to explore more sophisticated visual generality measures, other semantic relations and different ways of fusing visual representations with linguistic knowledge.