Integrating Multiplicative Features into Supervised Distributional Methods for Lexical Entailment

Supervised distributional methods are applied successfully in lexical entailment, but recent work questioned whether these methods actually learn a relation between two words. Specifically, Levy et al. (2015) claimed that linear classifiers learn only separate properties of each word. We suggest a cheap and easy way to boost the performance of these methods by integrating multiplicative features into commonly used representations. We provide an extensive evaluation with different classifiers and evaluation setups, and suggest a suitable evaluation setup for the task, eliminating biases existing in previous ones.


Introduction
Lexical entailment is concerned with identifying the semantic relation, if any, holding between two words, as in (pigeon, hyponym, animal). The popularity of the task stems from its potential relevance to various NLP applications, such as question answering and recognizing textual entailment (Dagan et al., 2013) that often rely on lexical semantic resources with limited coverage like Wordnet (Miller, 1995). Relation classifiers can be used either within applications or as an intermediate step in the construction of lexical resources which is often expensive and time-consuming.
Most methods for lexical entailment are distributional, i.e., the semantic relation holding between x and y is recognized based on their distributional vector representations. While the first methods were unsupervised and used highdimensional sparse vectors (Weeds and Weir, 2003;Kotlerman et al., 2010;Santus et al., 2014), in recent years, supervised methods became popular (Baroni et al., 2012;Roller et al., 2014;Weeds et al., 2014). These methods are mostly based on word embeddings (Mikolov et al., 2013b;Pennington et al., 2014a) utilizing various vector combinations that are designed to capture relational information between two words.
While most previous work reported success using supervised methods, some questions remain unanswered: First, several works suggested that supervised distributional methods are incapable of inferring the relationship between two words, but rather rely on independent properties of each word (Levy et al., 2015;Roller and Erk, 2016;Shwartz et al., 2016), making them sensitive to training data; Second, it remains unclear what is the most appropriate representation and classifier; previous studies reported inconsistent results with Concat v x ⊕ v y (Baroni et al., 2012) and Diff v y − v x (Roller et al., 2014;Weeds et al., 2014;Fu et al., 2014), using various classifiers.
In this paper, we investigate the effectiveness of multiplicative features, namely, the element-wise multiplication Mult v x ⊙ v y , and the squared dif- . These features, similar to the cosine similarity and the Euclidean distance, might capture a different notion of interaction information about the relationship holding between two words. We directly integrate them into some commonly used representations. For instance, we consider the concatenation that might capture both the typicality of each word in the relation (e.g., if y is a typical hypernym) and the similarity between the words.
We experiment with multiple supervised distributional methods and analyze which representations perform well in various evaluation setups. Our analysis confirms that integrating multiplicative features into standard representations can substantially boost the performance of linear classifiers. While the contribution over non-linear classifiers is sometimes marginal, they are expensive to train, and linear classifiers can achieve the same effect "cheaply" by integrating multiplicative fea-tures. The contribution of multiplicative features is mostly prominent in strict evaluation settings, i.e., lexical split (Levy et al., 2015) and out-ofdomain evaluation that disable the models' ability to achieve good performance by memorizing words seen during training. We find that Concat ⊕ Mult performs consistently well, and suggest it as a strong baseline for future research.

Related Work
Available Representations In supervised distributional methods, a pair of words (x, y) is represented as some combination of the word embeddings of x and y, most commonly Concat v x ⊕ v y (Baroni et al., 2012) or Diff v y − v x (Weeds et al., 2014;Fu et al., 2014).
Limitations Recent work questioned whether supervised distributional methods actually learn the relation between x and y or only separate properties of each word. Levy et al. (2015) claimed that they tend to perform "lexical memorization", i.e., memorizing that some words are prototypical to certain relations (e.g., that y = animal is a hypernym, regardless of x). Roller and Erk (2016) found that under certain conditions, these methods actively learn to infer hypernyms based on separate occurrences of x and y in Hearst patterns (Hearst, 1992). In either case, they only learn whether x and y independently match their corresponding slots in the relation, a limitation which makes them sensitive to the training data (Shwartz et al., 2017;Sanchez and Riedel, 2017). Levy et al. (2015) claimed that the linear nature of most supervised methods limits their ability to capture the relation between words. They suggested that using support vector machine (SVM) with non-linear kernels slightly mitigates this issue, and proposed KSIM, a custom kernel with multiplicative integration.

Multiplicative Features
The element-wise multiplication has been studied by Weeds et al. (2014), but models that operate exclusively on it were not competitive to Concat and Diff on most tasks. Roller et al. (2014) found that the squared difference, in combination with Diff, is useful for hypernymy detection. Nevertheless, little to no work has focused on investigating combinations of representations obtained by concatenating various base representations for the more general task of lexical entailment.

Base representations
Combinations

Methodology
We classify each word pair (x, y) to a specific semantic relation that holds for them, from a set of pre-defined relations (i.e., multiclass classification), based on their distributional representations.

Word Pair Representations
Given a word pair (x, y) and their embeddings v x , v y , we consider various compositions as feature vectors for classifiers. Table 1 displays base representations and combination representations, achieved by concatenating two base representations.

Word Vectors
We used 300-dimensional pre-trained word embeddings, namely, GloVe (Pennington et al., 2014b) containing 1.9M word vectors trained on a corpus of web data from Common Crawl (42B tokens), 1 and Word2vec (Mikolov et al., 2013a,c) containing 3M word vectors trained on a part of Google News dataset (100B tokens). 2 Out-ofvocabulary words were initialized randomly.

Classifiers
Following previous work (Levy et al., 2015;Roller and Erk, 2016), we trained different types of classifiers for each word-pair representation outlined in Section 3.1, namely, logistic regression with L 2 regularization (LR), SVM with a linear kernel (LIN), and SVM with a Gaussian kernel (RBF). In addition, we trained multi-layer perceptrons with a single hidden layer (MLP). We compare our models against the KSIM model found to be successful in previous work (Levy et al., 2015;Kruszewski et al., 2015). We do not include Roller and Erk (2016)  tuned using grid search, and we report the test performance of the hyper-parameters that performed best on the validation set. Below are more details about the training procedure: • For LR, the inverse of regularization strength is selected from {2 −1 , 2 1 , 2 3 , 2 5 }.
• For MLP, the hidden layer size is either 50 or 100, and the learning rate is fixed at 10 −3 . We use early stopping based on the performance on the validation set. The maximum number of training epochs is 100.

Evaluation Setup
We consider the following evaluation setups: Random (RAND) We randomly split each dataset into 70% train, 5% validation and 25% test. 3 We discarded two relations in EVALution with too few instances and did not include its domain information since each word pair can belong to multiple domains at once.
Lexical Split (LEX) In line with recent work (Shwartz et al., 2016), we split each dataset into train, validation and test sets so that each contains a distinct vocabulary. This differs from Levy et al. (2015) who dedicated a subset of the train set for evaluation, allowing the model to memorize when tuning hyper-parameters. We tried to keep the same ratio 70 : 5 : 25 as in the random setup.
Out-of-domain (OOD) To test whether the methods capture a generic notion of each semantic relation, we test them on a domain that the classifiers have not seen during training. This setup is more realistic than the random and lexical split setups, in which the classifiers can benefit from memorizing verbatim words (random) or regions in the vector space (lexical split) that fit a specific slot of each relation.
Specifically, on BLESS and K&H+N, one domain is held out for testing whilst the classifiers are trained and validated on the remaining domains. This process is repeated using each domain as the test set, and each time, a randomly selected domain among the remaining domains is left out for validation. The average results are reported. Table 3 summarizes the best performing base representations and combinations on the test sets across the various datasets and evaluation setups. 4 The results across the datasets vary substantially in some cases due to the differences between the datasets' relations, class balance, and the source from which they were created. For instance, K&H+N is imbalanced between the number of instances across relations and domains. ROOT09 was designed to mitigate the lexical memorization issue by adding negative switched hyponym-hypernym pairs to the dataset, making it an inherently more difficult dataset. EVALution contains a richer set of semantic relations. Overall, the addition of   multiplicative features improves upon the performance of the base representations.

Experiments
Classifiers Multiplicative features substantially boost the performance of linear classifiers. However, the gain from adding multiplicative features is smaller when non-linear classifiers are used, since they partially capture such notion of interaction (Levy et al., 2015). Within the same representation, there is a clear preference to non-linear classifiers over linear classifiers.

Evaluation Setup
The Only-y representation indicates how well a model can perform without considering the relation between x and y (Levy et al., 2015). Indeed, in RAND, this method performs similarly to the others, except on ROOT09, which by design disables lexical memorization. As expected, a general decrease in performance is observed in LEX and OOD, stemming from the methods' inability to benefit from lexical memorization. In these setups, there is a more significant gain from using multiplicative features when non-linear classifiers are used.
Word Pair Representations Among the base representations Concat often performed best, while Mult seemed to be the preferred multiplicative addition. Concat ⊕ Mult performed consistently well, intuitively because Concat captures the typicality of each word in the relation (e.g., if y is a typical hypernym) and Mult captures the similarity between the words (where Concat alone may suggest that animal is a hypernym of apple).
To take a closer look at the gain from adding Mult, Table 4 shows the performance of the various base representations and combinations with Mult using different classifiers on BLESS. 5

Analysis of Multiplicative Features
We focus the rest of the discussion on the OOD setup, as we believe it is the most challenging setup, forcing methods to consider the relation be-  tween x and y. We found that in this setup, all methods performed poorly on K&H+N, likely due to its imbalanced domain and relation distribution.
Examining the per-relation F 1 scores, we see that many methods classify all pairs to one relation. Even KSIM, the best performing method in this setup, classifies pairs as either hyper or random, effectively only determining if they are related or not. We therefore focus our analysis on BLESS.
To get a better intuition of the contribution of multiplicative features, Table 5 exemplifies pairs that were incorrectly classified by Concat (RBF) while correctly classified by Concat ⊕ Mult (RBF), along with their cosine similarity scores. It seems that Mult indeed captures the similarity between x and y. While Concat sometimes relies on properties of a single word, e.g. classifying an adjective y to the attribute relation and a verb y to the event relation, adding Mult changes the classification of such pairs with low similarity scores to random. Conversely, pairs with high similarity scores which were falsely classified as random by Concat are assigned specific relations by Concat ⊕ Mult.
Interestingly, we found that across domains, there is an almost consistent order of relations with respect to mean intra-pair cosine similarity:  Since the difference between random (0.141) and other relations (0.279-0.426) was the most significant, it seems that multiplicative features help distinguishing between related and unrelated pairs. This similarity is possibly also used to distinguish between other relations.

Conclusion
We have suggested a cheap way to boost the performance of supervised distributional methods for lexical entailment by integrating multiplica-tive features into standard word-pair representations. Our results confirm that the multiplicative features boost the performance of linear classifiers, and in strict evaluation setups, also of nonlinear classifiers. We performed an extensive evaluation with different classifiers and evaluation setups, and suggest the out-of-domain evaluation as the most suitable for the task. Directions for future work include investigating other compositions, and designing a neural model that can automatically learn such features.