Learning to Negate Adjectives with Bilinear Models

We learn a mapping that negates adjectives by predicting an adjective’s antonym in an arbitrary word embedding model. We show that both linear models and neural networks improve on this task when they have access to a vector representing the semantic domain of the input word, e.g. a centroid of temperature words when predicting the antonym of ‘cold’. We introduce a continuous class-conditional bilinear neural network which is able to negate adjectives with high precision.


Introduction
Identifying antonym pairs such as hot and cold in a vector space model is a challenging task, because synonyms and antonyms are both distributionally similar (Grefenstette, 1992;Mohammad et al., 2008). Recent work on antonymy has learned specialized word embeddings using a lexical contrast objective to push antonyms further apart in the space (Pham et al., 2015;Ono et al., 2015;Nguyen et al., 2016;Mrkšić et al., 2016), which has been shown to improve both antonym detection and the overall quality of the vectors for downstream tasks. In this paper we are interested in a related scenario: given an arbitrary word embedding model, with no assumptions about pretraining for lexical contrast, we address the task of negation, which we define as the prediction of a one-best antonym for an input word. For example, given the word talkative, the negation mapping should return a word from the set quiet, taciturn, uncommunicative, etc.
We focus on the negation of adjectives. The intuition behind our approach is to exploit a word's semantic neighborhood to help find its antonyms. Antonym pairs share a domain, or topic-e.g. tem-perature; but differ in their value, or polarity-e.g. coldness (Turney, 2012;Hermann et al., 2013). Negation must alter the polarity while retaining the domain information in the word embedding. We hypothesize that a successful mapping must be conditioned on the domain, since the relevant features for negating, say, a temperature adjective, differ from those for an emotion adjective. Inspired by Kruszewski et al. (2016), who find that nearest neighbors in a vector space are a good approximation for human judgements about negation, we represent an adjective's domain by the centroid of nearest neighbors in the embedding space or cohyponyms in WordNet.
We introduce a novel variant of a bilinear relational neural network architecture which has proven successful in identifying image transformations in computer vision (Memisevic, 2012;Rudy and Taylor, 2015), and which learns a negation mapping conditioned on a gate vector representing the semantic domain of an adjective. Our model outperforms several baselines on a multiple choice antonym selection task, and learns to predict a one-best antonym with high precision. In addition to the negation task, this model may be of interest for other NLP applications involving lexical or discourse relations.

Relational Encoders
Our task is to map a word embedding vector x, e.g. hot, to an antonym vector y in the same space, e.g. cold, conditioned on the semantic domain, which is represented by a vector z (see Sec 3.2 for how this vector is obtained). We learn this mapping using a relational neural network, which we introduce in the following sections.
computer vision to learn representations of transformations between images, such as rotation or translation (Memisevic and Hinton, 2007;Memisevic, 2012Memisevic, , 2013. RAEs are a type of gated network, which contains multiplicative connections between two related inputs. The "gating" of one image vector by another allows feature detectors to concentrate on the correspondences between the related images, rather than being distracted by the differences between untransformed images. See Figure 1(a). Multiplicative connections involve a weight for every pair of units in the input vector and gate vector. For an overview of RAEs see Memisevic (2013) and Sigaud et al. (2015). RAE gates perform a somewhat different function than LSTM gates (Hochreiter and Schmidhuber, 1997). Both architectures use a nonlinearity to modulate the contents of a product; in an RAE this is an outer (bilinear) product while in an LSTM it is a Hadamard (element-wise) product. However, LSTM memory gates represent an internal hidden state of the network, while RAE gates are part of the network input.
An Autoencoder (AE) can be defined as in Eq 1 (we omit bias terms for simplicity), where W e are the encoder weights and W d are the decoder weights. In autoencoders, weights are typically tied so that For an RAE, we have two inputs x and z. Instead of a weight matrix W we have a weight tensor W ∈ R n H ×n X ×n Z . The RAE is defined in Eq 2.
Rudy and Taylor (2015) introduce a classconditional gated autoencoder in which the gate is a one-hot class label, rather than a transformed version of the input image. For example, in the MNIST task the label represents the digit. Effectively, an autoencoder is trained per class, but with weight sharing across classes. See Figure 1(b).

Continuous Class-Conditional Relational Encoders
Our bilinear model is a continuous classconditional relational encoder (CCCRE). The model architecture is the same as an RAE with untied encoder and decoder weights (Eq 2). However, the training signal differs from a classic RAE in two ways. First, it is not an autoencoder, but simply an encoder, because it is not trained to reproduce the input but rather to transform the input to its antonym. Second, the encoder is class-conditional in the sense of Rudy and Taylor (2015), since the gate represents the class. Unlike the one-hot gates of Rudy and Taylor (2015), our gates are real-valued, representing the semantic domain of the input vector. See Figure 1(c).
Analogous to the case of image transformation detection, we want the model to learn the changes relevant to negation without being distracted by cross-domain differences.
We approximate the semantic domain as the centroid of a set of related vectors (see Sec 3.2). This approach is inspired by Kruszewski et al. (2016), who investigate negation of nouns, which typically involves a set of alternatives rather than an antonym. It is natural to finish the statement That's not a table, it's a ... with desk or chair, but not pickle. Kruszewski et al. (2016) find that near-est neighbors in a vector space are a good approximation for human judgements about alternatives. We hypothesize that a set of alternatives can stand in for the semantic domain. Note that each word has its own domain, based on its WordNet or distributional neighbors; however, similar words will generally have similar gates.

Models
We compare the CCCRE with several baselines. The simplest is Cosine similarity in the original vector space. We train a linear model (Linear) which maps the input word to its antonym (Eq 3), y = Wx (3) an Untied Encoder (UE) with a bottleneck hidden layer, and a shallow feed-forward model (FF) with a wide hidden layer rather than a bottleneck (both as in Eq 1 with different hidden layer sizes). To test whether the semantic domain is helpful in learning negation, each of these models has a Concat version in which the input consists of the concatenated input word and gate vectors x||z, rather than x.

Experimental Settings
We use publicly-available 1 300-dimensional embeddings trained on part of the Google News dataset using skip-gram with negative sampling (SGNS) (Mikolov et al., 2013). Antonym training data was obtained from WordNet (Miller, 1995) (hereafter WN), resulting in approximately 20K training pairs. Training data always excludes antonym pairs where the input word is an input word the test set. Exclusion of pairs where the target word is a target in the test set depends on the training condition.
Gate vectors were obtained under two conditions. In the standard condition we begin with all WN cohyponyms of an input word. If there are fewer than ten, we make up the difference with nearest neighbors from the vector space. The gate vector is the vector centroid of the resulting word list. In the standard training condition, we do not exclude antonym pairs with the target word in the test set, since we hypothesize it is important for the model to see other words with a similar semantic domain in order to learn the subtle changes necessary for negation. For example, if the pair (hot, cold) is in the test set, we exclude (hot, cold), (hot, freezing), etc. from training; but we do not exclude (icy, hot) or (burning, cold) from training.
In the unsupervised gate condition we do not use WN, but rather the ten nearest neighbors from the vector space. Note that it is only the gates which are unsupervised, not the word pairs: the training targets are still supervised.
We In all conditions, the input word vector is never part of the gate centroid, and we use the same gate type at training and test time.
Hyperparameters were tuned on the GRE development set (Sec 3.3). All models were optimized using AdaDelta (ρ = 0.95) to minimize Mean Squared Error loss. The FF and CCCRE networks have hidden layers of 600 units, while UE has 150 and UE-Concat has 300. Minibatch size was 48 for CCCRE and 16 for all other networks. The linear models were trained for 100 epochs, FF networks for 400, UE for 300, and CCCRE for 200.

Evaluation
Experiment 1 uses the Graduate Record Examination (GRE) questions of Mohammad et al. (2013). The task, given an input word, is to pick the best antonym from five options. An example is shown in (4), where the input word is piquant and the correct answer is bland. We use only those questions where both input and target are adjectives. piquant: (a) shocking (b) jovial (c) rigorous (d) merry (e) bland We evaluate a model by predicting an antonym vector for the input word, and choosing the multiple choice option with the smallest cosine distance to the predicted vector. We report accuracy, i.e. percentage of questions answered correctly.
Experiment 2 evaluates the precision of the models. A natural criterion for the success of a negation mapping is whether the model returns a good antonym at rank 1, or several good antonyms at rank 5, rather than returning any particular antonym as required by the GRE task. We use two datasets: the GRE test set (GRE), and a set of 99 adjectives and their antonyms from a crowdsourced dataset collected by Lenci and Benotto acccording to the guidelines of Schulte im Walde and Köper (2013) (LB). For each input word we retrieve the five nearest neighbors of the model prediction and check them against a gold standard. Gold standard antonyms for a word include its antonyms from the test sets and WN. Following Gorman and Curran (2005), to minimize false negatives we improve the coverage of the gold standard by expanding it with antonyms from Roget's 21st Century Thesaurus, Third Edition. 2 Table 1 shows the results of Experiment 1. A random baseline results in 0.20 accuracy. The cosine similarity baseline is already fairly strong at 0.50, suggesting that in general about two out of the five options are closely related to the input word.

Results and Discussion
Information about the semantic domain clearly provides useful information for this task, because the Concat versions of the Linear, UE, and FF models achieve several points higher than the models using only the input word. The Linear-Concat model achieves a surprisingly high 0.66 accuracy under standard training conditions. CCCRE achieves the highest accuracy across all training conditions, and is the only model that beats the linear baseline, suggesting that bilinear connections are useful for antonym prediction.
All the models show a notable loss of accuracy in the unsupervised condition, suggesting that the alternatives found in the vector neighborhood are 2 http://thesaurus.com less useful than supervised gates. Even in this setting, however, CCCRE achieves a respectable 0.60. In the restricted condition, all non-Concat models perform near the cosine baseline, suggesting that in the standard setting they were memorizing antonyms of semantically similar words. The Concat models and CCCRE retain a higher level of accuracy, indicating that they can generalize across different semantic classes.
We are unable to compare directly with previous results on the GRE dataset, since our evaluation is restricted to adjectives. As an indicative comparison, Mohammad et al. (2013) report an F-score of 0.69 on the full test dataset with a thesaurusbased method, while Zhang et al. (2014) report an F-score of 0.62 using a vector space induced from WN and distributional vectors, and 0.82 with a larger thesaurus. (Previous work reported Fscore rather than accuracy due to out-of-coverage terms.) Although CCCRE achieves the highest accuracy in Experiment 1, the GRE task does not reflect our primary goal, namely to negate adjectives by generating a one-best antonym. CCCRE sometimes fails to choose the target GRE antonym, but still makes a good overall prediction. For input word doleful, the model fails to choose the GRE target word merry, preferring instead sociable. However, the top three nearest neighbors for the predicted antonym of doleful are joyful, joyous, and happy, all very acceptable antonyms. Table 2 shows the results of Experiment 2. On the GRE dataset, under standard training conditions, CCCRE achieves an impressive P@1 of 0.66, i.e. two thirds of the time it is able to produce an antonym of the input word as the nearest neighbor of the prediction. All of the other models score less than 0.40. In the unsupervised and restricted training conditions CCCRE still predicts a one-best antonym about half the time.
The LB dataset is more challenging, because it contains a number of words which lack obvious antonyms, e.g. taxonomic, quarterly, psychiatric, and biblical. However, CCCRE still achieves the highest precision on this dataset. Interestingly, precision does not suffer as much in the less supervised training conditions, and P@1 even improves with the unsupervised nearest neighbor gates. We speculate that nearest distributional neighbors correspond better than the WN ontology to the crowdsourced antonyms in this dataset. LB antonyms for   psychiatric include normal, well, sane, and balanced. The unsupervised model predicts sane as the top neighbor, while standard predicts psychiatrists. The sense in which sane is an antonym of psychiatric is an extended sense, of a form unlikely to be found in WN training data. Table 3 shows sample predictions for the CCCRE and FF-Concat models. It can be seen that CCCRE has more antonyms at the highest ranks.

Related Work
Previous work on negation has focused on patternbased extraction of antonym pairs (Lin et al., 2003;Lobanova, 2012). Such bootstrapped lexical resources are useful for the negation task when the input words are covered. Turney (2008); Schulte im Walde and Köper (2013); Santus et al. (2014Santus et al. ( , 2015 use pattern-based and distributional features to distinguish synonym and antonym pairs. Schwartz et al. (2015) build a vector space using pattern-based word co-occurrence, which can be tuned to reduce the cosine similarity of antonyms.  (Rudy and Taylor, 2015). Wang et al. (2015) use RAEs for tag recommendation, but to our knowledge RAEs have not been previously used in NLP.

Conclusion
We have shown that a representation of the semantic domain improves antonym prediction in linear and non-linear models, and that the multiplicative connections in a bilinear model are effective at learning to negate adjectives with high precision.
One direction for future improvement is to make the model more efficient to train, by reducing the number of parameters to be learned in the relational network (Alain and Olivier, 2013). Future work will address negation of nouns and verbs, especially the cases requiring prediction of a set of alternatives rather than a true antonym (e.g. desk, chair, etc. for table). Bilinear models may also be useful for NLP tasks involving other lexical and discourse relations that would benefit from being conditioned on a domain or topic.