Unraveling Antonym’s Word Vectors through a Siamese-like Network

Discriminating antonyms and synonyms is an important NLP task that has the difficulty that both, antonyms and synonyms, contains similar distributional information. Consequently, pairs of antonyms and synonyms may have similar word vectors. We present an approach to unravel antonymy and synonymy from word vectors based on a siamese network inspired approach. The model consists of a two-phase training of the same base network: a pre-training phase according to a siamese model supervised by synonyms and a training phase on antonyms through a siamese-like model that supports the antitransitivity present in antonymy. The approach makes use of the claim that the antonyms in common of a word tend to be synonyms. We show that our approach outperforms distributional and pattern-based approaches, relaying on a simple feed forward network as base network of the training phases.


Introduction
Antonymy and synonymy are lexical relations that are crucial in language semantics. Antonymy is the relation between opposite words, (e.g. bigsmall) and synonymy refers to words with similar meaning (e.g. bug-insect). Detecting them automatically is a challenging NLP task that can benefit many others like textual entailment (Haghighi et al., 2005;Snow et al., 2006), machine translation (Bar and Dershowitz, 2010) and abstractive summarization (Khatri et al., 2018).
Hand crafted lexical databases, such as Word-Net (Miller, 1995), have been built and maintained to be used in NLP and other fields containing antonyms, synonyms and other lexical semantic relations. However, its construction and maintenance takes a considerable human effort and it is difficult to achieve a broad coverage. De-tecting antonyms automatically, relying on existent resources such as text, dictionaries and lexical databases is an active NLP research area.
In the last decade, the use and research concerning word vectors have increased rapidly. Word vectors rely on words co-occurrence information in a large corpus. The key idea behind word vectors is the distributional hypothesis that can be expressed as "the words that are similar in meaning tend to occur in similar contexts" (Sahlgren, 2008;Rubenstein and Goodenough, 1965). A variety of methods have been developed to train word vectors, such as skip-gram (Mikolov et al., 2013), GloVe (Pennington et al., 2014), FastText (Joulin et al., 2016) and ElMo (Peters et al., 2018). Word vectors are used widely in NLP, for example, a well-known use is in supervised learning, taking advantage of the expansion through words relatedness of the training data.
A main problem to discriminate antonymy automatically in a distributional unsupervised setting is that the oppositeness is not easily distinguishable in terms of the context distributions. In fact, pairs of antonyms are very similar in meaning. Antonyms are usable in the same contexts but leading to opposite meanings. Antonymy is said to have the paradox of simultaneous similarity and difference (Cruse, 1986), because antonyms are similar in almost every dimension of meaning except the one where they are opposite.
The paradox of simultaneous similarity and difference is notorious in word space models. The contexts of a word and its antonyms contexts usually are similar and therefore they have close vector representations 1 .
Due to this paradox, word space models seem not suitable for antonymy detection. Then, a commonly used resource is the path of words connecting the joint occurrence of two candidate words (Nguyen et al., 2017). Path based approaches take profit of the fact that antonyms cooccur in the same context more than expected by chance (Scheible et al., 2013;Miller and Charles, 1991), so it is possible to obtain a significant amount of patterns.
In this paper, we claim that vector space models, despite giving close representations for synonyms and antonyms, contain subtle differences that allow to discriminate antonymy. In order to stick out those differences we propose a method based on a neural network model that takes account of algebraic properties of synonymy and antonymy. The model formulation is based on the transitivity of synonymy and the antitransitivity of antonymy, on the symmetry of both relations and on the reflexivity and irreflexivity of synonymy and antonymy, respectively. Moreover, the model exploits the property that two antonyms of the same word tend to be synonyms (Edmundson, 1967) (Figure 1). We use these properties to define a model based on siamese networks and a training strategy through antonyms and synonyms. We show that the presented approach gives surprisingly good results, even in comparison to models that use external information, such as dependency parsing, part-of-speech tagging or path patterns from a corpus. The introduced model is a way to learn any kind of antitransitive relations between distributed vectors. Antitransitivity may be suitable, for instance, to represent the relation of being adversary (Bonato et al., 2017). A different application of the presented approach could be in includes other semantic fields, like antonyms, hypernyms, cohyponyms and specific relations (e.g. dog-bone). social networks in order to find out possible unknown enemies relying on a given set of known enmity and friendship links.
The rest of the paper is structured as follows: In Section 2 we present the previous work on antonymy detection. In Section 3 we describe the proposed approach. We start with some algebraic principles of synonymy and antonymy on which our approach relies. Then we describe siamese networks and how the learned transformation tends to induce an equivalence relationship, suitable for synonyms. In Section 3.3, we comment the unsuitability of siamese networks to deal with an antitransitive relationship like antonymy and we propose a variation of the original siamese network to do so. We refer to this network as a parasiamese network. Then, we argue that the same base network of a parasiamese model for antonymy can be pre-trained minimizing a siamese scheme on synonyms. Section 4 details the dataset, word vectors and the random search strategy carried out to find out an adequate hyperparameter configuration. In Section 5 we present the results and the behavior of the model. Finally, Section 6 contains the conclusion of this paper.

Related Work
Antonymy detection, and antonymy and synonymy discrimination, have been treated principally by two approaches: distributional and pattern-based. Distributional approaches refer to the use of word vectors or word's distributional information. Pattern-based are those that rely on patterns of joint occurrences of pair of words (such as "from X to Y") to detect antonymy. Due to the direction of this work, we will not extend on pathbased approaches and we will give the most attention in this section to distributional approaches.
As we commented before, at first glance word vectors seem not suitable to discriminate antonymy from synonymy because pairs of antonyms correspond to similar vectors. Many research studies and experiments have focused on the construction of vector representations that deem antonymy. Scheible et al. (2013) showed that the context distributions of adjectives allow to discriminate antonyms and synonyms if only words from certain classes are considered as context in the vector space mode. Hill et al. (2014) found that word vectors from machine translation models outperform those learned from monolingual models in word similarity. They suggest that vectors from machine translation models should be used on tasks that require word similarity information, while vectors from monolingual models are more suitable for word relatedness. Santus et al. (2014) proposed APAnt, an unsupervised method based on average precision of contexts intersections of two words, to discriminate antonymy from synonymy.
Symmetric patterns in corpus (e.g. X and Y) were used by Schwartz et al. (2015) to build word vectors and they showed that the patterns can be chosen so that the resulting vectors consider antonyms as dissimilar. Ono et al. (2015) proposed an approach to train word vectors to detect antonymy using antonymy and synonymy information from a thesauri as supervised data. A main difference between their approach and ours is that they did not rely on pre-trained vectors. They used distributional information jointly with the supervised information to train vectors through a model based on skip-gram. Also, Nguyen et al. (2016) integrated synonymy and antonymy information into the skip-gram model to predict word similarity and distinguish synonyms and antonyms.
More recently, Nguyen et al. (2017) distinguish antonyms and synonyms using lexico-syntactic patterns jointly with the supervised word vectors from Nguyen et al. (2016). To finish, (Vulić, 2018) obtain great performance injecting lexical contrast into word embeddings by terms of their ATTRACT-REPEL strategy.

Method
In this section we describe the proposed approach to discriminate antonymy and synonymy. It consists on a siamese networks inspired approach to magnify the subtle differences on antonyms that distinguish them from synonyms.

Algebra of synonymy and antonymy
In order to define and substantiate our approach we introduce an axiomatic characterization of antonymy and synonymy based on the work done by Edmundson (1967). Precisely, synonymy and antonymy are modeled as relations and a set of axioms is proposed. These axioms, as we are going to show, are essential to formulate our approach.
At first glance, synonymy and antonymy can be seen as binary relations between words. However, based on empirical results 2 Edmundson defined synonymy and antonymy as ternary relations in order to consider the multiple senses of the words, as follows: xS i y ≡ x synonym of y according to sense i xA i y ≡ x antonym of y according to sense i Note that the senses of the words are represented in the relationship rather than in the words themselves. Each i (and therefor S i and A i ) reflects a particular configuration of the senses of the words in the vocabulary, considering a unique sense for each word.
Firstly, synonymy is considered a reflexive, symmetric and transitive relationship. This is expressed by the following axioms: S i is an equivalence relation for each fixed i and therefor it splits the set of words into equivalence classes. In the next section we show that this is suitable for siamese networks.
Antonymy is also a symmetric relation but it is irreflexive and antitransitive: So far, synonymy and antonymy are described separately. The following two axioms involve both relationships: Axiom 7 is a refined version of the antitransitive property (axiom 6) 3 . Assuming that two words cannot be synonyms and antonyms simultaneously, it is direct to prove that axiom 7 implies axiom 6. We include axiom 6 for clarification purpose.
The right-identity, axiom 8, says that synonyms of an antonym of a word are also antonyms. Consequently, antonymy relation can be extended to operate between synonymy equivalence classes.
To introduce our model and the considered task setting, we simplify this definition enforcing a binary relation. We consider: where R and R i are S or A and S i or A i , respectively. This simplification encapsulates the multiple senses of the words and therefore it is suitable for word embeddings. However, the presented axioms may not be completely fulfilled under this simplification.

Synonymy and Siamese Networks
A siamese network is a model that receives two inputs an returns an output. A base neural network is applied to each input and the both outputs are measured using a vector distance function (see Figure 2). Usually, siamese networks are trained using a contrastive loss function. The complete model can be interpreted as a trainable distance function on complex data, like images, sound, or text. Siamese networks have been used in a variety of tasks such as sentence similarity (Chi and Zhang, 2018), palmprint recognition (Zhong et al., 2018) and object tracking (Bertinetto et al., 2016), among many others. Consider a vocabulary V of words where we want to discriminate synonyms and a given word vector set for that vocabulary of dimension n. Then consider a neural network F θ : IR n → IR n with weights θ and the following contrastive loss function where d : IR n × IR n → IR + is a vector distance function (e.g. d = ||x − y|| 2 ), α is the threshold for the negative examples, and P and N are positive and negative example pairs, respectively. So P is a set of pairs of synonyms and N a set of pair of words that are not synonyms. We consider that each pair is already composed by the word vector of each word, this is convenient to simplify the notation. This model can be trained using a backpropagation based technique and the output vectors closer than a given threshold are classified as related.
It can be proved that the relation induced by a siamese network is reflexive and symmetric. Transitivity is a little more tricky. It is assured to be satisfied when the sum of the distances of the antecedent related pairs is below the threshold and, in every case, the distance of the transitive pairs is below the double of the threshold. Therefore, a siamese network is a reasonable approach for supervised synonymy detection.

Antonymy and Antitransitivity
While a siamese network seems a reasonable choice for supervised synonym detection, antonymy presents a really different scenario. Consider F θ * as the base neural network in a siamese scheme and suppose that it is trained and working perfectly to discriminate pairs of antonyms. Consider also three words w 1 , w 2 , w 3 such that w 1 is antonym of w 2 and w 2 is antonym of w 3 , then hence, F θ * (w 1 ) = F θ * (w 3 ) an therefore, w 1 and w 3 would be recognized as antonyms, violating axiom 6.
A siamese network induces a transitive relationship but antonymy is actually antitransitive. To model an antitransitive relation, we propose the following variation of the siamese network.
Let's consider F θ and the model diagrammed in figure 3. It consists of a model that consumes two vectors with the same dimension and applies a base neural network once to one input and twice to the other. The idea behind this scheme is that if two word are antonyms then the base network applied once in one word vector and twice in the other word vector, will return close vectors. It can be interpreted as one application of the base network takes to a representation of the equivalence class of the synonymy relation and the second application to a representation its opposite class in terms of antonymy. Assume that F θ * is trained and behaves perfectly on data according to the following loss function: where P and N are positive and negative example pairs, respectively; α is the threshold for the negative examples, and d a distance function as in siamese network. Then, it can be seen that the relation induced fulfills the antitransitivity property if F θ * (w) = F θ * (F θ * (w)), which is expected since antonymy is an antireflexive relation.
Symmetry is not forced by definition but can be included in the loss function or by data, adding the reversed version of each pair in the dataset. The latter is the alternative chosen in this work.

Relaxed Loss Function
In order to classify a pair of words we rely in a threshold ρ. If the candidate pair obtains a distance (between its transformed vectors) below ρ, then it is classified as positive, otherwise as negative. So, it is not necessary to minimize the distance to 0 to classify it correctly. We propose to change the positive part of the contrastive loss function by where ν is a factor in [0, 1] that states the importance given to ρ, the rest of the terms remains the same as in the previous section. If ν = 0 then the original loss function is recovered. We consider ν = 1/2 and we experimentally observe an improvement in results when this relaxed loss function is used.

Pre-training using synonyms
Consider F θ * trained and perfectly working to detect pairs of antonyms using the parasiamese scheme presented in the previous section. Now, lets consider the word vectors w 1 , w 2 and w 3 such that w 1 is antonym of w 2 and w 2 is antonym of w 3 . According to the parasiamese loss function we have that, This implies that F θ * (w 1 ) = F θ * (w 3 ), suggesting to F the role of a siamese network. On the other hand, using axiom 7 we have that w 1 and w 3 tend to be synonyms, which, as we previously show, fits fine for siamese networks.
Using this result, we propose to pre-train F θ , minimizing a siamese network on synonymy data as in Section 3.2, and then perform the parasiamese training to detect antonyms as described in Section 3.3.
We use the same antonymy/synonymy dataset to pre-train and train the parasiamese network and we experimentally observe that this pre-training phase improves the performance of the parasiamese model.

Experiments
In this section we describe the setup details of the experiments performed using the presented approach. Here we give the complete information to reproduce the experiments performed. We describe the dataset, word vectors set used and the random search strategy used for the hyperparameter configuration.

Antonymy Dataset
To perform our experiments we use the dataset created by Nguyen et al. (2017). This dataset contains a large amount of pairs of antonyms and synonyms grouped according to its word class (noun, adjective and verb). This dataset was built using pairs extracted by Nguyen et al. (2016) from WordNet and Wordnik 4 to induce patterns through a corpus. Then, the induced patterns were used to extract new pairs, filtering those that match less than five patterns. Finally, the dataset was balanced to contain the same number of antonyms and synonyms, and split into train, validation and test. The number of pairs contained on each partition of each word class is showed in

Pre-trained word vectors
For the experimental setting we consider pretrained general purpose word vectors. We avoid out-of-vocabulary terms using character based approaches. The following publicly available resources were considered: • FastText (Joulin et al., 2016) vectors trained on English Wikipedia dump 5 . We use default hyper-parameters and vectors dimension is 300.
• ElMo (Peters et al., 2018) vectors for English from Che et al. (2018) 6 . We use the first layer of ElMo that gives representations for decontextualized words.
In the case of FastText we compute 300 dimensional vectors for each word in the dataset. In the case of ElMo embeddings, the pre-trained model was already defined to generate representations of 1024 dimensions.

Base Network Structure
The base network transforms each word vector into a representative form synonymy and antonymy. Any differentiable function that inputs and outputs vectors of the same dimension of the word embeddings space can be used as base network. In this work we consider layered fully connected networks with ReLU as activation function.
The presented model involves tens of hyperparameters and some of them with many options. We use random search to find a good hyperparameter configuration, since it may lead to a better and more efficient solution in comparison to grid or manual search (Bergstra and Bengio, 2012). This improvement is given by the fact that some hyper-parameters do not really matter and grid or manual search would consume time exploring each combination of them (for each combination of the rest), while random search does not exhaustively explore irrelevant parts of the hyperparameters space.
We perform random search sampling models according to the following considerations: • 2,3,4 and 5 layers uniformly chosen • for each hidden layer (if any) we sample its size from a Gaussian distribution with µ = d/2 and σ = d/5, where d is the dimension of the word vectors. 7 • dropout with 1/2 of probability to be activated or not and dropout probability is given by a Gaussian distribution (µ = 0.25 and σ = 0.1).
• batch size uniformly chosen from {32, 64, 128} • we choose between SGD and Adam with equal probability with a learning rate chosen from {0.01, 0.001, 0.0001} • the patience for the early stopping was sampled uniformly from {3, 4, 7, 9} We initialize the weights of the network using Glorot uniform function (Glorot and Bengio, 2010). We stop the training using early stopping and we checkout the best model in the whole run against the validation set. For the implementation we use Keras (Chollet et al., 2015).
After analyzing the results of 200 sampled hyperparameter configurations using the Fast-Text vectors we found that an adequate hyperparameters setting is a four layered network of input dimensions [300, 227, 109, 300] on its layer from input to output, without dropout, and ReLu activation function for every neuron. For training, a batch size is 64, an acceptance threshold of 2.0 and of 3.0 for the negative part of the contrastive loss. The optimizer method is SGD with a learning rate of 0.01 and a patience of 5 for the early stopping. This training setup was used in both phases: pre-training and training. For the experiments with ElMo embeddings we uniquely adjust hidden layers sizes, probably ElMo results may improve by a dedicated hyperparameter search.

Results
In this section we discuss the results obtained with the presented approach and we analyze the model behavior through the outputs of the base network in siamese and parasiamese schemes. We include two baselines with different motivations for comparison purpose. We analyze the model outputs for related and unrelated pairs (i.e. pairs that are not synonyms or antonyms). In the end of this section, we analyze the output of the base network.

Baselines
We consider two baselines to compare our experiments. The first baseline is a feed forward network classifier that consumes the concatenation of the embeddings of each word in the candidate pair. This baseline compares the performance boost of the proposed model against a conventional supervised classification scheme using neural networks. For this baseline we consider the FastText vectors to feed a four layered network with layer dimensions of [600,400,200,1] from input to output and ReLu as activation function. This model was trained through binary cross-entropy loss and SGD with a learning rate of 0, 01.
The second baseline we consider for comparison is AntSynNet (Nguyen et al., 2017), a patternbased approach that encodes the paths connecting the joint occurrences of each candidate pairs using a LSTM. It relies on additional information, such as, part-of-speech, lemma, syntax and dependency trees.

Antonymy and Synonymy discrimination
We evaluate our model in the antonymysynonymy discrimination task proposed by Nguyen et al. (2017). However, the task here is faced from a different point of view. In this work we are interested in showing that word vectors contains what is needed to distinguish between antonyms and synonyms, instead of resolving the general task using any available resource. For that reason we do not try to improve the performance adding more information to the model, such as, paths. It is a supervised approach that discriminate antonymy and synonymy using only word vectors as features.
The obtained results are reported in Table 1. The first baseline is included to compare the performance of our model with a word vector concatenation classification. We also report results with and without pre-training to show the performance gain that pre-training contributes. Notice that, in contrast to AntSynNet, no path-based information is considered in our approach.

Siamese and parasiamese outputs
In this section we show the outputs of the siamese and parasiamese networks on word pairs chosen from the validation set (see Table 3).  It can be observed in the obtained results, in general, a suitable behavior of the model. We also include the cosine distance to compare and show that it is unable to distinguish between antonyms and synonyms. It is interesting to notice, for in-stance, in the upper part of the table, that corresponds to antonyms, the difference in outputs between the pairs cold-warm and cold-hot. It may be interpreted as that cold-hot are more antonyms than cold-warm, which seems adequate. Below the dashed line of each part we include some failure cases.

Non-related pairs
The task setting considered for this work only uses synonyms and antonyms for training. It is interesting to notice that in this work the behavior of the model with unrelated pairs is learned from related pairs and word embeddings, without considering any unrelated pairs during training. We show in Table 4   The obtained results show that the model is not capable to detect unrelated pairs correctly. In fact, the model seems to learn a broader relation. For example, the words safe and adverse are predicted as antonyms and although they are not antonyms, they have some oppositeness. Similarly, the combinations of cold and warm with day and night also seems to be coherent since the day tends to be warmer than the night and the night tends to be colder than the day. In the upper part of the table we include unrelated pairs that were correctly predicted as unrelated and below the dashed line we include failure cases on unrelated pairs.

Base Network Output
In this section we analyze the learned base network. In Figure 4 we show a 2D visualization of the original and the transformed word embeddings. The sample of words was chosen from the validation set and t-SNE (Maaten and Hinton, 2008) was used for the dimensionality reduction. It can be observed that in the original space antonyms tend to be close and when the base network is applied the space seems to be split into two parts, corresponding to each pole of antonymy.
We also consider the resulting space from applying the transformation twice to the original word vector space, which is similar to the result of applying it only once. This behavior is coherent with the parasiamese network definition.
To conclude this section, we show the closest words (in the vocabulary) of the words natural and unnatural, in the original and the transformed spaces, sorted by distance (Table 5). Note how some opposite words appear close in the original space, while in the transformed space the nearest words does not seem to be opposite to the word in question.

Conclusion
We presented a supervised approach to distinguish antonyms and synonyms using pre-trained word embeddings. The proposed method is based on algebraic properties of synonyms and antonyms, principally in the transitivity of synonymy and the antitransitivity of antonymy. We proposed a new siamese inspired model to deal with antitransitivity, the parasiamese network. In addition, we proposed to pre-train this network, relying on the claim that two antonyms of the same word tend  to be synonyms, through a siamese network; and a relaxed version of the contrastive loss function. We evaluated our approach using a publicly available dataset and word vectors, obtaining encouraging results.