Learning Antonyms with Paraphrases and a Morphology-Aware Neural Network

Recognizing and distinguishing antonyms from other types of semantic relations is an essential part of language understanding systems. In this paper, we present a novel method for deriving antonym pairs using paraphrase pairs containing negation markers. We further propose a neural network model, AntNET, that integrates morphological features indicative of antonymy into a path-based relation detection algorithm. We demonstrate that our model outperforms state-of-the-art models in distinguishing antonyms from other semantic relations and is capable of efficiently handling multi-word expressions.


Introduction
Semantics is a branch of linguistics which studies meaning in language or specifically "meaning relationships" between words. These meaning relationships can be defined by a number of different relations including, but not limited to synonymy, antonymy, hypernymy, hyponymy etc. Synonymy refers to words that are pronounced and spelled differently but contain the same meaning. For example, happy and joyful are synonyms of each other. Hypernymy and Hyponymy refers to a relationship between a general term and the more specific terms that fall under the category of the general term. For example, the birds pigeon, sparrow, and crow are hyponyms. They fall under the general term of bird, which is the hypernym.
Antonymy can be defined as the oppositeness of meaning between two expressions or expressions containing contrasting meanings. Over the years, linguists, cognitive scientists, psycholinguists, and lexicographers have tried to better understand and define antonymy. Palmer (1982) classified antonymy into the following three types.
• Gradable antonymy refers to a pair of words with opposite meanings where the two meanings lie on a continuous spectrum. The members of a pair differ in terms of degree. If something is not A, then it is not merely B, it can be any C or D or E in between A and B. For instance, the expression "today is not hot" may mean "today is not cold". There is a scale or a space exists between hot and cold, it may mean " today is warm". Other examples are wet-dry, young-old, early-late. • Relational antonymy refers to a pair of words with opposite meanings, where opposite makes sense only in the context of the relationship between the two meanings. This is a special type of antonymy in which the members of a pair do not constitute a positive-negative opposition. They show the reversal of a relationship between two entities. X buys something from Y means the same as Y sells something to X. It is the same relationship seen from two different angles.
Other examples are parent-child, doctor-patient, give-receive In its strictest sense, antonymy applies to gradable adjectives, such as hot-cold and tall-short, where the two words represent the two ends of a semantic dimension. In a broader sense, it includes other adjectives, nouns, and verbs as well like life-death, ascend-descend, shout-whisper. In its broadest sense, it applies to any two words that represent contrasting meanings. The task of identifying antonymous expressions is valuable for NLP systems which go beyond recognizing semantic relatedness and require to identify specific semantic relations like synonymy, hypernymy etc. While manually created semantic taxonomies, like WordNet (Fellbaum, 1998), define antonymy relations between some word pairs that native speakers consider antonyms, they have limited coverage. Further, as each term of an antonymous pair can have many semantically close terms, the contrasting word pairs far outnumber those that are commonly considered antonym pairs, and they remain unrecorded. Therefore, automated methods have been proposed to determine for a given term-pair (x, y), whether x and y are antonyms of each other, based on their occurrences in a large corpus. Charles and Miller (1989) proposed that antonyms occur together in a sentence more often than chance. This is known as the co-occurrence hypothesis. However, nonantonymous semantically related words such as hypernyms, holonyms, meronyms, and near-synonyms also tend to occur together more often than chance. Thus, separating antonyms from them has proven to be difficult. Approaches to antonym detection have exploited distributional vector representations, relying on the distributional hypothesis of semantic similarity (Harris, 1954;Firth., 1957) that words that occur in similar contexts tend to be semantically close. Two main information sources are used to recognize semantic relations: path-based and distributional. Path-based methods consider the joint occurrences of the two terms in a given sentence and use the dependency paths that connect the terms as features (Hearst, 1992;Roth and Schulte im Walde, 2014;Schwartz et al., 2015). For distinguishing antonyms from other relations, Lin et al. (2003) proposed to use antonym patterns (such as either X or Y and from X to Y ). Distributional methods are based on the disjoint occurrences of each term and have recently become popular using word embeddings (Mikolov et al., 2013;Pennington et al., 2014), which provide a distributional representation for each term. Recently, combined path-based and distributional methods for relation detection have also been proposed . They showed that a good path representation can provide substantial complementary information to the distributional signal for distinguishing between different semantic relations.
While antonymy applies to expressions that represent contrasting meanings, paraphrases are phrases expressing the same meaning, which usually occur in similar textual contexts (Barzilay and McKeown, 2001) or have common translations in other languages (Bannard and Callison-Burch, 2005). Specifically, if two words or phrases are paraphrases, they are unlikely to be antonyms of each other. Our first approach to antonym detection exploits this fact to use paraphrases for detecting and generating antonyms (The dementors caught Sirius Black/ Black could not escape the dementors).
We start by focusing on phrase pairs that are most salient for deriving antonyms. Our assumption is that phrases (or words) containing negating words (or prefixes) are more helpful for identifying opposing relationships between term-pairs. For example, from the paraphrase pair (caught/not escape), we can derive the antonym pair (caught/escape) by just removing the negating word 'not'.
Our second method is inspired by the recent success of deep learning models for relation detection.  proposed an integrated path-based and distributional model to improve hypernymy detection between term-pairs, and later extended it to classify multiple semantic relations ) (LexNET).
Although LexNET was the best performing system in the semantic relation classification task of the CogALex 2016 shared task, the model performed poorly on synonyms and antonyms compared to other relations. The path-based component is weak in recognizing synonyms, which do not tend to co-occur and the distributional information caused confusion between synonyms and antonyms, since both tend to occur in the same contexts. We propose AntNET, a novel extension of LexNET that integrates information about negating prefixes as a new morphological pattern feature and is able to distinguish antonyms from other semantic relations. In addition, we optimize the vector representations of dependency paths between the given term-pair, encoded using a neural network, by replacing the embeddings of words with negating prefixes by the embeddings of the base, non-negated, forms of the words. For example, for the term pair unhappy/joyful, we record the negating prefix of unhappy using a new path feature and replace the word embedding of unhappy with happy in the vector representation of the dependency path between unhappy and sad. The proposed model improves the path embeddings to better distinguish antonyms from other semantic relations and gets higher performance than prior path-based methods on this task.

Contributions
The main contributions of this thesis are: • We present a novel technique of using paraphrases for antonym detection and successfully derive antonym pairs from paraphrases in the Paraphrase Database (Ganitkevitch et al., 2013;Pavlick et al., 2015b) (PPDB), the largest paraphrase resource currently available.
• We demonstrate improvements to an integrated path-based and distributional model, showing that our morphology-aware neural network model, AntNET performs better than state-of-the-art methods for antonym detection.

Document Structure
The rest of this thesis is structures as follows. Chapter 2 comprises of literature review and goes over the related work in antonymy detection as well as work in identifying other semantic relations. In Chapetr 3, we describe our novel technique of deriving antonym pairs from paraphrases in PPDB and analyse and evaluate the derived pairs.
In Chapter 4, we discuss AntNET, our morphology aware LSTM-based neural network degrees of antonymy: strongly antonymous, semantically contrasting and not antonymous. The higher the degree of antonymy between target word pair, the higher the tendency to be considered antonym pairs by native human speakers.
Automatically determining the degree of antonymy between words can be helpful in detecting and generating paraphrases, detecting contradictions, detecting humor (satire and jokes tend to have contradictions and oxymorons) and in finding words which are semantically contrasting to a target word (probably to filter them out).
Antonymy, Synonymy, Hyponymy etc. are some lexical-semantic relations that apply to two lexical units -A combination of surface form and word sense. The study also explores the paradoxes of antonymy. Why are some pairs better antonyms? (Eg. largesmall vs. large-little). Are semantic closeness and antonymy opposites? If two words are associated via synonymy, hyponymy-hypernymy or troponymy relations, they are considered to be semantically close or semantically related. Words that are semantically similar are also semantically related (Eg. plane-glider, doctor-surgeon) but not the other way round (Eg. plane-sky, surgeon-scalpel). Antonymous concepts are semantically re-lated but not semantically similar. The co-occurrence hypothesis states that antonyms occur together in a sentence more often than chance (Charles and Miller 1989). But this is also true for hypernyms, holonyms, meronyms and near-synonyms. Thus, separating antonyms from them has proven to be difficult. Strong co-occurrence is not a sufficient condition for detecting antonyms, but it is useful. The distributional hypothesis of closeness states that words that occur in similar contexts tend to be semantically close.
Their study states that manually-created lexicons have limited coverage and do not include most semantically contrasting word pairs. They presented a new automatic and empirical measure of antonymy that combines corpus statistics with the structure of a published thesaurus. Their approach was as follows. The adjacency heuristic is that adjacent categories in most published thesauri are considered to be contrasting categories.
Given a target word pair, the algorithm determined whether they are antonymous or not, and if they are, whether they have a high, medium, or low degree of antonymy. If the target words belong to the same thesaurus paragraphs as any of the seed antonyms linking the two contrasting categories, then the words have a high degree of antonymy.
If the target words do not belong to the same thesaurus paragraphs as a seed antonym pair, but occur in contrasting categories, they have a low degree of antonymy if the word-pairs have a lower tendency to co-occur and a medium degree of antonymy if the word-pairs have a higher tendency to co-occur. This algorithm when evaluated on a set of closest-opposite questions, obtained a precision of over 80%.

Paraphrase Extraction Methods
Paraphrases are words or phrases expressing the same meaning. Paraphrase extraction methods that exploit distributional or translation similarity might however propose paraphrase pairs that are not meaning equivalent but linked by other types of relations.
These methods often extract pairs having a related but not equivalent meaning, such as contradictory pairs. For instance, Lin and Pantel (2001) extracted 12 million "inference rules" from monolingual text by exploiting shared dependency contexts. Their method learns paraphrases that are truly meaning equivalent, but it just as readily learns contradictory pairs such as (X rises, X falls). Ganitkevitch et al. (2013) Snow et al. (2006) proposed a novel algorithm for inducing semantic taxonomies. Previous algorithms for taxonomy induction focused on independent classifiers for discovering single relationships based on hand-constructed or automatically generated textual patterns whereas their algorithm incorporates evidence from multiple classifiers over heterogeneous relationships to optimize the entire structure of the taxonomy. Though wide variety of relationship-specific classifiers like the pattern-based classifiers have achieved some degree of success, they frequently lack the global knowledge necessary to integrate their predictions into a complex taxonomy with multiple relations.

Semantic Taxonomy Induction
The paper mentions that previous algorithms focused only on inferring small taxonomies over a single relation, or has used evidence for multiple relations independently from one another. Another major shortfall was the inability to handle lexical ambiguity as these previous approaches sidestepped the issue of polysemy by making the assumption of only a single sense per word and inferring taxonomies explicitly over words and not senses. Their approach simultaneously provides a solution to the problems of jointly considering evidence about multiple relationships as well as lexical ambiguity within a single probabilistic framework. Within their model, they define the goal of taxonomy induction to be to find the taxonomy that maximizes the conditional probability of their observations given the relationships of the taxonomy.
They have also extended their model to manage Lexical Ambiguity. If the objects in the taxonomy are word senses, they extended their model to allow for a many-to-many mapping (eg. word-to-sense mapping) between the the sets of objects. They have presented an algorithm for inducing semantic taxonomies which attempts to globally optimize the entire structure of the taxonomy. The models ability to integrate heterogeneous evidence from different classifiers offers a solution to the key problem of choosing the correct word sense to which to attach a new relation (hypernym, hyponym, antonym etc).

Natural Logic
The task of textual inference involves automatically determining whether a naturallanguage hypothesis can be inferred from a given premise. The NatLog system (Mac-Cartney and Manning, 2007) which popularized natural logic for Rich Textual Entailment (RTE) tasks presented the first use of a computational model of natural logic -a system of logical inference operating over natural language for textual inference. Most current RTE systems achieve robustness by sacrificing semantic precision and those systems that rely on first-order logic and theorem proving are precise but excessively brittle. Their system found a low-cost edit sequence which transformed the premise into the hypothesis and learned to classify entailment relations across atomic edits. This system uses natural language as a representation and performs natural language inference using a structured algebra model. However, important kinds of inference like temporal reasoning, causal reasoning, paraphrasing and relation extraction are not addressed by natural logic.

Pattern-based Methods
Pattern-based methods for inducing semantic relations between a pair of terms (x, y) consider the lexico-syntactic paths that connect the joint occurrences of x and y in a large corpus. A variety of approaches have been proposed that rely on patterns between terms in a corpus to distinguish antonyms from other relations. Lin et al. (2003) used bilingual dependency triples and patterns to extract distributionally similar words, and then filtered out words that appeared with the patterns 'from X to Y' or 'either X or Y' significantly often. The intuition behind this was that if two words X and Y appear in one of these patterns, they are unlikely to represent a synonymous pair. Roth and Schulte im Walde (2014)  synonyms. In addition to the lexical and syntactic information, they also proposed the distance between the related words along the syntactic path as a new pattern feature.

RNNs for Relation Classification
Relation classification is a related task whose goal is to classify the relation that is expressed between two target terms in a given sentence to one of predefined relation classes. To illustrate, consider the following sentence, from the SemEval-2010 relation classification task dataset (Hendrickx et al., 2010): . Here, the relation expressed between the target entities is Content Container(e1, e2).
The shortest dependency paths between the target entities were shown to be informative for this task (Fundel et al., 2007

Integrated Pattern-based and Distributional Methods
In the past couple of years, deep learning models have been proposed for relation classification tasks. While  first proposed their model to improve hypernymy detection between term-pairs, they later extended it to classify multiple semantic relations , including antonyms. They suggested an improved path-based algorithm, in which the dependency paths are encoded using a recurrent neural network, that achieves results comparable to distributional methods.
They then extended the approach to integrate both path-based and distributional signals in to the network, resulting in an improved performance for the semantic relation classification task. While their proposed model is very good at identifying relations like meronyms and hypernyms (state-of-the-art for hypernym detection), it does not perform too well in distinguishing between related and unrelated words, and between synonyms and antonyms. The morphology-aware neural network model that we propose handles these cases and better distinguishes antonyms from other semantic relations. in the first method, the authors improved the quality of weighted feature vectors by strengthening those features that are most salient in the vectors, and by putting less emphasis on those that are of minor importance when distinguishing degrees of similarity between words. In the second method, the lexical contrast information was integrated into the skip-gram model (Mikolov et al., 2013) to learn word embeddings. This model successfully predicted degrees of similarity and identified antonyms and synonyms.

Modeling Multi-relational Data
Bordes et al. considered the problem of embedding entities and relationships of multirelational data in low-dimensional vector spaces. They proposed a scalable, easy to train, canonical model with reduced parameters that models relationships by interpreting them as translations operating on the low-dimensional embeddings of the entities.
Multi-relational data refers to directed graphs whose nodes correspond to entities and edges of the form (head, label, tail) (denoted (h, l, t)), each of which indicates that there exists a relationship of name label between the entities head and tail. Their work focused on modeling multi-relational data from Knowledge Bases (KBs), with the goal of providing an efficient tool to complete them by automatically adding new facts, without requiring extra knowledge. In contrast to single-relational data where ad-hoc but simple modeling assumptions can be made after some descriptive analysis of the data, the difficulty of relational data is that the notion of locality may involve relationships and entities of different types at the same time, so modeling multi-relational data requires more generic approaches that can choose the appropriate patterns considering all heterogeneous relationships at the same time.
In TransE, relationships are represented as translations in the embedding space: if Their experiments demonstrate that this new model, despite its simplicity and its architecture primarily designed for modeling hierarchies, ends up being powerful on most kinds of relationships, and can significantly outperform state-of-the-art methods in link prediction on real world KBs.

Paraphrase-based Antonym Derivation
In this chapter, we describe a novel automatic method of deriving antonym pairs from paraphrase pairs. Existing semantic resources like WordNET (Fellbaum, 1998) and EVALution (Enrico Santus and Huang, 2015) contain a much smaller set of antonyms compared to other semantic relations (e.g. synonyms, hypernyms and meronyms). Our aim is to create a large resource of high quality antonym pairs using paraphrases.

WordNet
WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.
Adjectives are organized in terms of antonymy. WordNet encodes antonymy as a lexical relationship a relation between two words and not concepts (Gross et al., 1989).
WordNet antonym pairs comprise of "direct antonyms" (wet/dry, young/old) and "in- We have used the English PPDB, version 2.0 and XXXL size for all our experiments described later.

Creation of a seed set of antonym pairs
We first created a seed set of antonyms generated using WordNet. As mentioned in Section 3.1.1, antonyms derived from WordNet are direct antonyms. We extended this list to include indirect antonym word pairs that were derived from a cross product of the synonyms of each word in the antonym pair and its direct antonym.

Selection of Paraphrases
We consider all phrase pairs from PPDB (p 1 , p 2 ) up to two words in length such that one of the two phrases either begins with a negating word like not or contains a negating prefix. 1 We chose these two types of paraphrase pairs since we believe these pairs to be the most indicative of an antonymy relationship between the target words.

Paraphrase Transformation
For paraphrases containing a negating prefix, we perform morphological analysis to identify and remove the negating prefixes. For a phrase pair like unhappy/sad, an antonymy relation is derived between the base form of the negated word, without the negation prefix, and its paraphrase (happy/sad). we use MORSEL (Lignos, 2010) to perform morphological analysis and identify negation markers. For multi-word phrases beginning with a negating word, the negating word is simply dropped to obtain an antonym pair (e.g. different/not identical → different/identical). Our equation ant(w, (p 1 , p 2 )) subsequently defines the antonym of a target word w and a paraphrase pair (p 1 , p 2 ) belonging to the set of all selected paraphrase pairs P .
∀ (p 1 ,p 2 )∈P ∧w=p 1 [ant(w, (p 1 , p 2 ))] = p 2 (3.1) p 1 is either a lexical phrase with a negating prefix or a multi-word phrasal pair beginning with not. The target word w is the base non-negated form of p 1 whose antonym is simply the paraphrase of p 1 or p 2 . For a PPDB paraphrase pair (unhappy/sad) ∈ P , antonym(happy) = sad. Similarly, for the pair (not identical/dif f erent) ∈ P , Given a paraphrase pair (p 1 , p 2 ) ∈ P , we derive the antonym pair (p 1 , p 2 ) or (w, p 2 ) using Equation 3.1.
We also enrich the number of antonyms obtained using this technique by considering all synonyms (and lexical paraphrases) of p 2 (u ∈ S(p 2 )) as antonyms of p 1 (or w) and synonyms of p 1 (v ∈ S(p 1 )) as antonyms of p 2 . Equation 3.2 describes this procedure. Given (p 1 , p 2 ) derived from Equation 1: In the above example, in addition to sad, we also retain its PPDB paraphrases and its WordNet synonyms as antonyms for happy.
In order to expand the antonym list, synonyms were obtained from WordNet and lexical paraphrases were obtained from PPDB. While expanding each phrase in the derived pair by its paraphrases, we filter out paraphrase pairs with a PPDB score (Pavlick et al., 2015a) of less than 2.5. In the above example, unhappy/sad, we first derive happy/sad as an antonym pair and expand it by considering all synonyms of happy as antonyms of sad (e.g. joyful/sad), and all synonyms of sad as antonyms of happy (e.g. happy/gloomy). Some examples of PPDB paraphrase pairs and antonym pairs derived from them are shown in Table 3.3. Table 3.4 shows the number of pairs derived at each step using PPDB. In total, we were able to derive around 213K unique pairs from PPDB. This is a much larger dataset than existing resources like WordNet and EVALution as shown in Table 3.5. Figure 3.1 displays the number of antonym pairs derived from each method explained above.
Paraphrase Pair Antonym Pair not sufficient/insufficient sufficient/insufficient insignificant/negligible significant/negligible dishonest/lying honest/lying unusual/pretty strange usual/pretty strange Table 3.3: Examples of antonyms derived from PPDB paraphrases. The antonym pairs in column 2 were derived from the corresponding paraphrase pairs in column 1.

Analysis
We performed a manual evaluation of the quality of the extracted antonyms by randomly selecting 1000 pairs classified as 'antonym' and observed that the dataset contained about 63% antonyms. A combined list of randomly selected antonyms from all of the methods listed above had about 53% accuracy. We also evaluated the percentage of antonyms yielded by each method. Figure 3.2 illustrates the percentage of antonyms derived from each method. Errors mostly consisted of phrases and words that do not have a opposing meaning after the removal of the negation pattern. For example, the equivalent pair till/until that was derived from the PPDB paraphrase rule not till/until.
Other non-antonyms derived from the above methods can be classified into unrelated pairs (background/figure), paraphrases or pairs that have an equivalent meaning (admissible/permissible), words that belong to a category (Africa/Asia), pairs that have an entailment relation (valid/equally valid) and pairs that are related but not with an antonym relationship (habitants/general public).

Hearst/Snow Patterns for Antonyms
In section 2.5 we described pattern based techniques to derive lexical relations from unrestricted text. Snow et al. (2006) and Hearst (1992)

Pattern
Example sentence compared with X, Y compared with the older generation, the new generation to X or to Y to fight or to surrender X rather than Y maximizing it rather than minimizing it either X or Y either low or high doses neither X nor Y neither women nor men

Annotation
Since the pairs derived from PPDB seemed to contain a variety of relations in addition to antonyms, we crowdsourced the task of labelling a subset of these pairs in order to obtain the true labels 3 . We asked workers to choose between the labels: antonym, synonym (or paraphrase for multi-word expressions), unrelated, other, entailment, and category. We show each pair to 7 workers, taking the majority label as truth.

Network
In this chapter we describe AntNET, an LSTM-based, morphology aware neural network model for antonymy detection. We first focus on improving the neural embeddings of the path representation (Section 4.1), and then integrate distributional signals into this network, resulting in a combined method (Section 4.2).

Path-based Network
Similarly to prior work, we represent each dependency path as a sequence of edges that leads from x to y in the dependency tree. We use the same path-based features proposed by  for recognizing hypernym relations: lemma and part-of-speech (POS) tag of the source node, the dependency label, and the edge direction between two subsequent nodes. Additionally, we also add a new feature that indicates whether the source node is negated.
Rather than treating an entire dependency path as a single feature, we encode the sequence of edges using a long short term memory (Hochreiter and Schmidhuber, 1997) (LSTM) network. The vectors obtained for the different paths of a given (x, y) pair are pooled, and the resulting vector is used for classification. The overall network structure is depicted in Figure 4.1. Figure 4.2 illustrates the differences in the path-based architecture between LexNET and AntNET.  happy nor sad is probably a more common phrase than Neither happy nor unhappy, so this technique will help our model to identify an opposing relationship between both types of pairs, happy/unhappy and happy/sad. Second, a common practice for creating word embeddings for multi-word expressions (MWEs) is by averaging over the embeddings of each word in the expression. Ideally, this is not a good representation for phrases like not identical since we lose out on the negating information obtained from not. Indicating the presence of not using a negation feature and replacing the embedding of not identical by identical will increase the classifier's probability of identifying not identical/different as paraphrases and identical/different as antonyms. And finally, this method helps us distinguish between terms that are seemingly negated but are not in reality (e.g. invaluable). we encode the sequence of edges using an LSTM network.
The vectors obtained for all the paths connecting x and y are pooled and combined, and the resulting vector is used for classification. The vector representation of each edge is the concatenation of its feature vectors: where v lemma , v pos , v dep , v dir , v neg represent the vector embeddings of the negation marker, lemma, POS tag, dependency label and dependency direction, respectively.

Path Representation
The representation for a path p composed of a sequence of edges edge 1 , edge 2 , .., edge k

Classification Task
Given a lexical or phrasal pair (x, y), we induce patterns from a corpus, where each pattern represents a lexico-syntactic path connecting x and y. The vector representation for each term pair (x, y) is computed as the weighted-average of its path vectors, by applying average pooling as follows: v p(x,y) refers to the vector of the pair (x, y); P (x, y) is the multi-set of paths connecting x and y in the corpus, f p is the frequency of p in P (x, y). The vector v p(x,y) is then fed into a neural network that outputs the class distribution c for each class (relation type), and the pair is assigned to the relation with the highest score r: MLP stands for Multi Layer Perceptron, and can be computed with or without a hidden layer (equations 4.3 and 4.4 respectively).
W refers to a matrix of weights that projects information between two layers; b is a layer-specific vector of bias terms; and h is the hidden layer.

Combined Path-based and Distributional Network
The Chapter 5

Experiments
For identifying antonymy, we experiment with the path-based and combined models of AntNET.

Binary Classification
We first tried experimenting with binary classification with 2 labels True (for antonym pairs) and False (for non-antonym pairs). The dataset was split into 70% train, 25% test, and 5% validation sets. Hyper-parameters were tuned on the validation set to choose the best dropout rate, learning rate, GloVe embedding dimensions, and number of hidden layers.

Multiclass Classification
Six Classes The first few experiments involved six labels Antonym, Category, Paraphrase, Unrelated, Entailment, and Other. The dataset was split into 70% train, 25% test, and 5% validation sets. Hyper-parameters were tuned on the validation set to choose the best dropout rate, learning rate, GloVe embedding dimensions, and number of hidden layers.

Three Classes
Given the skewed nature of the labels in the dataset, we thought combining some of the classes would help the model perform better. Category, Paraphrase, Entailment, and Other were clubbed into a single class Other. The final three classes were Antonym, Unrelated, and Other. The dataset was split into 70% train, 25% test, and 5% validation sets. Hyper-parameters were tuned on the validation set to choose the best dropout rate, learning rate, GloVe embedding dimensions, and number of hidden layers.

Dataset
Neural networks require a large amount of training data. we use the labelled portion of the dataset that we created using PPDB (Chapter 3). In order to induce paths for the pairs in the dataset, we identify sentences in the corpus that contain the pair and extract all patterns for the given pair. Pairs with an antonym relationship are considered as positive instances in both classification experiments. In the binary classification experiment, we consider all pairs related by other relations (entailment, other, synonymy, category, unrelated) as negative instances. we also perform a variant of the multiclass classification with three classes (antonym, other, unrelated). Due to the skewed nature of the dataset, we combined category, entailment, and synonym/paraphrases and otherrelated pairs. Table 5.1 displays the number of relations in this dataset. Wikipedia 1 was used as the underlying corpus for all methods and we perform model selection on the validation set to tune the hyper-parameters of each method. We apply grid search for a range of values and pick the ones that yield the highest F 1 score on the validation set.
The best hyper-parameters are reported in the appendix.
In order to show how our model performs in the notoriously difficult task of distinguishing antonyms and synonyms, we use the large-scale antonym and synonym 1 we used the English Wikipedia dump from May 2015 as the corpus.

Corpus
The English Wikipedia dump from May 2015 was used as the corpus to train our integrated neural network model. The corpus is used to extract connecting dependency paths between target words. Paths were computed between the most frequent unigrams, bigrams, and trigrams in Wikipedia based on GloVe vocabulary and the most frequent 100K bigrams and trigrams. The vocabulary for the model consisted of PPDB words that were contained in the most common 400k words in Wikipedia and the most common 100k bigrams and trigrams in Wikipedia.

Word Embeddings
GloVe Embeddings GloVe stands for Global Vectors for Word Representation. GloVe is an unsupervised learning algorithm for obtaining vector representations for words.
Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. We are using pre-trained word embeddings of different dimensions to train our model.
dLCE Embeddings Nguyen et al. (2016) proposed a novel extension of a skip-gram model with negative sampling (Mikolov et al., 2013) that integrates the lexical contrast information into the objective function of a skip-gram model. The proposed model optimizes the semantic vectors to predict degrees of word similarity and also to distinguish antonyms from synonyms. The improved word embeddings outperform state-of-the-art models on antonym synonym distinction and a word similarity task.

Majority Baseline
The majority baseline is achieved by labelling all the instances with the most frequent class occuring in the dataset i.e FALSE (binary) or UNRELATED (multiclass).

Distributed Baseline
The SP method proposed by Schwartz et al. (2015) uses symmetric patterns for generating word embeddings. the authors automatically acquire symmetric patterns (defined as a sequence of 3-5 tokens consisting of exactly 2 wildcards and 1-3 words) from a large plain-text corpus, and generate vectors where each co-ordinate represented the co-occurrence in symmetric patterns of the represented word with another word of the vocabulary. For antonym representation, the authors relied on the patterns suggested by (Lin et al., 2003) to construct word embeddings containing an antonym parameter that can be turned on in order to represent antonyms as dissimilar, and that can be turned off to represent antonyms as similar. To evaluate the SP method on my data, we used the pre-trained SP embeddings 3 with 500 dimensions. we used the SVM classifier with RBF kernel to perform for the classification of word-pairs.

Path-based and Combined Baseline
Since AntNET is an extension of the path-based and combined models proposed by  for classifying multiple semantic relations, we use their models as additional baselines. Because their model used a different dataset that contained very few antonym instances, we replicated the baseline (SD) with the dataset and corpus information as in chaptern 5.2.1 rather than comparing to the reported results.

Chapter 6
Results and Analysis 6.1 Preliminary Results

Effect of the dataset
Since LexNET was evaluated on a dataset that contained very few antonyms, our preliminary experiments included running LexNET on our dataset and making improvements to lead towards a better performance for the classification of antonyms. The first set of experiments were conducted on a small size of the dataset (722 pairs) that was manually labelled. In order to increase the size of the dataset, we crowdsourced the labelling task on CrowdFlower. This increased the size of the dataset to 5885 pairs but the dataset was skewed with an uneven distribution of classes. To fix this, we added antonyms from the EVALution dataset to the dataset generated using PPDB. The next set of experiments included preprocessing the data to handle punctuation within the words, analyzing false positives and false negatives and correcting the incorrectly labelled pairs in the dataset, and handling multi-word pairs.  From the results of these experiments, we can see that in general, the models without a hidden layer performed better than those with a hidden layer across all experiments.
It is possible that the contributions of the hidden layer and the path-based source over the distributional signal are redundant.

Effect of the number of dimensions of the word embeddings
In order to evaluate the effect of the dimensions of word embeddings and to choose the best one, we re-ran the first experiment with 722 pairs and self-labelled data (displayed in tables 6.1 and 6.2) with GloVe 100 and 200 dimensions.

Effect of the Negation-marking Feature
Based on these preliminary results, for further experiments leading to the development of our final models (AntNET-path and AntNET-combined), we use the dataset containing 7318 pairs for training, an MLP with no hidden layer, and GloVe embeddings of 50 dimensions. We also reduce the types of classification to binary and multiclass with 3 classes.
In our final model (AntNET), the novel negation marking feature is successfully integrated along the syntactic path to represent the paths between x and y. In order to evaluate the effect of our novel negation-marking feature for antonym detection, we compare this feature to variations of the AntNET model with slightly different features.

LexNET
Since AntNET is an extension of LexNET, we compare our novel negation feature with the path features of LexNET which include the POS tag, lemma, dependency label, and edge direction.

Negation Feature
In order to allow LexNET to better identify antonyms and better distinguish them from other semantic relations, we implemented AntNET-neg that adds a new morphological path-based feature to the existing features in LexNET. This new negation feature is used for marking whether the term pairs are negated.

Replacement of Word Embeddings
AntNET-morph is an improvement to AntNET-neg. AntNET-morph (or AntNET) not only records if either of the terms in the pair is negated but additionally, it also replaces the word (lemma) embeddings in the path by the word embedding of its base nonnegated form.

Distance Feature
Nguyen et al. (2016) has previously shown that replacing the directino feature in Hy-peNET by the distance feature improves performance for the task of distinguishing between antonyms and synonyms. In their approach, they integrate the distance between related words in a lexico-syntactic path as a new pattern feature, along with lemma, POS, and dependency labels. We re-implemented this model named AntNET-distance by making use of the same information regarding dataset and patterns as in chapter 5.2.1 and then replacing the direction feature in LexNET by the distance feature. The results are shown in Table 6.4 and indicate that the negation marking feature and the method of replacing the embeddings of negated words by their base forms, enhances the performance of our proposed models more effectively than the distance feature does, across both binary and multiclass classifications 1 . Although, the distance feature has previously been shown to perform well for the task of distinguishing antonyms from synonyms, this feature is not very effective in the multiclass setting. Figure 6.1 compares the performance of the negation-marking feature with other features described above. Figure 6.1: Illustration of the effect of the novel negation-marking feature.

Effect of Word Embeddings
Our methods rely on the GloVe word embeddings, state-of-the-art word embeddings for relation detection. In order to evaluate the effect of these word embeddings on the performance of our models, we replace them by the pre-trained dLCE embeddings with 100 dimensions, and compare the effects of the GloVe word embeddings and the dLCE word embeddings on the performance of AntNET. Nguyen et al. (2016) showed that the dLCE embeddings outperform state-of-the-art word embeddings for antonym-synonym distinction. Comparing the path-based methods, the AntNET model achieves a higher precision compared to the path-based SD baseline for binary classification, and outperforms the SD model in precision, recall, and F 1 in the multiclass classification experiment.
The low precision of the SD model stems from its inability to distinguish between antonyms and synonyms, and between related and unrelated pairs, which are common in my dataset, causing many false positive pairs such as difficult/harsh, bad/cunning, finish/far which were classified as antonyms.
Comparing the combined models, the AntNET model outperforms the SD model, in precision, recall, and F 1 , achieving state-of-the-art results for antonym detection. In all the experiments, the performance of the model in the binary classification task was better than in multiclass classification. Multiclass classification seems to be inherently harder for all methods, due to the large number of relations and smaller number of instances for each relation. we also observed that as we increased the size of the training dataset used in my experiments, the results improved for both path-based and combined models, confirming the need for large-scale datasets that will benefit training neural models. Figure ?? illustrates compares the performance of AntNET with other baselines.

False Negatives
we again sampled about 20% false positive pairs from both the binary and multiclass experiments and analyzed the major types of errors. Most of these pairs had only few co-occurrences in the corpus often due to infrequent terms (e.g. cisc/risc which define computer architectures). While my model effectively handled negative prefixes, it failed to handle negative suffixes causing incorrect classification of pairs like spiritless/spirited. A possible future work is to simply extend this model to handle negative suffixes as well.   Table 7.2: Performance of the AntNET models compared to the baseline models for antonym-synonym distinction achieving an improvement of 0.34 and 0.04 respectively, for F 1. Regarding nouns, we do not outperform the more advanced RS baseline but in comparison to the SP baseline, my model still shows a clear F 1 improvement of 0.33. Distinguishing between antonyms and synonyms is challenging because they often occur in similar contexts.

Antonym-Synonym Distinction
But with the help of the negation-marking feature, we were able to effectively distinguish between these pairs. wet is also possible that antonymous word pairs co-occur within a sentence more often than synonymous word pairs.

Chapter 8 Conclusion and Future Work
In this thesis, we presented an original technique for deriving antonyms using paraphrases from PPDB. we also presented a novel morphology-aware neural network model, AntNET, which improves antonymy prediction for path-based and combined models.
In addition to lexical and syntactic information, we suggested a novel morphological negation-marking feature.
Our proposed models outperform the baselines in two relation classification tasks.
we also demonstrated that the negation marking feature outperforms previously suggested path-based features for this task. Since my proposed techniques for antonymy detection are corpus based, they can be applied to different languages and relations.
For future work, we plan to annotate the rest of the dataset derived from PPDB by crowdsourcing the labelling task. We also plan to filter out the derived pairs to keep only those pairs where both the members of a pair belong to the same part-of-speech tag. Another filtering technique could be to test different PPDB 2.0 scores in order to choose the best threshold and keep only those pairs that have a higher score than the chosen threshold in PPDB. To compute the metrics for evaluation, we used scikit-learn with the "averaged setup", which computes the metrics for each relation, and reports their average, weighted by support (the number of true instances for each relation). Note that it can result in an F 1 score that is not the harmonic mean of precision and recall.

Appendix
While preprocessing we handled removal of punctuation. Since our dataset also contains short phrases, we removed any stop words occurring at the beginning of a sentence (Example: a man → man), and removing plurals. The best hyperparameters for all models mentioned in this paper are shown in