SUNNYNLP at SemEval-2018 Task 10: A Support-Vector-Machine-Based Method for Detecting Semantic Difference using Taxonomy and Word Embedding Features

We present SUNNYNLP, our system for solving SemEval 2018 Task 10: “Capturing Discriminative Attributes”. Our Support-Vector-Machine(SVM)-based system combines features extracted from pre-trained embeddings and statistical information from Is-A taxonomy to detect semantic difference of concepts pairs. Our system is demonstrated to be effective in detecting semantic difference and is ranked 1st in the competition in terms of F1 measure. The open source of our code is coined SUNNYNLP.


Introduction
Measuring semantic similarity between words has been a fundamental issue in Natural Language Processing (NLP). Semantic similarity measurements are used to improve downstream applications including paraphrase detection (Xu et al., 2014), question answering (Lin, 2007), taxonomy enrichment (Jurgens and Pilehvar, 2016) and dialogue state tracking (Mrksic et al., 2016).
Despite the current success in using semantic model to measure semantic similarity, lesser attention is paid to teaching machines to make reference (Searle, 1969;Abbott, 2010) to the real world in detecting semantic difference. The semantic difference detection problem can be formalized as a binary classification task: given a triplet (con-cept1, concept2, attribute) which comprises two concepts (e.g. apple, banana) and one attribute (e.g. red), determine whether the attribute characterizes the former concept but not the latter. Compared to pairwise semantic similarity detection, this problem is more complex than measuring similarity in general because of its underlying asymmetric property and the extra attribute involved. The SemEval 2018 Task 10 (Krebs et al., 2018) is 1 https://github.com/Yermouth/sunnynlp therefore posed to attract attention to solving this problem.
Although the task of semantic difference detection is novel, similar tasks like referring expression generation (REG) have been studied in the literature. Resources such as ontologies, knowledge bases (Krahmer and Van Deemter, 2012) and images (Kazemzadeh et al., 2014;Lazaridou et al., 2016) are used to learn referring expressions. The major difference between the present task and referring expression is that REG systems can choose salient attributes for making successful reference to objects, while our system is required to decide whether a given attribute can be used to differentiate two similar objects.
The rest of the paper is organized as follows: Section 2 explains our motivation and approach. Section 3 describes the official and external data used. Section 4 details our system implementation. We analyze and discuss the result in Section 5 and conclude our work in Section 6.

General Approach
Our approach to this problem is to divide the ternary concept-concept-attribute relationship (concept1, concept2, attribute) into two conceptattribute relationships (concept, attribute) 2 . The ternary relationship will hold only when the first pair of concept-attribute relation is true and the second false. This approach allows us to use well developed pairwise similarity measurements to extract semantic information from the two conceptattribute pairs, and aggregate the features to train a support vector machine (Cortes and Vapnik, 1995) to detect semantic difference of the triplet, i.e. identifying whether a concept contains a specific attribute is a key task of our system.

Useful concept-attribute (Has-A) pairs
(lemon, yellow) Semantic difference triplet in official test cases (lemon, cranberry, yellow) By observation, we draw similarities between concept-attribute relationship and meronomy (Has-A). They are similar in a sense that both describe subtype relationships. Although linguistics resources constructed by human subjects including norms and priming effect data can help us detect and verify these relationships effectively, they are not allowed to be used in this SemEval Task.
This SemEval Task also limits the scope of concepts and attributes to concrete concepts and visual attributes only. As instances of the same concept are likely to share common attributes from our intuitive perspective 3 , we would like to experiment on extracting meronomy (Has-A) information from hypernymy (Is-A) pairs. Taxonomies and ontologies which contain rich Is-A information in terms of concept-instance pairs are therefore the key external linguistic resources which we rely on to extract concept-attribute relationships.
Another intuition that guides our research direction is that modifiers such as adjectives, adverbs and noun modifiers are useful for capturing salient attribute of a specific class of objects 4 . As modifiers are used to describe the scope of concepts or specify context of instances, we can leverage on 3 As both apple and banana are hypernyms of fruit, i.e. apple Is-A fruit and banana Is-A fruit. If we know apple is "edible", then banana may have a higher chance of being "edible" by intuition because "edible" can be a common attribute for most fruits. 4 When we want to differentiate one object from another, we usually use a salient and outstanding attribute to describe the object instead of using a common or similar attribute. Similar viewpoint is previously raised in (Pechmann, 1989;Dale and Haddock, 1991), which states that human beings prefer using efficient and sufficiently distinguishing description when they are constructing referring expressions. the co-occurrence probability of modifiers to analyze their dependence/independence relationships with different concepts, and hence, to determine whether a concept-attribute relationship holds.
As the SemEval Task limits the word length of the concept and attribute to be 1, we can enumerate all possible pairs of modifiers and concepts from large scale taxonomy and ontology and use them as features to train our system. Table 1 shows an Is-A entry in taxonomies which we find instructive for learning semantic difference relationship. For instance, verifying whether semantic difference relationship holds for the triplet (lemon, cranberry, yellow) would require the information of "lemon has the attribute yellow?" and "cranberry does not have the attribute yellow?". With the Is-A pair (yellow food, lemon) from Probase, we can extract possible concept-attribute pairs and their frequency to train our system, such that our system knows with high probability that lemon has the attribute yellow while cranberry does not.

Data
We use the official dataset together with two external linguistic resources, GloVe (Pennington et al., 2014) and Probase (Wu et al., 2012;Cheng et al., 2015), to train our system.

Official Dataset
Official datasets 5 are split into three parts -training, validation and testing, where the testing holds a disjoint attribute sets apart from training and validation. This further increases the difficulty of the task as it prevents lexical memorization (Roller et al., 2014;Levy et al., 2015;Weeds et al., 2014) and tests for generalization.

Probase
Probase is a web scale open domain taxonomy which uses Hearst patterns (Hearst, 1992) to extract Is-A relationship from web documents. Each Is-A entry in Probase is represented as a triplet form: super-concept, sub-concept and number of co-occurrence. We choose Probase for two main reasons: 1. Large number of concepts covered: The number of concepts covered in Probase (Wu et al., 2012) exceeds other publicly available

Rich in semantic features: Probase provides
Is-A relationship pairs with concepts of different senses and abstraction levels, which allows our system to extract rich statistical information for training. For instance, Is-A pairs in Table 2 are extracted from Probase.

GloVe
Pre-trained embeddings such as Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) and FastText (Bojanowski et al., 2016) encode syntactic and semantic relationships of words in low-dimension space, which is crucial to the capturing of semantic difference. We use the GloVe embedding pre-trained on both Gigaword corpus and 2014 Wikipedia dump in our final submission system.

System Description
Our system architecture pipeline (Figure 1) includes the process of data preprocessing, feature extraction and classifier selection.

Data Preprocessing
Preprocessing procedure is applied to Probase including:   • Lemmatization: As Probase is crawled using a rule-based system, we lemmatize the data using stanford CoreNLP  to reduce words of different forms and allow better matching between Is-A entries in taxonomy and official dataset.
• Data partitioning: To give our system additional information regarding the adjectives, adverbs and modifiers of both concepts and instances, we partition Probase into 4 sub datasets, according to the word length of the concept and instance pair. For instance, partition "1 to 1" indicates that both concept and instance are of word length 1. Partition "N to 1" indicates concepts of arbitrary word length (more than 1) and instances of word length 1. Example of each partition are given in Table  2.

Statistical Features
As for statistical features, we consider the statistical features of the individual words, i.e. concept1, concept2, attribute, and the two concept-attribute relationship pairs, i.e. (concept1, attribute), (concept2, attribute)) using individual or cooccurrence frequency in Probase. Word frequency is extracted from individual words, and the following features are extracted from the Is-A pairs: • Co-occurrence frequency  Table 4: Result (F1-score) obtained by our system. The underlined value represents the score of our official submission. Best scores for each partition are denoted in boldface.
• Pointwise Mutual Information(PMI) (Fano, 1961;Church and Hanks, 1990) • Asymmetric Pointwise Mutual Information(APMI) There are three types of pairwise word co-occurrence frequencies, including Concept-Concept, Instance-Instance and Concept-Instance. All types of frequencies are calculated for all partitions as distinct features. Table 3 gives an example of how occurrence and co-occurrence are counted. We apply logarithm to the statistical features to reduce the scale of frequently occurring words.

Word Embedding Features
We use the Python package Gensim (Rehurek and Sojka, 2010) to match each word in the triplet (concept1, concept2, attribute) in the official dataset with their corresponding pre-trained vectors v con1 , v con2 , v attr , each of 300 dimensions. We then divide the triplet into three pairwise relationships i.e. (v con1 , v con2 ), (v con1 , v attr ), and (v con2 , v attr ), and calculate the cosine similarity and L1-norm of the vector difference of these pairs as features. Dot-product is considered initially but removed as it adversely affects the performance of our system.

Classifiers
Using the same set of word embedding and statistical features, we compared the performance of four off-the-shelf classifiers including SVM (Cortes and Vapnik, 1995), Logistic Regression Classifier, Decision Tree Classifier and Random Forest Classifier. SVM classifier with RBF kernel (Vert et al., 2004) is used in our system as it outperforms other classifiers in terms of precision and F1-score.

Results
We provide the results of our system with different combinations of features and datasets in Table 4. Column Train+Valid/Test represents the F1-score obtained by training our system with both the training partition and validation partition, while column Train/Test and Valid/Test are F1-score obtained by training our system on the training partition and validation partition individually. Training our SVM system with Probase and GloVe (or Fast-Text) gives the best result in terms of F1-score for official evaluation (column Valid/Test). Our system achieves a F1-score of 0.754 and outperforms those of the other teams.

Discussion
During the competition phase, we noticed that our system performs better when we did not use training partition together with validation partition. As the entries in the training partition are automatically generated, there may be false entries or noise which can adversely affect our system. Since the validation partition comprises manually curated examples, we evaluate our models using 5-fold cross validation on the clean validation partition only (indicated by column Valid(cv=5)).

Conclusion
In this paper, we have discussed how our simple yet effective SVM system leverages on hypernymy (Is-A) relationships and word embeddings to detect single word semantic difference relationship. SVM has been shown useful especially in performing semantic relationship detection tasks (Filice et al., 2016;Panchenko et al., 2016). We would like to extend our system for detecting multiple-words semantic difference relationship, and to broaden the scope of concepts and attributes from visual only to sound and taxonomic.
As our system separates a concept-conceptinstance relationship into two concept-instance relationships, our system is relatively weak in capturing attributes that are comparative or fuzzy, for instance, young and tall. It would be interesting to explore how semantic difference relationship can be embedded into taxonomies, ontologies and vector representations, so that comparative attributes can be comprehensively and directly captured.