ECNU at SemEval-2018 Task 10: Evaluating Simple but Effective Features on Machine Learning Methods for Semantic Difference Detection

This paper describes the system we submitted to Task 10 (Capturing Discriminative Attributes) in SemEval 2018. Given a triple (word1, word2, attribute), this task is to predict whether it exemplifies a semantic difference or not. We design and investigate several word embedding features, PMI features and WordNet features together with supervised machine learning methods to address this task. Officially released results show that our system ranks above average.


Introduction
The Capturing Discriminative Attributes task (Paperno et al., 2018) in SemEval 2018 is to provide a standard testbed for semantic difference detection, which will benefit many other applications in Natural Language Processing (NLP), such as automatized lexicography and machine translation (Krebs and Paperno, 2016). The goal of this task is to predict whether a word is a discriminative attribute between two concepts. Specifically, given two concepts and an attribute, the task is to predict whether the first concept has this attribute but the second concept does not. For example, given the concepts apple and pineapple, participants are required to predict whether the attribute seeds characterizes the first concept but not the other. In other words, semantic difference detection is a binary classification task: given a triple (apple, pineapple, seeds), the task is to determine whether it exemplifies a semantic difference or not, i.e., positive or negative. Table 1 shows more data examples.
If word 1 has a specific attribute but word 2 does not, then the correlation of attribute and word 1 should be higher than that of attribute and word 2 . The semantic similarity is in the same way. In view of the above considerations, to address this task, we explore supervised machine learn-word 1 word 2 attribute label apple pineapple seeds positive candle chandelier melts positive apple coconut brine negative apple cucumber seeds negative ing methods which use PMI features and Word-Net features. In recent years, more and more studies have focused on word embeddings as an alternative to traditional hand-crafted features (Pennington et al., 2014;Tang et al., 2014). Therefore we use word embeddings to obtain the semantic similarity as word embedding features. Besides, we perform a series of experiments to explore the effectiveness of feature types and supervised machine learning algorithms.

System Description
To perform semantic difference detection of given triples, we adopt supervised learning algorithms with several features that represent semantic similarity and correlation. In the next paragraphs, we will introduce feature engineering and learning algorithms.

Feature Engineering
In this task, we design three types of features: WordNet features, PMI features and word embedding features. WordNet (Miller et al., 1990) is an on-line lexical reference system, which is organized by semantic properties of words. Therefore, the WordNet features are designed to utilize WordNet to obtain the semantic information.

WordNet Features
Each word may have a number of different semantics, corresponding to different senses in the WordNet. And the WordNet provides the definitions of the senses for each word. If a word is an attribute of the target word, the attribute word may appear in the sense definition of the target word. For example, a semantic definition of "snow" is "white crystals of frozen water" and "white" is the attribute of "snow" which appears in the above definition. Therefore, we design the following features to record the semantic information. Given the triple (word 1 , word 2 , attribute), we first load all senses definitions of word 1 , word 2 and attribute. Then we implement four types of binary features: (1) whether attribute appears in the senses definitions of word 1 , (2) whether attribute appears in the senses definitions of word 2 , (3) whether word 1 appears in the senses definitions of attribute and (4) whether word 2 appears in the senses definitions of attribute. As a result, we get four features.

PMI Features
Pointwise mutual information (PMI) (Church and Hanks, 1990) is a measure of association between two things used in information theory and statistics. And in NLP, this metric can be used to measure the correlation between two words. The higher the PMI, the stronger the correlation between the two words. So we obtain the PMI features of the given triple (word 1 , word 2 , attribute). We record the P-MI value of word 1 and attribute as well as the PMI value of word 2 and attribute as PMI features. The PMI values we used are calculated using Wikimedia dumps 1 and directly obtained from SEMILAR (Rus et al., 2013). As a result, we get four PMI features.

Word Embedding Features
Word embedding is a continuous-valued vector representation for each word, which usually carries syntactic and semantic information. In this work, we employ two types of word embeddings which are pre-trained word vectors downloaded from Internet with dimensionality of 300: GoogleW2V (Mikolov et al., 2013) and GloVe (Pennington et al., 2014). The former is pretrained on News domain, available in Google 2 . And the latter is pre-trained on tweets, available in GloVe 3 .
• WE similarity: Given the triple (word 1 , 1 https://dumps.wikimedia.org/enwiki/20170724/ 2 https://code.google.com/archive/p/word2vec 3 http://nlp.stanford.edu/projects/glove word 2 , attribute), if attribute characterizes word 1 rather than word 2 , the semantic similarity score of attribute and word 1 should be higher than that of attribute and word 2 . After acquiring the vectors of three words in the triple, we calculate the similarity scores of attribute and word 1 as well as attribute and word 2 using cosine similarity and pearson coefficient. Finally, we got four word embedding similarity features.
• WE operation: In addition to the above similarity functions, we also explore two different ways of interaction between word vectors in order to capture the semantic information as much as possible. The operations between three vectors include concatenation and subtraction. Specifically, given the vectors V 1 , V 2 and V a of the triple (word 1 , word 2 , attribute), the concatenation operation is to concatenate three vectors as [V 1 ⊕ V 2 ⊕ V a ], and the subtraction operation is the elementwise subtraction of attribute from word 1 and word 2 respectively, i.e., Since we employ two types of word embeddings, finally, we get 3, 000 dimensional vectors as word embedding operation features.

Learning Algorithm
We grant this task as a binary classification task and explore six supervised machine learning algorithms: Logistic Regression (LR) and Support Vector Machine (SVM) both implemented in Liblinear toolkit (Fan et al., 2008), Stochastic Gradient Descent (SGD), RandomForest and AdaBoost all implemented in scikit-learn tools (Pedregosa et al., 2011), and XGBoost 4 provided in (Chen and Guestrin, 2016). All these algorithms are used with default parameters. Table 2 shows the statistics and distributions of training, development, test data sets of this task provided by task organizers.

Evaluation Metric
To evaluate the system performance, the official evaluation criterion is macro-averaged F1-score,  which is calculated among two classes (positive and negative) as follows:

Experiments on Training Data
Firstly, in order to explore the effectiveness of each feature type, we perform a series of experiments. Table 3 lists the comparison of different contributions made by different features on development data with Logistic Regression algorithm. We observe the following findings.
(1) The simple PMI features and word embedding similarity features are effective for semantic difference detection and it shows the effectiveness of semantic similarity and correlation for semantic difference detection.
(2) The combination of the first three features not only achieves the best performance for the overall classification but also for each class. These three types of features make contributions to semantic difference detection task. Therefore we use these features in following experiments.
(3) The result of merging the WE operation features is not as good as we expected and the possible reason is that the dimensionality of WE operation features is quite huger than the other three features(3, 000 Vs. 16), which dominates the performance of classification rather than other low dimension features. And the operations of word vectors are too simple to detect the semantic difference.
(4) The WordNet features are not as effective as expected, and the reason maybe that in many cases the attribute words do not appear in the sense definitions of concepts, so we can not get nonzero features.
Secondly, we also explore the performance of different supervised machine learning algorithms. Table 4 lists the comparison of different learning algorithms with WordNet, PMI and WE similarity features. We find:  (1) LR and SVM achive better results than the other supervised machine learning algorithms and Logistic Regression algorithm achieves the best performance when considering single classification algorithm.
(2) The ensemble of the top 3 machine learning algorithms (LR + SVM + XGBoost) achives higher performance than any single learning algorithm, i.e., 0.663.  Based on the above results, the system configuration of our final submission is ensemble of LR, SVM and XGBoost algorithms with WordNet, P-MI and WE similarity features. The models are trained on both training and development data sets. Table 5 shows the results of our system and the top-ranked systems provided by organizers for this semantic difference detection task. Compared with the top ranked systems, there is much room for improvement in our work. There are several possible reasons for this performance lag. First, the features we used are simple. We only record some semantic similarity information and correlations between words. More complex interactions of word vectors could be tried. Second, we only extract features from three words that need to be classified and have not used some extended resources like the sentences returned from search engines when retrieving these words.

Conclusion
In this paper, we extract WordNet features, PMI features and word embedding features from triples and adopt supervised machine learning algorithms to perform semantic difference detection. The system performance ranks above average. In future work, we consider to try more complex interactions of word vectors and use more web resources to capture semantic information.