Igevorse at SemEval-2018 Task 10: Exploring an Impact of Word Embeddings Concatenation for Capturing Discriminative Attributes

This paper presents a comparison of several approaches for capturing discriminative attributes and considers an impact of concatenation of several word embeddings of different nature on the classification performance. A similarity-based method is proposed and compared with classical machine learning approaches. It is shown that this method outperforms others on all the considered word vector models and there is a performance increase when concatenated datasets are used.


Introduction
Detecting semantic similarity is done well by state-of-the-art models. However, if the model is only good at similarity detection, it would have a limited practical usage (Krebs et al., 2018), since the task of understanding the semantics of words cannot be done without capturing semantic differences.
Semantic difference is a ternary relation between two concepts (apple, banana) and a discriminative feature (red) that characterizes the first concept but not the other. Semantic difference detection is a binary classification task: given a triple (apple, banana, red), the task is to determine whether it exemplifies a semantic difference or not (Krebs et al., 2018). In this paper two concepts and a discriminative attribute attr are represented as (word 1 , word 2 , attr).
This research was done during the participation in "Capturing Discriminative Attributes" task of SemEval 2018 competition.
The paper is organized as follows. Section 2 describes the methods used. Section 3 shows the results and analyzes them. Section 4 mentions future directions. Section 5 concludes the paper.

Methods
There were several approaches considered: classical machine learning algorithms and a similaritybased model.

Data Preparation
Dataset is provided by SemEval 2018 challenge organizers.
• The train set consists of 17501 automatically generated samples of the form (word 1 , word 2 , attr, y), where y is a binary target variable indicating whether attr is a discriminative attribute for word 1 and word 2 . Classes are imbalanced: there are 63.83% samples for class 0 (not a discriminative attribute) and 36.16% for class 1 (is a discriminative attribute).
• The validation set contains 2722 manually curated samples of the same form. Classes are almost balanced: 50.1% of samples have class 0, 49.9% -class 1.
• There are 2340 samples in the test set, 55.3% for class 0, 44.7% for class 1.
Each triple (word 1 , word 2 , attr) was converted to a numeric vector using pre-trained word embeddings. These vectors form a vector space, such that words that share common contexts in the corpus are located in close proximity to one another in space.
Three word embedding models were used: 1. Google News (Mihltz, 2017)  The problem of missing words was solved by replacing them with another spelling or a synonym. For each triple (word 1 , word 2 , attr) corresponding word vectors were concatenated, so each triple is converted to a 900-dimensional vector when using Google News or Wikipedia word vectors, and 1800-dimensional for concatenated word embeddings.

Machine Learning Approaches
There were several machine learning classification approaches chosen for comparison: logistic regression, stochastic gradient descent (SGD) classifier, k-nearest neighbors classifier and artificial neural network.
The best parameters that maximize F 1 score on the validation set: • Logistic regression with L 2 regularization with regularization strength set to 10.
• Stochastic gradient descent classifier with perceptron loss and regularization term set to 1e-05.
• K-nearest neighbors with k = 1 using Manhattan distance metric and weighting points by the inverse of their distance.
• The multilayer perceptron neural network model was built using Keras with TensorFlow as a backend. Structure of this network is described in Table 1.

Similarity-based Approach
Another approach is to derive an interpretable algorithm based on knowledge about word semantics. The intention is to use word similarities for distinguishing discriminative attributes while keeping the model as simple as possible. Cosine similarity between words a and b is represented as the cosine between corresponding word vectors A and B.
For each given triple (word 1 , word 2 , attr) similarities of attr with word 1 and attr with word 2 are computed. Then obtained similarities are compared using a threshold t. If the gap between them is big enough, i.e. sim(word 1 , attr) > sim(word 2 , attr) + t, attr is treated as a discriminative attribute of word 1 . It means that the attr word vector is much closer to word 1 than to word 2 in vector space. Thus, this model has only one tunable hyperparameter: t. Dependency of F 1 score on the threshold is shown in Figure 1.

Experimental Results
Models are evaluated on F 1 measure. Figure 1 shows that the behavior of the similarity-based model is highly dependent on the selected word vectors model. Thresholds, learned from the train set for Google News, Wikipedia and concatenated word vectors are 0.053030, 0.066667 and 0.060606 correspondingly.
Experimental results on the validation set are presented in Table 2. K-nearest neighbors showed the worst F 1 score, while neural network is the best among machine learning methods. The proposed similarity-based method outperforms all other models on all considered word embeddings.
Word embeddings with the highest F 1 score on the validation set were chosen for the final comparison: Google News for SGD classifier, Wikipedia for k-nearest neighbors and neural network, concatenated embeddings for logistic regression and similarity-based model. The results on the test set are presented in Table 3. K-nearest neighbors has the smallest F 1 score. In contrast with the performance on the validation set, the result of neural network is noticeably worse, which means that overfitting took place. Logistic regression performed better than SGD classifier, while similarity-based method showed the highest score.

Model
Best F 1 score K-nearest neighbors 0.502 SGD classifier 0.515 Logistic regression 0.527 Neural network 0.503 Similarity-based 0.646

Error Analysis
In this section error analysis of similarity-based model is provided. During the evaluation on the test set 65.5% of samples were classified correctly, while 34.5% were not. Predicted classes are imbalanced: 1436 (61.4%) samples were classified as 0 and 904 (38.6%) samples as 1.
Considering misclassified samples, 332 of them were assigned class 1 while it should have been 0, whereas 475 of them got label 0, when it should have got 1. As we can see, the model is more likely to consider attributes as non-discriminative.
Attributes could be divided in several categories. For example, there are attributes representing colors: 'black', 'brown', 'red', 'blue' and 'yellow'. It worth mentioning that 43.36% of samples with color attributes were misclassified, which is more than 34.5% of misclassified samples for the whole test set.
As can be seen from

Future Work
It was shown that there is a performance increase of the similarity-based model when concatenated word vectors are used. Training a Word2Vec model specifically for the task instead of using pre-trained models can solve the mentioned prob-lem of missing words and multiple cases of the same word in word embeddings. According to SemEval Task 10 organizers, training set contains noisy data, which was not verified by humans. Another potential improvement is training models only on validation dataset, since it was created manually and should not have noise.
It was discovered that samples with color attributes have higher misclassification rate than other samples. There are proposed solutions for learning discriminative properties of images (Lazaridou et al., 2016), which could be combined with a text-based approach to derive a multimodal classifier.
It also worth analyzing other categories of attributes and their misclassification rate.

Conclusion
This paper presents several approaches for capturing discriminative attributes. The main contribution is the proposed similarity-based method, which is interpretable and takes into account the semantic similarity of words. This method is compared with machine learning methods, such as Logistic regression, SGD classifier, KNN and Multilayer perceptron neural network. Experiments on three pre-trained word vector models show that similarity-based method outperforms others. It was discovered that concatenation of word embeddings of different nature leads to a quality improvement for several methods.