UWB at SemEval-2018 Task 10: Capturing Discriminative Attributes from Word Distributions

We present our UWB system for the task of capturing discriminative attributes at SemEval 2018. Given two words and an attribute, the system decides, whether this attribute is discriminative between the words or not. Assuming Distributional Hypothesis, i.e., a word meaning is related to the distribution across contexts, we introduce several approaches to compare word contextual information. We experiment with state-of-the-art semantic spaces and with simple co-occurrence statistics. We show the word distribution in the corpus has potential for detecting discriminative attributes. Our system achieves F1 score 72.1% and is ranked #4 among 26 submitted systems.


Introduction
In this paper, we describe our UWB system participating in the pilot shared task on capturing discriminative attributes held at SemEval 2018. Given two words and an attribute, the goal of this task is to decide, whether the attribute is discriminative between them. For example, we can distinguish between the words car and boat by a discriminative feature (attribute) wheels. On the other hand, both tennis and basketball use a ball, so that the ball is not discriminative between them. By its nature, capturing discriminative attributes is a binary classification task. In general, there is no assumption on the input words and their attributes (e.g., part of speech, etc.).
While most related works focus on extracting discriminative features from images (Guo et al., 2015;Huang et al., 2016;Lazaridou et al., 2016), this shared task is oriented purely on textual level. The first experiments have been performed by Krebs and Paperno (2016) and have shown the promising potential of this task.
The fundamental assumption of our work is Distributional Hypothesis, i.e., two words are expected to be semantically similar if they occur in similar contexts (they are similarly distributed across the text). This hypothesis was formulated by Harris (1954) several decades ago. Today it is the basis of state-of-the-art distributional semantic models (Mikolov et al., 2013;Pennington et al., 2014;Bojanowski et al., 2017). We present several approaches, which rely on Distributional Hypothesis and employ the word contexts for statistical comparison of their meanings.

Proposed Approach
Given two words w 1 ∈ V , w 2 ∈ V and the attribute a ∈ V , where V is a word vocabulary. The task is to predict, whether the attribute a is discriminative between the words w 1 and w 2 , which leads to a binary classification task.
We propose several metrics, which estimate the degree to which the attribute a is important for the word w. We denote this importance as ϕ(w, a) ∈ R. Clearly, if the attribute is important for one word and not for the other, it is likely to be discriminative. In general, we do not place any assumption on the importance metric ϕ(w, a). We transform this score onto the binary vector b w,a containing exactly one non-zero element (one-hot vector). Let T : R → {0, 1} b be the transformation function so that b w,a = T ϕ(w, a) . In our case, we split the scores ϕ(w, a) for all pairs (w, a) from training data into b bins according to 100 b % quantiles. The bin, where the importance score belongs to, represents the value 1 in the vector b w,a .
Having the one-hot vectors b w 1 ,a and b w 2 ,a for the pair of words w 1 and w 2 , we represent the discriminativeness of the attribute a as a conjunction matrix C w 1 ,w 2 ,a = b w 1 ,a b w 2 ,a (note b w 1 ,a is a column vector). The matrix C w 1 ,w 2 ,a ∈ {0, 1} b×b has exactly one non-zero element at the coordinates given by the bins, onto which the scores ϕ(w 1 , a) and ϕ(w 2 , a) are mapped. Values in the matrix are used as binary features for the classifier. The main motivation behind this binarization is to allow combining different importance metrics on different scale.
In the following subsections, we introduce several approaches to estimate the importance score ϕ(w, a).

Semantic Spaces
Let S : V → R n be a semantic space, i.e., a function which projects word w into Euclidean space with dimension n. The meaning of the word w is represented as a real-valued vector S(w).
We assume, the more similar is the attribute a to the word w in meaning, the more likely a represents some feature of w. We estimate this similarity as a cosine of the angle between the corresponding vectors (1)

Word Co-occurrences
We follow the intuition behind the Global Vectors (GloVe) model (Pennington et al., 2014), i.e., that the co-occurrence probabilities have the ability to encode the meaning of words. We are given the corpus c = {c i } k i=1 , i.e., a sequence of words c i ∈ V , where subscript i denotes the position in the corpus. Let N (w, a) denote the weighted frequency of the word w in the context of the word a where λ is a weighting function. We experiment with two types of weighting: a) uniform weighting, where λ(m) = 1 independently of the distance between words and b) hyperbolic weighting, where λ(m) = 1 m . For uniform weighting the equation expresses the number of times the word w occurs in the context of word a. Hyperbolic weighting incorporates the assumption that closer words are more important for each other (the weight decreases with increasing distance).
Let N (w) = a∈V N (w, a) be the number of times any word occurs in the context of w. We estimate the conditional probability of an attribute a given the word w and use it as an importance metric The core idea is that if a often occurs in the context of w 1 and not in the context of w 2 , then a is likely to be discriminative attribute between w 1 and w 2 . The similar idea can also be expressed in an opposite way, i.e., to use probability of the word w given the attribute a

ConceptNet
ConceptNet (Speer and Havasi, 2012) is a large semantic graph, which connects words and phrases with labeled edges. It is based on knowledge collected from many sources, including Wiktionary, WordNet, DBPedia, etc. When ConceptNet is combined with state-of-the-art semantic spaces (e.g., GloVe (Pennington et al., 2014) or Skip-Gram (Mikolov et al., 2013)) it provides exceptional performance in intrinsic tasks (Speer and Lowry-Duda, 2017).
In this paper, we use ConceptNet API, which enables to measure the relatedness between words 1 . It is built using an ensemble that combines data from ConceptNet, SkipGram, GloVe, and OpenSubtitles 2016, using a variation on retrofitting (Speer et al., 2016). We use the relatedness weight as an importance metric ϕ(w, a) [CN ] .

Experiments
In all our experiments we employ Maximum Entropy classifier (Berger et al., 1996) implemented in the Brainy machine learning library (Konkol, 2014). For every importance metric we use mapping onto b = 5 bins. This leads to 5 × 5 = 25 binary features describing the discriminativeness of an attribute for single importance metric.
We train the classifier on the validation dataset 2 proposed by the organizers of this task, containing 2722 manually annotated examples (1364 positive and 1358 negative) with total 576 distinct attributes. We do not use automatically generated data train.txt. For the selection of optimal feature set we perform 10-fold cross-validation. The official test data consists of 2340 examples (1047 positive and 1293 negative). F1 score is the official evaluation measure of this task. Note the majority class system achieves F1 score 50.1% and 55.3% on the validation and test data sets, respectively.

Settings
We estimate word co-occurrence probabilities (Section 2.2) using the English Wikipedia corpus. We experiment with several semantic spaces: SkipGram is a neural-network based model (Mikolov et al., 2013). Levy and Goldberg (2014) provide pre-trained SkipGram models on English Wikipedia with two sizes of the context window (2 and 5) and their own model with dependencybased context.
GloVe is a log-bilinear model for word representations, which encodes global word co-occurences (Pennington et al., 2014). We use vectors provided by authors of the model, pre-trained on various corpus sizes (6, 42, and 840 billion words) 3 .

FastText (Bojanowski et al., 2017) is a character-n-gram-based SkipGram model. We use word vectors pre-trained on English Wikipedia 4 .
LexVec is based on factorization of positive point-wise mutual information matrix using proven strategies from GloVe, SkipGram, and methods based on singular value decomposition (Salle et al., 2016). We use pre-trained word vectors provided by the authors of the model 5 .
Latent Semantic Analysis (LSA) (Landauer et al., 1998) first creates a word-document cooccurrence matrix and then reduces its dimension by singular value decomposition. We trained the model on English Wikipedia.
Latent Dirichlet Allocation (LDA) (Blei et al., 2003) represents the text as a topic distribution. In our case, each value in the word vector corresponds to the probability of this word conditioned by the particular topic.

Results
In Table 1   namely, semantic spaces (Section 2.1), word cooccurrences (Section 2.2), and ConceptNet (Section 2.3). The last two columns in the table contain F1 scores for 10-fold cross-validation on the validation dataset and F1 scores on the official test data. All three approaches provide comparable F1 scores on both datasets.
Detailed experiments with different context window sizes 1 ≤ d ≤ 10 for estimating cooccurrence probabilities are shown in Figure 1. We show F1 scores achieved by 10-fold crossvalidation on the validation dataset. The hyperbolic weighting performs better than uniform for both cases w|a and a|w independently on the size of the context window. Bigger context window seems to be more suitable for capturing discriminativeness. We can see that both metrics enrich each other and their combination leads to significantly better results than using standalone metrics. Based on this graph, we chose context window size d = 9 and use it in all further experiments.
Based on the cross-validation F1 scores, we combine different importance metrics to boost the performance (see Table 2). LexVec proved to perform best among semantic spaces. We found out that LDA enrich LexVec and improve the performance by approximately 1%. We believe this is because of the different context type (we used Wikipedia articles as documents for LDA). Significant improvements are achieved when we combine co-occurrence probabilities with semantics spaces or with ConceptNet (both cases give approximately 70% F1 score on both validation and test data). Combining all three approaches to-  gether yields additional improvements (72.0% and 71.3% on validation and test data, respectively). Our final UWB system combines all three approaches with one extra trick. We create additional binary features represented as a product of each pair of features (x a × x b for a = b) and add them into the classifier. We do this to better model the dependencies between single features. In the table, we denote this trick as a conjunction. Despite the fact that this setting leads to increasing sparseness of the feature set, it boosts F1 score on validation data by 1.9% and on test data by 0.8%.

Conclusion
In this paper we described our UWB system participating in SemEval 2018 shared task for capturing discriminative attributes. We explored three approaches based on word distribution in the corpus, including various semantic spaces, cooccurrence probabilities, and ConceptNet. Our best results have been achieved by Maximum Entropy classifier combining all three approaches with careful feature engineering. Our system is ranked #4 among 26 participating systems.