ABDN at SemEval-2018 Task 10: Recognising Discriminative Attributes using Context Embeddings and WordNet

This paper describes the system that we submitted for SemEval-2018 task 10: capturing discriminative attributes. Our system is built upon a simple idea of measuring the attribute word’s similarity with each of the two semantically similar words, based on an extended word embedding method and WordNet. Instead of computing the similarities between the attribute and semantically similar words by using standard word embeddings, we propose a novel method that combines word and context embeddings which can better measure similarities. Our model is simple and effective, which achieves an average F1 score of 0.62 on the test set.


Introduction
Capturing discriminative attributes is a novel task, which is very different from classical semantic tasks that model similarities in semantics.The task aims to recognise semantic differences between words.Traditional semantic similarity evaluation tasks were designed for evaluating the quality of word representations based on the fact that words with similar semantics will be close to each other in vector space.Recent state-of-the-art distributed semantic models (Ling et al., 2015;Bojanowski et al., 2017) inspired by the success of word2vec (Mikolov et al., 2013) gave good performance in these similarity measure tasks.Nevertheless, how to capture discriminative attributes between semantically similar words is still a challenge for traditional word embedding methods, because these methods are designed to capture similar semantics.
We have two observations for the nature of the task and the provided data: (1) only limited data is available for model training; (2) the inputs of the model are merely isolated words themselves, which lack context information for apply-ing complex models.Therefore, we propose a novel framework that differentiates two semantically similar words with the attribute word by using their word and context embeddings.We experimented with both Continuous Bag of Words (CBOW) and Skip-gram, demonstrating that using the combination of word and context embeddings outperforms using word embeddings alone.
The contribution of this work can be summarised as follows.We examine word and context embeddings in CBOW and Skip-gram, showing that using both word and context embeddings can better measure the co-occurrence of two words in sentences than simply using word embeddings.Hence our similarity measure can recognise the discriminative attributes of two semantically similar words more accurately.

System Description
Our system is trained based on word and context embedding features as well as WordNet features (Fellbaum, 1998).Before introducing our framework in detail, we first introduce the two key technical parts of our framework, i.e., context embedding and WordNet.

Context embeddings
In contrast to simply using traditional word embeddings which model semantic similarities based on contextual similarities, we consider using both word and context embeddings.Word and context embeddings are the vectors of target words and context words in CBOW and Skip-gram.Using them together can model the co-occurrence of attribute words and distinguished words in a sentence, which is useful in predicting whether the attribute word can distinguish two semantically similar words.
Take the Skip-gram model as an example.Skipgram uses a neural network with a single hidden layer of neurons.Given a target word, the objective function is to maximize the probability of predicting each context word (several words before and after the target word in a training sentence) (Rong, 2014): where w t is a target word in a training sentence, w c is a context word of w t , appearing within the same sentence.
During every training epoch on each context word, the weight matrices before and after the hidden layer will be updated.A row vector in the matrix before the hidden layer is a traditional word embedding v t , as the vector is updated when the corresponding word w t is in the target word position.A column vector in the matrix after the hidden layer is defined as context embedding v c , as the column vector is updated when the corresponding word w c is in the context word position.Each word has two vectors, v t and v c , as each word can be a target word or a context word of other target words.Some popular toolkits, e.g., gensim word2vec ( Řehůřek and Sojka, 2010), abandon Skip-gram's context embeddings v c after training, as experimental research (Nalisnick et al., 2016) proves that simply using v t or v c (IN-IN or OUT-OUT in the original paper) for two function or type similar words to measure their similarity yields higher scores.
Actually, the conditional probability in the objective function in Eq. 1 can be expanded as: where V is the vocabulary of the training set, the dot product v t • v c in numerator computes the similarity between the target word vector and the context word vector.The denominator is to normalise the similarity into a probability.Thus, given a target word, training the whole model involves updating the matrices before and after the hidden layer to maximize the probability of predicting the context word.This is similar to maximizing the similarity between the target word embeddings v t and context word embeddings v c .It means that if we can reuse this trained similarity measure, to compute e.g., cosine similarity, then we will get a much better result.In other words, using both the word and context embeddings of two words that frequently appeared within each other's contexts will result in a better similarity measure, which may be incorporating the co-occurrence information of the two words.CBOW can be considered as the reverse of Skip-gram.Given a context, the target is to maximize the probability of predicting the target word appearing in the context.Later, we will examine both CBOW and Skip-gram's word and context embeddings in our model.

Word definition in WordNet
We also introduce features based on word sense definitions in WordNet (Fellbaum, 1998), considering the differences between the definitions of two semantically similar words.The two words may be similar in semantics, but different in definitions.An eligible discriminative attribute word may have high possibility to appear within one of the two word definitions, rather than both of them.For example, ears can distinguish corn and broccoli, as in WordNet, ears occurs in the definition of corn as "tall annual cereal grass bearing kernels on large ears: widely cultivated in America in many varieties; the principal cereal in Mexico and Central and South America since pre-Columbian times", rather than broccoli's definition that "plant with dense clusters of tight green flower buds".We will also capture such characters to distinguish two words.

Hypothesis and framework
Our first hypothesis is that an attribute word w A can distinguish two semantically similar words w 1 and w 2 , if the attribute word co-occurs much more frequently with one word than the other in the corpora.In vector space, the attribute word can be closer to a distinguished word than the other one.Our second hypothesis is that if w A can distinguish w 1 and w 2 , w A may appear within one of the definitions of w 1 and w 2 in WordNet.
The framework of our model can be summarized as: (1) we firstly train word embeddings v t and context embeddings v c on a Wikipedia dump.
(2) Given a triple (w 1 , w 2 , w A ), we then compute w A 's cosine similarities with w 1 and w 2 , and the difference in their similarities, which are used as three input features in the following classifications.For example, given the context embeddings of w 1 and w 2 and the word embedding of w A , we compute Feature 1: cosine(v w 1 c , v w A t ); Feature 2: cosine(v w 2 c , v w A t ); and Feature 3: We first iteratively train 300 dimensional word and context embeddings based on CBOW and Skip-gram with a Wikipedia dump2 for 3 epochs respectively, setting a context window of 5 words before and after the target word.Words with frequency less than 5 in the Wikipedia are ignored.The down sampling rate is 10 −4 .
Based on CBOW and Skip-gram, we test all possible combinations of word and context embeddings to compute cosine similarities.The first combination is context embeddings of the two semantically similar words w 1 and w 2 , and word embeddings of the attribute word w A .In

Results
We cast the challenge task as a supervised classification problem.We first examine which combination of word and context embeddings and which training method (CBOW or Skip-gram) is optimal in this task.In this step, we only use Feature 1-3 (see Table 1) to classify the triple (w 1 , w 2 , w A ).
As can be seen in Table 2, both the CBOW based methods that use word and context embeddings yield the highest average F1 of 0.55 in the validation set.Skip-gram based models generally perform worse than CBOW based models, but using Skip-gram word embeddings of w 1 and w 2 and context embeddings of w A also outperforms the word embedding based model in the validation set.The experiments running on the test set show similar trends that word and context embedding based models outperform word embedding based models.Such results demonstrate that using word and context embeddings together can better distinguish two semantically similar words with an attribute word, than simply using standard word embeddings.The results also support our first hypothesis that if the attribute word frequently appears in one word's context than the other one, it can distinguish the two words.
We also examine WordNet definition features individually.As shown in Table 3, simply using Feature 4-5 cannot classify the triple accurately.The F1 score of setting positive label as 1 is very low on the validation set (F1=36%).This is for the reason that an eligible discriminative attribute word cannot always associate with the definitions of one of two semantically similar words.So, simply using such features cannot identify discriminative words precisely.
Finally, we combine both similarity and Word-Net features together to address this challenge.There is no significant difference between CBOW based v w1 c v w2 c v wA t and v w1 t v w2 c v wA c in the validation set in terms of average F1.We select v w1 c v w2 c v wA t as the winner combination of word and context embeddings, because this approach has closer F1 scores, when setting different la-   bels (1 or 0) as the positive label.Thus, in the final submission, we use CBOW trained context embeddings of w 1 and w 2 , and word embeddings of w A to compute similarity features.We identify whether an attribute word w A can distinguish w 1 and w 2 by using the above similarity features and WordNet definition features together.Although word and context embedding based similarity features are much more effective than WordNet features, by introducing WordNet features, the model further improves its performance, achieving 62% F1 on the test set (Table 4).WordNet definitions are also supportive features in this task.
Error analysis.We found that a significant portion of failures appear in those examples that the textual associations of the attribute words and the semantically similar words are not always discriminative.E.g., given a triple, (sons, father, young), our model failed in identifying young as a discriminative attribute, because young has been widely used to describe sons and father in the text (e.g., young sons and a young father).In this case, our word co-occurrence based method is suboptimal.

Conclusion
In this paper, we extended traditional word embedding methods (CBOW and Skip-gram) to distinguish two semantically similar words using an attribute word.In contrast with simply using traditional word representations, using both context and word embeddings can better model the cooccurrence between the two similar words and their discriminative attribute word.If the attribute word frequently co-occurs with one of the similar words more than another one within the same sentences, then the two semantically similar words can be distinguished by the attribute word.By using CBOW word and context embedding based similarity features and simple WordNet based word sense definition features, our model performs an average F1 of 62% on the test set.

Table 1 :
respec-1 cosine(w1, wA) 2 cosine(w2, wA) 3 |cosine(w1, wA) − cosine(w2, wA)| 4 binary variable, indicating if wA appears in the WordNet definitions of w1 5 binary variable, indicating if wA appears in the WordNet definitions of w2 Feature descriptions.tively.(3) Next, we introduce two binary features to indicate whether w A appears in any sense definitions of w 1 and w 2 in WordNet (Feature 4 and 5), respectively.(4) We train a random forest classifier with the above five features (see Table 1) to classify if the attribute word w A can distinguish two semantically similar words w 1 and w 2 .
1 , w 2 , w A ) in training, validation and testing sets, respectively.All the words in the triples are nouns.Note that the discriminative attribute words w A in the given dataset are selected, because they represent the visual attribute of one of two semantically similar words.For example, red can differentiate apple and banana, because visually, apple is red, while banana is yellow.The task does not consider other discriminative features, such as sound and taste.So, using image features may take advantages in this dataset, however, semantic features can also capture invisible discriminative attributes.Word and context embedding.

Table 2
, w 2 and w A .

Table 4 :
Final results by using CBOW word and context embeddings, and WordNet features (Feature 1-5).