GHH at SemEval-2018 Task 10: Discovering Discriminative Attributes in Distributional Semantics

This paper describes our system submission to the SemEval 2018 Task 10 on Capturing Discriminative Attributes. Given two concepts and an attribute, the task is to determine whether the attribute is semantically related to one concept and not the other. In this work we assume that discriminative attributes can be detected by discovering the association (or lack of association) between a pair of words. The hypothesis we test in this contribution is whether the semantic difference between two pairs of concepts can be treated in terms of measuring the distance between words in a vector space, or can simply be obtained as a by-product of word co-occurrence counts.


Introduction
Equipped with their cognitive skills, encyclopedic knowledge and linguistic competence, humans generally can identify the lexical association or semantic relation between two words or concepts with relative ease. However, building a computational model for identifying fine-grained semantic relations (such as synonymy, antonymy, hyponymy, or hypernymy, meronymy, holonymy, metonymy, containment or causality) or even detecting binary relatedness has proven to be a challenging task.
Efforts to model semantic representation computationally are generally classified into statistical and knowledge-driven semantics. This classification depends on whether the assumption is that human knowledge is encapsulated in language man-ifestation or that explicit manual encoding of this knowledge is needed. The statistical approach to the encoding of semantic relations is referred to as "distributional semantics" or "distributed word representations" (Speer et al., 2017), and its theoretical appeal stems from the fact that it gives practical application to the Firthian dictum "You shall know a word by the company it keeps" (Firth, 1957) which has become commonsense wisdom in lexical semantics. Features of the statistical model are extracted from unstructured data, such as words embeddings, n-gram counts, or directly from raw data.
The basic idea with word embeddings is to formulate semantic relations in arithmetic fashion by creating a vector space in which words with similar contextual embeddings have closer vectors distance Elman, 1990;Bengio et al., 2003;Kann and Schtze, 2008;Mikolov et al., 2013c). The public availability of word embedding training programs such as word2vec (Mikolov et al., 2013a) and GloVe (Pennington et al., 2014) allowed researchers to create models with different parameters and dimensionality sizes for different purposes including capturing semantic relations (Gladkova et al., 2016;Attia et al., 2016).
The Google n-gram corpus (Brants and Franz, 2006) is a collection of English word n-grams and their observed counts generated from 1 trillion words of texts from web pages. This corpus has been used in many different applications including estimating word-relatedness (Islam et al., 2012), comparison of semantic similarity (Joubarne and Inkpen, 2011), information retrieval (Tandon and De Melo, 2010;Klein and Nelson, 2009), lexical disambiguation (Bergsma et al., 2009), improving general purpose NLP classifiers , and improving parsing performance .
In this work we follow a statistical based approach and show the strengths and weakness of the distributional semantics of the word vectors and ngram frequency counts in capturing the different types of discriminative attributes.

Task and Data Description
The goal of the shared task on Capturing Discriminative Attributes (Krebs et al., 2018) is to detect semantic difference between pairs of concepts, or in other words, determine whether a semantic property differentiates between two possibly related concepts. For example both 'bear' and 'goat' are animals, but only a 'bear' has 'claws'. Therefore 'claws' is considered as a discriminative feature.
The shared task data is formatted in triples that represent a ternary relation between two concepts (Word 1 , Word 2 ) on one hand and an attribute (Word 3 ) on the other. Word 3 is considered as a discriminative attribute if, and only if, it characterizes Word 1 but not Word 2 . For example, in the triple (sailboat,yacht,mast), 'mast' is discriminative as it is found in Word 1 , 'sailboat', but not in Word 2 . By contrast, in the triple (goose,duck,flies) the event 'flies' is not discriminative as it characterizes both entities. Similarly in the triple (pickle,lemon,round), 'round' is not a discriminative feature, as it characterizes Word 2 , not Word 1 .
The size of the shared task data is described in Table 1. It is to be noted that there is no intersection between the discriminative attributes in any of the datasets. We think the purpose is to make sure that the participating systems are able to learn how to estimate the relations, regardless of the lexical items involved. 3 System Description In our system we use a deep neural network for the binary classification of discriminative attributes. The basic idea with deep learning is to use hidden layers of neural nets to automatically capture the underlying factors that lead from the input to the output, eliminating the need for feature engineering.
The system is trained on features extracted from two main publicly available resources that fall within the paradigm of unstructured data as no manual lexical or encyclopedic knowledge is encoded. The two resources are the Google n-gram counts and the Google News Word2Vec. Google n-gram counts. We use the Google 5-gram counts as provided by Google Books ngrams 1 (Michel et al., 2011;Lin et al., 2012). Google News Word2Vec. This is a publicly available pre-trained word vector 2 , built with the word2vec architecture (Mikolov et al., 2013b) from a news corpus of 100B words (3M vocabulary entries) with 300 dimensions, negative sampling, using continuous bag of words and window size of 5.

Features Used
We describe the features used to train our DNN binary classifier to detect discriminative attributes. In this section we use the abbreviations W 1 , W 2 , and W 3 for Word 1 , Word 2 , and Word 3 , respectively.
We use pre-trained word vectors in order to obtain similarity scores between words. This leads to the following features.
• distW 1 W 3 : Cosine distance between W 1 and W 3 • distW 2 W 3 : Cosine distance between W 2 and W 3 • cosDiff : Difference between distW 1 W 3 and distW 2 W 3 • similarityCompare: We compute the cosine similarity between two sets of words using the Gensim 'n similarity' function. So it gives a single number for comparing the similarity between W 1 and W 3 , and W 2 and W 3 .
In order to capture all morphological variations of the words, we use word lemmas and then expand to all variants that share the same lemma.
• lemmaDistW 1 W 3 Ex: The average cosine distance between W 1 and all lemma expansions of W 3 • lemmaDistW 2 W 3 Ex: The average cosine distance between W 2 and all lemma expansions of W 3 We use to Google 5-gram counts to obtain the following features.
• cntW 1 W 3 : counts of W 1 and W 3 co-occurring • cntW 2 W 3 : counts of W 2 and W 3 co-occurring • cntW 1 W 3 Ex: counts of W 1 and the lemma expansions of W 3 co-occurring • cntW 2 W 3 Ex: counts of W 2 and the lemma expansions of W 3 co-occurring

Machine Learning Models
We use a deep neural network model for the binary classification of attributes as either True or False (or discriminative or non-discriminative) based on the set of features described above. We use a simple and straight-forward architecture consisting of 5 feed-forward fully-connected (or dense) layers with single dropout layer with a rate of 0.3. The network is narrow on the top and wide on the bottom. The function of the dropout layer (Hinton et al., 2012) is to mitigate overfitting and make sure that our model learns significant representations by randomly omitting a certain percentage of the neurons in the hidden layer for each presentation of the samples during training. This encourages each neuron to depend less on other neurons and to try to learn generalizations. Table 2 shows the layer configuration of the model.

Experiments and Results
We test our system on various combination of the features mentioned in subsection 3.1. We assume  the baseline is 50% as this is what a random system would generate given that the validation set has an almost equal number of True's and False's. Table 3 shows the system results on the dev set, with the last row showing results on the test set using our best model, "all features". Surprisingly, using the cosine distance between pairs of words gives a low score (56.17%) which is slightly above the baseline, indicating the ineffectiveness of cosine distances in capturing this kind of relationships. Word counts alone were the most impactful of all the features.

Error Analysis
In order to be able to analyze the performance of the system and identify where it is faring well and where it is failing, we first manually classify the relations between concepts and attributes in the validation set into 8 types.
1. Part-whole. This is when the attribute denotes an entity that can be part or whole of concept 1 , e.g. tractor, wheels; moose, legs; cat, eyes; iguana, tongue; condos, rooms.
2. Container-contained. This is when the entity attribute can be located/situated physically or temporally in concept 1 , e.g. oven, kitchen; fort, cannons; mouse, house; priest, parish; surfboard, water.
3. Made-of. This is when the entity attribute is a material of which concept 1 can be made, e.g. cart, wood; wire, metal; rum, sugarcane; scarf, wool; wine, grape; roof, clay.
4. Agent-patient. This is when the attribute is a topic or theme on which concept 1 can act on, e.g. politician, politics; physiotherapist, muscles; dermatologist, skin; mammals, milk.

5.
HasAttribute. This is when the attribute is an adjective that can be used to describe con-   cept 1 , e.g. garlic, white; girl, virgin; alligator, long; tuna, large; honey, sweet; pumpkin, round.

7.
Event. That is when the attribute is a verb that is associated with the concept/entity, e.g. woman, talk; educator, teaches; knee, bend; tuna, swims; frog, jumps; shirt, wear; seabirds, fly; novelist, write.
8. Relates-to. This is when the relationship cannot stated with any of the aforementioned types, e.g. bus, passengers; knee, pads; lung, transplant; widow, death; brother, sister; uncle, nephew. Table 4 shows our manual classification of the discriminative attributes in the validation set. It is to be noted that the majority of relations (62.64%) are of three types: hasAttribute, part-whole and hyper-hypo.
The types of discriminative features in Table 4 are sorted by system performance, highlighting strengths and weaknesses of the system. The deep learning algorithm assumes that the attribute is discriminative for concept 1 if it has considerably higher n-gram counts with concept 1 than with concept 2 . In the upper end ngram counts shows strength in dealing with events and container-contained relationships, where cooccurrence statistics showed to be very helpful. The examples below shows frequency counts that indicate stronger relation between Word 1 and Word 2 than between Word 2 and Word 3 . Gold answers are the numbers (0 or 1) following the triples. (shoulder, cheek, carry, 1), cntW 1 W 3 : 104620, cntW 2 W 3 : 498 (teacher, pupil, teaches, 1), cntW 1 W 3 : 134656, cntW 2 W 3 : 0 (albums, music, picture, 1), cntW 1 W 3 : 3937564, cntW 2 W 3 : 374572 It is to be mentioned that in the validation set, there were 246 (9%) instances where no frequency counts were found for either concepts.
In the lower end of our system performance there were the classes of hasAttribute, part-whole and hyper-hypo. As these classes constitute the majority of the data, the overall system performance is compromised. We make further detailed analysis of our top losses with hasAttribute and part-whole.

Analysis of Errors with hasAttribute
Most of the errors in this class can be identified with one of three reasons.
• N-gram counts are not aware of the qualification scope. For example, in the tuple below, 'large' has equally high frequency with 'brick', not because a brick can be large, but they co-occur in phrases like, "large brick house/ranch" (garage, brick, large, 1), cntW 1 W 3 : 245802, cntW 2 W 3 : 193816 • Contrary to common sense knowledge, data could prove the association between a concept and attribute that might not be readily perceived. The example below shows that "green tomato" is not a rarity. This could indicate an error with manual annotation of the data.

Analysis of Errors with part-whole
Similarly the errors in this class can be attributed to one of three causes.
• Disproportionate frequency count, which could be tied to the disparity in the individual frequency of the concepts themselves. This might be solved by taking the n-gram count as a function of the unigram counts of the concepts themselves.
(car, taxi, wheels, 0), cntW 1 W 3 : 504848, cntW 2 W 3 : 2734 • There could be an association of different kind between concept 2 and the attribute that yield higher frequency counts. For instance in the example below, 'garlic' and 'wings' have higher frequency, not because garlic has wings, but because they co-occur in phrases like "garlic chicken wings". (pheasant, garlic, wings, 1), cntW 1 W 3 : 500, cntW 2 W 3 : 11136 • Either of the two concepts has no n-gram cooccurrence with the given attribute leading to missing information.

Conclusion
In this paper we have presented our system for detecting discriminative features using distributional semantics. We have shown that, without resort to human knowledge, a great deal of encyclopedic knowledge can be captured from unstructured data. We also conducted a detailed error analysis which shows the strengths and weaknesses of the system. In its quest to approximate the distance between words with similar contexts, the cosine distance becomes oblivious to the internal intrinsic relationship between words and their immediate neighbors, and this is why many relations that are induced from co-occurrence counts are not captured by cosine distance.
While n-gram counts from raw data can present a great wealth for mining for lexical information and inducing semantic knowledge, co-occurrence counts can suffer from considerable constraints when two or more adjacent words have different scope of predication or qualification. For example, while "wood spoon" has a high frequency due to the semantic relation of 'made-of', "wood pepper" has an even higher frequency count, not due to any semantic relationship, but because 'wood' is scoped to a subsequent word, "wood pepper mill". If syntactic information related to the head of noun compounds and scope of modification, more meaningful assumptions can be made.