AmritaNLP at SemEval-2018 Task 10: Capturing discriminative attributes using convolution neural network over global vector representation.

The “Capturing Discriminative Attributes” sharedtask is the tenth task, conjoint with SemEval2018. The task is to predict if a word can capture distinguishing attributes of one word from another. We use GloVe word embedding, pre-trained on openly sourced corpus for this task. A base representation is initially established over varied dimensions. These representations are evaluated based on validation scores over two models, first on an SVM based classifier and second on a one dimension CNN model. The scores are used to further develop the representation with vector combinations, by considering various distance measures. These measures correspond to offset vectors which are concatenated as features, mainly to improve upon the F1score, with the best accuracy. The features are then further tuned on the validation scores, to achieve highest F1score. Our evaluation narrowed down to two representations, classified on CNN models, having a total dimension length of 1204 & 1203 for the final submissions. Of the two, the latter feature representation delivered our best F1score of 0.658024 (as per result).


Introduction
As famously quoted by firth "You shall know a word by the company it keeps" that is, the semantic information embedded in a representation can only be described by the words surrounding it. This can only get you somewhere when, company itself is unambiguous and a representation goes through capturing "hypothetically" every sense of the word over a corpus. The capturing discriminative attributes sharedtask, conducted with Se-mEval(2018) is a task proposed by alicia kerbs and denis paperno (2016). It describes, how lexical similarity may not be enough to access qualitatively, the semantic information for a multitude of tasks. Wherein they propose that, with this task, a 1 Results/Evaluation under the team name "AmritaNLP" system can be modelled for effectively extracting certain semantic differences in the words for understanding the sense embedded within them. This is provided as a proof of concept dataset for this sharedtask, where a certain word is used to check if it can distinguish between a pair of words. The dataset in itself seems simple where, in the training set a label information for the two classes, positive or negative are provided making this a binary classification task.
The three words that are provided in each instance are given in the order as, a pivot word followed by a compare word and ending with a attribute or feature word, that may or may not be associated with the pivot word. Based on the last word it is decided, if that attribute word actually is a distinguishing feature that is able to discriminate the pivot word from that of the compare word. e.g (apple,banana,red) here apple is the pivot word, banana the compare word and red, the word which decides if this is a feature that can be associate with apple to distinguish it from banana. This is a rather oversimplified example to a human, as from a very young age we are taught to distinguish objects based on visual aid, which simplifies the task for us as we have embedded subconsciously to differentiate the fruits mainly based on their color or size. This information is seldom used to describe the fruits when illustrated in written form, thus lacking that visual form of information for a machine to make this judgment call, making it that much more difficult to take an informed decision. Their work is based on a method, that was presented by Lazaridou et al. (2016) for prediction of distinguishing feature with use of image as reference for visual discrimination attribute identification task, more prominently it was related to capturing of lexical information using offset vectors.

Dataset
The dataset in the sharedtask2018 (Krebs and Paperno, 2018) is divided into three sets namely train test and validation. The training set contains automatically generated examples which are not manually curated. Whereas, the test and validation set are manually verified examples which include just over 5000 instances. The test set instances are made keeping in consideration that feature word overlap between the words in train and test are minimal. The validation set is similar to that of the test set and is used for parameter tuning of the models.
There are in total 17782 instances in the training set, 2722 in the validation set and 2340 in the test set. With the automated nature of the data, the training set is noisier in comparison to that of the validation and test set. In the dataset, positive examples are annotated with the label '1', signifying that the attribute/feature word is a positive association only to the pivot word in the order presented and not vice verse. e.g. (airplane,helicopter,wings) here 'wings' is an attribute only associated to 'airplane', whereas (helicopter,airplane,wings) is an invalid entry. The combination of (helicopter,airplane) in this order will only be added if the concept 'helicopter' has a feature that airplane does not have in this set. On the other hand, the negative examples are annotated with label '0' at the end. These are considered when the attribute/feature words are either similar to both pivot and the compare word or are dissimilar to them, e.g. (Tractor,scooter,wheels), (Spider,elephant,legs) e.t.c.
In the training dataset, there is a total of 508 unique concepts (pivot) words, of which 375 words have positive attributes and 505 of these have negative attributes, seeing the big contrast between the two labeled attributes we can infer that not every concept word has an equal proportion of labeled instances.

Methodology
Here discussed are methods which are considered in our implementation. On a cursory look at the dataset, we decided to go with a pre-trained representation of the words, rather than preparing a word embedded representation of the dataset. This is devised with a notion that, word pair associated models on this dataset would not help educate the   embedding. Further, using the pre-trained embeddings, the representation are evaluated based on validation accuracy with machine learning techniques like SVM, where we use ten fold ten cross with linear kernel for validation. This algorithm was earlier explored for sense disambiguation of a native language (Tamil), having rich feature representation presented in his work by Anand Kumar et al. (2014a), and is also implemented in his work (2014b). A simple one dimension convolution neural networks model is also illustrated upon, based on the works by Vinayakumar et al. (2017). The CNN model is fixed on an empirical method where the representation is convoluted with twenty filters, of size three, on a batch size of sixty-four, with activation ReLU over a wayward ten epochs, which are flattened and reduced to thirty-two and later to one at the final layer for evaluation. The architectures for the models, are as shown in Figure 1 & 2 respectively. Moving ahead, a GloVe pre-trained word embed-ding (Pennington et al., 2014) of various dimensions are considered, which is learned over public data, available under the PDDL. 2 (100, 300 dimension word representation, embedded over 6B, 840B sizes common crawl corpus are considered). The focus is on using one of these representations for our base method. Upon these embedding, various distance, dissimilarity and similarity measures are considered, to provide a measure between vectors or in our case between the words. In Table 1, provided are abbreviations that we used through out the upcoming discussion regarding the methods and the representations. With the implementation of pre-trained vectors, we refer few vector measurement technique that could be used to measure a sense of semantic similarity among them. These vector carry within them a spacial correlation between words which has be discussed in their work by (Pennington et al., SN Conditional representation (CR) If : Else : 1 W p ,(W p + W a ),W c ,W a W p ,(W p -W a ),W c ,(W c -W a ) 2 W p ,(W p + W a ),W c ,W a ,(Dis c -Dis p) W p ,(W p -W a ),W c ,(W c -W a ),(Dis c -Dis p)

2014).
Initially, a simple concatenation of the three words is considered as an instance, which are the pivot(W p ), compare(W c ) and attribute(W a ) words, for the entire dataset. The same representation is taken of two different dimensions lengths as mentioned earlier. Based on the model fit across training data, the validation accuracy and F1score are measured, these are as shown in Table 2. Similarly, these representations are also passed on to a convolution neural network, where their respective accuracy and F1scores are measured and shown in Table 3. With an empirical approach, the representations are further extended by appending (W a ) to (W p ) and (W c ) sequentially and passing it to the two models(As shown by representation two in the Table 2). The SVM model did not show any significant improvement in the score, over the representations. In comparison, the CNN model observed a slight improvement in scores on the same representation. Word embedding being a vector representation in higher dimensional space, has proved (Pennington et al., 2014) to captures spatial information, that can be employed to use as features for the representation. This is exerted by using certain measures between the (W p ), (W a ) and (W c ), (W a ). These measures are cal-culated using Scipy libraries (Jones et al., 2001) and Sklearn library (Pedregosa et al., 2011) to find the distance, similarity and dissimilarity measure between the two 1-D array words. The similarity of the two words indicates how similarly associated these words are, this measure is calculated using the cosine distance which is a scalar representation that signifies, larger the number between the two words the more similar they are. Whereas, the dissimilarity is the vice-versa of this measure. Of the various distance measures explored, we considered euclidean, chebyshev, sqeuclidean, minkowski and for dissimilarity measures jaccard, kulsinski, Hamming and these are implemented using the Scipy (Jones et al., 2001) library. Amongst the measures considered, kulsinski dissimilarity gave the nearest disambiguation between the comparison of W p , W a and W c , W a , thus we chose it as the threshold measured for differentiating the representations between a positive and a negative instances i.e if the dissimilarity of W c , W a is greater than that of the W p , W a then the W a were added to the W P and concatenated to form a representation. Otherwise the second representation is considered where the W a is subtracted from both the words. This is as shown in the first conditional representation (CR) of the Table 4.
The CR based representation accuracy decreased for SVM models. Whereas, the F1score and accuracy increased for the CNN model over the initial representations shown in Table 3. Thus, the further representation where improved on the CNN model to achieve better F1score with good accuracy. Comparing the two GloVe pre-trained vectors of 300 dimension for varied corpus size shown in Table 3, the 840B 300d trained model has achieved better F1score and accuracy compared to the other, thus moving along further with word embedding.
In Table 5 we see that subsequent representations, built upon the simple representation of CR1 are concatenated with kulsinski(Dis 3 ) distance and Cosine similarity (Cos 3 ) have improved the F1score. As show in the third representation, where the F1score increased to 53.1% with a considerable accuracy over the previous iteration. Further improvisation on CR1 representation with different features like correlation coefficient have increased the F1score to 56.8% but brought down the accuracy. Representation six is the next feature representation for which the accuracy, as well as the F1score, increases to 54.6% and 58.9% respectively. After many iterations of adding features, the representation eleven is the one that gave the highest F1score with the best accuracy, and this representation based model is submitted along with the representation ten, which also had good F1score, but a lower accuracy on the validation dataset.

Results & Conclusion
The tenth and eleventh representation of Table 5 are the two feature set based on CNN models, which are predicted on the test set and submitted for the competition. The results published for our models showed that the first set was scored at 0.52, where as the second set was scored at 0.66 for F1score. Comparing the predicted labels of the two systems with that of the gold standard, we see that our system fit over the tenth representation predicted correctly only 399 of 1293 as negative example and 855 of 1047 as the positive example. On the eleventh representation it gave 857 of 1293 and 687 of 1047 for negative and positive example respectively. Comparing the outcomes of the sys-