CitiusNLP at SemEval-2018 Task 10: The Use of Transparent Distributional Models and Salient Contexts to Discriminate Word Attributes

This article describes the unsupervised strategy submitted by the CitiusNLP team to the SemEval 2018 Task 10, a task which consists of predict whether a word is a discriminative attribute between two other words. Our strategy relies on the correspondence between discriminative attributes and relevant contexts of a word. More precisely, the method uses transparent distributional models to extract salient contexts of words which are identified as discriminative attributes. The system performance reaches about 70% accuracy when it is applied on the development dataset, but its accuracy goes down (63%) on the official test dataset.


Introduction
The goal of SemEval-2018 Task 10 (Paperno, Lenci and Krebs, To Appear) is to predict whether a word is a discriminative attribute between two other words. The key idea underlying this task is to capture semantic attributes of words in order to discriminate their senses. Distributional semantics is based on the assumption that two words have similar senses if they tend to appear with the same contextual words (Firth, 1957). As contextual words actually refer to the semantic attributes of a given word, I will focus on identifying the most salient word contexts. So, my method to identify discriminative attributes relies on the identification of salient contexts, since they represent the main semantic attributes of a word.
For this purpose, in this paper we will make use of distributional models built with transparent and lexico-syntactic contexts. To capture discriminative attributes, I will rank the most relevant contexts of a word by using lexical association measures between a given word and their contexts. My method is unsupervised and only requires pretrained distributional models. This paper is organized as follows. The method is described in Section 2. Experiments, results, and a discussion on them are presented in Section 3. Finally, conclusions are addressed in Section 4.

The method
As mentioned in the previous section, discriminative attributes might be captured by searching for the most salient contexts of words. For this purpose, the distributional vector space I have adopted is a transparent count-based model with explicit and sparse dimensions. Sparseness reduction is performed by selecting the most salient contexts per word using a filtering strategy (Bordag, 2008;Gamallo and Bordag, 2011;Gamallo, 2017). The filtering strategy to select the most salient contexts consists of selecting, for each word, the S (salient) contexts with highest lexical association scores (e.g. loglikelihood, ppmi, etc). The top S contexts are considered to be the most relevant and informative for each word. S is a global, arbitrarily defined constant whose usual values range from 10 to 1000 (Biemann et al., 2013;Padró et al., 2014). In short, I keep at most the S most relevant contexts for each target word. This is an explicit and transparent distributional representation giving rise to a non-zero matrix. By contrast, methods based on dimensionality reduction, such as LSA (Landauer and Dumais, 1997) or neural-based embeddings (Mikolov et al., 2013), make the vector space more compact with dimensions that are not transparent in linguistic terms (Gamallo, 2017).
SemEval-2018 Task 10 to detect discriminative attributes consists of predicting whether a word is a discriminative attribute between two other words. For instance, given a triple <car, table, wheels>, the system must determine if the last word of the triple, wheels, represents a semantic feature that characterizes the first word, car, but not the second one, table. The task is a binary classification task. For this particular example, the classifier must return a positive answer since cars have wheels but tables have not. By taking into account the objective of SemEval-2018 Task 10 and my concept of salient context introduced above, the classification method I propose is the following very simple rule: Given the triplet < w1, w2, att >, att is a discriminative attribute of w1 and not of w2 if att belongs to the most salient contexts of w1 and not to those of w2.
Concerning the type of context used to represent word distributions, there is a great number of previous studies that evaluate and compare syntactic contexts (usually dependencies) with bagof-words techniques (Grefenstette, 1993;Seretan and Wehrli, 2006;Padó and Lapata, 2007;Peirsman et al., 2007;Gamallo, 2008Gamallo, , 2009Levy and Goldberg, 2014;Gamallo, 2017). The cited papers state that syntax-based methods outperform bagof-words techniques, in particular when the objective is to compute semantic similarity between functional equivalent words, such as detection of co-hyponym/hypernym word relations (i.e. near synonymy).
In my proposal, I use lexico-syntactic contexts to model word distributions. When contexts are defined as lexico-syntactic contexts, I consider that a word is an attribute of w1 if that word is the lexical element in at least one of the salient contexts of w1. For instance, consider the following three lexico-syntactic contexts: If they are salient contexts of the word car, then the three lexical words of these three contexts, i.e. wheels, run, and red, will be considered as attributes of car.
The number of salient contexts considered per word will be determined experimentally.

Resources
The count-based, explicit and transparent distributional model used in the exeperiments was generated from the English Wikipedia (August 2013 dump) containing almost 2 billion tokens. The description of this model is reported in Gamallo (2017), and a version with the 500 most salient contexts per word is freely available. 1 To process the corpus and create the transparent matrices, I used the multilingual PoS tagger of Lin-guaKit 2 (Garcia and Gamallo, 2015) and DepPattern, a rule-based and multilingual dependency parser (Gamallo, 2015) also taking part of Lin-guaKit. I also generated other models with different thresholds: from 10 to 2000 salient contexts per word.
As will be described in the next subsection, I will compare the transparent matrix with dense word embeddings, in particular with those reported in Levy and Goldberg (2014), which are publicly available. 3 These embeddings were generated from the same Wikipedia dump as the transparent model. Given that embeddings are opaque and, thereby, their dimensions are not easily associated to specific words, I use Cosine similarity to find discrimative attributes. A word is a discriminative attribute of w1 and not of w2, if the similarity score between the attribute and w1 is higher than a given threshold whereas it is lower in the case of w2.

Preliminary Experiments
To find the best configuration of the proposed system, I carried out several experiments on the train and validation datasets (20,510 examples). As the system is unsupervised, I am not required to separate training from validation. First, I searched for the best lexical association by comparing loglikelihood (Dunning, 1993) and positive pointwise mutual information (ppmi) (Niwa and Nitta, 1994), by using models with 400 and 500 salient contexts. As loglikelihood performed slightly better than ppmi, I chose the former measure to carry out the next experiments. Second, I searched for the best number of salient contexts. For this purpose, several evaluations were made with models from 10 to 2000 salient contexts. Figure 1 shows that the peak is quickly reached with 500 contexts (more than 0.67 accuracy), while performance is getting down slowly as more contexts are added.  I therefore decided to use models with 500 salient contexts per word for the next experiments. Next, I merged the Wikipedia-based model with other models generated from two different corpora: British National Corpus (BNC), 4 and a sample with 500 million words from Reddit corpus. 5 Results are shown in Table 1. As expected, accuracy is improved as the model grows.
Given these preliminary experiments, I submitted the two best configurations to the test evaluation (2,340 examples): syst. meas. saliency corpora run1 loglike 500 ctxs wiki+bnc run2 loglike 500 ctxs wiki+bnc+reddit In my preliminary experiment, I also used the word embeddings described in the previous subsection to capture discriminative attributes. As mentioned above, a word is considered to be an attribute of a target word if their similarity is higher than a specific threshold, otherwise it is not a discriminative attribute. Several similarity scores were set to determine whether a word is an attribute or not. Figure 2 shows that the best similarity threshold is around 0.3 (cosine value). Accuracy drops dramatically with higher threshold values. The best accuracy reached by this strategy is about 20 points below the best models based on salient contexts. Therefore, for this particular task, transparent models consistently outperform word embeddings.

Official Test
The test dataset consists of 2,340 examples. My run1 (wiki+bnc) merely reached 0.625 accuracy while run2 (wiki+bnc+reddit) reached 0.634. These results are very far below those obtained with the development dataset, which is nevertheless 10 times larger.

Discussion
With regard to the rest of teams at the shared task, my run2 is just in the middle of the ranking (13 out of 26 runs). However, its performance in the development dataset (0.700 accuracy) is close to the third best system. I am not able to explain the difference between the development and the test dataset. It would require a deep error analysis to understand that significant difference. This disparity is not due to a difference in the corpus frequency of the words included in the test set. I have checked the frequency of all words (test and development) and there is no important contrast at this regard. The reason could just be that the test dataset might contain more difficult triples.
The best system at the shared task achieves 0.75, leading by 12 points my run2. Even though the score of my system is lower, it is worth mentioning that my strategy is fully unsupervised and no tunning or specific configuration has been carried out to adapt the system to the test dataset.

Conclusions and Future Work
I presented a very basic unsupervised strategy to predict whether a word is a discriminative attribute between two other words. The current strategy relies on the correspondence between discriminative attribute and context saliency, and it works on transparent distributional models to extract salient contexts of words.
As I observed that accuracy improves as the corpus grows, in future work, I will compile specific text corpora for just the words of the test. This should lead to select more salient contexts (and so more discriminative attributes) per word. In addition, I will make new experiments with relational lexical resources, such as WordNet, to compare them with distributional models in this particular task.