NTU NLP Lab System at SemEval-2018 Task 10: Verifying Semantic Differences by Integrating Distributional Information and Expert Knowledge

This paper presents the NTU NLP Lab system for the SemEval-2018 Capturing Discriminative Attributes task. Word embeddings, pointwise mutual information (PMI), ConceptNet edges and shortest path lengths are utilized as input features to build binary classifiers to tell whether an attribute is discriminative for a pair of concepts. Our neural network model reaches about 73% F1 score on the test set and ranks the 3rd in the task. Though the attributes to deal with in this task are all visual, our models are not provided with any image data. The results indicate that visual information can be derived from textual data.


Introduction
Modern semantic models are good at capturing semantic similarity and relatedness. The widely-used distributional word representations, or word embeddings, have achieved promising performance on various semantic tasks. The word pair similarities calculated with these models are to some extent consistent with human judgments, and many downstream applications such as sentiment analysis and machine translation have benefited from word embeddings' ability to aggregate the information of lexical items with similar meaning but different surface forms.
However, the ability to distinguish one concept from another similar concept is also core to linguistic competence. Our knowledge about what is a "subway", for example, may contain "it is a kind of train that runs underground". Also, discriminating things is an important mechanism for teaching and learning. For example, if we would like to explain how a "plate" is different from a "bowl", we may use expressions like "a plate is flatter" or "a bowl is deeper". All these examples show that one form of semantic difference is a discriminative at-tribute which applies to one of the two concepts being compared but does not apply to the other.
In the SemEval-2018 Capturing Discriminative Attributes task (Krebs et al., 2018), participants need to put forward semantic models that are aware of semantic differences. A data instance consists of a triple and a label. In this paper, we denote a triple with < w 1 , w 2 , a >, in which w 1 and w 2 are the two words (concepts) to be compared, and a is an attribute. The label is either positive (1) or negative (0). In a positive example, a is an attribute of w 1 but not an attribute of w 2 . For negative examples, there are two cases: 1) both w 1 and w 2 have attribute a ; 2) neither w 1 nor w 2 has attribute a. In this task, a is limited to visual ones such as color and shape. The evaluation metric is the macro-averaged F1 score of the positive and the negative classes.
Visual attribute learning has been investigated by past researchers. Silberer et al. (2013) build a dataset of concept-level attribute annotations based on images in ImageNet (Deng et al., 2009). For each attribute, they train a classifier to predict its presence or absence in the input image. Lazaridou et al. (2016) propose a model that does not learn visual attributes explicitly, but learns discriminativeness. Their model predicts whether an attribute can be used to discriminate a referent from a context. Both the referent and the context are represented by visual instances sampled from ImageNet. This setting is similar to that of this SemEval task. However, one critical difference is that in this task, the set of attributes is open. The dataset is partitioned so that all the attributes in the test set are unseen in the training set, which makes this task more challenging.
The use of word embeddings for detecting semantic properties is studied by Rubinstein et al. (2015). They focus on a fixed set of properties and train a binary classifier for each property. Their results indicate that word embeddings capture taxonomic properties (e.g. "an animal") better than attributive properties (e.g. "is fast"), possibly because attributive signal is weak in text.
In this task, most visual attributes are attributive properties. The signal of "visual" attributes can be even weaker in text since they are not mainly communicated through language in human cognition. The word "red" in "I bought a red apple" sounds more like a linguistic redundancy than that in "I bought a red jacket" does, since "red" is a typical attribute of apples. However, these visual attributes may impose constraints on valid expressions. For instance, we can say "the bananas turned yellow", but it would be extremely difficult to find some context where "the bananas turned red" makes sense. Therefore, visual attributes can be signaled in some implicit and indirect ways. By utilizing several computational approaches, we reveal to what extent visual attributes can be acquired from text.
This paper aims at capturing semantic difference by incorporating information from both corpus statistics and expert-constructed knowledge bases. We build a rule-based system and a learning-based system for the binary classification problem, i.e., to tell whether an attribute is discriminative for two concepts. The learningbased system achieved F1 score of 0.7294, which is the third best in the official evaluation period of SemEval-2018 Task 10. Our approach is purely based on textual data, without access to image instances, which indicates that it is possible to figure out substantial visual information from text.

Distributional Information
We utilize two kinds of computational approaches to derive information from co-occurrence statistics in large copora. The first one is word embedding, which has been shown to encode semantic information in low-dimensional vectors. The second one is pointwise mutual information (PMI), which is a commonly-used measurement of the strength of association between two words. We analyze the performance of rule-based or learning-based models with different sets of features to reflect their effectiveness.

Concatenation of Word Embeddings
A very straight-forward approach is concatenating the embedding of w 1 , w 2 and a into a fea- ture vector to train a binary classifier. We use the pre-trained 300-dimensional Word2vec embeddings (Mikolov et al., 2013) trained on Google News 1 as input features. We construct a multilayer perceptron (MLP) model with two hidden layers of size 1,024 to conduct preliminary experiments. The activation function is ReLU and the dropout rate is 0.5. The model is implemented with Keras (Chollet, 2015). We train for 20 epochs and report the best validation scores. However, we find out that there is a serious issue of overfitting. As shown in Table 1, the gap between training and validation scores is large. We also experimented simpler models such as Logistic Regression and Random Forest, and got similar results. A possible cause of overffiting is that the model does not learn to extract and compare attributes, but learns the "pattern" of some combination of words in the triples.
To verify the above speculation, we train similar MLP models which only take "partial" triples as input. Theoretically, the label cannot be determined correctly with an incomplete triple. However, according to the results shown in Table 1, the models considering solely a part of every triple can still "learn" some information from the training set (majority-class baseline accuracy on the training set: 0.6383). Some models with partial information even achieve better validation scores than that with complete information. This indicates that the models overfit to the vocabulary of the training set. At the test time, all the attributes are unknown, so the model cannot make effective predictions. In fact, these results are similar to the lexical memorization phenomenon reported by Levy et al. (2015) on the hypernym detection task.

Embeddings Similarity Difference
Because "raw" word embedding features do not work, we turn to more abstract features. Let sim 1 and sim 2 be the cosine similarity of the vector of a to the vector of w 1 and w 2 respectively. We compare the values sim 1 and sim 2 . The rationale is that if a word w has an attribute a, then it tends to, though not necessarily, be more similar to a than other words without a.
The following six embedding models are experimented with. The embedding size is fixed to 300. 1. W2V(GNews): The standard Word2vec model as described in Section 2.1. 2. fastText: fastText (Bojanowski et al., 2017) is a modification of Word2vec that takes subword information into account. We adopt the pretrained vectors trained on 6B tokens 2 . 3. Numberbatch: Numberbatch embeddings are built upon several corpus-based word embeddings and improved by retrofitting on Concept-Net, a large semantic network containing an abundance of general knowledge (Speer et al., 2017). We use the pre-trained embeddings of English concepts 3 . 4. GloVe(Common Crawl): The GloVe model (Pennington et al., 2014) obtains word representation according to global co-occurrence statistics. We use the pre-trained vectors trained on 840B tokens of Common Crawl 4 . 5. Sense(enwiki)-c: Sense vectors may encode more fine-grained semantic information than word vectors do, so we also experimented with sense vectors. We perform word sense disambiguation (WSD) on the English Wikipedia corpus to get a sense-annotated corpus, using the Adapted Lesk algorithm implemented in pywsd 5 . The sense inventory is based on synsets in WordNet. We train a Word2vec Skip-gram (SG) model with this corpus to obtain sense vectors. To apply sense vectors to words and attributes in this SemEval task, we propose the following closest sense-selection method (denoted by -c) to choose a sense for each of w 1 , w 2 and a. S(w) denotes the set of synsets that a word w belongs to and emb(s) denotes the vector of synset (sense) s.
cos(emb(s 1 ), emb(s a )) cos(emb(s 1 * ), emb(s 2 )) 2 https://fasttext.cc/docs/en/english-vectors.html 3 https://github.com/commonsense/conceptnetnumberbatch 4 https://nlp.stanford.edu/projects/glove/ 5 https://github.com/alvations/pywsd Since a might be an attribute of w 1 , we choose the closest pair of senses for them. Then, we choose the sense of w 2 that is closest to s 1 * , the selected sense for w 1 . The reason is that a semantic difference is more likely to be meaningful for two similar concepts. Finally, we use the vector of the selected senses to compute similarities. 6. Sense(enwiki)-f : We use the same sense embeddings as described previously but directly select the first sense (predominant sense) in WordNet for w 1 , w 2 and a respectively, without performing WSD. This method is denoted by -f. We first use these similarities in a simple rulebased model: if sim 1 > sim 2 then output 1; otherwise output 0. The results are summarized in Table 2. In general, this similarity comparison rule performs better on the positive class than on the negative class. GloVe results in the highest negative F1, while Numberbatch results in the best macro-averaged F1. We show the confusion matrix for this rule with Numberbatch in Table 3. As can be seen, similarity differences are helpful for discriminating the positive examples, but they are not good indicators of negative examples.
We use sim 1 − sim 2 of different kinds of embeddings as features and train MLP models as described in the previous section. The results of different combinations of embeddings are shown in Table 4    F1 improvement over the rule-based models. On the other hand, though including the last three embedding models does not yield better result in this setting, we find them useful when combined with other kinds of features. Therefore, they are included in one of our submitted systems.

PMI Difference
Similar to word embedding, PMI reflects the cooccurrence tendencies of words. It has been shown that the Skip-gram with Negative Sampling (SGNS) algorithm in Word2vec corresponds to implicit factorization of the PMI matrix (Levy and Goldberg, 2014). Nevertheless, PMI should be interpreted differently from word vector similarity. Since PMI is calculated in an exact matching manner, there is no propagation of similarity as in the case of word vectors. That is, suppose that both PMI("red", "yellow") and PMI("apple", "banana") are high, this does not imply that PMI("red", "banana") will be high. Thus, PMI might be less prone to confusion of similar concepts. We calculate PMI on the English Wikipedia corpus.
We first experimented with a P M I 1 > P M I 2 rule that is similar to the one for vector similarities. In Table 5     the results of PMI calculated with different sizes of context window within which a pair of words is considered to be a co-occurrence. 20-word context window yields the best performance so we show its corresponding confusion matrix in Table 6. As can be seen, PMI performs slightly better in discriminating the negative class, compared to word similarities (Table 3). Based on the above observation, we propose a heuristic rule of combining vector similarity and PMI: if sim 1 > sim 2 and P M I 1 > P M I 2 then output 1. We use the Numberbatch embeddings and PMI of 20-word context. This majority-voting model is more reliable and achieves macro-F1 above 0.69. It is one of our submitted systems so the result is shown in Table 14. According to the confusion matrix in Table 7, both the positive and the negative classes can be discriminated well with the combination of distributional vectors and PMI.
We also build learning-based models with combinations of PMI of different context window sizes. Since the range of PMI can be large, we only consider the sign of the difference. The sign of zero is defined to be negative. In addition, we also combine vector similarities to train the MLP model. The results are all shown in Table 8. However, none of the results show improvement over the corresponding rule-based models.

Edge Connection
ConceptNet can be regarded as a directed graph of concepts (vertices) connected by different relations (edges). There are 47 relation types in Con-ceptNet. Some of them, such as HasProperty and CapableOf, are directly related to attributes. Other relations such as RelatedTo can also reflect some kinds of attributes.
We experiment with a simple rule-based model that outputs 1 if there exists a relation from w 1 to a and there is no relation from w 2 to a. Additionally, we augment the ConceptNet graph with reverse edges and apply the rule again. The results of both versions are shown in   version with reverse edges performs competitively with the vector similarity rule (macro F1 about 0.6), but the behavior is quite different. As can be seen, the ConceptNet features help achieve better negative F1. The relatively low performance on the positive class might be due to the sparseness of the knowledge graph. Some w 1 might have attribute a but it is not directly connected to a on the graph.
To encode edge connection information for training learning-based models, we compute the following four binary features: • Is there an edge from w 1 to a?
• Is there an edge from a to w 1 ?
• Is there an edge from w 2 to a?
• Is there an edge from a to w 2 ? We also experimented with two versions. In the first version, each type of relations are considered separately, so the total dimensionality is 4 * 47 = 188. In the second version, we set a binary feature to 1 if there is at least one edge that satisfies its condition, so the feature dimensionality is only 4. The results are shown in Table 10. Although different types of relations have different semantics and should be treated differently, the version considering relation type does not perform better. A possible reason is that it can suffer from the data sparseness problem, since some dimensions are zero for almost all the instances.

Shortest Path Length
To include connections between words and attributes that take more than one step, we calculate the shortest path lengths. Let dis(w i , a) be the shortest path length between w i and a on the ConceptNet graph. We first experiment with a simple rule-based model that outputs 1 when dis(w 1 , a) < dis(w 2 , a) , that is, when w 1 is closer to a. The results are reported in Table 11. Including reverse edges slightly improves the accuracy but does not improve the macro F1 score. A confusion matrix is presented in Table 12, showing that this rule is a strong indicator for the negative class. Compared to the ones with edge connection features, however, these rule-based classifiers    achieve slightly lower negative F1 but higher positive F1.
Since the maximum shortest path distance between a word and an attribute in the training set is 5 (when reverse edges are included), we encode dis(w i , a) into 6-dimensional discrete binary features as follows.
• No path from w i to a • dis(w i , a) = 1 • dis(w i , a) = 2 • dis(w i , a) = 3 • dis(w i , a) = 4 • dis(w i , a) ≥ 5 We build similar MLP models that take these features as input. The features for w 1 and w 2 are computed separately and then concatenated., There are clear improvements of learning-based models (Table 13) over rule-based ones (Table 11). The improvements are mostly contributed by the higher positive F1 scores. On the other hand, in general it is helpful to include a separate set of features calculated on the graph with reverse edges.

Submitted Systems
We submitted the predictions of a rule-based system and a learning-based system. The evaluation results are summarized in Table 14. Run 1 system is a rule-based combination of similarity differences of the Numberbatch embedding and the sign of PMI differences (window size 20). Run 2 is an MLP model with three size-2048 hidden layers that takes input features of the similarity dif-  ference of the six kinds of embeddings, the sign of PMI differences of three different context window sizes and the ConceptNet edge and shortest path length features. Our run 2 system performed the third best among all 26 participants with macro-F1 0.7294, showing that the features we proposed are highly effective. On the other hand, our run 1 system got an only slightly lower macro-F1 of 0.7044 and would get a rank between 5 (0.69) and 4 (0.72) if it was considered. This again proves the complementary effect of word vector similarity and PMI.

Error Analysis
Since even the top system in this task did not achieve macro-F1 above 75%, we think that there might be some cases that are very difficult to handle. Based on the test ground-truth released officially, we analyze the errors of our best system. We find out that the difficulties mainly arise from the following cases. Ambiguous concept: Word ambiguity is not considered in this task. However, this may be problematic in some cases such as the positive example <mouse, squirrel, plastic>. According to the answer, we know that the word "mouse" is interpreted as a "computer device" instead of an "animal". Therefore, sometimes the answer is dependent on which sense is selected. Vague or ambiguous attribute: Since the attribute is expressed only with a single word in this task, sometimes it is hard to tell what the attribute means, even from a human's perspective. For example, the triple <philanthropist, lawyer, active> is labeled 0 in the gold answer. Nevertheless, a positive interpretation also makes sense: philanthropists usually engage in philanthropy actively, while lawyers usually handle matters under the authorization of someone. Relative attribute: In some positive examples, w 1 does not necessarily have a, but only more likely to have it. In the positive example <father, brother, old>, "father" might be "old" when being compared to "brother", but not necessarily so when considered isolatedly. It is even more diffi-cult to determine when to evaluate the absence of an attribute relatively, given that we also encounter cases such as <banker, lawyer, rich>, whose gold label is 0.

Conclusions
We propose several approaches to tackle the Se-mEval 2018 Capturing Discriminative Attributes task in this paper. We utilize information derived from both corpus distribution statistics and expert knowledge in ConceptNet to build our systems. According to the experimental results, word embedding and PMI, though both based on cooccurrence, can complement each other in a simple heuristic rule-based system. Moreover, the ConceptNet features with high sensitivity to the negative class can complement the corpus-based features, which are more sensitive to the positive class. Our best learning-based system achieved F1 score of 0.7294 and got the 3rd place in the official run. We did not adopt image features, which suggests that it is possible to learn substantially about visual attributes solely from text.
Given the limited advancement of the learningbased model over the rule-based one, it is worth studying how to design some mechanism in machine learning models that can guide them to "compare" the features of the two concepts and determine the discriminativeness.