Commonsense mining as knowledge base completion? A study on the impact of novelty

Commonsense knowledge bases such as ConceptNet represent knowledge in the form of relational triples. Inspired by recent work by Li et al., we analyse if knowledge base completion models can be used to mine commonsense knowledge from raw text. We propose novelty of predicted triples with respect to the training set as an important factor in interpreting results. We critically analyse the difficulty of mining novel commonsense knowledge, and show that a simple baseline method that outperforms the previous state of the art on predicting more novel triples.


Introduction
Many natural language understanding tasks require commonsense knowledge in order to resolve ambiguities involving implicit assumptions. Collecting such knowledge and representing it in a reusable way is thus an important challenge. There exist several commonsense knowledge bases maintained by experts (CyC) or acquired by crowdsourcing (ConceptNet) which represent commonsense knowledge as relational triples, e.g. ("pen", "UsedFor", "writing") (Liu and Singh, 2004). Automatic mining of commonsense knowledge, the focus of this work, aims to improve the coverage of such resources.
One common way of improving the coverage of knowledge bases is through knowledge base completion (KBC), which can be formalized as predicting the existence of edges between (usually) pre-existing nodes in the graph. Recent work by Li et al. (2016) approached commonsense mining as a KBC task. Their method mines candidate triples * Work partially done as intern at MILA † CIFAR Senior Fellow from Wikipedia and reranks the triples with a KBC model in order to extend ConceptNet. The goal of this paper is to investigate why recent systems such as the above achieve good performance, and understand their potential for mining commonsense. We approach it by breaking down the previously reported aggregate results into the cases in which models perform well or poorly. We focus in particular on the issue of the novelty of model predictions with respect to the triples in the training set. For example, a triple predicted by a system could be correct because it generates output with a slightly different wording or morphological inflection, e.g. ("fish", "AtLocation", "water") from ("fish", "AtLocation", "in water"), or it could be correct because it exhibits some degree of semantic generalization, e.g. ("fish", "IsCapableOf", "swimming") from ("fish", "AtLocation", "in water"). Arguably, the former could be handled by better standardization of data set formats or more comprehensive model pre-processing, whereas the latter presents an example of genuine commonsense inference and novelty. This analysis is especially important for commonsense mining because of the diversity of the entities, relations, and linguistic expressions thereof in current datasets.
The contribution of this paper is two-fold. First, we test if the KBC task as it is set up in recent work can gauge a model's ability to mine novel commonsense (i.e. find novel commonsense facts based on some resource). We observe the contrary. We present a model that performs poorly on KBC but matches the best model on the task of mining novel commonsense (evaluated by re-ranking extracted candidate triples from Wikipedia). We then examine the cause of this discrepancy, and find that around 60% of triples in the KBC test set used by Li et al. (2016) are minor rewordings of existing triples in the training set. This suggests that controlling for the novelty of triples in both KBC and Wikipedia evaluation is needed.
Second, we present a reassessment of previous methods in which we control the dataset for novelty, extending the results of Li et al. (2016). We introduce a simple automated novelty metric and show that it correlates with human judgment. We then show that the performance of most models on both KBC and Wikipedia triple reranking drops drastically when we evaluate them on examples that are genuinely new according to our metric. Finally, we demonstrate that a simple baseline model that does not model all interactions between elements in a triple performs surprisingly well on both KBC and reranking when we focus on novel triples.

Related work
Knowledge extraction from text corpora is a vast research area (Banko et al., 2007;Mitchell et al., 2015), yet works that specifically target commonsense knowledge are comparatively rare (Gordon, 2014). Our focus is on the specific approach to mining commonsense knowledge by casting it as a KBC task, as in Li et al. (2016) and Forbes and Choi (2017).
Knowledge base completion (KBC) is a method to improve coverage of knowledge base by predicting non-existing edges between nodes (Nickel and Tresp, 2011;Socher et al., 2013). A common modeling approach to KBC is to embed both nodes and the edge into a common representation space, followed by a simple prediction model (Socher et al., 2013).
Recently, Dettmers et al. (2017) observed that some KBC benchmarks have test set triples that are simply inversions of triples in their training sets. Our work draws attention to a related issue in commonsense KBC. Additionally, we find that simple baseline models achieve strong performances in our setting, in agreement with other studies of KBC (Joulin et al., 2017;Kadlec et al., 2017).
In Angeli and Manning (2013), triple retrieval based on distributional similarity is used to complete ConceptNet. Our procedure for determining the novelty of the triple is similar, but we apply it only in the context of evaluation.

Completion vs Mining
Our goal in this section is to analyse the relation between KBC and commonsense mining tasks following the setup of Li et al. (2016).

Models
All our models take (h, r, t) triples as inputs, where h and t are sequences of words representing concepts and r is a relation from the ConceptNet schema, and output the probability that the triple is true. Following Li et al. (2016), we embed h and t by computing the sums h and t of the respective word vectors. Levy et al. (2015) showed that, in the context of predicting the hypernymy relation, using only head or only tail can be a strong baseline. To better understand how complex reasoning is needed for both KBC and mining tasks, we similarly consider the two following models, which make strong simplifying assumptions about the dependencies between elements in a triple. The Factorized model uses only two-way interactions to compute the triple score: where h, r, t are d 1 dimensional embeddings of head, relation and tail, A, B are d 1 × d 2 matrices, b 1 , b 2 are d 2 dimensional biases, and α, β, γ are learned scalars. The Prototypical model is similar, but considers only the head-to-relation and tail-to-relation terms (first and third terms in Eq. 1).
We compare the two new models with the best model from Li et al. (2016), a single hidden layer DNN. In that model, the triple score is computed as where φ is a nonlinearity, A, B are d 1 × d 2 matrices, b 1 is a d 2 dimensional bias, W is a d 2 dimensional vector and b 2 is a scalar bias. Additionally, we compare against the Bilinear model of Li et al. (2016) 1 , which computes the triple score as where M r is a d 1 × d 1 dimensional matrix, separate for each relation in the dataset. All models' scores are fed into a sigmoid function in order to compute the final prediction.

Setup
KBC models are trained using 100, 000 triples from ConceptNet5 that were extracted from the Open Mind Common Sense (OMCS) corpus (Speer and Havasi, 2012). For evaluation, we consider two ways to split the dataset: a random split, as well as the confidence-based split proposed by (Li et al., 2016), which uses triples with the highest ConceptNet confidence scores as a test set. 2 Following Li et al. (2016), negative examples are sampled by randomly swapping the head, tail, or relation component of each triple with a different head, tail, or relation in the dataset". The cross-entropy loss is used, and models are evaluated using F1 score. 3 All models are initialized using skip-gram embeddings that were pretrained on the OMCS corpus.
The commonsense mining task is based on a set of 1.7 M extracted candidate triples from Wikipedia by Li et al. (2016). The extracted triples are ranked using a KBC model, and the top of the ranking is manually evaluated. We will refer to the experiments in which we rerank external candidate triples as mining experiments.
We found that similar hyperparameters and optimization methods work well across the models. We use 1, 000 hidden units, and apply L2 regularization with a weight of 10 −6 to the word embeddings. All models are optimized using Adagrad (Duchi et al., 2010) with a learning rate of 0.01 and batch sizes of 200 (DNN) and 600 (Factorized and Prototypical). In Section 3.3, we compare against the scores of a Bilinear model provided by Li et al. (2016). Experiments are performed using Keras (Chollet et al., 2015) and Ten-sorFlow (Abadi et al., 2015).

Comparison of KBC and Wikipedia evaluations
First, we directly test if the performance of a model on the KBC task is predictive of its perfor-X X X X X X X X . mance on the mining task. We follow the mining evaluation protocol from (Li et al., 2016): we rank triples by assigned scores and manually evaluate the top 100 resulting triples on a scale from 0 (nonsensical) to 4 (true statement). We re-evaluate their model against our baselines and find that the knowledge base completion task is a poor indicator of performance on Wikipedia. Even though the Factorized and Prototypical models achieve a similar or worse score compared to DNN on the KBC task (see the first row of Table 1), their mining performance on the top 100 triples is better than both DNN and the Bilinear model (see Table 2). Triples were scored by two students and scores were averaged, with 0.81 Pearson correlation and 0.48 kappa inter-annotator agreement.

Novelty of triples
We hypothesize that the discrepancy reported in Section 3.3 is due to a strong overlap of the training and testing sets in the KBC setup of Li et al. (2016). We perform a human evaluation of the novelty of the triples in the three test sets with respect to the 100, 000 triples in the ConceptNet training set. The first is the confidence-based test set used in Li et al. (2016). We compare it with a random subset of ConceptNet. Finally, we consider a sample of 300 triples from the top 10, 000 triples of the Wikipedia dataset ordered by the Bilinear model. For each triple in the three datasets, we fetch the five closest neighbours using word embedding dis-tance and divide them into five categories based on the closest triple found in the training set: "same relation and minor rewording" (1), "different relation and minor rewording" (2), "same relation and related word" (3), "different relation and related word" (4), "no directly related triple" (5). We ignore a small percentage of triples that are not describing commonsense knowledge, as well as false triples (some in the random subset, and a large percentage in the Wikipedia dataset).
To give a better intuition, we provide example triples for the confidence-based split of Li et al. (2016). In Category 1 (defined as "same relation and minor rewording"), we find ("egg", "IsA", "food"), which has a close analog in the training set: ("egg", "IsA", "type of food"). An example of a test triple in Category 3 (defined as "different relation and related word") is ("floor", "UsedFor", "walk on"), which has a corresponding triple in the training set ("floor", "UsedFor", "stand on"). In the Appendix, we provide more examples of triples from each category. Table 3, we observe that approximately 87% of examples in the confidence-based test set fall into the first or second category, while these categories constitute only 19% of the considered subset of the Wikipedia triples (even after filtering out false triples). We argue that not controlling for the novelty of triples might introduce hard-to-predict biases in the evaluation.

As shown in
Finally, to understand the effects of using the confidence-based split, we also re-evaluate models on a random split. We observe that scores are consistently lower than on the confidence-based split (compare the first rows of Tables 1 and 4). Interestingly, the overall performance of the DNN model degrades the most (with an absolute difference in F1 score of 9%), compared to Prototypical (4%) and Factorized (7%).

Evaluation using novelty metric
Motivated by the described similarity of the train and test sets in the KBC task, we shift our attention to re-evaluating models on datasets controlled for novelty, extending results of Li et al. (2016). We consider the same tasks as in Sec. 3: the Con-ceptNet5 completion task and the commonsense mining task based on Wikipedia triples.

Automatically measuring novelty
To approximate novelty, we use word embeddings (computed over the OMCS corpus) to calculate distance d(a, b) = ||head(a) − head(b)|| 2 + ||tail(a) − tail(b)|| 2 , where head and tail are represented by the average of word embeddings. Such a formulation is related to the concept of paradigmatic similarity (Sahlgren, 2006), and word embedding-based distance can approximate paradigmatic similarity (Sun et al., 2015). Two words are paradigmatically similar if one can be replaced by the other, while maintaining syntactical correctness of the sentence (e.g. "The wolf/tiger is a fierce animal"). We observe that many trivial test triples are characterized by the existence of a triple in the training set that only differs by such substitutions.
We observe that the proposed distance metric is correlated with human-assigned novelty scores (from Sec. 3.4). On the considered datasets, the Pearson correlation between the automatic novelty score and the human-assigned novelty score is 0.22 to 0.47, with p-values between 0.03 and 0.004. We acknowledge that the automated metric is simplistic; for instance, it underperforms for the triples containing rare words or long phrases. Nevertheless, the metric enables the detection of a substantial portion of trivial triples (e.g. morphological variations), and we leave developing better measures of novelty for future work.
Using the introduced metric, we can partially explain the inconsistency in the performance of the Prototypical and Bilinear models between KBC and mining Wikipedia. We note that the top of the ranking on Wikipedia consists of mostly very far (novel) triples (Figure 1), while the KBC confidence-based test set is mostly composed of trivial triples (as argued in Section 3.4).

Novelty-binned evaluation of KBC
We now re-evaluate the KBC models using our proposed novelty metric. First, we examine the performance on different subsets of the confidence-based split of ConceptNet5. Specifically, we split the confidence-based test set into 3 buckets, according to 33% (1.93 distance) and 66% (2.80 distance) quantile of distance to the training set. Second, we run a similar experiment but on a random split of the training set (bucket thresholds at 2.1 and 2.95). Results are reported in Tables 1 and 4. As expected, the performance of models degrades quickly across buckets. The F1 score of the farthest bucket is 10 to 20% lower than the F1 score of the closest bucket. We observe that the Factorized model achieves the strongest performance on the farthest bucket.

Novelty-binned evaluation on Wikipedia
Similar to Section 4.2, we analyse splitting candidate triples for the mining task using our novelty metric. We split the Wikipedia dataset into 3 buckets based on 33% (3.21 distance) and 66% (4.22 distance) quantiles of distance to the training set, and we manually score the top 100 triples in each bucket on the same scale from 1 to 5.
As in Section 4.2, we note a degradation of per-

Conclusions
Mining genuinely novel commonsense is a challenging task, and training successful models will require large training sets (e.g. ConceptNet) and principled evaluation. We critically assess the potential of KBC models for mining commonsense knowledge, and propose several first steps towards a more principled evaluation methodology. Future work could focus on developing better novelty metrics, and developing new regularization techniques to better generalize to novel triples.

A Example triples
In this Appendix we report randomly picked examples from the human-assigned novelty categories considered in the paper for each of the 3 datasets. Due to the large size of the training set, instead of showing all of the triples from the training set to the human annotators, we show only the 5 closest triples using the embedding-based distance. A triple is classified as belonging to the given category if at least one of the retrieved triples is sufficiently related. For example, if for ("egg", "IsA", "food") we find the triple ("egg", "IsA", "type of food") in the top 5 closest examples, we categorize it as belonging to the first category ("same rel, rephrase").

A.1 Confidence-based split
In this Section we report examples for each novelty category from the confidence-based split dataset. For each example we include the 5 triples that were shown to the human annotator, ordered by closeness according to our word embedding metric.