Commonsense Knowledge Mining from Pretrained Models

Inferring commonsense knowledge is a key challenge in natural language processing, but due to the sparsity of training data, previous work has shown that supervised methods for commonsense knowledge mining underperform when evaluated on novel data. In this work, we develop a method for generating commonsense knowledge using a large, pre-trained bidirectional language model. By transforming relational triples into masked sentences, we can use this model to rank a triple's validity by the estimated pointwise mutual information between the two entities. Since we do not update the weights of the bidirectional model, our approach is not biased by the coverage of any one commonsense knowledge base. Though this method performs worse on a test set than models explicitly trained on a corresponding training set, it outperforms these methods when mining commonsense knowledge from new sources, suggesting that unsupervised techniques may generalize better than current supervised approaches.


Introduction
Commonsense knowledge consists of facts about the world which are assumed to be widely known. For this reason, commonsense knowledge is rarely stated explicitly in natural language, making it challenging to infer this information without an enormous amount of data (Gordon and Van Durme, 2013). Some have even argued that machine learning models cannot learn common sense implicitly (Davis and Marcus, 2015).
One method for mollifying this issue is directly augmenting models with commonsense knowledge bases (Young et al., 2018), which typically contain high-quality information but with low coverage. These knowledge bases are represented as a graph, with nodes consisting of conceptual entities (i.e. dog, running away, excited, etc.) and the pre-defined edges representing the nature of the relations between concepts (IsA, UsedFor, CapableOf, etc.). Commonsense knowledge base completion (CKBC) is a machine learning task motivated by the need to improve the coverage of these resources. In this formulation of the problem, one is supplied with a list of candidate entityrelation-entity triples, and the task is to distinguish which of the triples express valid commonsense knowledge and which are fictitious (Li et al., 2016).
Several approaches have been proposed for training models for commonsense knowledge base completion (Li et al., 2016;Jastrzebski et al., 2018). Each of these approaches uses some sort of supervised training on a particular knowledge base, evaluating the model's performance on a held-out test set from the same database. These works use relations from ConceptNet, a crowd-sourced database of structured commonsense knowledge, to train and validate their models (Liu and Singh, 2004). However, it has been shown that these methods generalize poorly to novel data (Li et al., 2016;Jastrzebski et al., 2018). Jastrzebski et al. (2018) demonstrated that much of the data in the ConceptNet test set were simply rephrased relations from the training set, and that this train-test set leakage led to artificially inflated test performance metrics. This problem of traintest leakage is typical in knowledge base completion tasks (Toutanova et al., 2015;Dettmers et al., 2018).
Instead of training a predictive model on any specific database, we attempt to utilize the world knowledge of large language models to identify commonsense facts directly. By constructing a candidate piece of knowledge as a sentence, we can use a language model to approximate the likelihood of this text as a proxy for its truthfulness.
In particular, we use a masked language model to estimate point-wise mutual information between entities in a possible relation, an approach that differs significantly from fine-tuning approaches used for other language modeling tasks. Since the weights of the model are fixed, our approach is not biased by the coverage of any one dataset. As we might expect, our method underperforms when compared to previous benchmarks on the Con-ceptNet common sense triples dataset (Li et al., 2016), but demonstrates a superior ability to generalize when mining novel commonsense knowledge from Wikipedia. Schwartz et al. (2017) and Trinh and Le (2018) demonstrate a similar approach to using language models for tasks requiring commonsense, such as the Story Cloze Task and the Winograd Schema Challenge, respectively (Mostafazadeh et al., 2016;Levesque et al., 2012). Bosselut et al. (2019) and Trinh and Le (2019) use unidirectional language models for CKBC, but their approach requires a supervised training step. Our approach differs in that we intentionally avoid training on any particular database, relying instead on the language model's general world knowledge. Additionally, we use a bidirectional masked model which provides a more flexible framework for likelihood estimation and allows us to estimate point-wise mutual information. Although it is beyond the scope of this paper, it would be interesting to adapt the methods presented here for the related task of generating new commonsense knowledge (Saito et al., 2018).

Method
Given a commonsense head-relation-tail triple x = (h, r, t), we are interested in determining the validity of that tuple as a representation of a commonsense fact. Specifically, we would like to determine a numeric score y ∈ R reflecting our confidence that a given tuple represents true knowledge.
We assume that heads and tails are arbitrarylength sequences of words in a vocabulary V so that h = {h 1 , h 2 , . . . , h n } and t = {t 1 , t 2 , . . . , t m }. We further assume that we have a known set of possible relations R so that r ∈ R.
The goal is to determine a function f that maps relational triples to validity scores. We propose decomposing f (x) = σ(τ (x)) into two subcomponents: a sentence generation function τ which maps a triple to a single sentence, and a scoring model σ which then determines a validity score y.
Our approach relies on two types of pretrained language models. Standard unidirectional models are typically represented as autoregressive probabilities: Masked bidirectional models such as BERT, proposed by Devlin et al. (2018), instead model in both directions, training word representations conditioned both on future and past words. The masking allows any number of words in the sequence to be hidden. This setup provides an intuitive framework to evaluate the probability of any word in a sequence conditioned on the rest of the sequence, where w ∈ V ∪ {κ} and κ is a special token indicating a masked word.

Generating Sentences from Triples
We first consider methods for turning a triple such as (ferret, AtLocation, pet store) into a sentence such as "the ferret is in the pet store". Our approach is to generate a set of candidate sentences via hand-crafted templates and select the best proposal according to a language model. For each relation r ∈ R, we hand-craft a set of sentence templates. For example, one template in our experiments for the relation AtLocation is, "you are likely to find HEAD in TAIL". For the above example, this would yield the sentence, "You are likely to find ferret in pet store".
Because these sentences are not always grammatically correct, such as in the above example, we apply a simple set of transformations. These consist of inserting articles before nouns, converting verbs into gerunds, and pluralizing nouns which follow numbers. See the supplementary materials for details and Table 1 for an example. We then enumerate a set of alternative sentences S = {S 1 , . . . , S j } resulting from each template and from all combinations of transformations. This yields a set of candidate sentences for each data point. We then select the candidate sentence with the highest log-likelihood according to a pre-trained unidirectional language model P coh .
"musician can playing musical instrument" −5.7 "musician can be play musical instrument" −4.9 "musician often play musical instrument" −5.5 "a musician can play a musical instrument" −2.9 Table 1: Example of generating candidate sentences. Several enumerated sentences for the triple (musician, CapableOf, play musical instrument). The sentence with the highest loglikelihood according to a pretrained language model is selected.
We refer to this method of generating a sentence from a triple as COHERENCY RANKING. Coherency Ranking operates under the assumption that natural, grammatical sentences will have a higher likelihood than ungrammatical or unnatural sentences. See an example subset of sentence candidates and their corresponding scores in Table 1. From a qualitative evaluation of the selected sentences, we find that this approach produces sentences of significantly higher quality than those generated by deterministic rules alone. We also perform an ablation study in our experiments demonstrating the effect of each component on CKBC performance.

Scoring Generated Triples
Assuming we have generated a proper sentence from a relational triple, we now need a way to score its validity with a pretrained model that considers the relationship between the relation entities. We therefore propose using the estimated point-wise mutual information (PMI) of the head h and tail t of a triple conditioned on the relation r, defined as, PMI(t, h|r) = log p(t|h, r) − log p(t|r) We can estimate these scores by using a masked bidirectional language model, P cmp . In the case where the tail is a single word, the model allows us to evaluate the conditional likelihood of a single triple component p(t|h, r) by computing P cmp (w i = t |w 1:i−1 , w i+1:m ) for the tail word.
In practice, the tail might be realized as a jword phrase. To handle this complexity, we use a greedy approximation of its probability. We first mask all of the tail words and compute the probability of each. We then find the word with highest probability p k , substitute it back in, and repeat j times. Finally, we calculate the total conditional likelihood of the tail by the product of these terms, p(t|h, r) = j k=1 p k . The marginal p(t|r) is computed similarly, but in this case we mask the head throughout. For example, to compute the marginal tail probability for the sentence, "You are likely to find a ferret in the pet store" we mask both the head and the tail and then sequentially unmask the tail words only: "You are likely to find a κ h1 in the κ t1 κ t2 ". If κ t2 = "store" has a higher probability than κ t1 = "pet", we unmask "store" and compute "You are likely to find a κ h1 in the κ t1 store". The marginal likelihood p(t|r) is then the product of the two probabilities.
The final score combines the marginal and conditional likelihoods by employing a weighted form of the point-wise mutual information, PMI λ (t, h|r) = λ log p(t|h, r) − log p(t|r) where λ is treated as a hyperparameter. Although exact PMI is symmetrical, the approximate model itself is not. We therefore average PMI λ (t, h|r) and PMI λ (h, t|r) to reduce the variance of our estimates, computing the masked head values rather than the tail values in the latter.

Experiments
To evaluate the Coherency Ranking approach we measure whether it can distinguish between valid and invalid triples. For our masked model, we use BERT-large (Devlin et al., 2018). For sentence ranking, we use the GPT-2 117M LM (Radford et al., 2019). The relation templates and grammar transformation rules which we use can be found in the supplementary materials.
We compare the proposed method to several baselines. Following Trinh and Le (2018), we evaluate a simple CONCATENATION method for generating sentences, splitting the relation r into separate words and concatenating it with the head and tail. For the triple (ferret, AtLocation, pet store), the Concatenation approach would yield, "ferret at location pet store".
We also evaluate CKBC performance when we construct sentences by applying a single handcrafted template. Since each triple is mapped to a sentence with a single template without any grammatical transformations, we refer to this as the TEMPLATE method. Using the Template approach, (ferret, AtLocation, pet store)

Model
Task  would become "You are likely to find ferret in pet store" using the template "you are likely to find HEAD in TAIL". Next, we extend the Template method by applying deterministic grammatical transformations, which we refer to as the TEMPLATE + GRAMMAR approach. Like the full approach, these transformations involve adding articles before nouns, converting verbs into gerunds, and pluralizing nouns following numbers. The Template + Grammar approach differs from Coherency Ranking in that all transformations are applied to every sentence instead of applying combinations of transformations and templates, which are then ranked by a language model. Returning to our example, the Template + Grammar method produces "You are likely to find a ferret in a pet store". While this sentence is grammatical, applying this method to (star, AtLocation, outer space) yields "You are likely to find a star in an outer space", which is incorrect.
We compare our results to the supervised models from the work of Jastrzebski et al. (2018) and the best performing model from Li et al. (2016). Jastrzebski et al. (2018) introduce FAC-TORIZED and PROTOTYPICAL models. The Factorized model embeds the head, relation, and tail in a vector space and then produces a score by taking a linear combination of the inner products between each pair of embeddings. The Prototypi-cal model is similar, but does not include the inner product between head and tail. Li et al. (2016) evaluate a deep neural network (DNN) for CKBC. They concatenate embeddings for the head, relation, and tail, which they then feed through a multilayer perceptron with one hidden layer. All three models are trained on 100,000 ConceptNet triples.
Task 1: Commonsense Knowledge Base Completion Our experimental setup follows Li et al. (2016), evaluating our model with their test set (n = 2400) containing an equal number of valid and invalid triples. The valid triples are from the crowd-sourced Open Mind Common Sense (OMCS) entries in the ConceptNet 5 dataset (Speer and Havasi, 2012). Invalid triples are generated by replacing an element of a valid tuple with another randomly selected element.
We use our scoring method to classify each tuple as valid or invalid. To this end, we use our method to assign a score to each tuple and then group the resulting scores into two clusters. Instances in the cluster with the higher mean PMI are labeled as valid, and the remainder are labeled as invalid. We use expectation-maximization with a mixture of Gaussians to cluster. We also tune the PMI weight via grid search over 90 points from λ ∈ [0.5, 5.], using the Akaike information criterion of the Gaussian mixture model for evaluation (Akaike, 1974). Table 2 shows the full results. Our unsupervised approach achieves a test set F1 score of 78.8, comparable to the 79.4 F1 score found by the supervised prototypical approach. The Factorized and DNN models significantly outperformed our approach with F1 scores of 89.2 and 89.0, respectively. Our grid search found an optimal λ value of 1.65 for the Concatenation sentence generation model and 1.55 for the Coherency Ranking model. The Template and Template + Grammar methods found lambda values of 1.20 and 0.95, respectively.
Task 2: Mining Wikipedia To assess the model's ability to generalize to unseen data, we evaluate our unsupervised model in comparison to previous supervised methods on the task of mining commonsense knowledge from Wikipedia. In their evaluations, Li et al. (2016) curate a set of 1.7M triples across 10 relations by applying partof-speech patterns to Wikipedia articles. We sample 300 triples from each relation. We apply our method to evaluate these 3000 triples. Using the approach described by Speer and Havasi (2012), and followed by Li et al. (2016) and Jastrzebski et al. (2018), two human annotators manually rate the 100 triples with the highest predicted score on a 0 to 4 scale: 0 (Doesn't make sense), 1 (Not true), 2 (Opinion/Don't know), 3 (Sometimes true), and 4 (Generally true). We tuned λ by measuring the quality of the 100 triples with the highest predicted score across λ ∈ {1, 2, . . . , 9, 10}.
The top 100 triples selected by our model were assigned a mean rating of 3.00 (λ = 4) with a standard error of 0.11 under the Coherency Ranking approach, well exceeding the performance of current supervised methods (Table 2). Standard errors were calculated using 1000 bootstrap samples of the top 100 triples. The ratings assigned by the two human annotators had a 0.50 Pearson correlation and 0.23 kappa inter-annotator agreement. Rater disagreements occur most frequently when triples are ambiguous or difficult to interpret. Notably, if we bucket the five scores into just two categories of true and false, this disagreement rate drops by 50%. To give a sense of the types of commonsense knowledge our models struggle to capture, we report the top 100 most confident predictions that receive an average score below 3 in the supplementary material. Notably, some of the top 100 triples our model identified were indeed true, but would not be reasonably considered common sense (e.g. (vector bundle, HasProperty, manifold)). This suggests that our approach may be applicable to mining knowledge beyond common sense.
Analysis: Sentence Generation In order to measure the impact of sentence generation on our model, we select a sample of 100 sentences and group the results by a) whether the sentence contained a grammatical error, and b) whether the sentence misrepresented the meaning of the triple. For example, the triple (golf, HasProperty, good) yields the sentence "golf is a good", which is grammatically correct but conveys the wrong meaning. On both Wikipedia mining and CKBC, we find that misrepresenting meaning has an adverse impact on model performance. In CKBC, we also find that grammar has a high impact on the resulting F1 scores (Table 3). Future work could therefore focus on designing templates that more reliably encode a relation's true meaning.  Table 3: Test results examining the effect of sentence meaning and grammaticality on task performance. Scores are shown for a sample of 100 triples split by whether the generated sentence is grammatical and whether it conveys the correct meaning of the triple.

Conclusion
We introduce a robust unsupervised method for commonsense knowledge base completion using the world knowledge of pre-trained language models. We develop a method for expressing knowledge triples as sentences. Using a bidirectional masked language model on these sentences, we can then estimate the weighted point-wise mutual information of a triple as a proxy for its validity. Though our approach performs worse on a held-out test set developed by Li et al. (2016), it does so without any previous exposure to the Con-ceptNet database, ensuring that this performance is not biased. In the future, we hope to explore whether this approach can be extended to mining facts that are not commonsense and to generating new commonsense knowledge outside of any given database of candidate triples. We also see potential benefit in the development of a more expansive set of evaluation methods for commonsense knowledge mining, which would strengthen the validity of our conclusions.