Do Nuclear Submarines Have Nuclear Captains? A Challenge Dataset for Commonsense Reasoning over Adjectives and Objects

How do adjectives project from a noun to its parts? If a motorcycle is red, are its wheels red? Is a nuclear submarine’s captain nuclear? These questions are easy for humans to judge using our commonsense understanding of the world, but are difficult for computers. To attack this challenge, we crowdsource a set of human judgments that answer the English-language question “Given a whole described by an adjective, does the adjective also describe a given part?” We build strong baselines for this task with a classification approach. Our findings indicate that, despite the recent successes of large language models on tasks aimed to assess commonsense knowledge, these models do not greatly outperform simple word-level models based on pre-trained word embeddings. This provides evidence that the amount of commonsense knowledge encoded in these language models does not extend far beyond that already baked into the word embeddings. Our dataset will serve as a useful testbed for future research in commonsense reasoning, especially as it relates to adjectives and objects


Introduction
We investigate the commonsense inference of the transitivity of an attribute of a whole object to its component parts. To illustrate this targeted reasoning by example, "is a sharp knife's handle sharp?" The ability to perform commonsense inference of this type enables a more complete understanding of the physical world and therefore may find use in a variety of tasks in pragmatics and at the interface of vision and language. Consider generating a story in which a slow car goes to the shop to get a new part. If the new part is a windshield, the car remains slow, whereas if the new part is an engine, *Research conducted while author was at USC/ISI the car may now be fast. This knowledge may also help a visual agent reason about unseen objects: it knows a brick house does not have a brick door without needing to see the door.
The past few years have seen a raft of data sets intended to test our ability to construct models with an understanding of commonsense knowledge. Standout examples are the Stanford Natural Language Inference (SNLI) and related Multi-Genre Natural Language Inference (MNLI) corpora (Bowman et al., 2015;Williams et al., 2018), the SemEval-2018 commonsense shared task (Ostermann et al., 2018), the Rochester Story Completion (ROCStories) corpus (Mostafazadeh et al., 2016), and the Situations with Adversarial Generations (SWAG) grounded inference corpus (Zellers et al., 2018). After their release, very large language models (LMs) were able to reach or surpass human-level performance on SNLI (Peters et al., 2018) and SWAG (Devlin et al., 2018).
However, researchers have found inadequacies in these datasets and the models trained on them. Despite the strong performance of recent systems on SNLI (e.g., , Glockner et al. (2018) show that by making trivial changes to the test set, these methods suffered. Further, Pavlick and Callison-Burch (2016) show that state-of-the-art models for natural language inference fail on a task requiring only reasoning over adjective-noun relations. Relatedly, in the shared task to predict sentence endings of ROCStories, Schwartz et al. (2017) show that by incorporating style features, with only the answer choices as input, it is possible to reach near stateof-the-art performance. These results point to implicit bias baked into the data sets.  demonstrate similar systematic and social bias in SNLI, attributing it to the fact that hypothesis sentences were written by crowd workers. The SWAG data set was specif- ically constructed in an adversarial way with this in mind, but may be disadvantaged by the fact that continuation sentences are generated by computers. This may lead to patterns that are hard to detect but can nevertheless be picked up by other language models. We avoid the issue of elicitation bias by first collecting candidates grounded in natural sources of text and images, and then gathering only scaled judgments from crowd workers, as was done by Zhang et al. (2017).
To understand how to build truly intelligent agents, we should strive to create datasets with as little exploitable bias as possible, and to further investigate the landscape of current performance. We contribute a dataset which provides a focused evaluation, based on a specific task in commonsense reasoning. Gathering and validating data from crowd workers, we evaluate a number of approaches to performing these inferences, a three-way lexical entailment problem. We find that simple word embedding-based models perform adequately, but beneath humans, on this task, with recent large LM approaches (Devlin et al., 2018;Radford et al., 2018) providing only slight improvement over the purely lexical approach.

Related Work
Other researchers have constructed datasets investigating similar ideas in commonsense reasoning. Forbes and Choi (2017) develop a dataset and methods for inferring physical commonsense knowledge from verb usage, showing it is possible to learn the physical implications of unseen verbs from a small seed set. Zhang et al. (2017) create a large dataset for general commonsense inference in the form of premise-hypothesis pairs, equipped with ordinal labels ranging from "impossible" to "very likely". We adopt much of their methodology but for a targeted subset of commonsense reasoning. The SemEval 2018 Task 10 on Capturing Discriminative Attributes (Krebs et al., 2018) describes a similar lexical reasoning task involving triplets of words, though it focuses on finding attributes that distinguish two concepts, while in our work the adjective may well apply to both part and whole.
Past work has also evaluated commonsense capabilities in neural models. Pavlick and Callison-Burch (2016) investigate the related problem of entailment in adjective-nouns, and show surprising negative results for neural NLI models. Wang et al. (2018) showed that models based on distributional semantics without explicit external knowledge perform poorly at predicting physical plausibility of actions.
Lucy and Gauthier (2017) investigate perceptual properties of distributional embeddings and suggest that part-whole properties like has legs are well encoded by embeddings. This may help explain why the simple word-based MLP models perform well without other sources of context. Rei et al. (2018) introduce an effective neural architecture for learning word-embedding based models for graded lexical entailment. Prior work (Bulat et al., 2016;Fagarasan et al., 2015) utilizes embeddings to predict real-world perceptual proper- In front of me about five feet distance, stood a wooden bench.
The bench's support is wooden. ties, and we expect an approach that leverages this will help solve this task, but we leave it to future work.

Candidate collection
We seek to annotate examples of (whole, part, adjective) triples with answers to the question: "Does an adjective whole have an adjective part ?" As a major part of our contribution, we provide an annotated dataset that is visually grounded, with relations mined from Visual Genome (Krishna et al., 2017) and Google Syntactic N-grams (Goldberg and Orwant, 2013). We provide an overview here, with details in Appendix A.

Part-whole relations
Visual Genome (VG) is a large dataset of images annotated with objects, their attributes, and the relations between them. We start by considering all relationships in the VG dataset where the predicate is an underspecified has relation. We count the number of images in which a pair of objects appear in a has relation, and keep only those pairs appearing in at least three distinct images.

Adjectives
We gather adjectives from both Google Syntactic N-grams and VG. From Syntactic N-grams, we count the occurrences of an adjective modifying a noun with the amod relation. We remove common non-attributive (e.g., awake) and non-descriptive (e.g., first) adjectives using manually constructed lexicons. Then, for each whole noun, we gather its five most common adjectival modifiers, as well as its five most common adjective attributes from Visual Genome. Through pilot studies we observed that without further filtering, annotations were highly skewed towards non-entailment, thus we achieve a more balanced dataset by filtering out adjectives that are never observed attached to the part.

Collecting human annotations
We crowdsource annotations on Amazon Mechanical Turk (AMT) for each (whole, part, adjective) triple as follows:

Task overview
For each part-whole pair, we sample three random images from VG that contain the pair, and draw bounding boxes around both objects, provided by VG annotations. We present these to workers simply to provide context for the partwhole pair, since early tests showed that without visual cues workers often have trouble understanding the overall problem. Then, we ask a series of questions that each associates the pair with an adjective. To encourage the worker to imagine the prototypical version of the objects rather than the specific ones shown, 1 we use the template "Consider any whole , not the particular ones pictured". Specific questions have the form: "If the whole is adjective , which of the following is true?" The answers describe whether it is "impossible", "unlikely", "unrelated", "likely", or "guaranteed" that the identified part is also described by the adjective. The answers use causal language to encourage "conditional plausibility" thinking, as described by Zhang et al. (2017). This also allows for the "unrelated" answer, which covers spu-  rious examples, such as a black guitar's cord being black, where the cord is likely black, but not as a result of the guitar being black. We also give an option for the worker to mark that one of the pairwise relations is nonsensical.

Qualification task
After manually annotating some examples, and conducting two AMT pilot studies, we found a non-trivial margin between our own agreement and that of workers, as measured by the quadraticweighted Cohen's κ. To alleviate this, we followed Zhang et al. (2017) and conducted a pilot study to gather a pool of qualified workers. We launched a pilot task with two gold examples from each class on which our manual annotations agreed, and recruited 300 crowd workers to label them. By setting a κ threshold on agreement with the gold examples at 0.7, this resulted in 106 qualified workers, whom we requested to perform the rest of the annotations. We collected at least three annotations per triple. An example annotation interface is shown in Figure 1.

Filtering and statistics
From the total of 20,284 triples annotated, we filter out 4,040 (19.9%) that were reported to contain an invalid triple. We further remove instances without a majority vote from the workers. This results in a set of 13,684 triples with an interannotator agreement (quadratic-weighted Cohen's κ) of 0.624. (For reference, Zhang et al. (2017) report κ = 0.54 for general commonsense inference.) The label distribution is shown in Table 2. The dataset has 728 unique part nouns, 873 unique whole nouns, and 553 unique adjectives.

Inference baselines
We now describe several basic approaches for solving these commonsense inference problems, which we intend as a baseline to be built upon by future work. Formally, models answer the question: Given (1) a noun denoting a whole object that has (2), a component part also denoted by a noun, does (3), an adjective that describes 1 also describe 2? The data is first split into training, validation, and test sets consisting of 70%, 10%, and 20% of the data respectively. Model selection and tuning details are described in Appendix C.

Word embedding models
We approach the problem as categorical classification and train a multi-layer perceptron (MLP) model to classify inputs consisting of word embeddings for the whole, part, and adjective words. The MLP takes as input the concatenation of these three word embeddings, obtained from GloVe (Pennington et al., 2014), and applies a single hidden layer with ReLU activation before the final softmax layer which predicts the class label.

Adjective projection as NLI
As we want to evaluate strong yet simple preexisting language understanding models on this task, we now describe a method for obtaining the direct prediction described above via conversion to a form suitable for inference in the style of the SNLI and MNLI datasets (Bowman et al., 2015;Williams et al., 2018), which consist of premise and hypothesis sentence pairs. We first form simple hypothesis sentences from the tuples using the fixed template "The whole 's part is adjective ." We then retrieve premise sentences that describe a whole adjective . An example for (bicycle, old) is "He rode an old bicycle and brought fruits and vegetables home from Chinatown." We retrieve context sentences from five resources: Project Gutenberg books 2 , the Gigaword news corpus (Parker et al., 2011), SNLI, MNLI, and MSCOCO image captions (Lin et al., 2014); premise sentence selection is described fully in Appendix D and examples are shown in Table 1.

Fine-tuning language models
We apply transfer learning from two recently developed large contextualized LMs to this task. Both are state-of-the-art on NLI and commonsense tasks.
OpenAI GPT is a unidirectional model that predicts the next word, while BERT is bidirectional and predicts randomly missing words, as well as the next sentence. Both train on the BooksCorpus, with BERT additionally trained on English Wikipedia. Both models are fine-tuned to perform NLI by applying a linear layer to the model's final output at one position of the input. The models are then trained in a multi-task way for the inference task and the language modeling objective(s), updating the whole network.

Method Accuracy
Majority baseline 0.430 Majority-per-part baseline 0.485 GloVe embeddings 0.651 OpenAI GPT (Radford et al., 2018) 0.666 BERT (Devlin et al., 2018) 0.667 Human performance 0.785 Test set results for these methods are in Table 3. We also provide performance by two simple baselines, the first of which always predicts the majority class ("guaranteed"). To choose the second baseline, we evaluated choosing the majority class for the given whole, part, or adjective. Of these, predicting the majority-per-part had the best validation set performance, so we report that result on test.
We observe that the best model that operates on just word embeddings is within ≈ 0.02 of both language models in absolute accuracy points, and the best performing model still lags behind human accuracy 3 by nearly 0.12 absolute points, suggesting work remains to be done on incorporating this variety of common sense into intelligent models.

Conclusion
Inspired by recent commonsense dataset construction efforts and the speed with which researchers develop highly performant models for them, we develop a dataset that evaluates a type of inference that is specific but that agents with commonsense should be able to solve. We show that state-of-theart language models perform well, but that models using just pretrained word embeddings perform