SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

Given a partial description like “she opened the hood of the car,” humans can reason about the situation and anticipate what might come next (”then, she examined the engine”). In this paper, we introduce the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning. We present SWAG, a new dataset with 113k multiple choice questions about a rich spectrum of grounded situations. To address the recurring challenges of the annotation artifacts and human biases found in many existing datasets, we propose Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data. To account for the aggressive adversarial filtering, we use state-of-the-art language models to massively oversample a diverse set of potential counterfactuals. Empirical results demonstrate that while humans can solve the resulting inference problems with high accuracy (88%), various competitive models struggle on our task. We provide comprehensive analysis that indicates significant opportunities for future research.


Introduction
When we read a story, we bring to it a large body of implicit knowledge about the physical world.For instance, given the context "on stage, a woman takes a seat at the piano," shown in Table 1, we can easily infer what the situation might look like: a woman is giving a piano performance, with a crowd watching her.We can furthermore infer her likely next action: she will most likely set her fingers on the piano keys and start playing.
This type of natural language inference requires commonsense reasoning, substantially broadening the scope of prior work that focused primarily on On stage, a woman takes a seat at the piano.She a) sits on a bench as her sister plays with the doll.b) smiles with someone as the music plays.c) is in the crowd, watching the dancers.d) nervously sets her fingers on the keys.
A girl is going across a set of monkey bars.She a) jumps up across the monkey bars.b) struggles onto the monkey bars to grab her head.c) gets to the end and stands on a wooden plank.d) jumps up and does a back flip.
The woman is now blow drying the dog.The dog a) is placed in the kennel next to a woman's feet.b) washes her face with the shampoo.c) walks into frame and walks towards the dog.d) tried to cut her face, so she is trying to do something very close to her face.
Table 1: Examples from Swag; the correct answer is bolded.Adversarial Filtering ensures that stylistic models find all options equally appealing.
linguistic entailment (Chierchia and McConnell-Ginet, 2000).Whereas the dominant entailment paradigm asks if two natural language sentences (the 'premise' and the 'hypothesis') describe the same set of possible worlds (Dagan et al., 2006;Bowman et al., 2015), here we focus on whether a (multiple-choice) ending describes a possible (future) world that can be anticipated from the situation described in the premise, even when it is not strictly entailed.Making such inference necessitates a rich understanding about everyday physical situations, including object affordances (Gibson, 1979) and frame semantics (Baker et al., 1998).
A first step toward grounded commonsense inference with today's deep learning machinery is to create a large-scale dataset.However, recent work has shown that human-written datasets are susceptible to annotation artifacts: unintended stylistic patterns that give out clues for the gold labels (Gururangan et al., 2018;Poliak et al., 2018).As a result, models trained on such datasets with hu-man biases run the risk of over-estimating the actual performance on the underlying task, and are vulnerable to adversarial or out-of-domain examples (Wang et al., 2018;Glockner et al., 2018).
In this paper, we introduce Adversarial Filtering (AF), a new method to automatically detect and reduce stylistic artifacts.We use this method to construct Swag: an adversarial dataset with 113k multiple-choice questions.We start with pairs of temporally adjacent video captions, each with a context and a follow-up event that we know is physically possible.We then use a state-of-theart language model fine-tuned on this data to massively oversample a diverse set of possible negative sentence endings (or counterfactuals).Next, we filter these candidate endings aggressively and adversarially using a committee of trained models to obtain a population of de-biased endings with similar stylistic features to the real ones.Finally, these filtered counterfactuals are validated by crowd workers to further ensure data quality.
Extensive empirical results demonstrate unique contributions of our dataset, complementing existing datasets for natural langauge inference (NLI) (Bowman et al., 2015;Williams et al., 2018) and commonsense reasoning (Roemmele et al., 2011;Mostafazadeh et al., 2016;Zhang et al., 2017).First, our dataset poses a new challenge of grounded commonsense inference that is easy for humans (88%) while hard for current state-ofthe-art NLI models (<60%).Second, our proposed adversarial filtering methodology allows for cost-effective construction of a large-scale dataset while substantially reducing known annotation artifacts.The generality of adversarial filtering allows it to be applied to build future datasets, ensuring that they serve as reliable benchmarks.
2 Swag: Our new dataset We introduce a new dataset for studying physically grounded commonsense inference, called Swag. 1 Our task is to predict which event is most likely to occur next in a video.More formally, a model is given a context c = (s, n): a complete sentence s and a noun phrase n that begins a second sentence, as well as a list of possible verb phrase sentence endings V = {v 1 , . . ., v 4 }.See Figure 1 for an example triple (s, n, v i ).The model must then select the most appropriate verb phrase v î ∈ V .
1 Short for Situations With Adversarial Generations.
is put on top of the vegetables. is putting vegetable fruits.is using a red sponge to add eggs and parsley.
⋮ is placed in the oven.
The mixer creams the butter.Sugar…

Adversarially select generations
Annotators filter endings to ensure agreement

Oversample endings from context+NP
Sugar is added to the mixing bowl.The mixer creams the butter.

NP VP context
Using video captions from t < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 k W h J 9 P M P 9 U X t 7 A = = < / l a t e x i t > t + 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " F x Z l r q R S p Q 2 + c q d l v / 9 i y s H N v s c = " > A A A C 3 3 i c d V J L b x M x E H a 2 P M r y a s u R y 4 o I C Y E U 7 S I k 6 K 2 C H r g g i i C 0 U h J V Y 2 e y s W K v V + P Z 0 r D K v R c Q J x C / i N / A v 8 F J 9 8 A m M J L l z 9 + 8 Z y x L o z 2 n 6 e 9 O t H X l 6 r X r 2 z f i m 7 d u 3 7 m 7 s 7 v 3 0 b u K F P a V M 4 5 O J H g 0 u s A + a z Z 4 U h K C l Q a P 5 e z V U n 9 8 h u S 1 K z 7 w v M S R h b z Q E 6 2 A A / W e n 2 S n O 9 2 0 l 6 4 k 2 Q R Z A 7 q i k a P T 3 c 6 v 4 d i p y m L B y o D 3 g y w t e l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F x Z l r q R S p Q 2 + c q d l v / 9 i y s H N v s c = " > A A A C 3 3 i c d V J L b x M x E H a 2 P M r y a s u R y 4 o I C Y E U 7 S I k 6 K 2 C H r g g i i C 0 U h J V Y 2 e y s W K v V + P Z 0 r D K v R c Q J x C / i N / A v 8 F J 9 8 A m M J L l z 9 + 8 Z y x L o z 2 n 6 e 9 O t H X l 6 r X r 2 z f i m 7 d u 3 7 m 7 s 7 v 3 0 b u K F P a V M 4 5 O J H g 0 u s A + a z Z 4 U h K C l Q a P 5 e z V U n 9 8 h u S 1 K z 7 w v M S R h b z Q E 6 2 A A / W e n 2 S n O 9 2 0 l 6 4 k 2 Q R Z A 7 q i k a P T 3 c 6 v 4 d i p y m L B y o D 3 g y w t e l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F x Z l r q R S p Q 2 + c q d l v / 9 i y s H N v s c = " > A A A C 3 3 i c d V J L b x M x E H a 2 P M r y a s u R y 4 o I C Y E U 7 S I k 6 K 2 C H r g g i i C 0 U h J V Y 2 e y s W K v V + P Z 0 r D K v R c Q J x C / i N / A v 8 F J 9 8 A m M J L l z 9 + 8 Z y x L o z 2 n 6 e 9 O t H X l 6 r X r 2 z f i m 7 d u 3 7 m 7 s 7 v 3 0 b u K F P a V M 4 5 O J H g 0 u s A + a z Z 4 U h K C l Q a P 5 e z V U n 9 8 h u S 1 K z 7 w v M S R h b z Q E 6 2 A A / W e n 2 S n O 9 2 0 l 6 4 k 2 Q R Z A 7 q i k a P T 3 c 6 v 4 d i p y m L B y o D 3 g y w t e l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F x Z l r q R S p Q 2 + c q d l v / 9 i y s H (the videos are never used) Figure 1: Overview of the data collection process.For a pair of sequential video captions, the second caption is split into noun and verb phrases.A language model generates many negative endings, of which a difficult subset are human-annotated.
Overview Our corpus consists of 113k multiple choice questions (73k training, 20k validation, 20k test) and is derived from pairs of consecutive video captions from ActivityNet Captions (Krishna et al., 2017;Heilbron et al., 2015) and the Large Scale Movie Description Challenge (LSMDC; Rohrbach et al., 2017).The two datasets are slightly different in nature and allow us to achieve broader coverage: ActivityNet contains 20k YouTube clips containing one of 203 activity types (such as doing gymnastics or playing guitar); LSMDC consists of 128k movie captions (audio descriptions and scripts).For each pair of captions, we use a constituency parser (Stern et al., 2017) to split the second sentence into noun and verb phrases (Figure 1).2Each question has a human-verified gold ending and 3 distractors.

A solution to annotation artifacts
In this section, we outline the construction of Swag.We seek dataset diversity while minimizing annotation artifacts, conditional stylistic patterns such as length and word-preference biases.For many NLI datasets, these biases have been shown to allow shallow models (e.g.bag-of-words) obtain artificially high performance.
To avoid introducing easily "gamed" patterns, we present Adversarial Filtering (AF), a generallyapplicable treatment involving the iterative refinement of a set of assignments to increase the entropy under a chosen model family.We then discuss how we generate counterfactual endings, and ). end for end while finally, the models used for filtering.

Formal definition
In this section, we formalize what it means for a dataset to be adversarial.Intuitively, we say that an adversarial dataset for a model f is one on which f will not generalize, even if evaluated on test data from the same distribution.More formally, let our input space be X and the label space be Y.Our trainable classifier f , taking parameters θ is defined as f θ : X → R |Y| .Let our dataset of size N be defined as D = {(x i , y i )} 1≤i≤N , and let the loss function over the dataset be L(f θ , D).We say that a dataset is adversarial with respect to f if we expect high empirical error I over all leave-one-out train/test splits (Vapnik, 2000): where with regularization terms omitted for simplicity.

Adversarial filtering (AF) algorithm
In this section, we outline an approach for generating an adversarial dataset D, effectively maximizing empirical error I with respect to a family of trainable classifiers f .Without loss of generality, we consider the situation where we have N contexts, each associated with a single positive example (x + i , 1) ∈ X × Y, and a large population of context-specific negative examples (x − i,j , 0) ∈ X × Y, where 1≤j≤N − for each i.For instance, the negative examples could be incorrect relations in knowledge-base completion (Socher et al., 2013), or all words in a dictionary for a single-word cloze task (Zweig and Burges, 2011).
Our goal will be to filter the population of negative examples for each instance i to a size of k N − .This will be captured by returning a set of assignments A, where for each instance the assignment will be a k-subset The filtered dataset will then be: Unfortunately, optimizing I(D AF , f ) is difficult as A is global and non-differentiable.To address this, we present Algorithm 1.On each iteration, we split the data into dummy 'train' and 'test' splits.We train a model f on the training portion and obtain parameters θ, then use the remaining test portion to reassign the indices of A. For each context, we replace some number of 'easy' negatives in A that f θ classifies correctly with 'adversarial' negatives outside of A that f θ misclassifies.This process can be thought of as increasing the overall entropy of the dataset: given a strong model f θ that is compatible with a random subset of the data, we aim to ensure it cannot generalize to the held-out set.We repeat this for several iterations to reduce the generalization ability of the model family f over arbitrary train/test splits.

Generating candidate endings
To generate counterfactuals for Swag, we use an LSTM (Hochreiter and Schmidhuber, 1997) language model (LM), conditioned on contexts from video captions.We first pretrain on BookCorpus (Zhu et al., 2015), then finetune on the video caption datasets.The architecture uses standard best practices and was validated on held-out perplexity of the video caption datasets; details are in the appendix.We use the LM to sample N − =1023 unique endings for a partial caption. 3mportantly, we greedily sample the endings, since beam search decoding biases the generated endings to be of lower perplexity (and thus easily distinguishable from found endings).We find this process gives good counterfactuals: the generated endings tend to use topical words, but often make little sense physically, making them perfect for our task.Further, the generated endings are marked as "gibberish" by humans only 9.1% of the time (Sec 3.5); in that case the ending is filtered out.

Stylistic models for adversarial filtering
In creating Swag, we designed the model family f to pick up on low-level stylistic features that we posit should not be predictive of whether an event happens next in a video.These stylistic features are an obvious case of annotation artifacts (Cai et al., 2017;Schwartz et al., 2017). 4Our final classifier is an ensemble of four stylistic models: 1.A multilayer perceptron (MLP) given LM perplexity features and context/ending lengths.
2. A bag-of-words model that averages the word embeddings of the second sentence as features.
3. A one-layer CNN, with filter sizes ranging from 2-5, over the second sentence.4. A bidirectional LSTM over the 100 most common words in the second sentence; uncommon words are replaced by their POS tags.We ensemble the models by concatenating their final representations and passing it through an MLP.
On every adversarial iteration, the ensemble is trained jointly to minimize cross-entropy.The accuracies of these models (at each iteration, evaluated on a 20% split of the test dataset before indices of A get remapped) are shown in Figure 2. Performance decreases from 60% to close to random chance; moreover, confusing the perplexity-based MLP is not sufficient to lower performance of the ensemble.Only once the other stylistic models are added does the ensemble accuracy drop substantially, suggesting that our approach is effective at reducing stylistic artifacts. 4A broad definition of annotation artifacts might include aspects besides lexical/stylistic features: for instance, certain events are less likely semantically regardless of the context (e.g.riding a horse using a hose).For this work, we erred more conservatively and only filtered based on style.
Imagine that you are watching a video clip.The clip has a caption, but it is missing the final phrase.Please choose the best 2 caption endings, and classify each as: • likely, if it completes the caption in a reasonable way; • unlikely, if it sounds ridiculous or impossible; • gibberish if it has such serious errors that it doesn't feel like a valid English sentence.
Example: Someone is shown sitting on a fence and talking to the camera while pointing out horses.Someone • stands in front of a podium.(likely, second best) • rides a horse using a hose.(unlikely) • is shown riding a horse.(likely, best) • , the horse in a plaza field.(gibberish) Figure 3: Mechanical Turk instructions (abridged).

Human verification
The final data-collection step is to have humans verify the data.Workers on Amazon Mechanical Turk were given the caption context, as well as six candidate endings: one found ending and five adversarially-sampled endings.The task was twofold: Turkers ranked the endings independently as likely, unlikely, or gibberish, and selected the best and second best endings (Fig 3).
We obtained the correct answers to each context in two ways.If a Turker ranks the found ending as either best or second best (73.7% of the time), we add the found ending as a gold example, with negatives from the generations not labelled best or gibberish.Further, if a Turker ranks a generated ending as best, and the found ending as second best, then we have reason to believe that the generation is good.This lets us add an additional training example, consisting of the generated best ending as the gold, and remaining generations as negatives. 5Examples with ≤3 nongibberish endings were filtered out. 6e found after 1000 examples that the annotators tended to have high agreement, also generally choosing found endings over generations (see Table 2).Thus, we collected the remaining 112k examples with one annotator each, periodically verifying that annotators preferred the found endings.

Experiments
In this section, we evaluate the performance of various NLI models on Swag.Recall that models  for our dataset take the following form: given a sentence and a noun phrase as context c = (s, n), as well as a list of possible verb phrase endings V = {v 1 , . . ., v 4 }, a model f θ must select a verb î that hopefully matches i gold : To study the amount of bias in our dataset, we also consider models that take as input just the ending verb phrase v i , or the entire second sentence (n, v i ).For our learned models, we train f by minimizing multi-class cross-entropy.We consider three different types of word representations: 300d GloVe vectors from Common Crawl (Pennington et al., 2014), 300d Numberbatch vectors retrofitted using ConceptNet relations (Speer et al., 2017), and 1024d ELMo contextual representations that show improvement on a variety of NLP tasks, including standard NLI (Peters et al., 2018).We follow the final dataset split (see Section 2) using two training approaches: training on the found data, and the found and highly-ranked generated data.See the appendix for more details.

Unary models
The following models predict labels from a single span of text as input; this could be the ending only, the second sentence only, or the full passage.a. fastText (Joulin et al., 2017): This library models a single span of text as a bag of n-grams, and tries to predict the probability of an ending being correct or incorrect independently.7 b.Pretrained sentence encoders We consider two types of pretrained RNN sentence encoders, SkipThoughts (Kiros et al., 2015) and InferSent (Conneau et al., 2017).SkipThoughts was trained by predicting adjacent sentences in book data, whereas InferSent was trained on supervised NLI data.For each second sentence (or just the ending), we feed the encoding into an MLP.c. LSTM sentence encoder Given an arbitrary span of text, we run a two-layer BiLSTM over it.
The final hidden states are then max-pooled to obtain a fixed-size representation, which is then used to predict the potential for that ending.

Binary models
The following models predict labels from two spans of text.We consider two possibilties for these models: using just the second sentence, where the two text spans are n, v i , or using the context and the second sentence, in which case the spans are s, (n, v i ).The latter case includes many models developed for the NLI task.d.Dual Bag-of-Words For this baseline, we treat each sentence as a bag-of-embeddings (c, v i ).We model the probability of picking an ending i using a bilinear model: softmax i (cWv T i ).8 e. Dual pretrained sentence encoders Here, we obtain representations from SkipThoughts or In-ferSent for each span, and compute their pairwise compatibility using either 1) a bilinear model or 2) an MLP from their concatenated representations.f.SNLI inference Here, we consider two models that do well on SNLI (Bowman et al., 2015): Decomposable Attention (Parikh et al., 2016) and ESIM (Chen et al., 2017).We use pretrained versions of these models (with ELMo embeddings) on SNLI to obtain 3-way entailment, neutral, and contradiction probabilities for each example.We then train a log-linear model using these 3-way probabilities as features.g.SNLI models (retrained) Here, we train ESIM and Decomposable Attention on our dataset: we simply change the output layer size to 1 (the potential of an ending v i ) with a softmax over i.

Other models
We also considered the following models: h.Length: Although length was used by the adversarial classifier, we want to verify that human validation didn't reintroduce a length bias.For this baseline, we always choose the shortest ending.i. ConceptNet As our task requires world knowledge, we tried a rule-based system on top of the  Table 3: Performance of all models in accuracy (%).All models substantially underperform humans, although performance increases as more context is provided (left to right).We optionally train on found endings only, or found and human-validated generated endings (found+gen).
For an ending sentence, we use the spaCy dependency parser to extract the head verb and its dependent object.The ending score is given by the number of ConceptNet causal relations9 between synonyms of the verb and synonyms of the object.j.Human performance To benchmark human performance, five Mechanical Turk workers were asked to answer 100 dataset questions, as did an 'expert' annotator (the first author of this paper).Predictions were combined using a majority vote.

Results
We present our results in Table 3.The best model that only uses the ending is the LSTM sequence model with ELMo embeddings, which obtains 43.6%.This model, as with most models studied, greatly improves with more context: by 3.1% when given the initial noun phrase, and by an ad-ditional 4% when also given the first sentence.Further improvement is gained from models that compute pairwise representations of the inputs.While the simplest such model, Dual-BoW, obtains only 35.1% accuracy, combining In-ferSent sentence representations gives 40.5% accuracy (InferSent-Bilinear).The best results come from pairwise NLI models: when fully trained on Swag, ESIM+ELMo obtains 59.2% accuracy.
When comparing machine results to human results, we see there exists a lot of headroom.Though there likely is some noise in the task, our results suggest that humans (even untrained) converge to a consensus.Our in-house "expert" annotator is outperformed by an ensemble of 5 Turk workers (with 88% accuracy); thus, the effective upper bound on our dataset is likely even higher.

Swag versus existing NLI datasets
The past few years have yielded great advances in NLI and representation learning, due to the availability of large datasets like SNLI and MultiNLI Our dataset shows a greater variety of dynamic verbs, such as "move", as well as temporal verbs such as "start" and "come.""Continue" is cut off for SNLI (it has frequency 6 • 10 −5 ).Bottom: CDF for verbs in SNLI and Swag.
( Bowman et al., 2015;Williams et al., 2018).With the release of Swag, we hope to continue this trend, particularly as our dataset largely has the same input/output format as other NLI datasets.
We observe three key differences between our dataset and others in this space: First, as noted in Section 1, Swag requires a unique type of temporal reasoning.A state-of-theart NLI model such as ESIM, when bottlenecked through the SNLI notion of entailment (SNLI-ESIM), only obtains 36.1% accuracy. 10This implies that these datasets necessitate different (and complementary) forms of reasoning.
Second, our use of videos results in wide coverage of dynamic and temporal situations Compared with SNLI, with contexts from Flickr30K (Plummer et al., 2017) image captions, Swag has more active verbs like 'pull ' and 'hit,' and fewer static  verbs like 'sit' and 'wear' (Figure 4). 11 Third, our dataset suffers from few lexical biases.Whereas fastText, a bag of n-gram model, obtains 67.0% accuracy on SNLI versus a 34.3% baseline (Gururangan et al., 2018), fastText obtains only 29.0% accuracy on Swag. 12

Error analysis
We sought to quantify how human judgments differ from the best studied model, ESIM+ELMo.We randomly sampled 100 validation questions 10 The weights of SNLI-ESIM pick up primarily on entailment probability (0.59), as with neutral (0.46), while contradiction is negatively correlated (-.42).
11 Video data has other language differences; notably, character names in LSMDC were replaced by 'someone' 12 The most predictive individual words on SWAG are infrequent in number: 'dotted' with P(+|dotted) = 77% with 10.3 counts, and P(−|similar) = 81% with 16.3 counts.(Counts from negative endings were discounted 3x, as there are 3 times as many negative endings as positive endings).

Reason
Explanation Freq.
Situational The good ending is better in context.53.7% Plausibility The bad ending is implausible regardless of context.

Novelty
The bad ending seems redundant; it is entailed by the context.

1.8%
Weirdness The bad ending is semantically or grammatically malformed, e.g.'the man is getting out of the horse.'

18.1%
Ambiguous Both endings seem equally likely.12.0% that ESIM+ELMo answered incorrectly, for each extracting both the gold ending and the model's preferred ending.We asked 5 Amazon Mechanical Turk workers to pick the better ending (of which they preferred the gold endings 94% of the time) and to select one (or more) multiple choice reasons explaining why the chosen answer was better.
The options, and the frequencies, are outlined in Table 4.The most common reason for the turkers preferring the correct answer is situational (52.3% of the time), followed by weirdness (17.5%) and plausibility (14.4%).This suggests that ESIM+ELMo already does a good job at filtering out weird and implausible answers, with the main bottleneck being grounded physical understanding.The ambiguous percentage is also relatively low (12.0%),implying significant headroom.

Qualitative examples
Last, we show several qualitative examples in Table 5.Though models can do decently well by identifying complex alignment patterns between the two sentences (e.g.being "up a tree" implies that "tree" is the end phrase), the incorrect model predictions suggest this strategy is insuffi-  about what happens next in making an omelet.

Where to go next?
Our results suggest that Swag is a challenging testbed for NLI models.However, the adversarial models used to filter the dataset are purely stylistic and focus on the second sentence; thus, subtle artifacts still likely remain in our dataset.These patterns are ostensibly picked up by the NLI models (particularly when using ELMo features), but the large gap between machine and human performance suggests that more is required to solve the dataset.As models are developed for commonsense inference, and more broadly as the field of NLP advances, we note that AF can be used again to create a more adversarial version of Swag using better language models and AF models.
The NLI task requires a variety of commonsense knowledge (LoBue and Yates, 2011), which our work complements.However, previous datasets for NLI have been challenged by unwanted annotation artifacts, (Gururangan et al., 2018;Poliak et al., 2018) or scale issues.Our work addresses these challenges by constructing a new NLI benchmark focused on grounded commonsense reasoning, and by introducing an adversarial filtering mechanism that substantially reduces known and easily detectable annotation artifacts.
Commonsense NLI Several datasets have been introduced to study NLI beyond linguistic entailment: for inferring likely causes and endings given a sentence (COPA; Roemmele et al., 2011), for choosing the most sensible ending to a short story (RocStories; Mostafazadeh et al., 2016;Sharma et al., 2018), and for predicting likelihood of a hypothesis by regressing to an ordinal label (JOCI; (Zhang et al., 2017)).These datasets are relatively small: 1k examples for COPA and 10k cloze examples for RocStories.13JOCI increases the scale by generating the hypotheses using a knowledge graph or a neural model.In contrast to JOCI where the task was formulated as a regression task on the degree of plausibility of the hypothesis, we frame commonsense inference as a multiple choice question to reduce the potential ambiguity in the labels and to allow for direct comparison between machines and humans.In addition, Swag's use of adversarial filtering increases diversity of situations and counterfactual generation quality.
Last, another related task formulation is sentence completion or cloze, where the task is to predict a single word that is removed from a given context (Zweig and Burges, 2011;Paperno et al., 2016). 14Our work in contrast requires longer textual descriptions to reason about.
Vision datasets Several resources have been introduced to study temporal inference in vision.The Visual Madlibs dataset has 20k image captions about hypothetical next/previous events (Yu et al., 2015); similar to our work, the test portion is multiple-choice, with counterfactual answers retrieved from similar images and verified by humans.The question of 'what will happen next?' has also been studied in photo albums (Huang et al., 2016), videos of team sports, (Felsen et al., 2017) and egocentric dog videos (Ehsani et al., 2018).Last, annotation artifacts are also a recurring problem for vision datasets such as Visual Genome (Zellers et al., 2018) and Visual QA (Jabri et al., 2016); recent work was done to create a more challenging VQA dataset by annotating complementary image pairs (Goyal et al., 2016).
Reducing gender/racial bias Prior work has sought to reduce demographic biases in word embeddings (Zhang et al., 2018) as well as in image recognition models (Zhao et al., 2017).Our work has focused on producing a dataset with minimal annotation artifacts, which in turn helps to avoid some gender and racial biases that stem from elicitation (Rudinger et al., 2017).However, it is not perfect in this regard, particularly due to biases in movies (Schofield and Mehr, 2016;Sap et al., 2017).Our methodology could potentially be extended to construct datasets free of (possibly intersectional) gender or racial bias.
Physical knowledge Prior work has studied learning grounded knowledge about objects and verbs: from knowledge bases (Li et al., 2016), syntax parses (Forbes and Choi, 2017), word embeddings (Lucy and Gauthier, 2017), and images and dictionary definitions (Zellers and Choi, 2017).
An alternate thread of work has been to learn scripts: high-level representations of event chains (Schank and Abelson, 1975;Chambers and Jurafsky, 2009).Swag evaluates both of these strands. 14Prior work on sentence completion filtered negatives with heuristics based on LM perplexities.We initially tried something similar, but found the result to still be gameable.

Conclusion
We propose a new challenge of physically situated commonsense inference that broadens the scope of natural language inference (NLI) with commonsense reasoning.To support research toward commonsense NLI, we create a large-scale dataset Swag with 113k multiple-choice questions.Our dataset is constructed using Adversarial Filtering (AF), a new paradigm for robust and cost-effective dataset construction that allows datasets to be constructed at scale while automatically reducing annotation artifacts that can be easily detected by a committee of strong baseline models.Our adversarial filtering paradigm is general, allowing potential applications to other datasets that require human composition of question answer pairs.dricks et al., 2017).However, many of the referring expressions are themselves sentence fragments, (e.g."first time we see people" so we ultimately did not use this dataset.)Additionally, we considered the Visual Madlibs dataset (Yu et al., 2015), as it contains 10k hypothetical captions written by Mechanical Turk workers about what might happen next given an image.However, these captions are fundamentally different from the rest of the data (as they're about what might) happen next; as a result, they use different types of language.They also have different tenses versus the other datasets that we considered (e.g.past tense), as a result of the "Mad-libs" style of data collection.

A.2 Details of the language model
Our language model follows standard best practices: the input and output embedding layers are tied (Inan et al., 2017;Press and Wolf, 2017), all embedding and hidden layers are set to 512, and we used recurrent dropout (Gal and Ghahramani, 2016) on the hidden states and embedding layer.We additionally train a backwards language model alongside the forward language model, and they share embedding parameters.This adds extra supervision to the embedding layer and gives us another way to score candidate generations.We first pretrain the language model for two epochs on pairs of two sentences in the Toronto Books dataset (Zhu et al., 2015), and then train on sentence pairs from ActivityNet Captions and LSMDC, validating on held-out perplexity.For optimization, we use Adam (Kingma and Ba, 2015) with a learning rate of 10 −3 and clip gradients to norm 1.0.
All of the above details were validated using perplexity on a held-out set of the video datasets during early experimentation.Our final development set forward perplexity was 31.2 and backward perplexity was 30.4.We tried more complicated language modeling architectures, such as from (Józefowicz et al., 2016), but ended up not seeing an improvement due to overfitting.

A.3 Language model features for the MLP, during adversarial filtering
We obtained LM perplexity features to be used during adversarial filtering in the following ways, using both directions of the bidirectional language model.We extract perplexities for the context by itself (going forward), the ending given the con-text (going forward), the context given the ending (going backward), and the ending by itself (going backward).We also extract the probability of the final generated token going forward, since sentences sometimes reach the length limit of 25 tokens and end unnaturally.
A.4 Refinining the generated answers to four distractors In the main paper, we noted that we started with 1023 negatives per example, which the adversarial filtering process filtered down to 9. Five of these were passed to mechanical turk workers, and we were left with anywhere between 0 and 4 of these per example as "distractors."(Note that we always were filtering out the second best option that the was selected by the turkers).This means that for many of our examples (62%) we actually have a fourth distractor.In these cases, we sorted the distractors by their "unlikely/likely" score, so that the fourth distractor was the one deemed most likely.We still provided the fourth distractor in the training set to be possibly used in future work, however we didn't train on it for simplicity.

A.5 More information about Mechanical turk
We used several tricks to keep the interannotator agreement high (with a pairwise percent agreement of 79% at classifying an ending as either in the Top 2).First, we had a screening HIT where turkers were given detailed instructions for the task, and only the best-scoring turk workers qualified for the remaining HITs.Second, we periodically dequalified turkers that had a low agreement with the gold endings: any turk worker with an accuracy of less than 55% of classifying the "gold" ending as the best or second best, over 10 or more HITs, had the qualification taken away.We also gave small bonuses to turkers with high accuracy.
During our crowdsourcing, we tried to pay the Turkers a fair wage (median $8.57 per hour) and they left positive comments for us on TurkOpticon and TurkerView.The total dataset cost was $23,000, or an average of 20 cents per example.

A.6 Implementation details of the models considered
We implemented the neural models in PyTorch using the AllenNLP library (Gardner et al., 2018)

A.7 More info about dataset diversity
The final dataset has a vocabulary size of 21000.We also visualize the coverage of the dataset with a Topic model (see Table 7).

A.8 Comparing the distribution of verbs with MultiNLI
We also produced an extension to Figure 4 of the main paper, that involves verbs from MultiNLI, in Figure 5.We ended up not including it in the paper because we wanted to focus our comparison between SNLI and Swag (as they are both grounded datasets).Interestingly, we find that Swag has a less skewed cumulative distribution of verbs up to around 120, when afterwards MultiNLI has a slightly less skewed distribution.This is possibly due to the broader set of domains considered by MultiNLI, whereas we consider videos (which is also a broad domain!but still underrepresents words highly used in newswire text, for instance.)

A.9 More examples
We have more qualitative examples in Table 8.

Algorithm 1
Adversarial filtering (AF) of negative samples.During our experiments, we set N easy = 2 for refining a population of N − = 1023 negative examples to k = 9, and used a 80%/20% train/test split.while convergence not reached do • Split the dataset D randomly up into training and testing portions D tr and D te .• Optimize a model f θ on D tr .for index i in D te do • Identify easy indices:

Figure 2 :
Figure 2: Test accuracy by AF iteration, under the negatives given by A. The accuracy drops from around 60% to close to random chance.For efficiency, the first 100 iterations only use the MLP.

Figure 4 :
Figure 4: Top: Distribution of the 40 top verbs in the union of SNLI and Swag.Our dataset shows a greater variety of dynamic verbs, such as "move", as well as temporal verbs such as "start" and "come.""Continue" is cut off for SNLI (it has frequency 6 • 10 −5 ).Bottom: CDF for verbs in SNLI and Swag.

Figure 5 :
Figure 5: Bottom: CDF for verbs in SNLI, Swag, and MultiNLI.The lady demonstrates wrapping gifts using her feet.The lady a) shows us the different shapes of the ornaments.(99.67%) b) continues playing when the lady talks to the camera.(0.26%) c) takes the desserts from the box and continues talking to the camera .(0.07%) d) cuts the paper with scissors.(0.01%)

Table 2 :
Annotators tend to label the found ending as likely and within the top 2 (column 2), in other cases the example is filtered out.Both label groups have high inter-annotator agreement, in terms of Krippendorff's α and pairwise percent agreement.

Table 4 :
Justifications for ranking the gold answer over a wrong answer chosen by ESIM+ELMo.

Table 5 :
Example questions answered by the best model, ESIM+Elmo, sorted by model probability.Correct model predictions are in blue, incorrect model predictions are red.The right answers are bolded.cient.For instance, answering "An old man rides a small bumper car" requires knowledge about bumper cars and how they differ from regular cars: bumper cars are tiny, don't drive on roads, and don't work in parking lots, eliminating the alternatives.However, this knowledge is difficult to extract from existing corpora: for instance, the Con-ceptNet entry for Bumper Car has only a single relation: bumper cars are a type of vehicle.Other questions require intuitive physical reasoning: e.g, for "he pours the raw egg batter into the pan,"

Table 6 :
Statistics of Swag.

Table 8 :
More (incorrect)questions answered by the best model, ESIM+Elmo, sorted by model probability.The right answers are bolded.Ehsani, Hessam Bagherinezhad, Joseph Redmon, Roozbeh Mottaghi, and Ali Farhadi.2018.Who let the dogs out? modeling dog behavior from visual data.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Panna Felsen, Pulkit Agrawal, and Jitendra Malik.2017.What will happen next?forecasting player moves in sports videos.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition, pages 3342-3351.Maxwell Forbes and Yejin Choi.2017.Verb physics: Relative physical knowledge of actions and objects.In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 266-276.Yarin Gal and Zoubin Ghahramani.2016.A theoretically grounded application of dropout in recurrent