Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning

Understanding narratives requires reading between the lines, which in turn, requires interpreting the likely causes and effects of events, even when they are not mentioned explicitly. In this paper, we introduce Cosmos QA, a large-scale dataset of 35,600 problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. In stark contrast to most existing reading comprehension datasets where the questions focus on factual and literal understanding of the context paragraph, our dataset focuses on reading between the lines over a diverse collection of people's everyday narratives, asking such questions as"what might be the possible reason of ...?", or"what would have happened if ..."that require reasoning beyond the exact text spans in the context. To establish baseline performances on Cosmos QA, we experiment with several state-of-the-art neural architectures for reading comprehension, and also propose a new architecture that improves over the competitive baselines. Experimental results demonstrate a significant gap between machine (68.4%) and human performance (94%), pointing to avenues for future research on commonsense machine comprehension. Dataset, code and leaderboard is publicly available at https://wilburone.github.io/cosmos.


Introduction
Reading comprehension requires not only understanding what is stated explicitly in text, but also reading between the lines, i.e., understanding what is not stated yet obviously true (Norvig, 1987).
For example, after reading the first paragraph in Figure 1, we can understand that the writer is not a child, yet needs someone to dress him or her every  1) the correct answer is not explicitly mentioned anywhere in the context paragraph, thus requiring reading between the lines through commonsense inference and (2) answering the question correctly requires reading the context paragraph, thus requiring reading comprehension and contextual commonsense reasoning.
morning, and appears frustrated with the current situation.Combining these clues, we can infer that the plausible reason for the writer being dressed by other people is that he or she may have a physical disability.
As another example, in the second paragraph of Figure 1, we can infer that the woman was admitted to a psychiatric hospital, although not mentioned explicitly in text, and also that the job of the hospital staff is to stop patients from committing suicide.Furthermore, what the staff should have done, in the specific situation described, was to stop the woman from taking the elevator.
There are two important characteristics of the problems presented in Figure 1.First, the correct answers are not explicitly mentioned anywhere in the context paragraphs, thus requiring reading between the lines through commonsense inference.Second, selecting the correct answer requires reading the context paragraphs.That is, if we were not provided with the context paragraph for the second problem, for example, the plausible correct answer could have been B or C instead.
In this paper, we focus on reading comprehension that requires contextual commonsense reasoning, as illustrated in the examples in Figure 1.Such reading comprehension is an important aspect of how people read and comprehend text, and yet, relatively less studied in the prior machine reading literature.To support research toward commonsense reading comprehension, we introduce COSMOS QA (Commonsense Machine Comprehension), a new dataset with 35, 588 reading comprehension problems that require reasoning about the causes and effects of events, the likely facts about people and objects in the scene, and hypotheticals and counterfactuals.Our dataset covers a diverse range of everyday situations, with 21, 886 distinct contexts taken from blogs of personal narratives.
The vast majority (93.8%) of our dataset requires contextual commonsense reasoning, in contrast with existing machine comprehension (MRC) datasets such as SQuAD (Rajpurkar et al., 2016), RACE (Lai et al., 2017), Narrative QA (Kočiskỳ et al., 2018), and MCScript (Ostermann et al., 2018), where only a relatively smaller portion of the questions (e.g., 27.4% in MCScript) require commonsense inference.In addition, the correct answer cannot be found in the context paragraph as a text span, thus we formulate the task as multiple-choice questions for easy and robust evaluation.However, our dataset can also be used for generative evaluation, as will be demonstrated in our empirical study.
To establish baseline performances on COS-MOS QA, we explore several state-of-the-art neural models developed for reading comprehension.Furthermore, we propose a new architecture variant that is better suited for commonsense-driven reading comprehension.Still, experimental results demonstrate a significant gap between ma-chine (68.4% accuracy) and human performance (94.0%).We provide detailed analysis to provide insights into potentially promising research directions.

Context Paragraphs
We gather a diverse collection of everyday situations from a corpus of personal narratives (Gordon and Swanson, 2009) from the Spinn3r Blog Dataset (Burton et al., 2009).Appendix A provides additional details on data pre-processing.

Question and Answer Collection
We use Amazon Mechanical Turk (AMT) to collect questions and answers.Specifically, for each paragraph, we ask a worker to craft at most two questions that are related to the context and require commonsense knowledge.We encourage the workers to craft questions from but not limited to the following four categories: • Causes of events: What may (or may not) be the plausible reason for an event?
• Effects of events: What may (or may not) happen before (or after, or during) an event?
• Facts about entities: What may (or may not) be a plausible fact about someone or something?
• Counterfactuals: What may (or may not) happen if an event happens (or did not happen)?
These 4 categories of questions literally cover all 9 types of social commonsense of Sap et al. (2018).Moreover, the resulting commonsense also aligns with 19 ConceptNet relations, e.g., Causes, HasPrerequisite and MotivatedByGoal, covering about 67.8% of ConceptNet types.For each question, we also ask a worker to craft at most two correct answers and three incorrect answers.We paid workers $0.7 per paragraph, which is about $14.8 per hour.Appendix B provides additional details on AMT instructions.

Validation
We create multiple tasks to have humans verify the data.Given a paragraph, a question, a correct answer and three incorrect answers,1 we ask AMT workers to determine the following sequence of   questions: (1) whether the paragraph is inappropriate or nonsensical, (2) whether the question is nonsensical or not related to the paragraph, (3) whether they can determine the most plausible correct answer, (4) if they can determine the correct answer, whether the answer requires commonsense knowledge, and (5) if they can determine the correct answer, whether the answer can be determined without looking at the paragraph.
We follow the same criterion as in Section 2.2 and ask 3 workers to work on each question set.Workers are paid $0.1 per question.We consider as valid question set where at least two workers correctly picked the intended answer and all of the workers determined the paragraph/question/answers as satisfactory.Finally we obtain 33, 219 valid question sets in total.

Unanswerable Question Creation
With human validation, we also obtain a set of questions for which workers can easily determine the correct answer without looking at the context or using commonsense knowledge.To take advantage of such questions and encourage AI systems to be more consistent with human understanding, we create unanswerable questions for COSMOS QA.Specifically, from validation outputs, we collect 2, 369 questions for which at least two workers correctly picked the answer and at least on worker determined that it is answerable without looking at the context or requires no common sense.We replace the correct choice of these questions with a "None of the above" choice.
To create false negative training instances, we randomly sample 70% of questions from the 33, 219 good question sets and replace their least challenging negative answer with "None of the above".Specifically, we fine-tune three BERT2 next sentence prediction models on COSMOS: BERT(A|P, Q), BERT(A|P ), BERT(A|Q), where P , Q, A denotes the paragraph, question, and answer.BERT(A| ) denotes the possibility of an answer A being the next sentence of .The least challenging negative answer is determined by

Train / Dev / Test Split
We finally obtain 35, 588 question sets for our COSMOS dataset.To ensure that the development and test sets are of high quality, we identify a group of workers who excelled in the generation task for question and answers, and randomly sample 7K question sets authored by these excellent workers as test set, and 3K question sets as development set.The remaining questions are all used as training set.Table 1 shows dataset statistics.

Data Analysis
Figure 2 compares frequent trigram prefixes in COSMOS and SQuAD 2.0 (Rajpurkar et al., 2018).Most of the frequent trigram prefixes in COS-MOS, e.g., why, what may happen, what will happen are almost absent from SQuAD 2.0, which demonstrates the unique challenge our dataset contributes.We randomly sample 500 answerable questions to manually categorize according to their contextual commonsense reasoning types.Figure 3 shows representative examples.Table 2 shows the distribution of the question types.
• Pre-/post-conditions: causes/effects of an event.
• Reactions: possible reactions of people or objects to an event.
• Temporal events: what events might happen before or after the current event.
• Situational facts: facts that can be inferred from the description of a particular situation.
• Counterfactuals: what might happen given a counterfactual condition.

BERT with Multiway Attention
Multiway attention (Wang et al., 2018a;Zhu et al., 2018) has been shown to be effective in capturing the interactions between each pair of input paragraph, question and candidate answers, leading to better context interpretation, while BERT finetuning (Devlin et al., 2018) also shows its prominent ability in commonsense inference.To fur-ther enhance the context understanding ability of BERT fine-tuning, we perform multiway bidirectional attention over the BERT encoding output.
Figure 4 shows the overview of the architecture.
Encoding with Pre-trained BERT Given a paragraph, a question, and a set of candidate answers, the goal is to select the most plausible correct answer from the candidates.We formulate the input paragraph as P = {p 0 , p 1 , ..., p n }, the question as Q = {q 0 , q 1 , ..., q k } and a candidate answer as A = {a 0 , a 1 , ..., a s }, where p i , q i and a i is the i-th word of the paragraph, question and candidate answer respectively.Following (Devlin et al., 2018), given the input P , Q and A, we apply the same tokenizer and concatenate all tokens as a new sequence where [CLS] is a special token used for classification and [SEP] is a delimiter.Each token is initialized with a vector by summing the corresponding token, segment and position embedding from pre-trained BERT, and then encoded into a hidden state.Finally we get [H cls , H P , H Q , H A ] as encoding output.
Multiway Attention To encourage better context interpretation, we perform multiway attention over BERT encoding output.Taking the paragraph P as an example, we compute three types of attention weights to capture its correlation to the question, the answer, and both the question and answer, and get question-attentive, answer-attentive, and question and answer-attentive paragraph representations where W t and b t are learnable parameters.Next we fuse these representations with the original encoding output of P where Classification For each candidate answer A i , we compute the loss as follows:

Baseline Methods
We explore two categories of baseline methods: reading comprehension approaches and pretrained language model based approaches.
Sliding Window (Richardson et al., 2013) measures the similarity of each candidate answer with each window with m words of the paragraph.Stanford Attentive Reader (Chen et al., 2016) performs a bilinear attention between the question and paragraph for answer prediction.over BERT encoding output.Taking the paragraph P as an example, we compute three types of attention weights to capture its correlation to the question, the answer, and both the question and answer, and get question-attentive, answer-attentive, and question and answer-attentive paragraph representations where W t and b t are learnable parameters.Next we fuse these representations with the original encoding output of P We explore two categories of baseline methods: reading comprehension approaches and pretrained language model based approaches.
Sliding Window (Richardson et al., 2013) measures the similarity of each candidate answer with each window with m words of the paragraph.Stanford Attentive Reader (Chen et al., 2016) performs a bilinear attention between the question and paragraph for answer prediction.Gated-Attention Reader (Dhingra et al., 2017) performs multi-hop attention between the question and a recurrent neural network based paragraph encoding states.Co-Matching (Wang et al., 2018b) captures the interactions between question and paragraph, as well as answer and paragraph with attention.Commonsense-RC (Wang et al., 2018a) applies three-way unidirectional attention to model interactions between paragraph, question, and answers.GPT-FT (Radford et al., 2018) is based on a generative pre-trained transformer language model, following a fine-tuning step on COSMOS.text interpretation, we perform multiway attention over BERT encoding output.Taking the paragraph P as an example, we compute three types of attention weights to capture its correlation to the question, the answer, and both the question and answer, and get question-attentive, answer-attentive, and question and answer-attentive paragraph representations where W t and b t are learnable parameters.Next we fuse these representations with the original encoding output of P Multiway Attention To encourage better con-text interpretation, we perform multiway attention over BERT encoding output.Taking the paragraph P as an example, we compute three types of attention weights to capture its correlation to the question, the answer, and both the question and answer, and get question-attentive, answer-attentive, and question and answer-attentive paragraph representations where W t and b t are learnable parameters.Next we fuse these representations with the original encoding output of P   Human Performance To get human performance on COSMOS QA, we randomly sample 200 question sets from the test set, and ask 3 workers from AMT to select the most plausible correct answer.Each worker is paid $0.1 per question set.We finally combine the predictions for each question with majority vote.

Results and Analysis
Table 3 shows the characteristics and performance of varying approaches and human performance. 3 Most of the reading comprehension approaches apply attention to capture the correlation between paragraph, question and each candidate answer and tend to select the answer which is the most semantically closed to the paragraph.For example, in Figure 5, the Commonsense-RC baseline mistakenly selected the choice which has the most overlapped words with the paragraph without any commonsense reasoning.However, our analysis shows that more than 83% of correct answers in COSMOS QA are not stated in the given paragraphs, thus simply comparing the semantic 3 Appendix C shows the implementation details. relatedness doesn't work well.Pre-trained language models with fine-tuning achieve more than 20% improvement over reading comprehension approaches.
By performing attention over BERT-FT, the performance is further improved, which demonstrates our assumption that incorporating interactive attentions can further enhance the context interpretation of BERT-FT.For example, in Figure 5, BERT-FT mistakenly selected choice A which can be possibly entailed by the paragraph.However, by performing multiway attention to further enhance the interactive comprehension of context, question and answer, our approach successfully selected the correct answer.

Ablation Study
Many recent studies have suggested the importance of measuring the dataset bias by checking the model performance based on partial information of the problem (Gururangan et al., 2018;Cai et al., 2017).Therefore, we report problem ablation study in Table 4 using BERT-FT as a sim-  Ablating other components of the problems cause more significant drops in performance.

Knowledge Transfer Through Fine-tuning
Recent studies (Howard and Ruder, 2018;Min et al., 2017;Devlin et al., 2018) have shown the benefit of fine-tuning on similar tasks or datasets for knowledge transfer.Considering the unique challenge of COSMOS, we explore two related multiple-choice datasets for knowledge transfer: RACE (Lai et al., 2017), a large-scale reading comprehension dataset, and SWAG (Zellers et al., 2018), a large-scale commonsense inference dataset.Specifically, we first fine-tune BERT on RACE or SWAG or both, and directly test on COS-MOS to show the impact of knowledge transfer.Furthermore, we sequentially fine-tune BERT on both RACE or SWAG and COSMOS.As Table 5 shows, with direct knowledge transfer, RACE provides significant benefit than SWAG since COS-MOS requires more understanding of the interaction between paragraph, question and each candidate answer.With sequentially fine-tuning, SWAG provides better performance, which indicates that  with fine-tuning on SWAG, BERT can obtain better commonsense inference ability, which is also beneficial to COSMOS.

Error Analysis
We randomly select 100 errors made by our approach from the dev set, and identify 4 phenomena: Complex Context Understanding: In 30% of the errors, the context requires complicated crosssentence interpretation and reasoning.Taking P1 in Figure 6 as an example, to correctly predict the answer, we need to combine the context information that the woman attempted to suicide before but failed, she made the bed since she determined to leave, and she took the elevator and headed to the roof, and infer that the woman was attempting to suicide again.
Inconsistent with Human Common Sense: In 33% of the errors, the model mistakenly selected the choice which is not consistent with human common sense.For example, in P2 of Figure 6, both choice A and choice C could be potentially correct answers.However, from human common sense, it's not safe to leave a baby alone at home.Multi-turn Commonsense Inference: 19% of the errors are due to multi-turn commonsense inference.For example, in P3 of Figure 6, the model needs to first determine the cause of headache is that she chatted with friends and forgot to sleep using common sense.Further, with counterfactual reasoning, if she didn't chat to her friends, then she wouldn't have gotten up with a headache.Unanswerable Questions: 14% of the errors are from unanswerable questions.The model cannot handle "None of the above" properly since it cannot be directly entailed by the given paragraph or the question.Instead, the model needs to compare the potential of all the other candidate choices.

Generative Evaluation
In real world, humans are usually asked to perform contextual commonsense reasoning without being provided with any candidate answers.
To test machine for human-level intelligence, we leverage a state-of-the-art natural language generator GPT2 (Radford et al., 2019) to automatically generate an answer by reading the given paragraph and question.Specifically, we fine-tune a pre-trained GPT2 language model on all the Paragraph, Question, Correct Answer of COSMOS training set, then given each Paragraph, Question from test set, we use GPT2-FT to generate a plausible answer.We automatically evaluate the generated answers against human authored correct answers with varying metrics in Table 6.We also create a AMT task to have 3 workers select all plausible answers among 4 automatically generated answers and a "None of the aboce" choice for 200 question sets.We consider an answer as correct only if all 3 workers determined it as correct.Figure 7 shows examples of automatically generated answers by pre-trained GPT2 and GPT2-FT as well as human authored correct answers.We observe that by fine-tuning on COSMOS, GPT2-FT generates more accurate answers.Although intuitively there may be multiple correct answers to the questions in COSMOS QA, our analysis shows that more than 84% of generated correct answers identified by human are semantically consistent with the gold answers in COSMOS, which demonstrates that COSMOS can also be used as a benchmark for generative commonsense reasoning.Appendix E shows more details and examples for generative evaluation.Human 1: We were getting more food and needed to create room.
GPT2-FT: I had gone through everything before and it was no longer able to hold food.
GPT2: I had never cleaned cupboards before when I moved here.
P2: My head hurts.I had so much fun at a chat with some scrap friends last Saturday night that I forgot to sleep.I ended up crawling into bed around 7AM.
Q: What may have happened if she did n't chat to her scrap friends?
Human 1: She would go to bed and sleep better.
GPT2-FT: She would have gotten up early and spend the night in bed.
GPT2: She was so happy that I woke her up early , just in time to get her back to sleep.
Human 2: She would not have gotten up with a headache.

P3:
Bertrand Berry has been announced as out for this Sunday's game with the New York Jets.Of course that comes as no surprise as he left the Washington game early and did not practice yesterday.His groin is now officially listed as partially torn.
Q: What might happen if his groin is not healed in good time?
Human 1: He will be benched for the rest of the season because of his injury.

GPT2-FT:
He may miss the next few games.

Related Work
There have been many exciting new datasets developed for reading comprehension, such as SQuAD (Rajpurkar et al., 2016), NEWSQA (Trischler et al., 2017), SearchQA (Dunn et al., 2017), Narra-tiveQA (Kočiskỳ et al., 2018) (Hill et al., 2015), and MCScript (Ostermann et al., 2018).Most these datasets focus on relatively explicit understanding of the context paragraph, thus a relatively small or unknown fraction of the dataset requires commonsense reasoning, if at all.A notable exception is ReCoRD (Zhang et al., 2018) that is designed specifically for challenging reading comprehension with commonsense reasoning.COSMOS complements ReCoRD with three unique challenges: (1) our context is from webblogs rather than news, thus requiring commonsense reasoning for everyday events rather than news-worthy events.(2) All the answers of ReCoRD are contained in the paragraphs and are assumed to be entities.In contrast, in COSMOS, more than 83% of answers are not stated in the paragraphs, creating unique modeling challenges.
(3) COSMOS can be used for generative evaluation in addition to multiple-choice evaluation.
There also have been other datasets focusing specifically on question answering with commonsense, such as CommonsenseQA (Talmor et al., 2018) and Social IQa (Sap et al., 2019), and various other types of commonsense inferences (Levesque et al., 2012;Rahman and Ng, 2012;Gordon, 2016;Rashkin et al., 2018;Roemmele et al., 2011;Mostafazadeh et al., 2017;Zellers et al., 2018).The unique contribution of COSMOS is combining reading comprehension with commonsense reasoning, requiring contextual commonsense reasoning over considerably more complex, diverse, and longer context.Table 7 shows comprehensive comparison among the most relevant datasets.
There have been a wide range of attention mechanisms developed for reading comprehension datasets (Hermann et al., 2015;Kadlec et al., 2016;Chen et al., 2016;Dhingra et al., 2017;Seo et al., 2016;Wang et al., 2018b).Our work investigates various state-of-the-art approaches to reading comprehension, and provide empirical insights into the design choices that are the most effective for contextual commonsense reasoning required for COSMOS.

Conclusion
We introduced COSMOS QA, a large-scale dataset for machine comprehension with contextual commonsense reasoning.We also presented extensive empirical results comparing various state-ofthe-art neural architectures to reading comprehension, and demonstrated a new model variant that leads to the best result.The substantial headroom (25.6%) between the best model performance and human encourages future research on contextual commonsense reasoning.

A Context Paragraph Extraction
We gather a diverse collection of everyday situations from a corpus of personal narratives (Gordon and Swanson, 2009) from the ICWSM 2009 Spinn3r Blog Dataset (Burton et al., 2009).It contains over 1.6 million of non-spam weblog entries describing everyday personal events.For each personal story, we use spaCy 4 for sentence segmentation and tokenization.In order to get a short substory as context, we apply pre-trained BERT 5 (Devlin et al., 2018) model to predict a confidence score for each two consecutive sentences from the story, and then segment each story into multiple paragraphs so that each paragraph contains between 30 and 150 words.For each blog, we randomly sample one paragraph as context to create questions and answers.

B Additional Details on AMT Instructions
To make the questions more challenging for an AI system, we recommend the workers use less words from the paragraph for correct answers and make incorrect answers more appealing by using words from the paragraph as much as possible.We also encourage workers to provide all the candidate answers with similar length and style.
We restrict this task to the workers in Englishspeaking countries (United States, Canada, and United Kingdom) and with more than 5,000 HITs with at least a 99% acceptance rate.To ensure quality, we also create a qualification task.

C Implementation Details
For baseline methods, we use their released implementations from open source projects and retrain them on our dataset.All approaches follow the same pre-processing steps: segment each paragraph into multiple sentences, and tokenize each sentence as well as question and candidate answers with spaCy.For BERT-FT based approaches, we optimize the parameters with grid search: training epochs 10, learning rate l ∈ {2e-5, 3e-5, 5e-5}, gradient accumulation steps g ∈ {1, 4, 8}, training batch size b ∈ {2g, 3g, 4g, 5g}.We will make all the resources and implementations publicly available.

D Impact of Training Data Size
To explore the impact of the amount of training data, we divide the whole training dataset into 10fold and successively add another 10% into the training data.We use BERT-FT approach for comparison.Figure 8 shows the learning curve.We can see that, the performance goes up as we add more training data.However, we do not observe significant improvement when we further increase the training data after 15K questions.For automatic evaluation, we compare each generated candidate answer against the original human authored correct choice in COSMOS QA, and average all metric scores with 10 sets of candidate answers.For AMT based human evaluation, we randomly sample 200 paragraphs and questions, and for each question we randomly sample 4 automatically generated answers from the outputs of GPT2 without fine-tuning and GPT2-FT.For each question set, we ask 3 workers to select all plausible correct answers from the 4 candidate choices or "None of the above".Workers are paid

Figure 1 :
Figure 1: Examples of COSMOS QA. ( indicates the correct answer.)Importantly, (1) the correct answer is not explicitly mentioned anywhere in the context paragraph, thus requiring reading between the lines through commonsense inference and (2) answering the question correctly requires reading the context paragraph, thus requiring reading comprehension and contextual commonsense reasoning.

Figure 2 :
Figure 2: Distribution of trigram prefixes of questions in COSMOS QA and SQuAD 2.0

Figure 3 :
Figure 3: Examples of each type of commonsense reasoning in COSMOS QA. ( indicates the correct answer.) Gated-Attention Reader(Dhingra et al., 2017) performs multi-hop attention between the question and a recurrent neural network based paragraph encoding states.Co-Matching(Wang et al., 2018b) captures the interactions between question and paragraph, as well as answer and paragraph with attention.Commonsense-RC(Wang et al., 2018a) applies three-way unidirectional attention to model interactions between paragraph, question, and answers.GPT-FT(Radford et al., 2018) is based on a generative pre-trained transformer language model, following a fine-tuning step on COSMOS QA.BERT-FT(Devlin et al., 2018) is a pre-trained bidirectional transformer language model following a fine-tuning step on COSMOS QA.DMCN(Zhang et al., 2019a) performs dual attention between paragraph and question/answer over BERT encoding output.

F
A P = ([HP M A P : HP M A P ]W P + bP ) F QA P = ([HP M QA P : HP M QA P ]W P + bP ) where [:] denotes concatenation operation.W P , b P are learnable parameters for fusing paragraph representations.denotes ReLU function.Finally, we apply column-wise max pooling on [F Q P : F A P : F QA P ] and obtain the new paragraph representation F P .Similarly, we can also obtain 4 Experiments 4.1 Baseline Methods

:
HP M QA P ]W P + bP ) where [:] denotes concatenation operation.W P , b P are learnable parameters for fusing paragraph representations.denotes ReLU function.Finally, we apply column-wise max pooling on [F Q P : F A P : F QA P ] and obtain the new paragraph representation F P .Similarly, we can also obtain 4 Experiments4.1 Baseline MethodsWe explore two categories of baseline methods: reading comprehension approaches and pretrained language model based approaches.Sliding Window(Richardson et al., 2013) measures the similarity of each candidate answer with each window with m words of the paragraph.Stanford Attentive Reader(Chen et al., 2016) performs a bilinear attention between the question and paragraph for answer prediction.Gated-Attention Reader(Dhingra et al., 2017) performs multi-hop attention between the question and a recurrent neural network based paragraph encoding states.Co-Matching(Wang et al., 2018b) captures the interactions between question and paragraph, as well as answer and paragraph with attention.Commonsense-RC(Wang et al., 2018a) applies three-way unidirectional attention to model interactions between paragraph, question, and answers.GPT-FT(Radford et al., 2018) is based on a generative pre-trained transformer language model, following a fine-tuning step on COSMOS.

:
HP M QA P ]W P + bP ) where [:] denotes concatenation operation.W P , b P are learnable parameters for fusing paragraph representations.denotes ReLU function.Finally, we apply column-wise max pooling on [F Q P : F A P : F QA P ] and obtain the new paragraph representation F P .Similarly, we can also obtainL(Ai|P, Q) = log P 4 j=1 exp(W > f Fj))We explore two categories of baseline methods: reading comprehension approaches and pretrained language model based approaches.Sliding Window(Richardson et al., 2013) measures the similarity of each candidate answer with each window with m words of the paragraph.Stanford Attentive Reader (Chen et al., 2016) performs a bilinear attention between the question and paragraph for answer prediction.Gated-Attention Reader (Dhingra et al., 2017) performs multi-hop attention between the question and a recurrent neural network based paragraph encoding states.Co-Matching (Wang et al., 2018b) captures the interactions between question and paragraph, as well as answer and paragraph with attention.Commonsense-RC (Wang et al., 2018a) applies three-way unidirectional attention to model interactions between paragraph, question, and answers.GPT-FT (Radford et al., 2018) is based on a generative pre-trained transformer language model, following a fine-tuning step on COSMOS.

Figure 4 :
Figure 4: Architecture overview of BERT with multiway attention: Solid lines and blocks show the learning of multiway attentive context paragraph representation.

Q:
What is the most likely reason that I decided to clean the cupboards ?P: I cleaned the two large bottom cupboards and threw a ton of old stuff away.Dustin's parents like to drop off boxes of food like we're refugees or something.It's always appreciated, and some of it is edible.Most of what I threw away was from last year when Dustin's great-aunt was moving into her new apartment home (retirement center) and they cleaned out her pantry.B: We were getting more food and needed to create room.A: I was getting tired of having food in the house.C: Dustin and I split up and I need to get rid of his old stuff.D: None of the above choices.

P1:
A woman had topped herself by jumping off the roof of the hospital she had just recently been admitted to.She was there because the first or perhaps latest suicide attempt was unsuccessful.She put her clothes on, folded the hospital gown and made the bed.She walked through the unit unimpeded and took the elevator to the top floor.Q: What would have happened to the woman if the staff at the hospital were doing their job properly?A: The woman would have been stopped before she left to take the elevator to the top floor and she would have lived.B: She would have been ushered to the elevator with some company.C: She would have managed to get to the elevator quicker with some assistance.D: None of the above choices.Q: What would happened if she could not find a daycare?A: She would try to find a babysitter.B: She would take the baby to work.C: She would leave the baby alone at home.D: None of the above choices.P2: Like me, she had no family or friends who could help with childcare.So like me, she found a daycare center that met her part-time needs.In sharp contrast to my job as a (gasp!) writer for the evil MSM, her nursing job was deemed by the other moms to be useful and worthwhile --in fact, worth putting her baby into daycare for "just a few hours, what harm could it do?"Q: What may have happened if she did n't chat to her scrap friends ?B: She would not have gotten up with a headache.A: She would have done some scrap thing at home.C: She would have been lonely and stayed up all night.D: None of the above choices.P3: My head hurts.I had so much fun at a chat with some scrap friends last Saturday night that I forgot to sleep.I ended up crawling into bed around 7AM.

Figure 6 :
Figure 6: Examples errors of our approach.( indicates correct answers and shows prediction errors.)

P1:
I cleaned the two large bottom cupboards and threw a ton of old stuff away.Dustin's parents like to drop off boxes of food like we're refugees or something.It's always appreciated, and some of it is edible.Most of what I threw away was from last year when Dustin's great-aunt was moving into her new apartment home (retirement center) and they cleaned out her pantry.Q: What is the most likely reason that I decided to clean the cupboards?

Figure 7 :
Figure 7: Examples of human authored correct answers, and automatically generated answers by pretrained GPT2 and GPT2-FT.(indicates the answer is correct while shows that the answer is incorrect.)

Figure 8 :
Figure 8: Performance on COSMOS with various amount of training data

Table 1 :
Statistics of training, dev and test sets of COSMOS QA.

Table 6 :
Generative performance of pre-trained GPT2 and GPT2-FT on COSMOS QA.All automatic metric scores are averaged from 10 sets of sample output.

Table 7 :
Comparison of the COSMOS QA to other multiple-choice machine reading comprehension datasets: P: contextual paragraph, Q: question, A: answers, MC: Multiple-choice, and -means unknown.