What do we expect from Multiple-choice QA Systems?

The recent success of machine learning systems on various QA datasets could be interpreted as a significant improvement in models’ language understanding abilities. However, using various perturbations, multiple recent works have shown that good performance on a dataset might not indicate performance that correlates well with human’s expectations from models that “understand” language. In this work we consider a top performing model on several Multiple Choice Question Answering (MCQA) datasets, and evaluate it against a set of expectations one might have from such a model, using a series of zero-information perturbations of the model’s inputs. Our results show that the model clearly falls short of our expectations, and motivates a modified training approach that forces the model to better attend to the inputs. We show that the new training paradigm leads to a model that performs on par with the original model while better satisfying our expectations.


Introduction
Question answering (QA) has been a prevalent format for gauging advances in language understanding. Recent advances in contextual language modelling have led to impressive results on multiple NLP tasks, including on several multiple choice question answering (MCQA, depicted in Fig. 1) datasets, a particularly interesting QA task that provides a flexible space of candidate answers along with a simple evaluation.
However, recent work (Khashabi et al., 2016;Jia and Liang, 2017;Si et al., 2019;Gardner et al., 2019, inter alia) has questioned the interpretation of these QA successes as progress in natural language understanding. Indeed, they exhibit, in various task 1 Resources for this work are available at: http://cogcomp.org/page/publication_view/913 settings, the brittleness of neural models to various perturbations. They also show (Kaushik and Lipton, 2018;Gururangan et al., 2018) how models could learn to latch on to spurious correlations in the data to achieve high performance on a given dataset. In this paper we continue this line of work with a careful analysis of the extent to which the top performing MCQA model satisfies one's expectation from a model that "understands" language.
We formulate the following set of (nonexhaustive) expectation principles that a MCQA model should satisfy. Monotonicity Expectation: Model performance should not drop if an incorrect option is changed to make it even less likely to be correct. Sanity Expectation: Model should perform poorly given trivially insufficient input. Reading Expectation: Model should only choose an answer that is supported by the provided context (and thus perform poorly in the absence of informative context).
While we view the first two expectation principles as necessary axioms, the third could depend on one's definition of the MCQA task. An alternate definition could expect the MCQA model to answer questions using the provided context or, in its absence, using its internal knowledge. In this work, however, we use the Reading Expectation as phrased above; we believe that requiring a model to rely on externally supplied context better gauges its language understanding abilities, and levels the playing field among models with varying levels of internal knowledge.
Guided by these expectation principles we formulate concrete input perturbations to evaluate whether a model satisfies these expectations. We show that the top MCQA model fails to meet any of the expectation principles described above. Our results point to the presence of dataset artifacts which the model uses to solve the datasets, rather than the underlying task.
With goals and insights, we then propose (a) a different training objective -which encourages the model to score each candidate option on its own merit, and (b) an unsupervised data augmentation technique -which aims at "explaining" to the model the necessity of simultaneously attending to all inputs, to help the model solve the task. Our experiments on three popular MCQA datasets indicate that a model trained using our proposed approach better satisfies the expectation principles described above, while performing competitively as compared to the baseline model.

Multi-choice Question Answering
In this section, we briefly describe the multiplechoice question answering (MCQA) task, and the model and datasets we use in this work. 2 MCQA Task In a k-way MCQA task, a model is provided with a question q, a set of candidate options O = {O 1 , . . . , O k }, and a supporting context for each option C = {C 1 , . . . , C k }. The model needs to predict the correct answer option that is best supported by the given contexts. Figure 1 shows an example of a 4-way MCQA task.
Datasets We use the following MCQA datasets: 1. RACE (Lai et al., 2017): A reading comprehension dataset containing questions from the English exams of middle and high school Chinese students. The context for all options is the same input paragraph.
2. QASC : An MCQA dataset containing science questions of elementary and middle school level, which require composition of facts using common-sense reasoning.
2 All results are reported on the dev split of the datasets.  , and then on the respective datasets. More details on the training procedure can be found in the appendix.

Model vs. Our Expectations
In this section, we define the perturbations we design to evaluate a model against our expectation principles (defined in §1). We then analyze how well the baseline model satisfies these expectations.

Monotonicity Expectation:
The following setting tests whether a model is fooled by an obviously incorrect option, one with high word overlap between its inputs. • Perturbed Incorrect Option (PIO): The option description for an incorrect option is changed to the question itself and its corresponding context is changed to 10 concatenations of the question. 4 Sanity Expectation: The following settings test how the model's performance changes when given an unreasonable input, for which it should not be possible to predict the correct answer.
• No option (NO): The option descriptions for all candidate options is changed to empty, "<s>".
• No question (NQ): The question (for all its candidate options O) is changed to empty , "<s>".
Reading Expectation: The following setting tests how crucial the context is for the model to correctly answer the questions.
• No context (NC): The contexts for all candidate options is changed to empty, "<s>".
Baseline model performance Table 1 shows that the model achieves impressive accuracy on all three dataset; RACE (84.8), QASC (85.2), and ARISTO (78.3), which suggests that the model should satisfy the expectations laid out for a good MCQA model.
Evaluating expectations When evaluating the model by modifying an incorrect option and its context (PIO), we find that its performance drops notably across all three datasets, for example, from 85.2 → 7.9 for QASC. This shows that the model is not able to cope with an incorrect option containing high word overlap with the question, even when it is trivially wrong and the correct option and its context are present. The baseline model thus fails to satisfy the Monotonicity Expectation.
Given an unreasonable input, where a pivotal component of the input is missing, we find that the baseline model still performs surprisingly well. For example, in ARISTO, removal of the question (NQ) only leads to a performance drop from 78.3 → 55.3, and removal of the options (NO), from 78.3 → 46.8. This suggests that the datasets contain unwanted biases that the model relies on to answer correctly. This shows that the baseline model fails to satisfy the Sanity Expectation.
The model achieves reasonable performance on the removal of the contexts; thus failing our Reading Expectation, e.g., performance only drops from 78.3 → 63.8 in ARISTO. To achieve this performance the model must rely on its inherent knowledge (Petroni et al., 2019) or, more likely, on dataset artifacts as suggested previously.

Proposed Training Approach
To address the aforementioned limitations, and reduce the tendency of the model to exploit dataset artifacts, we propose the following modifications to the training methodology.

MCQA as Binary Classification
Treating MCQA as a multi-class classification problem requires the model to minimally differentiate the correct option from the incorrect options, thus making the training sensitive to the relative difficulty between the options. We propose to prevent this by training the model to predict the correctness of each candidate option separately, by converting the k-way MCQA task into k binary classification tasks. The model is trained to predict a high probability for the correct option triplet (q, O g , C g ), and low for the other k − 1 options.

Unsupervised data augmentation
We introduce an unsupervised data augmentation technique to discourage the model from exploiting spurious correlations between pairs of inputs and encourage it to read all the inputs. During training, given an MCQA instance (q, O, C), for each of the option triplet (q, O i , C i ), we generate new examples (each with negative label) by performing one of the following perturbations with a certain probability (details in the appendix): Option: O i is changed to one of (a) empty ("<s>") or (b) O j ∈ O; j = g. Context: C i is changed to one of (a) empty ("<s>") or (b) C j ∈ C; j = g. Question: q, for all options O is changed to one of (a) empty ("<s>") or (b) another question from the training set.
No change: The triple is left as is. This is an automatic data augmentation and requires no manual annotation.

Results
The performance of our proposed training approach (+ Our Training) along with the baseline model are presented in Table 2  competitively (within 2.6 points) with the baseline on all three datasets suggesting that our proposed training approach only has minor impact on the overall model performance.
In our PIO setting, the new model outperforms the baseline on all three datasets by a large margin (55.5 compared to baseline's 25.4 on ARISTO indicating an improvement over the baseline with regard to our Monotonicity Expectation. Even though the data augmentation did not augment examples with this perturbation, our training approach helps the model better read the inputs and avoid distractor options. When evaluating over unreasonable inputs in the NO and NQ settings, the resulting model performs poorly compared to the baseline (13.7 vs. 50.2 and 12.3 vs. 34.3 on QASC), showing that our training approach helps the model to not rely on dataset bias and satisfy the Sanity Expectation.
Finally, the new model performs poorly when we remove the contexts (e.g 20.6 on RACE), indicating how it is able to meet our Reading Expectations. The results also show the resulting model's reliance on the context for information required to correctly answer questions. Moreover, it implies that the resulting model is able to achieve performance similar to the baseline by heavily relying on information from the contexts, as opposed to the baseline that exploits dataset artifacts (as previously shown).
Results showing the performance of the model trained using binary classification loss, without the data augmentation, are attached in the appendix.

Related work
Our work builds on numerous recent works that challenge the robustness of neural language models (Jin et al., 2020;Si et al., 2019) or, more generally, neural models (Kaushik and Lipton, 2018;Jia and Liang, 2017;Khashabi et al., 2016). Our evaluation settings -hiding one of the three inputs to the MCQA models -are similar to Kaushik and Lipton 2018's partial input settings which were designed to point out the existence of dataset artifacts in reading comprehension datasets. However, we argue that our results additionally point to a need for more robust training methodologies and propose an improved training approach. Our data augmentation approach builds on recent works (Khashabi et al., 2020;Kobayashi, 2018;Kaushik et al., 2020;Cheng et al., 2018;Andreas, 2020) that try to leverage augmenting training data to improve the performance and/or robustness of models. However most of these works are semi-automatic or require human annotation while our augmentation approach requires no additional annotation.

Conclusion
We formulated three expectation principles that a MCQA model must satisfy, and devised appropriate settings to evaluate a model against these principles. Our evaluations on a RoBERTa-based model showed that the model fails to satisfy any of our expectations, and exposed its brittleness and reliance on dataset artifacts. To improve learning, we proposed a modified training objective to reduce the model's sensitivity to the relative difficulty of candidate options, and an unsupervised data augmentation technique to encourage the model to rely on all the input components of a MCQA problem. The evaluation of our proposed training approach showed that the resulting model performs competitively with the original model while being robust to perturbations; hence, closer to satisfying our expectation principles.  Table 3: Results contrasting the performance of the baseline model trained using binary classification loss to the baseline model and the model trained using our proposed training approach on RACE, ARISTO and QASC datasets. The evaluation settings used are described in the paper.