Finding Generalizable Evidence by Learning to Convince Q&A Models

We propose a system that finds the strongest supporting evidence for a given answer to a question, using passage-based question-answering (QA) as a testbed. We train evidence agents to select the passage sentences that most convince a pretrained QA model of a given answer, if the QA model received those sentences instead of the full passage. Rather than finding evidence that convinces one model alone, we find that agents select evidence that generalizes; agent-chosen evidence increases the plausibility of the supported answer, as judged by other QA models and humans. Given its general nature, this approach improves QA in a robust manner: using agent-selected evidence (i) humans can correctly answer questions with only ~20% of the full passage and (ii) QA models can generalize to longer passages and harder questions.


Introduction
There is great value in understanding the fundamental nature of a question (Chalmers, 2015). Distilling the core of an issue, however, is timeconsuming. Finding the correct answer to a given question may require reading large volumes of text or understanding complex arguments. Here, we examine if we can automatically discover the underlying properties of problems such as question answering by examining how machine learning models learn to solve that task.
We examine this question in the context of passage-based question-answering (QA). Inspired by work in interpreting neural networks (Lei et al., 2016), we have agents find a subset of the passage (i.e., supporting evidence) that maximizes a QA model's probability of a particular answer. Each agent (one agent per answer) finds the sentences that a QA model regards as strong evidence for its answer, using either exhaustive search or learned prediction. Figure 1 shows an example. To examine to what extent evidence is general and independent of the model, we evaluate if humans and other models find selected evidence to be valid support for an answer too. We find that, when provided with evidence selected by a given agent, both humans and models favor that agent's answer over other answers. When human evaluators read an agent's selected evidence in lieu of the full passage, humans tend to select the agentsupported answer.
Given that this approach appears to capture some general, underlying properties of the problem, we examine if evidence agents can be used to assist human QA and to improve generalization of other QA models. We find that humans can accurately answer questions on QA benchmarks, based on evidence for each possible answer, using only 20% of the sentences in the full passage. We observe a similar trend with QA models: using only selected evidence, QA models trained on short passages can generalize more accurately to questions about longer passages, compared to when the models use the full passage. Furthermore, QA models trained on middle-school reading comprehension questions generalize better to high-school exam questions by answering only based on the most convincing evidence instead of the full passage. Overall, our results suggest that learning to select supporting evidence by having agents try to convince a judge model of their designated answer improves QA in a general and robust way.
2 Learning to Convince Q&A Models Figure 1 shows an overview of the problem setup. We aim to find the passage sentences that provide the most convincing evidence for each answer option, with respect to a given QA model (the judge). To do so, we are given a sequence of passage sentences S = [S(1), . . . , S(m)], a question Q, and a sequence of answer options A = [A(1), . . . , A(n)]. We train a judge model with parameters to predict the correct answer index i ⇤ by maximizing p (answer = i ⇤ |S, Q, A).
Next, we assign each answer A(i) to one evidence agent, AGENT(i).
AGENT(i) aims to find evidence E(i), a subsequence of passage sentences S that the judge finds to support A(i). For ease of notation, we use set notation to describe E(i) and S, though we emphasize these are ordered sequences. AGENT(i) aims to maximize the judge's probability on A(i) when conditioned on E(i) instead of S, i.e., argmax E(i)✓S p (i|E(i), Q, A). We now describe three different settings of having agents select evidence, which we use in different experimental sections ( §4-6).
Individual Sequential Decision-Making Since computing the optimal E(i) directly is intractable, a single AGENT(i) can instead find a reasonable E(i) by making T sequential, greedy choices about which sentence to add to E(i). In this setting, the agent ignores the actions of the other agents. At time t, AGENT(i) chooses index e i,t of the sentence in S such that: where E(i, t) is the subsequence of sentences in S that AGENT(i) has chosen until time step t, i.e., E(i, t) = {S(e i,t )} [ E(i, t 1) with E(i, 0) = ? and E(i) = E(i, T ). It is a no-op to add a sentence S(e i,t ) that is already in the selected evidence E(i, t 1). The individual decision-making setting is useful for selecting evidence to support one particular answer.
Competing Agents: Free-for-All Alternatively, multiple evidence agents can compete at once to support unique answers, by each contributing part of the judge's total evidence. Agent competition is useful as agents collectively select a pool of question-relevant evidence that may serve as a summary to answer the question. Here, each of AGENT(1), . . . , AGENT(n) finds evidence that would convince the judge to select its respective answer, A(1), . . . , A(n). AGENT(i) chooses a sentence S(e i,t ) by conditioning on all agents' prior choices: . Agents simultaneously select a sentence each, doing so sequentially for t time steps, to jointly compose the final pool of evidence. We allow an agent to select a sentence previously chosen by another agent, but we do not keep duplicates in the pool of evidence. Conditioning on other agents' choices is a form of interaction that may enable competing agents to produce a more informative total pool of evidence. More informative evidence may enable a judge to answer questions more accurately without the full passage.
Competing Agents: Round Robin Lastly, agents can compete round robin style, in which case we aggregate the outcomes of all n 2 pairs of answers {A(i), A(j)} competing. Any given AGENT(i) participates in n 1 rounds, each time contributing half of the sentences given to the judge. In each one-on-one round, two agents select a sentence each at once. They do so iteratively multiple times, as in the free-for-all setup. To aggregate pairwise outcomes and compute an answer i's probability, we average its probability over all rounds involving AGENT(i):

Judge Models
The judge model is trained on QA, and it is the model that the evidence agents need to convince. We aim to select diverse model classes, in order to: (i) test the generality of the evidence produced by learning to convince different models; and (ii) to have a broad suite of models to evaluate the agentchosen evidence. Each model class assigns every answer A(i) a score, where the predicted answer is the one with the highest score. We use this score L(i) as a softmax logit to produce answer probabilities. Each model class computes L(i) in a different manner. In what follows, we describe the various judge models we examine.
TFIDF We define a function BoW TFIDF that embeds text into its corresponding TFIDF-weighted bag-of-words vector. We compute the cosine similarity of the embeddings for two texts X and Y: We define two model classes that select the answer most similar to the input passage sentences: L(i) = TFIDF(S, [Q; A(i)]), and L(i) = TFIDF(S, A(i)).
fastText We define a function BoW FT that computes the average bag-of-words representation of some text using fastText embeddings (Joulin et al., 2017). We use 300-dimensional fastText word vectors pretrained on Common Crawl. We compute the cosine similarity between the embeddings for two texts X and Y using: This method has proven to be a strong baseline for evaluating the similarity between two texts (Perone et al., 2018). Using this function, we define a model class that selects the answer most similar to the input passage context: BERT L(i) is computed using the multiplechoice adaptation of BERT (Devlin et al., 2019;Radford et al., 2018;Si, 2019), a pre-trained transformer network (Vaswani et al., 2017). We fine-tune all BERT parameters during training. This model predicts L(i) using a trainable vector v and BERT's first token embedding: We experiment with both the BERT BASE model (12 layers) and BERT LARGE (24 layers). For training details, see Appendix B.

Evidence Agents
In this section, we describe the specific models we use as evidence agents. The agents select sentences according to Equation 1, either exactly or via function approximation.
Search agent AGENT(i) at time t chooses the sentence S(e i,t ) that maximizes p (i|S(i, t), Q, A), after exhaustively trying each possible S(e i,t ) 2 S. Search agents that query TFIDF or fastText models maximize TFIDF or fastText scores directly (i.e., L(i), rather than p (i|S(i, t), Q, A)).
Learned agent We train a model to predict how a sentence would influence the judge's answer, instead of directly evaluating answer probabilities at test time. This approach may be less prone to selecting sentences that exploit hard-to-predict quirks in the judge; humans may be less likely to find such sentences to be valid evidence for an answer (discussed in §4.1). We define several loss functions and prediction targets, shown in Table 1. Each forward pass, agents predict one scalar per passage sentence via end-of-sentence token positions. We optimize these predictions using Adam (Kingma and Ba, 2015) on one loss from Table 1. For t > 1, we find it effective to simply predict the judge model at t = 1 and use this distribution for all time steps during inference. This trick speeds up training by enabling us to precompute prediction targets using the judge model, instead of querying it constantly during training.
We use BERT BASE for all learned agents. Learned agents predict the BERT BASE judge, as it is more efficient to compute than BERT LARGE . Each agent AGENT(i) is assigned the answer A(i) that it should support. We train one learned agent to find evidence for an arbitrary answer i. We condition AGENT(i) on i using a binary indicator when predicting L(i). We add the indicator to BERT's first token segment indicator and embed it into vectors and ; for each timestep's features f from BERT, we scale and shift f element-wise: ( ⇤ f ) + . See Appendix B for training details.
Notably, learning to convince a judge model does not require answer labels to a question. Even if the judge only learns from a few labeled examples, evidence agents can learn to model the judge's behavior on more data and out-ofdistribution data without labels.

Evaluating Evidence Agents
Evaluation Desiderata An ideal evidence agent should be able to find evidence for its answer w.r.t. a judge, regardless (to some extent) of the specific answer it defends. To appropriately evaluate evidence agents, we need to use questions with more than one defensible, passage-supported answer per question. In this way, an agent's performance will not depend disproportionately on the answer it is to defend, rather than its ability to find evidence.
Multiple-choice QA: RACE and DREAM For our experiments, we use RACE (Lai et al., 2017) and DREAM (Sun et al., 2019), two multiplechoice, passage-based QA datasets. Both consist of reading comprehension exams for Chinese students learning English; teachers explicitly designed answer options to be plausible (even if incorrect), in order to test language understanding. Each question has 4 total answer options in RACE and 3 in DREAM. Exactly one option is correct. DREAM consists of 10K informal, dialogue-based passages. RACE consists of 100K formal, written passages (i.e., news, fiction, or well-written articles). RACE also divides into easier, middle school questions (29%) and harder, high school questions (71%).
Other datasets we considered Multiple-choice passage-based QA tasks are well-suited for our purposes. Multiple-choice QA allows agents to support clear, dataset-curated possible answers. In contrast, Sugawara et al. (2018) show that 5-20% of questions in extractive, span-based QA datasets have only one valid candidate option. For example, some "when" questions are about passages with only one date. Sugawara   SQuAD (Rajpurkar et al., 2016), we found that agents could only learn to convince the judge model when supporting the correct answer (one answer per question).

Training and Evaluating Models
Our setup is not directly comparable to standard QA setups, as we aim to evaluate evidence rather than raw QA accuracy. However, each judge model's accuracy is useful to know for analysis purposes. Table 2 shows model accuracies, which cover a broad range. BERT models significantly outperform word-based baselines (TFIDF and fastText), and BERT LARGE achieves the best overall accuracy. No model achieves the estimated human ceiling for either RACE (Lai et al., 2017) or DREAM (Sun et al., 2019). Our code is available at https://github. com/ethanjperez/convince. We build off AllenNLP (Gardner et al., 2018) using Py-Torch (Paszke et al., 2017). For all human evaluations, we use Amazon Mechanical Turk via Par-lAI (Miller et al., 2017). Appendix B describes preprocessing and training details.

Human Evaluation of Evidence
Would evidence that convinces a model also be valid evidence to humans? On one hand, there is ample work suggesting that neural networks can learn similar patterns as humans do. Convolutional networks trained on ImageNet share similarities with the human visual cortex (Cadieu et al., 2014). In machine translation, attention learns to align foreign words with their native counterparts (Bahdanau et al., 2015). On the other hand, neural networks often do not behave as humans   (Szegedy et al., 2014;Jia and Liang, 2017;Ribeiro et al., 2018;Alzantot et al., 2018). Convolutional networks rely heavily on texture (Geirhos et al., 2019), while humans rely on shape (Landau et al., 1988). Neural networks trained to recognize textual entailment can rely heavily on dataset biases (Gururangan et al., 2018).
Human evaluation setup We use human evaluation to assess how effectively agents select sentences that also make humans more likely to provide a given answer, when humans act as the judge. Humans answer based only on the question Q, answer options A, and a single passage sentence chosen by the agent as evidence for its answer option A(i) (i.e., using the "Individual Sequential Decision-Making" scheme from §2). Appendix C shows the interface and instructions used to collect evaluations. For each of RACE and DREAM, we use 100 test questions and collect 5 human answers for each (Q, A(i)) pair for each agent. We also evaluate a human baseline for this task, where 3 annotators select the strongest supporting passage sentence for each (Q, A(i)) pair. We report the average results across 3 annotators.
Humans favor answers supported by evidence agents when shown that agent's selected evidence, as shown in Table 3. 1 Without receiving any passage sentences, humans are at ran-1 Appendix D shows results by question type.
dom chance at selecting the agent's answer (25% on RACE, 33% on DREAM), since agents are assigned an arbitrary answer. For all evidence agents, humans favor agent-supported answers more often than the baseline (33.5-42.0% on RACE and 41.7-50.5% on DREAM). For our best agents, the relative margin over the baseline is substantial. In fact, these agents select evidence that is comparable to human-selected evidence. For example, on RACE, humans select the target answer 41.6% when provided with human-selected evidence, compared to 42% evidence selected by the learned agent that predicts p(i). All agents support right answers more easily than wrong answers. On RACE, the learned agent that predicts p(i) finds strong evidence more than twice as often for correct answers than for incorrect ones (74.6% vs. 31.1%). On RACE and DREAM both, BERT-based agents (search or learned agents) find stronger evidence than wordbased agents do. Humans tend to find that BERTbased agents select valid evidence for an answer, right or wrong. On DREAM, word-based agents generally fail to find evidence for wrong answers compared to the no-sentence baseline (28.4% vs. 24.5% for a search-based fastText agent).
On RACE, learned agents that predict the BERT BASE judge outperform search agents that directly query the BERT BASE judge. This effect may occur if search agents find an adversarial sentence that unduly affects the judge's answer but that humans do not find to be valid evidence. Appendix A shows one such example. Learned agents may  (left), with human evidence selection in the leftmost column. All agents find evidence that convinces judge models more often than a no-evidence baseline (25%). Learned agents predicting p(i) or p(i) find the most broadly convincing evidence.
have difficulty predicting such sentences, without directly querying the judge. Appendix E provides some analysis on why learned agents may find more general evidence than search agents do. Learned agents are most accurate at predicting evidence sentences when the sentences have a large impact on the judge model's confidence in the target answer, and such sentences in turn are more likely to be found as strong evidence by humans. On DREAM, search agents and learned agents perform similarly, likely because DREAM has 14x less training data than RACE.

Model Evaluation of Evidence
Evaluating an agent's evidence across models Beyond human evaluation, we test how general agent-selected evidence is, by testing this evidence against various judge models. We expect evidence agents to most frequently convince the model they are optimized to convince, by nature of their direct training or search objective. The more similar models are, the more we expect evidence from one model to be evidence to another. To some extent, we expect different models to rely on similar patterns to answer questions. Thus, evidence agents should sometimes select evidence that transfers to any model. However, we would not expect agent evidence to transfer to other models if models only exploit method-specific patterns.
Experimental setup Each agent selects one evidence sentence for each (Q, A(i)) pair. We test how often the judge selects an agent's answer, when given this sentence, Q, and A. We evaluate on all (Q, A(i)) pairs in RACE's test set. Human evaluations are on a 100 question subset of test.
Results Figure 2 plots how often each judge selects an agent's answer. Without any evidence, judge models are at random at choosing an agent's assigned answer (25%). All agents find evidence that convinces judge models more often than the no-evidence baseline. Learned agents that predict p(i) or p(i) find the evidence most broadly considered convincing; other judge models select these agents' supported answers over 46% of the time. These findings support that evidence agents find general structure despite aiming to convince specific methods with their distinct properties.
Notably, evidence agents are not uniformly convincing across judge models. All evidence agents are most convincing to the judge model they aim to convince; across any given agent's row, an agent's target judge model is the model which most frequently selects the agent's answer. Search agents are particularly effective at finding convincing evidence w.r.t. their target judge model, given that they directly query this model. More broadly, similar models find similar evidence convincing. We find similar results for DREAM (Appendix F).

Evidence Agents Aid Generalization
We have shown that agents capture methodagnostic evidence representative of answering a question (the strongest evidence for various answers). We hypothesize that QA models can generalize better out of distribution to more challenging questions by exploiting evidence agents' capability to understand the problem.
Throughout this section, using various train/test splits of RACE, we train a BERT BASE judge on easier examples (involving shorter passages or middle-school exams) and test its generalization to harder examples (involving longer passages or high-school exams). Judge training follows §2.1. We compare QA accuracy when the judge answers using (i) the full passage and (ii) only evidence sentences chosen by competing evidence agents. We report results using the round robin competing agent setup described in §2, as it resulted in higher generalization accuracy than free-for-all competition in preliminary experiments. Each competing agent selects sentences up to a fixed, maximum turn limit; we experiment with 3-6 turns per agent (6-12 total sentences for the judge), and we report the best result. We train learned agents (as de-  scribed in §2.2) on the full RACE dataset without labels, so these agents can model the judge using more data and on out-of-distribution data.
For reference, we evaluate judge accuracy on a subsequence of randomly sampled sentences; we vary the number of sentences sampled from 6-12 and report the best result. As a lower bound, we train an answer-only model to evaluate how effectively the QA model is using the passage sentences it is given. As an upper bound, we evaluate our BERT BASE judge trained on all of RACE, requiring no out-of-distribution generalization.

Generalizing to Longer Passages
We train a judge on RACE passages averaging 10 sentences long (all training passages each with 12 sentences); this data is roughly 1 10 th of RACE. We test the judge on RACE passages averaging 30 sentences long.
Results Table 4 shows the results. Using the full passage, the judge outperforms an answeronly BERT baseline by 4% (44.1% vs. 40.2%). When answering using the smaller set of agentchosen sentences, the judge outperforms the baseline by 10% (50.2% vs. 40.2%), more than doubling its relative use of the passage. Both search and learned agents aid the judge model in generalizing to longer passages. The improved generalization is not simply a result of the judge using a shorter passage, as shown by the random sentence selection baseline (44.7%).  Table 5: Generalizing to harder questions: We train a judge to answer questions with RACE's Middle School exam questions only. We test its generalization to High School exam questions. The judge is more accurate when using evidence agent sentences (last 5 rows) rather than the full passage.

Generalizing Across Domains
We examine if evidence agents aid generalization even in the face of domain shift. We test the judge trained on short RACE passages on long passages from DREAM. We use the same evidence agents from the previous subsection; the learned agent is trained on RACE only, and we do not fine-tune it on DREAM to test its generalization to finding evidence in a new domain. DREAM passages consist entirely of dialogues, use more informal language and shorter sentences, and emphasize general world knowledge and commonsense reasoning (Sun et al., 2019). RACE passages are more formal, written articles (e.g. news or fiction).
Results Table 4 shows that BERT-based evidence agents aid generalization even under domain shift. The model shows notable improvements for RACE ! DREAM transfer when it predicts from BERT-based agent evidence rather than the full passage (65.0% vs. 68.9%). These results support that our best evidence agents capture something fundamental to the problem of QA, despite changes in e.g. content and writing style.

Generalizing to Harder Questions
Using RACE, we train a judge on middle-school questions and test it on high-school questions.
Results Table 5 shows that the judge generalizes to harder questions better by using evidence from either search-based BERT agents (53.0%) or learned BERT agents (51.9%) compared to using the full passage directly (50.7%) or to searchbased TFIDF and fastText agents (50.4%-51.0%). Figure 3 shows that the improved generalization comes from questions the model originally gener-≈ Figure 3: Generalizing to harder questions by question type: We train a judge on RACE Middle School questions and test its generalization to RACE High School questions. To predict the answer, the judge uses either the full passage or evidence sentences chosen by a BERT-based search agent. The worse the judge does on a question category using the full passage, the better it does when using the agent-chosen sentences. alizes worse on. Simplifying the passage by providing key sentences may aid generalization by e.g. removing extraneous or distracting sentences from passages with more uncommon words or complex sentence structure. Such improvements come at the cost of accuracy on easier, wordmatching questions, where it may be simpler to answer with the full passage as seen in training.

Evidence Agents Aid Human QA
As observed in §4.1, evidence agents more easily support right answers than wrong ones. Furthermore, evidence agents do aid QA models in generalizing systematically when all answer evidence sentences are presented at once. We hypothesize that when we combine all evidence sentences, humans prefer to choose the correct answer.
Human evaluation setup Evidence agents compete in a free-for-all setup ( §2), and the human acts as the judge. We evaluate how accurately humans can answer questions based only on agent sentences. Appendix C shows the annotation interface and instructions. We collect 5 human answers for each of the 100 test questions.
Humans can answer using evidence sentences alone Shown in Table 6, humans correctly answer questions using many fewer sentences ( Table 6: Human accuracy using evidence agent sentences: Each agent selects a sentence supporting its own answer. Humans answer the question given these agent-selected passage sentences only. Humans still answer most questions correctly, while reading many fewer passage sentences. most similar to the question alone (via fastText), while achieving lower accuracy when using the BERT LARGE search agent's evidence (75.0%) and higher accuracy when using the BERT BASE search agent's evidence (83.8%). We explain the discrepancy by examining how effective agents are at supporting right vs. wrong answers (Table 3 from §4.1); BERT BASE is more effective than BERT LARGE at finding evidence for right answers (82.5% vs. 79.4%) and less effective at finding evidence for wrong answers (34.6% vs. 38.7%).

Related Work
Here, we discuss further related work, beyond that discussed in §4.1 on (dis)similarities between patterns learned by humans and neural networks.
Evidence Extraction Various papers have explored the related problem of extracting evidence or summaries to aid downstream QA. Wang et al. (2018a) concurrently introduced a neural model that extracts evidence specifically for the correct answer, as an intermediate step in a QA pipeline. Prior work uses similar methods to explain what a specific model has learned (Lei et al., 2016;Li et al., 2016;. Others extract evidence to improve downstream QA efficiency over large amounts of text (Choi et al., 2017;Kratzwald and Feuerriegel, 2019;Wang et al., 2018b). More broadly, extracting evidence can facilitate fact verification (Thorne et al., 2018) and debate. 2 Generic Summarization In contrast, various papers focus primarily on summarization rather than QA, using downstream QA accuracy only as a reward to optimize generic (question-agnostic) summarization models Liu, 2018, 2019;Eyal et al., 2019).
Debate Evidence extraction can be viewed as a form of debate, in which multiple agents support different stances (Irving et al., 2018;Irving and Askell, 2019). Chen et al. (2018) show that evidence-based debate improves the accuracy of crowdsourced labels, similar to our work which shows its utility in natural language QA.

Conclusion
We examined if it was possible to automatically distill general insights for passage-based question answering, by training evidence agents to convince a judge model of any given answer. Humans correctly answer questions while reading only 20% of the sentences in the full passage, showing the potential of our approach for assisting humans in question answering tasks. We examine how selected evidence affects the answers of humans as well as other QA models, and we find that agent-selected evidence is generalizable. We exploit these capabilities by employing evidence agents to facilitate QA models in generalizing to longer passages and out-of-distribution test sets of qualitatively harder questions.