A Self-Training Method for Machine Reading Comprehension with Soft Evidence Extraction

Neural models have achieved great success on machine reading comprehension (MRC), many of which typically consist of two components: an evidence extractor and an answer predictor. The former seeks the most relevant information from a reference text, while the latter is to locate or generate answers from the extracted evidence. Despite the importance of evidence labels for training the evidence extractor, they are not cheaply accessible, particularly in many non-extractive MRC tasks such as YES/NO question answering and multi-choice MRC. To address this problem, we present a Self-Training method (STM), which supervises the evidence extractor with auto-generated evidence labels in an iterative process. At each iteration, a base MRC model is trained with golden answers and noisy evidence labels. The trained model will predict pseudo evidence labels as extra supervision in the next iteration. We evaluate STM on seven datasets over three MRC tasks. Experimental results demonstrate the improvement on existing MRC models, and we also analyze how and why such a self-training method works in MRC.


Introduction
Machine reading comprehension (MRC) has received increasing attention recently, which can be roughly divided into two categories: extractive and non-extractive MRC.Extractive MRC requires a model to extract an answer span to a question from reference documents, such as the tasks in SQuAD (Rajpurkar et al., 2016) and CoQA (Reddy et al., 2019).In contrast, non-extractive MRC infers answers based on some evidence in reference documents, including Yes/No question answering (Clark et al., 2019), multiple-choice MRC (Lai et al., 2017;Khashabi et al., 2018;Sun et al., 2019), and open domain question answering (Dhingra et al., 2017b).As shown in Table 1, evidence plays a vital role in MRC (Zhou et al., 2019;Ding et al., 2019;Min et al., 2018), and the coarse-tofine paradigm has been widely adopted in multiple models (Choi et al., 2017;Li et al., 2018;Wang et al., 2018) where an evidence extractor first seeks the evidence from given documents and then an answer predictor infers the answer based on the evidence.However, it is challenging to learn a good evidence extractor since there lack evidence labels for supervision.
Manually annotating the golden evidence is expensive.Therefore, some recent efforts have been dedicated to improving MRC by leveraging noisy evidence labels when training the evidence extractor.Some works (Lin et al., 2018;Min et al., 2018) generate distant labels using hand-crafted rules and external resources.Some studies (Wang et al., 2018;Choi et al., 2017) adopt reinforcement learning (RL) to decide the labels of evidence, however such RL methods suffer from unstable training.More distant supervision techniques are also used to refine noisy labels, such as deep probability logic (Wang et al., 2019), but they are hard to transfer to other tasks.Nevertheless, improving the evidence extractor remains challenging when golden evidence labels are not available.
In this paper, we present a general and effective method based on Self-Training (Scudder, 1965) to improve MRC with soft evidence extraction when golden evidence labels are not available.Following the Self-Training paradigm, a base MRC model is iteratively trained.At each iteration, the base model is trained with golden answers, as well as noisy evidence labels obtained at the preceding it-Q: Did a little boy write the note?D: ...This note is from a little girl.She wants to be your friend.If you want to be her friend, ... A: No Q: Is she carrying something?D: ...On the step, I find the elderly Chinese lady, small and slight, holding the hand of a little boy.In her other hand, she holds a paper carrier bag.... A: Yes eration.Then, the trained model generates noisy evidence labels, which will be used to supervise evidence extraction at the next iteration.The overview of our method is shown in Figure 1.Through this iterative process, evidence is labelled automatically to guide the RC model to find answers, and then a better RC model benefits the evidence labelling process in return.Our method works without any manual efforts or external information, and therefore can be applied to any MRC tasks.Besides, the Self-Training algorithm converges more stably than RL.Two main contributions in this paper are summarized as follows: 1. We propose a self-training method to improve machine reading comprehension by soft evidence labeling.Compared with other existing methods, our method is more effective and general.
2. We verify the generalization and effectiveness of STM on several MRC tasks, including Yes/No question answering (YNQA), multiple-choice machine reading comprehension (MMRC), and open-domain question answering (ODQA).Our method is applicable to different base models, including BERT and DSQA (Lin et al., 2018).Experimental results demonstrate that our proposed method improves base models in three MRC tasks remarkably.

Related Work
Early MRC studies focus on modeling semantic matching between a question and a reference document (Seo et al., 2017;Huang et al., 2018;Zhu et al., 2018;Mihaylov and Frank, 2018).In order to mimic the reading mode of human, hierarchical coarse-to-fine methods are proposed (Choi et al., 2017;Li et al., 2018).Such models first read the full text to select relevant text spans, and then infer answers from these relevant spans.Extracting such spans in MRC is drawing more and more attention, though still quite challenging (Wang et al., 2019).
In general, evidence extraction in MRC can be classified into four types according to the training method.First, unsupervised methods provide no guidance for evidence extraction (Seo et al., 2017;Huang et al., 2019).Second, supervised methods train evidence extraction with golden evidence labels, which sometimes can be generated automatically in extractive MRC settings (Lin et al., 2018;Yin and Roth, 2018;Hanselowski et al., 2018).Third, weakly supervised methods rely on noisy evidence labels, where the labels can be obtained by heuristic rules (Min et al., 2018).Moreover, some data programming techniques, such as deep probability logic, were proposed to refine noisy labels (Wang et al., 2019).Last, if a weak extractor is obtained via unsupervised or weakly supervised pre-training, reinforcement learning can be utilized to learn a better policy of evidence extraction (Wang et al., 2018;Choi et al., 2017).
For non-extractive MRC tasks, such as YNQA and MMRC, it is cumbersome and inefficient to annotate evidence labels (Ma et al., 2019).Although various methods for evidence extraction have been proposed, training an effective extractor is still a challenging problem when golden evidence labels are unavailable.Weakly supervised methods either suffer from low performance or rely on too many external resources, which makes them diffi-cult to transfer to other tasks.RL methods can indeed train a better extractor without evidence labels.However, they are much more complicated and unstable to train, and highly dependent on model pretraining.
Our method is based on Self-Training, a widely used semi-supervised method.Most related studies follow the framework of traditional Self-Training (Scudder, 1965) and Co-Training (Blum and Mitchell, 1998), and focus on designing better policies for selecting confident samples.Co-Trade (Zhang and Zhou, 2011) evaluates the confidence of whether a sample has been correctly labeled via a statistic-based data editing technique (Zighed et al., 2002).Self-paced Co-Training (Ma et al., 2017) adjusts labeled data dynamically according to the consistency between the two models trained on different views.A reinforcement learning based method (Wu et al., 2018) designs an additional Q-agent as a sample selector.

Task Definition and Model Overview
The task of machine reading comprehension can be formalized as follows: given a reference document composed of a number of sentences D = {S 1 , S 2 , • • • , S m } and a question Q, the model should extract or generate an answer Â to this question conditioned on the document, formally as The process can be decomposed into two components, i.e., an evidence extractor and an answer predictor.The golden answer A is given for training the entire model, including the evidence extractor and the answer predictor.Denote E i as a binary evidence label {0, 1} for the i-th sentence S i , where 0/1 corresponds to the non-evidence/evidence sentence, respectively.An auxiliary loss on the evidence labels can help the training of the evidence extractor.
The overview of our method is shown in Figure 1, which is an iterative process.During training, two data pools are maintained and denoted as U (unlabeled data) and L (labeled data).In addition to golden answers, examples in L are annotated with pseudo evidence labels.In contrast, there are only golden answers provided in U .At each iteration, the base model is trained on both data pools (two training arrows).After training, the model makes evidence predictions on unlabeled instances (the labeling arrow), and then Selector chooses the most confident instances from U to provide noisy evidence labels.In particular, the instances with newly generated evidence labels are moved from U to L (the moving arrow), which are used to supervise evidence extraction in the next iteration.This process will iterate several times.After training, the base model is used to generate evidence labels for the data from U , and then Selector chooses the most confident samples, which will be used to supervise the evidence extractor at the next iteration.The selected data is moved from U to L at each iteration.

Base Model
As shown in Figure 2, the overall structure of a base model consists of an encoder layer, an evidence extractor, and an answer predictor.Overall structure of a base model that consists of an encoder layer, an evidence extractor, and an answer predictor.The encoders will obtain h Q for the question, and h D i for each sentence in a document.The summary vector h D will be used to predict the answer.
The encoder layer takes document D and question Q as input to obtain contextual representation for each word.Denote h D i,j as the representation of the j-th word in S i , and h Q i as the representation of the i-th word in question Q.Our framework is agnostic to the architecture of the encoder, and we show improvements on two widely used encoding models, i.e.Transformer (with BERT, Devlin et al., 2019) and LSTM (with DSQA, Lin et al., 2018) in the experiments.
The evidence extractor employs hierarchical attention, including token-and sentence-level attention, to obtain the document representation h D .Token-level attention obtains a sentence vector by self-attention (Vaswani et al., 2017) within the words in a sentence, as follows: where h Q is the sentence representation of the question.α i,j refers to the importance of word j in sentence i, and so on for β i,j .w s and b s are learnable parameters.The attention function F S follows the bilinear form (Kim et al., 2018).
Sentence-level attention identifies important sentences conditioned on the question in a soft way to get the summary vector (h D ), as follows: where F D has the same bilinear form as F S with different parameters.γ i refers to the importance of the corresponding sentence.
The answer predictor adopts different structures for different MRC tasks.For Yes/No question answering, we use a simple linear classifier to infer answers.For multiple choice MRC, we use a Multiple Layer Perceptron (MLP) with Softmax to obtain the score of each choice.And for open-domain question answering, one MLP is used to predict the answer start, and another MLP is used to predict the end.

Loss Function
We adopt two loss functions, one for task-specific loss and the other for evidence loss.
The task-specific loss is defined as the negative log-likelihood (NLL) of predicting golden answers, formally as follows: where Â denotes the predicted answer and A is the golden answer.
When the evidence label E is provided, we can impose supervision on the evidence extractor.For the most general case, we assume that a variable number of evidence sentences exist in each sample (Q, A, D).Inspired by previous work (Nishida et al., 2019) that used multiple evidences, we calculate the evidence loss step by step.Suppose we will extract K evidence sentences.At the first step, we compute the loss of selecting the most plausible evidence sentence.At the second step, we compute the loss in the remaining sentences, where the previously selected sentence is masked and not counted in computing the loss at the second step.The overall loss is the average of all the step-by-step loss until we select out K evidence sentences.In this manner, we devise a BP-able surrogate loss function for choosing the top K evidence sentences.
Formally, we have where K is the number of evidence sentences, a pre-specified hyperparamter.
, −∞} is a sentence mask, where 0 means sentence i is not selected before step k, and −∞ means selected.
At each step, the model will compute an attention distribution over the unselected sentences, as follows: .
As M k i = −∞ for the previously selected sentences, the attention weight on those sentences will be zero, in other words, they are masked out.Then, the step-wise loss can be computed as follows: where λ k i indicates the attention weight for sentence i, and E i ∈ {0, 1} is the evidence label for sentence i.The sentence with the largest attention weight will be chosen as the k-th evidence sentence.
For each sentence i, M 1 i is initialized to be 0. At each step k(k > 1), the mask M k i will be set to −∞ if sentence i is chosen as an evidence sentence at the preceding step k − 1, and the mask remains unchanged otherwise.Formally, the mask is updated as follows: During training, the total loss L is the combination of the task-specific loss and the evidence loss: where η is a factor to balance the two loss terms.L and U denote the two sets in which instances with and without evidence labels, respectively.Note that the evidence label in L is automatically obtained in our self-training method.

Self-Training MRC (STM)
STM is designed to improve base MRC models via generating pseudo evidence labels for evidence extraction when golden labels are unavailable.STM works in an iterative manner, and each iteration consists of two stages.One is to learn a better base model for answer prediction and evidence labelling.
The other is to obtain more precise evidence labels for the next iteration using the updated model.At each iteration, STM first trains the base model with golden answers and pseudo evidence labels from the preceding iteration using the total loss as defined Equation 1. Then the trained model can predict a distribution of pseudo evidence labels for each unlabelled instance (D, Q, A), and decides Ê as Define the confidence of a labelled instance Selector selects the instances with the largest confidence scores whose L A (D, Q, A) and L E (D, Q, Ê) are smaller than the prespecified thresholds.These labelled instances will be moved from U to L for the next iteration.
In the first iteration (iteration 0), the initial labeled set L is set to an empty set, thus the base model is supervised only by golden answers.In this case, the evidence extractor is trained in a distant supervised manner.
The procedure of one iteration of STM is illustrated in Algorithm 1. δ and are two thresholds (hyper-parameters).sort operation ranks the candidate samples according to their confidence scores s and returns the top-n samples.n varies different datasets, and details are presented in the appendix.

Analysis
To understand why STM can improve evidence extraction and the performance of MRC, we revisit the training process and present a theoretical explanation, as inspired by (Anonymous, 2020).
In Section 3.4, we introduce the simple labeling strategy used in STM.If there is no sample selection, the evidence loss can be formulated as where x represents (D, Q, A), and θ t is the parameter of the t-th iteration.In this case, pseudo evidence labels E are randomly sampled from p θ t−1 (E|x) to guide p θ t (E|x), and therefore minimizing L θ t will lead to θ t = θ t−1 .As a matter of fact, the sample selection strategy in STM is to filter out the low-quality pseudo labels with two distribution mappings, f and g.The optimizing target becomes In STM, f is a filter function with two pre-specified thresholds, δ and .g is defined as argmax (Equation 2).Compared with random sampling, our strategy tends to prevent θ t from learning wrong knowledge from θ t−1 .And the subsequent training might benefit from implicitly learning the strategy.
In general, the strategy of STM imposes naive prior knowledge on the base models via the two distribution mappings, which may partly explain the performance gains.(Lin et al., 2018).

Baselines
We compared several methods in our experiments, including some powerful base models without evidence supervision and some existing methods (*+Rule/RL/DPL/STM) which improve MRC with noisy evidence labels.Experimental details are shown in the appendix.YNQA and MMRC: (1) BERT-MLP utilizes a BERT encoder and an MLP answer predictor.

The predictor makes classification based on the BERT representation at the position of [CLS].
The parameters of the BERT module were initialized from BERT-base.(2) BERT-HA refers to the base model introduced in Section 3.2, which applies hierarchical attention over words and sentences.(3) Based on BERT-HA, BERT-HA+Rule supervises the evidence extractor with noisy evidence labels, which are derived from hand-crafted rules.We have explored three types of rules based  on Jaccard similarity, integer linear programming (ILP) (Boudin et al., 2015), and inverse term frequency (ITF) (Wang et al., 2019), among which ITF performed best in most cases.For simplicity, we merely provided experimental results with the rule of ITF. ( 4) Based on BERT-HA, BERT-HA+RL trains the evidence extractor via reinforcement learning, similar to (Choi et al., 2017).And (5) another deep programming logic (DPL) method, GPT+DPL (Wang et al., 2019), is complicated and the source code is not provided, thus We directly used the results from the original paper and did not evaluate it on BERT.ODQA: (1) For each question, DSQA (Lin et al., 2018) aggregates multiple relevant paragraphs from ClueWeb09, and then infers an answer from these paragraphs.(2) GA (Dhingra et al., 2017a) and BiDAF (Seo et al., 2017) perform semantic matching between questions and paragraphs with attention mechanisms.And (3) R 3 (Wang et al., 2018) is a reinforcement learning method that explicitly selects the most relevant paragraph to a given question for the subsequent reading comprehension module.

Yes/No Question Answering
Table 2 shows the results on the three YNQA datasets.We merely reported the classification accuracy on the development sets since the test sets are unavailable.
HA achieved better performance on all the three datasets, indicating that distant supervision on evidence extraction can benefit Yes-No question answering.However, compared with BERT-HA, BERT-HA+RL made no improvement on MARCO and BoolQ, possibly due to the high variance in training.Similarly, BERT-HA+Rule performed worse than BERT-HA on CoQA and MARCO, implying that it is more difficult for the rulebased methods (inverse term frequency) to find correct evidence in these two datasets.In contrast, our method BERT-HA+STM is more general and performed the best on all datasets.BERT-HA+STM achieved comparable performance with BERT-HA+Gold which stands for the upper bound by providing golden evidence labels, indicating that the effectiveness of noisy labels in our method.

Multiple-choice MRC
Table 3 shows the experimental results on the three MMRC datasets.We adopt the metrics from the Model/Dataset CoQA MultiRC P@1 R@1 R@2 R@3 P@1 P@2 P@3 BERT-HA 20.0 28.Table 5: Evidence extraction evaluation on the development sets of CoQA and MultiRC.P@k / R@k represent precision / recall of the generated evidence labels, respectively for top k predicted evidence sentences.referred papers.STM improved BERT-HA consistently on RACE-H, MultiRC and DREAM in terms of all the metrics.However, the improvement on RACE-M is limited (1.0 gain on the test sets).The reason may be that RACE-M is much simpler than RACE-H, and thus, it is not challenging for the evidence extractor of BERT-HA to find the correct evidence on RACE-M.

Open-domain Question Answering
4 shows the exact match scores and F1 scores on Quasar-T.Distant evidence supervision (DS) indicates whether a passage contains the answer text.Compared with the base models DSQA and DSQA+DS, DSQA+STM achieved better performance in both metrics, which verifies that DSQA can also benefit from Self-Training.Our method is general and can improve both lightweight and heavyweight models, like LSTM-based and BERTbased models, in different tasks.

Performance of Evidence Extraction
To evaluate the performance of STM on evidence extraction, we validated the evidence labels generated by several methods on the development sets of CoQA and MultiRC.Considering that the evidence of each question in MultiRC is a set of sentences, we adopted precision@k and recall@k as the metrics for MultiRC, which represent the precision and recall of the generated evidence labels, respectively, when k sentences are predicted as evidence.We adopted only precision@1 as the metric for CoQA as this dataset provides each question with one golden evidence sentence.
Table 5 shows the performance of five methods for evidence labeling on the CoQA and MultiRC development sets.It can be seen that BERT-HA+STM outperformed the base model BERT-HA by a large margin in terms of all the metrics.As a result, the evidence extractor augmented with STM provided more evidential information for the answer predictor, which may explain the improvements of BERT-HA+STM on the two datasets.

Analysis on Error Propagation
To examine whether error propagation exists and how severe it is in STM, we visualized the evolution of evidence predictions on the development set of CoQA (Figure 3).From the inside to the outside, the four rings show the statistic results of the evidence predicted by BERT-HA (iteration 0) and BERT-HA+STM (iteration 1, 2, 3).Each ring is composed of all the instances from the development set of CoQA, and each radius corresponds to one sample.If the evidence of an instance is predicted correctly, the corresponding radius is marked in green, otherwise in purple.Two examples are shown in the appendix due to space limit.
Self-correction.As the innermost ring shows, about 80% of the evidence predicted by BERT-HA (iter 0) was incorrect.However, the proportion of wrong instances reduced to 60% after self-training (iter 3).More concretely, 27% of the wrong predictions were gradually corrected with high confidence within three self-training iterations, as exemplified by instance A in Figure 3.
Error propagation.We observed that 4% of the evidence was mistakenly revised by STM, as exemplified by instance B in Figure 3.In such a case, the incorrect predictions are likely to be retained in the next iteration.But almost 50% of such mistakes were finally corrected during the subsequent iterations like instance C.This observation shows that STM can prevent error propagation to avoid catastrophic failure.

Improvement Over Stronger Pretrained Models
To evaluate the improvement of STM over stronger pretrained models, we employed RoBERTalarge (Liu et al., 2019) as the encoder in the base model.Table 6 shows the results on CoQA.STM significantly improved the evidence extraction (Evi.Acc) of the base model.However, the improvement on answer prediction (Ans.Acc) is marginal.One reason is that RoBERTa-HA achieved so high performance that there was limited room to improve.Another possible explanation is that evidence information is not important for such stronger models to generate answers.In other words, they may be more adept at exploiting data bias to make answer   prediction.In comparison, weaker pretrained models, such as BERT-base, can benefit from evidence information due to their weaker ability to exploit data bias.

Conclusion and Future Work
We present an iterative self-training method (STM) to improve MRC models with soft evidence extraction, when golden evidence labels are unavailable.
In this iterative method, we train the base model with golden answers and pseudo evidence labels.The updated model then generates new pseudo evidence labels, which can be used as additional supervision in the next iteration.Experiments results show that our proposed method consistently improves the base models in seven datasets for three MRC tasks, and that better evidence extraction indeed enhances the final performance of MRC.
As future work, we plan to extend our method to other NLP tasks which rely on evidence finding, such as natural language inference.(Case 1) Passage: ...(3)"Why don't you tackle Indian River, Daylight?"(4)Harper advised, at parting.( 5)"There's whole slathers of creeks and draws draining in up there, and somewhere gold just crying to be found.(6)That's my hunch.(7)There's a big strike coming, and Indian River ain't going to be a million miles away.(8)"And the place is swarming with moose," Joe Ladue added.( 9)"Bob Henderson's up there somewhere, been there three years now, swearing something big is going to happen, living off'n straight moose and prospecting around like a crazy man."(10)Daylight decided to go Indian River a flutter, as he expressed it; but Elijah could not be persuaded into accompanying him.Elijah's soul had been seared by famine, and he was obsessed by fear of repeating the experience.( 11)"I jest can't bear to separate from grub," he explained.( 12)"I know it's downright foolishness, but I jest can't help it..." Question: Are there many bodies of water there?3)Does all that mean that if you are younger than 14, you can't make your own money?(4)Of course not! (5)Kids from 10-13 years of age can make money by doing lots of things.(6)Valerie, 11, told us that she made money by cleaning up other people's yards....( 11)Kids can learn lots of things from making money.( 12)By working to make your own money, you are learning the skills you will need in life.(13)These skills can include things like how to get along with others, how to use technology and how to use your time wisely.( 14)Some people think that asking for money is a lot easier than making it; however, if you can make your own money, you don't have to depend on anyone else... Question: Can they learn time management?Answer: No

Figure 1 :
Figure1: Overview of Self-Training MRC (STM).The base model is trained on both L and U .After training, the base model is used to generate evidence labels for the data from U , and then Selector chooses the most confident samples, which will be used to supervise the evidence extractor at the next iteration.The selected data is moved from U to L at each iteration.
Figure2: Overall structure of a base model that consists of an encoder layer, an evidence extractor, and an answer predictor.The encoders will obtain h Q for the question, and h D i for each sentence in a document.The summary vector h D will be used to predict the answer.

Algorithm 1
One iteration of STM Input: Training sets U, L; Thresholds δ and ; Number of generated labels n; Weight of evidence loss η; Output: Trained MRC model M ; Updated training sets U, L; 1: Randomly initialize M ; 2: Train M on U and L; 3: Initialize L = ∅; 4: for each (D, Q, A) ∈ U do 5:

Figure 3 :
Figure 3: Evolution of evidence predictions on the development set of CoQA.From the inside to the outside, the four rings correspond to BERT-HA (iteration 0) and BERT-HA+STM (iteration 1, 2, 3), respectively.

Figure 4 :
Figure 4: Weight distribution of the two cases from the sentence-level attention.
If you live in the United States, you can't have a full-time job until you are 16 years old.(2)At 14 or 15, you work part-time after school or on weekends, and during summer vacation you can work 40 hours each week.(

Table 1 :
Examples of Yes/No question answering.Evidential sentences in bold in reference documents are crucial to answer the questions.
To balance the ratio of Yes and No questions, we randomly removed some questions whose answers are Yes.
(Nguyen et al., 2016)., 2016)is a large MRC dataset.Each question is paired with a set of reference documents and the answer may not exist in the documents.We extracted all Yes/No questions, and randomly picked some reference documents containing evidence 1 .

Table 3 :
(Wang et al., 2019)ltiple-choice reading comprehension datasets.(F1a: F1 score on all answer-options; F1 m : macro-average F1 score of all questions; EM 0 : exact match.)Notethatthere is no golden evidence label on RACE and DREAM.The results for DPL (deep programming logic) are copied from(Wang et al., 2019).Significance tests were conducted between BERT-HA+STM and the best baseline of each column (t-test).‡ means p-value < 0.01, and † means p-value < 0.05.

Table 6 :
Answer prediction accuracy (Ans.Acc) and evidence extraction accuracy (Evi.Acc) on the development set of CoQA.

Table 7 :
Examples from the development set of CoQA.Evidential sentences in red in reference passages are crucial to answer the questions.Sentences in blue are distracting as Figure4shows.

Table 8 :
Hyper-parameters marked with ♣/♠ are used in BERT-HA/BERT-HA+STM, respectively.Other unmarked hyper-parameters are shared by these two models.