Context-Aware Answer Extraction in Question Answering

Extractive QA models have shown very promising performance in predicting the correct answer to a question for a given passage. However, they sometimes result in predicting the correct answer text but in a context irrelevant to the given question. This discrepancy becomes especially important as the number of occurrences of the answer text in a passage increases. To resolve this issue, we propose \textbf{BLANC} (\textbf{BL}ock \textbf{A}ttentio\textbf{N} for \textbf{C}ontext prediction) based on two main ideas: context prediction as an auxiliary task in multi-task learning manner, and a block attention method that learns the context prediction task. With experiments on reading comprehension, we show that BLANC outperforms the state-of-the-art QA models, and the performance gap increases as the number of answer text occurrences increases. We also conduct an experiment of training the models using SQuAD and predicting the supporting facts on HotpotQA and show that BLANC outperforms all baseline models in this zero-shot setting.


Introduction
Question answering tasks require a high level of reading comprehension ability, which in turn requires a high level of general language understanding. This is why the question answering (QA) tasks are often used to evaluate language models designed to be used in various language understanding tasks. Recent advances in contextual language models brought on by attention (Hermann et al., 2015;Chen et al., 2016;Seo et al., 2017;Tay et al., 2018) and transformers (Vaswani et al., 2017) have led to significant improvements in QA, and these improvements show that better modeling of contextual meanings of words plays a key role in QA.
While these models are designed to select answer-spans in the relevant contexts from given passages, they sometimes result in predicting the Figure 1: Example passage, question, and answer triple. This passage has multiple spans that are matched with the answer text. The first occurrence of "prefrontal cortex" is the only answer-span within the context of the question.
correct answer text but in contexts that are irrelevant to the given questions. Figure 1 shows an example passage where the correct answer text appears multiple times. In this example, the only answer-span in the context relevant to the given question is the first occurrence of the "prefrontal cortex" (in blue), and all remaining occurrences of the answer text (in red) show incorrect predictions. Figure 2 shows quantitatively, the discrepancy between predicting the correct answer text versus predicting the correct answer-span. Using BERT  trained on curated Nat-uralQuestions (Fisch et al., 2019), we show the results of extractive QA task using exact match (EM) and Span-EM. EM only looks for the text to match the ground truth answer, whereas Span-EM additionally requires the span to be the same as the ground truth answer-span. Figure 2 shows that BERT finds the correct answer text more than it finds the correct answer-spans, and this proportion of wrong predictions increases as the number of occurrences of answer text in a passage increases.
Tackling this problem is very important in more realistic datasets such as NaturalQuestions (Kwiatkowski et al., 2019), where the majority of questions have more than one occurrence of the answer text in the passage. This is in contrast with the SQuAD dataset, where most questions have a single occurrence of the answer. These details of the SQuAD (Rajpurkar et al., 2016), NewsQA, and NaturalQuestions datasets (Fisch et al., 2019) are shown in Figure 3.
To address this issue, we define context prediction as an auxiliary task and propose a block attention method, which we call BLANC (BLock AttentioN for Context prediction) that explicitly forces the QA model to predict the context. We design the context prediction task to predict softlabels which are generated from given answerspans. The block attention method effectively calculates the probability of each word in a passage being included in the context with negligible extra parameters and inference time. We provide the implementation of BLANC publicly available 1 .
Adding context prediction and block attention enhances BLANC to correctly identify context related to a given question. We conduct two types of experiments to verify the context differentiation performance of BLANC: extractive QA task, and zero-shot supporting facts prediction. In the extractive QA task, we show that BLANC significantly increases the overall reading comprehension performance, and we verify the performance gain increases as the number of answer texts in a passage increases. We verify BLANC's contextaware performance in terms of generalizability in the zero-shot supporting facts prediction task. We train BLANC and baseline models on SQuAD1.1. and perform zero-shot supporting facts (supporting sentences in passages) prediction experiment on HotpotQA dataset (Yang et al., 2018). The results show that the context prediction performance that the model has learned from one dataset is generalizable to predicting the context of an answer to a question in another dataset.
Contributions in this paper are as follows: • We show the importance of correctly identifying the answer-span to improving the model performance on extractive QA.  • We show that context prediction task plays a key role in the QA domain.
• We propose a new model BLANC that resolves the discrepancy between answer text prediction and answer-span prediction.

Related Work
Evidence in the form of documents, paragraphs, and sentences, has been shown to be necessary and effective in predicting the answers in open-domain QA (Chen et al., 2017;Wang et al., 2018;Das et al., 2018; and multi-hop QA (Yang et al., 2018;Min et al., 2019b;Asai et al., 2020). One problem of identifying evidence in answering questions is the expensive cost in labeling the evidence. Self-labeling with simple heuristics can be a solution to this problem, as shown in Choi et al. (2017); Li et al. (2018). Self-training is another solution, as presented in Niu et al. (2020). In this paper, we propose self-generating soft-labeling method to indicate support words of answer texts, and train BLANC with the soft-labels.
Related but different from our work, Swayamdipta et al. (2018) and Min et al. (2019a) predict the answer-span when only the answer texts are provided and the ground truth answer-spans are not. Swayamdipta et al. (2018) designs a model that benefits from aggregating information from multiple mentions of the answer text in predicting the final answer. Min et al. (2019a) approach the problem of the lack of ground truth answer-spans with latent modeling of candidate spans. Both of these papers tackle the problem of identifying the correct answer among multiple mentions of the answer text in datasets without annotations of the correct answer-spans. Our work solves a different problem from the above-mentioned papers in that the golden answer-spans are provided.

Model
We propose BLANC based on two novel ideas: softlabeling method for the context prediction and a block attention method that predicts the soft-labels. Two important functionalities of BLANC are 1) calculating the probability that a word in a passage belongs to the context, which is in latent, and 2) enabling the probability to reflect spatial locality between adjacent words. We provide an overall illustration of BLANC in Figure 4.

Notations
In this section, we define the notations and the terms used in our study. We denote a word at index i in a passage with w i . We define the context of a given question as a segment of words in a passage and denote with C. In our circumstance, the context is latent. We denote the start and end indices of a context with s c and e c . Training a block attention model to predict the context requires the labeling process for the latent context, and we define two probabilities for that, p soft (w i ∈ C) and p(w i ∈ C). p soft (w i ∈ C) represents the self-generated softlabel that we assume as ground truth of the context, and p(w i ∈ C) is a block attention model's prediction. We denote the start and end indices of a labeled answer-span with s a and e a .

Soft-labeling for latent context C
We assume words near an answer-span are likely to be included in the context of a given question. From our assumption, we define the probability of words belong to the context, p soft (w i ∈ C), which is used as a soft-label for the auxiliary context prediction task. To achieve this, we hypothesize the words in an answer-span are included in the context and make the probability of adjacent words decrease with a specific ratio as the distance between answerspan and a word increases. The soft-label for the latent context is as follows: where 0 ≤ q ≤ 1, and q is a hyper-parameter for the decreasing ratio as the distance from a given answer-span. For computational efficiency, we apply (1) to words bounded by certain windowsize only, which is a hyper-parameter, on both sides of an answer-span. This results in assigning p soft (w i ∈ C) to 0 for the words outside the segment bounded by the window-size.

Block Attention
Block attention model calculates p(w i ∈ C) to predict the soft-label, p soft (w i ∈ C), and localizes the correct index of an answer-span with p(w i ∈ C). We embed spatial locality of p(w i ∈ C) to block attention model with the following steps: 1) predicting the start and end indices of context, p(i = s c ) and p(i = e c ), and 2) calculating p(w i ∈ C) with cumulative distribution of p(i = s c ) and p(i = e c ). In the first step, at predicting the start and end indices, all encoder models that produce vector representation of words in a passage are compatible with the block attention model. In this paper, we apply the same structure of the answer-span classification layer used in the transformer model  to our context words prediction layer.
Here, we denote H H H as output vectors of transformer encoder and H H H j as output vector of w j . From H H H, we predict the start and end indices of the context: where , and b c e represent weight and bias parameters for context prediction layer. We calculate p(w i ∈ C) as multiplication of the probability Figure 4: Schematic visualization of BLANC. Block attention model takes contextual vector representations from transformer encoder and predicts context words of an answer, p(w i ∈ C). We define loss function for context words with the prediction, p(w i ∈ C) and the self-generated soft-label p soft (w i ∈ C) defined in (1). Answer-span predictor takes p(w i ∈ C and H H H to predict an answer-span. We optimize our model in manner of multi-task learning of two tasks: answer-span prediction and context words prediction.
of the word w i which appears after s c and that of the word w i which appears before e c .
Here, we assume the independence between s c and e c for computational conciseness. The cumulative distributions of p(i ≥ s c ) and p(i ≤ e c ) are calculated with the following equations: We explicitly force the block attention model to learn context words of a given question by minimizing the cross-entropy of the two probabilities, p(w i ∈ C) and p soft (w i ∈ C). The loss function for the latent context is defined by the following equation: where l is the length of a passage. By averaging L context across all train examples, we get the final context loss function.

Answer-span Prediction
BLANC predicts answer-span with the context probability, p(w i ∈ C). We use the same answerspan prediction layer as BERT, but we multiply p(w i ∈ C) to the output of the encoder, H H H to give attention at indices of answer-span within the context, C.
where W a W a W a , V a V a V a , b a s , and b a e represent weight and bias parameters for answer-span prediction layer, and A i = p(w i ∈ context). The loss function for answer-span prediction is defined by the following equation: 1(condition) represents an indicator function that returns 1 if the condition is true and returns 0 otherwise. By averaging L answer across all train examples, we get the final answer-span loss function. We define our final loss function as the weighted sum of the two loss functions: where λ is a hyper-parameter moderating the ratio of two loss functions. (1) can be represented by the probability distributions calculated by block attention model, p(w i ∈ C). We provide detailed proof in Appendix A.1.

Experimental Setup
We validate the efficacy of BLANC on two types of tasks: extractive QA and zero-shot supporting fact prediction. In the extractive QA, we evaluate the overall reading comprehension performance with three QA datasets, and we further analyze the ability of BLANC to discern relevant contexts on passages with multiple answer texts. In zero-shot supporting facts prediction, we train QA models on SQuAD (Rajpurkar et al., 2016) and predict supporting facts (supporting sentences) of answers in HotpotQA (Yang et al., 2018). Due to our experimental computing resource limitation, we compare BLANC to baseline models trained in slightly modified hyperparameter settings instead of the results from their original papers.

Datasets
SQuAD: SQuAD1.1 (Rajpurkar et al., 2016) is a large reading comprehension dataset for QA. Since the test set for SQuAD1.1 (Rajpurkar et al., 2016) is not publicly available, and their benchmark does not provide an evaluation on the span-based metric, we split train data (90%/10%) into new train/dev dataset and use development dataset as test dataset.
NewsQA & NaturalQ: NewsQA (Trischler et al., 2017) consists of answer-spans to questions generated in a way that reflects realistic information seeking processes in the news domain. NaturalQuestions (Kwiatkowski et al., 2019) is a QA benchmark in a real-world scenario with Google search queries for naturally-occurring questions and passages from Wikipedia for annotating answer-spans. Due to computational limits, we use the curated versions of NewsQA and NaturalQ provided by Fisch et al. (2019). The curated datasets contain train and development set only, so we use the development set as the test set and build new train and dev sets from the train set (90%/10%).
HotpotQA: HotpotQA (Yang et al., 2018) aims to measure complex reasoning performance of QA models and requires finding relevant sentences from the given passages. HotpotQA consists of passages, questions, answer, and corresponding supporting facts (sentences) for each answer. We use the development set in HotpotQA.

Evaluation Metrics
F1 and EM are evaluation metrics widely used in existing QA models (Rajpurkar et al., 2016). These two metrics measure the number of overlapping tokens between the predicted answers and the ground truth answers. Token matching evaluation treats as correct even answers in unrelated contexts, thus being insufficient to evaluate the context prediction performance. As the alternatives, we propose span-EM and span-F1. We modify the metric proposed in Kwiatkowski et al. (2019) to be suitable for our experiment setting.
Span-F1 and Span-EM: Span-F1 and span-EM are defined with overlapping indices between the predicted span and the ground truth span: Here, s p / e p represent the start/end indices of a predicted answer-span in a passage and s g / e g denote the start/end indices of the ground truth answer-span in a passage. Span-EM measures exactly matched predicted spans, and Span-F1 quantifies the degree of overlap between the predicted answer-span and the ground truth span.  size as BERT. SpanBERT uses span-oriented pretraining for span representation. Since the block attention is stacked on SpanBERT, and to provide detailed results of effectiveness of BLANC, we use both 12-layer SpanBERT-base and 24-layer SpanBERT-large.

Hyper-parameter Settings
We conduct experiments on limited hyperparameter settings (e.g. max len, batch size), as we were limited by computational resources. We use the same hyperparameter settings across all baseline models and BLANC. We set the training batch size to 8, learning-rate to 2×e −5 , the number of train epochs to 3, the max sequence length of transformer encoder to 384, warm-up proportion to 10%, and we use the various optimizers used in the respective original papers. We set λ to 0.8, which is the optimal value as we show in Figure  6, for all experiments except the large model experiment on SQuAD1.1. We set λ = 0.2 in the large model experiment on SQuAD1.1. We use dif-  ferent q, the decreasing ratio in (1), and different window-size for each dataset to reflect the average length of passages of each QA datasets. We set q = 0.7 and window-size to 2 on SQuAD which contains relatively short passages, and q = 0.99 and window-size to 3 on the other two QA datasets where most passages are longer than SQuAD. q and window-size are optimized empirically.

Results & Discussion
We now present the results for the experiments described in the previous section. We describe the overall reading comprehension performance, highlighting the increased gain for passages with multiple mentions of the answer text. We show that We categorize Natu-ralQ dataset into five groups by number answer texts appeared in a passage: n = 1, 2, 3, 4, and n ≥ 5. BLANC outperforms baseline models on every groups and the performance gap increases as the number of answer texts in a passage increases.
BLANC outperforms other models for zero-shot supporting fact prediction. We also demonstrate the importance of the context prediction loss and the negligible extra parameter and inference time.

Reading Comprehension
We verify the reading comprehension performance of BLANC with four evaluation metrics (F1, EM, Span-F1, and Span-EM) on three QA datasets: SQuAD, NaturalQ, and NewsQA. We show the results in Table 1 which shows BLANC consistently outperforms all comparison models including RoBERTa and SpanBERT. We focus on the evaluation metric Span-EM which measures the exact match of the answerspan, and we further highlight the performance gain of BLANC over the most recent SpanBERT model, both base and large. On NaturalQ, BLANC outperforms SpanBERT by 1.86, whereas the performance difference between SpanBERT and RoBERTa is 0.12. On NewsQA, BLANC outperforms by 2.56, whereas the difference between SpanBERT and RoBRTa is 0.61. This pattern holds for the large models as well.
We now compare the performance gain between the datasets. Recall that we showed in Figure 3 the proportion of multi-mentioned answer is smallest in SQuAD, medium for NaturalQ-MRQA, and  largest in NewsQA-MRQA. Reading comprehension results show the performance gap of BLANC and SpanBERT increases in the same order, verifying the effectiveness of BLANC on the realistic multi-mentioned datasets.

Performance on Passages with Multi-mentioned Answers
In Section 5.1, we show Span-EM and EM of BLANC and baselines on the entire datasets. However, the context discerning performance is only observed on passages with multiple mentions of the answer text. We investigate the context-aware performance (distinguishing relevant context and irrelevant context) of BLANC by categorizing Nat-uralQ dataset by the number of occurrences of the answer text in a passage. We subdivide the dataset into five groups: n = 1, 2, 3, 4 and n ≥ 5, where n is the number of occurrences of the answer text in a passage. Figure 5 presents Span-F1 and Span-EM on those subsets of the data. BLANC outperforms SpanBERT and BERT across all subsets, and we show that the performance gain increases as n increases. In Table 2, we explicitly show reading comprehension performance of BLANC on the question-answer pairs of passages with n ≥ 2 from NaturalQ, and we confirm that block attention method increases context-aware performance of SpanBERT by 3.6 with Span-F1, and by 3.8 with Span-EM, which are larger improvements than the increments on the data including n = 1 shown in Table 1.

Supporting Facts Prediction
We present the results of the zero-shot supporting facts prediction task on HotpotQA dataset (Yang et al., 2018) in Table 3. HotpotQA has ten passages and two supporting facts (sentences) for each question-answer pair. Since HotpotQA has a different data format than the extractive QA datasets, we curate HotpotQA with the following steps. We con-  Figure 6: Analysis on λ for context word prediction for NaturalQ. We adjust λ, weight of (L context ), from 0.0 to 0.99 and report Span-F1 and Span-EM. Increasing λ improves answer-span prediction until λ = 0.8 and then decreases. This decrease is expected as the weight for (L answer ) becomes too small.  catenate the ten passages to make one passage. Two supporting facts exist in the passage. By removing each one of them, we build two passages and each of passage contains one supporting fact. We repeat this process for all examples in HotpotQA. As a result, the curated dataset contains triples of one question, one supporting fact, and one passage. We report the accuracy of models by checking if the supporting fact includes the predicted span. We train baseline models and BLANC on SQuAD1.1 and test on the curated development set of HotpotQA dataset. Table 3 shows that BLANC captures sentence relevant to the given question better than other baseline models in zero-shot setting. This result shows that BLANC is capable of applying what it has learned from one dataset to predicting the context of an answer to a question in another dataset.

Analysis on λ
We verify the relationship between reading comprehension performance and context word prediction task by conducting reading comprehension experiment with λ = [0.2, 0.4, 0.6, 0.8, 0.9, 0.99]. The hyperparameter λ represents weight of L context in the total loss function L total . Figure 6 shows that the performance increases as λ increases until it reaches 0.8 and decreases after λ = 0.8. Leveraging the context word prediction task increases reading comprehension performance, and we show efficacy of BLANC. As λ increases, the weight on L answer decreases, so we expect to see a decrease in performance as λ becomes too large.

Space and Time Complexity
The additional parameters of block attention model come from Eq. (3) in Section 3.3. The number of parameters is (768 + 1) * 2 = 1538 when the hidden dimension size of the transformer encoder is 768, and 1538 is negligible considering the total number of parameters in BERT-base (108M). The exact numbers of parameters of baseline models are presented in Table 1.

Conclusion
In this paper, we showed the importance of predicting an answer with the correct context of a given question. We proposed BLANC with two novel ideas: context word prediction task and a block attention method that identifies an answer within the context of a given question. The context words prediction task labels latent context words with the labeled answer-span and is used in a multi-task learning manner. Block attention models the latent context words with negligible extra parameters and training/inference time. We showed that BLANC increases reading comprehension performance, and we verify that the performance gain increases for complex examples (i.e., when the answer occurs two or more times in the passage). Also, we showed the generalizability of BLANC and its contextaware performance with the zero-shot supporting fact prediction task on the HotpotQA dataset.

A Properties of Block Attention
A.1 Block Attention on a Soft-label Theorem 1. There exist two probability distributions, p(i = s c ) and p(i = e c ), that makes p(w i ∈ C) equal to p soft (w i ∈ C), which is defined as follows: (10) Here, q is the decreasing ratio, which satisfies q ≤ 1.0. s a and e a are the start and end indices of an answer-span. s a and e a satisfy s a ≤ e a . s w and e w are the start and end indices of the segments bounded by certain window-size. s w and e w satisfy s w ≤ s a and e a ≤ e w .
Proof. Based on the independent assumption between s c and e c in section 3.3, p(w i ∈ C) becomes multiplication of two probability distributions as follows: Then, the following two cumulative distributions, p(i ≥ s c ) and p(i ≤ e c ), make p(w i ∈ C) equal to p soft (w i ∈ C): (13) Since block attention method can predict any form of p(i = s c ) and p(i = e c ), any soft-label can be represented by block attention method.

A.2 Block Attention on Multiple Spans
Block attention model can be expanded to predict multiple spans.
Theorem 2. Any form of the following p multi-span (w i ∈ C), which has m-blocks, can be represented by the multiplication of a scaling factor, k, and the probability distribution calculated by block attention model, p(w i ∈ C).
otherwise . Proof. Following two cumulative distributions and the scaling factor make k × p(i ≥ s c ) × p(i ≤ e c ) equal to p soft (w i ∈ C) for all i.
Since block attention model can predict any form of p(i = s c ) and p(i = e c ), p multi-span (w i ∈ C) can be represented by the multiplication of a scaling factor and the probability distribution calculated by block attention model.

B Semantic Similarity Between Context Words and Questions
Soft-labeling method assumes that words near an answer-span are likely to be included in the context of a given question. We provide the basis of this assumption with the question-word similarity experiment. The question-word similarity is calculated with the cosine similarity between word vectors and question vectors. We use word2vec vectors and calculate the question vectors by averaging word vectors in the questions. Figure 7 shows that words adjacent to the answer-spans have the most similar  The semantic similarity between a given question and words in a passage. The x-axis represents the distance between a word and an answer-span. The yaxis represents the cosine similarity between the question and the word on 100 scale. Words near an answerspan are likely to have a similar meaning to a given question.  Table 5: The performance of BLANC on NaturalQuestions. We vary window-size to find the optimal context size. AVG represents the average of the four performances.

WS
meaning to given questions. Also, the similarity decreases as the distance between the words and the answer-spans increases. From the results, we verify the assumption.

C Details about Hyperparameter Settings
We vary window-size, and λ to find the optimal hyperparameters of BLANC.
C.1 Analysis on Window-size Table 5 shows the performance of BLANC trained on NaturalQuestions with window-size = [1,2,3,4,5,7,21]. AVG represents the average of the four performances. BLANC shows the best AVG performance at WS = 3, and we set windowsize to 3 for NaturalQuestions and NewsQA experiments.  C.2 Varying λ on SQuAD1.1 Table 6 shows the performance of BLANC with two different λ settings on SQuAD1.1. The results show that BLANC performs better at λ = 0.2 than λ = 0.8 (the optimal value for NaturalQuestions) on SQuAD1.1. We set λ to 0.2 in SQuAD1.1 experiments.