Probabilistic Assumptions Matter: Improved Models for Distantly-Supervised Document-Level Question Answering

We address the problem of extractive question answering using document-level distant super-vision, pairing questions and relevant documents with answer strings. We compare previously used probability space and distant supervision assumptions (assumptions on the correspondence between the weak answer string labels and possible answer mention spans). We show that these assumptions interact, and that different configurations provide complementary benefits. We demonstrate that a multi-objective model can efficiently combine the advantages of multiple assumptions and outperform the best individual formulation. Our approach outperforms previous state-of-the-art models by 4.3 points in F1 on TriviaQA-Wiki and 1.7 points in Rouge-L on NarrativeQA summaries.


Introduction
Distant supervision assumptions have enabled the creation of large-scale datasets that can be used to train fine-grained extractive short answer question answering (QA) systems. One example is TriviaQA (Joshi et al., 2017). There the authors utilized a pre-existing set of Trivia questionanswer string pairs and coupled them with relevant documents, such that, with high likelihood, the documents support answering the questions (see Fig. 1 for an illustration). Another example is the NarrativeQA dataset (Kočiský et al., 2018), where crowd-sourced abstractive answer strings were used to weakly supervise answer mentions in the text of movie scripts or their summaries. In this work, we focus on the setting of documentlevel extractive QA, where distant supervision is specified as a set A of answer strings for an input question-document pair. In the Triv-iaQA example, there are three occurrences of the original answer string "Joan Rivers" (blue), and one alternate but incorrect alias "Diary of a Mad Diva" (purple). Only two "Joan Rivers" mentions (shown in blue boxes) support answering the question. In the NarrativeQA example, there are two answer stings in A: "in the spring at mount helicon" (blue) and "mount helicon" (orange), with the latter being a substring of the former. Both mentions in P2 are correct answer spans.
Depending on the data generation process, the properties of the resulting supervision from the sets A may differ. For example, the provided answer sets in TriviaQA include aliases of original trivia question answers, aimed at capturing semantically equivalent answers but liable to introducing semantic drift. In Fig. 1, the possible answer string "Diary of a Mad Diva" is related to "Joan Rivers", but is not a valid answer for the given question.
On the other hand, the sets of answer strings in NarrativeQA are mostly valid since they have high overlap with human-generated answers for the given question/document pair. As shown in Fig. 1, "in the spring at mount helicon" and "mount helicon" are both valid answers with relevant mentions. In this case, the annotators chose answers that appear verbatim in the text but in the more general case, noise may come from partial phrases and irrelevant mentions.
While distant supervision reduces the annotation cost, increased coverage often comes with increased noise (e.g., expanding entity answer strings with aliases improves coverage but also increases noise). Even for fixed document-level distant supervision in the form of a set of answers A, different interpretations of the partial supervision lead to different points in the coverage/noise space and their relative performance is not well understood.
This work systematically studies methods for learning and inference with document-level distantly supervised extractive QA models. Using a BERT (Devlin et al., 2019)  We show that the choice of probability space puts constraints on the distant supervision assumptions that can be captured, and that all three choices interact, leading to large differences in performance. Specifically, we provide a framework for understanding different distant supervision assumptions and the corresponding trade-off among the coverage, quality and strength of distant supervision signal. The best configuration depends on the properties of the possible annotations A and is thus data-dependent. Compared with recent work also using BERT representations, our study show that the model with most suitable probabilistic treatment achieves large improvements of 4.6 F1 on TriviaQA and 1.7 Rouge-L on Narra-tiveQA respectively. Additionally, we design an efficient multi-loss objective that can combine the benefits of different formulations, leading to significant improvements in accuracy, surpassing the best previously reported results on the two studied  Figure 2: The document-level QA model as used for test-time inference. The lower part is a BERT-based paragraph-level answer scoring component, and the upper part illustrates the probability aggregation across answer spans sharing the same answer string. Ξ refers to either a sum or a max operator. In the given example, "John Rivers" is derived from two paragraphs.
tasks. Results are further strengthened by transfer learning from fully labeled short-answer extraction data in SQuAD 2.0 (Rajpurkar et al., 2018), leading to a final state-of-the-art performance of 76.3 F1 on TriviaQA-Wiki and 62.9 on the Narra-tiveQA summaries task. 2

Probability Space
Here, we first formalize both paragraph-level and document-level models, which have been previously used for document-level extractive QA. Typically, paragraph-level models consider each paragraph in the document independently, whereas document models integrate some dependencies among paragraphs.
To define the model, we need to specify the probability space, consisting of a set of possible outcomes and a way to assign probabilities to individual outcomes. For extractive QA, the probability space outcomes consist of token positions of answer mention spans.
The overall model architecture is shown in Fig. 2. We use BERT (Devlin et al., 2019) to derive representations of document tokens. As is standard in state-of-the-art extractive QA models (Devlin et al., 2019;Min et al., 2019), the BERT model is used to encode a pair of a given question with one paragraph from a given document into neural text representations. These representations are then used to define scores/probabilities of possible answer begin and end positions, which are in turn used to define probabilities over possible answer spans. Then the answer string probabilities can be defined as the aggregation over all possible answer spans/mentions.
In the following, we show that paragraph-level and document-level models differ only in the space of possible outcomes and the way of computing answer span probabilities from answer position begin and end scores.
Scoring answer begin and end positions Given a question q and a document d consisting of K paragraphs p 1 , . . . , p K , the BERT encoder produces contextualized representations for each question-paragraph pair (q, p k ). Specifically, for each token position i k in p k , the final hidden vector h (i,k) ∈ R d is used as the contextualized token embedding, where d is the vector dimension.
The span-begin score is computed as The span-end score s e (j k ) is defined in the same way. The probabilities for a start position i k and an end position j k are where Z b , Z e are normalizing factors, depending on the probability space definition (detailed below). The probability of an answer span from i k to j k is defined as P s (i k , j k ) = P b (i k )P e (j k ). The partition functions Z b and Z e depend on whether we use a paragraph-level or documentlevel probability space.
Paragraph-level model In paragraph level models, we assume that for a given question against a document d, each of its paragraphs p 1 , . . . , p K independently selects a pair of answer positions (i k , j k ), which are the begin and end of the answer from paragraph p k . In the case that p k does not support answering the question q, special NULL positions are selected (following the SQuAD 2.0 BERT implementation 3 ). Thus, the set of possible outcomes Ω in the paragraph-level probability space is the set of lists of begin/end position pairs, one from each paragraph: {[(i 1 , j 1 ), . . . , (i K , j K )]}, where i k 3 https://github.com/google-research/bert and j k range over positions in the respective paragraphs.
The answer positions in different paragraphs are independent, and the probability of each paragraph's answer begin and end is computed by normalizing over all possible positions in that paragraph, i.e., where I k is the set of all positions in the paragraph p k . The probability of an answer begin at i k is P b (i k ) = exp(s b (i k ))/Z b k and the probability of an end at j k is defined analogously. The probability of a possible answer position assignment for the document d is then defined as As we can see from the above definition, due to the independence assumption, models using paragraph-level normalization do not learn to directly calibrate candidate answers from different paragraphs against each other. Document-level model In document-level models, we assume that for a given question against document d, a single answer span is selected (as opposed to one for each paragraph in the paragraph-level models). 4 Here, the possible positions in all paragraphs are a part of a joint probability space and directly compete against each other.
In this case, Ω is the set of token spans {(i, j)}, where i and j are the begin and end positions of the selected answer. The normalizing factors are therefore aggregated over all paragraphs, i.e., Compared with (3) and (4), since there is always a valid answer in the document for the tasks studied here, NULL is not necessary for documentlevel models and thus can be excluded from the Coverage Quality Strength inner summation of (5) and (6). The probability of a possible outcome, i.e. an answer span, is

Distant Supervision Assumptions
There are multiple ways to interpret the distant supervision signal from A as possible outcomes in our paragraph-level and document-level probability spaces, leading to corresponding training loss functions. Although several different paragraphlevel and document-level losses (Chen et al., 2017;Kadlec et al., 2016;Clark and Gardner, 2018;Lin et al., 2018;Min et al., 2019) have been studied in the literature, we want to point out that when interpreting the distant supervision signal, there is a tradeoff among multiple desiderata: • Coverage: maximize the number of instances of relevant answer spans, which we can use to provide positive examples to our model. • Quality: maximize the quality of annotations by minimizing noise from irrelevant answer strings or mentions. • Strength: maximize the strength of the signal by reducing uncertainty and pointing the model more directly at correct answer mentions. We introduce three assumptions (H1, H2, H3) for how the distant supervision signal should be interpreted, which lead to different tradeoffs among the desiderata above (see Table 1).
We begin with setting up additional useful notation. Given a document-question pair (d, q) and a set of answer strings A, we define the set of A-consistent token spans Y A in d as follows: for each paragraph p k , span (i k , j k ) ∈ Y k A if and only if the string spanning these positions in the paragraph is in A. For paragraph-level models, if for paragraph p k the set Y k A is empty, we redefine Y k A to be {NULL}. Similarly, we define the set of Aconsistent begin positions Y k b,A as the start positions of consistent spans: Y k e,A for A-consistent end positions is defined analogously. In addition, we term an answer span (i, j) correct for question q, if its corresponding answer string is a correct answer to q, and the context of the specific mention of that answer string from positions i to j entails this answer. Similarly, we term an answer begin/end position correct if there exists a correct answer span starting/ending at that position. H1: All A-consistent answer spans are correct. While this assumption is evidently often incorrect (low on the quality dimension ), especially for TriviaQA, as seen from Fig. 1, it provides a large number of positive examples and a strong supervision signal (high on coverage and strength ). We include this in our study for completeness.
H1 translates differently into possible outcomes for corresponding models depending on the probability space (paragraph or document). Paragraphlevel models select multiple answer spans, one for each paragraph, to form a possible outcome. Thus, multiple A-consistent answer spans can occur in a single outcome, as long as they are in different paragraphs. For multiple A-consistent answer spans in the same paragraph, these can be seen as mentions that can be selected with equal probability (e.g., by different annotators). Document-level models select a single answer span in the document and therefore multiple A-consistent answer spans can be seen as occurring in separate annotation events. Table 2 shows in row one the logprobability of outcomes consistent with H1. H2: Every positive paragraph has a correct answer in its A-consistent set. Under this assumption, each paragraph with a non-empty set of Aconsistent spans (termed a positive paragraph) has a correct answer. As we can see from the Trivi-aQA example in Fig. 1, this assumption is correct for the first and third paragraph, but not the second one, as it only contains a mention of a noisy answer alias. This assumption has medium coverage (→), as it generates positive examples from multiple paragraphs but does not allow multiple positive mentions in the same paragraph. It also decreases noise (higher quality →) (e.g. does not claim that all the mentions of "Joan Rivers" in the first paragraph support answering the question). The strength of the supervision signal is weakened (→) relative to H1, as now the model needs to figure out which of the multiple A-consistent mentions in each paragraph is correct.
H2 has two variations: correct span, assuming that one of the answer spans (i k , j k ) in Y k A is correct, and correct position, assuming that the paragraph has a correct answer begin position from Y k b,A and a correct answer end position from Y k e,A , but its selected answer span may not necessarily belong to Y k A . For example, if A contains {abcd, bc}, then abc would have correct begin and end, but not be a correct span. It does not make sense for modeling to assume the paragraph has correct begin and end positions instead of a correct answer span (i.e., we don't really want to get inconsistent answers like abc above), but given that our probabilistic model assumes independence of begin and end answer positions, it may not be able to learn well with span-level weak supervision. Some prior work (Clark and Gardner, 2018) uses an H2 position-based distant supervision assumption with a pair-paragraph model akin to our document-level ones. Lin et al. (2018) use an H2 span-based distant supervision assumption. The impact of position vs. span-based modeling of the distant supervision is not well understood. As we will see in the experiments, for the majority of settings, position-based weak supervision is more effective than span-based for our model.
For paragraph-level and document-level models, H2 corresponds differently to possible outcomes. For paragraph models, one outcome can select answer spans in all positive paragraphs and NULL in negative ones. For document-level models, we view answers in different paragraphs as outcomes of multiple draws from the distribution. The identity of the particular correct span or begin/end position is unknown, but we can compute the probability of the event comprising the consistent outcomes. Table 2 shows the log-probability of the outcomes consistent with H2 in row two (right for span-based and left for position-based interpretation, when plugging in for Ξ). H3: The document has a correct answer in its A-consistent set Y A . This assumption posits that the document has a correct answer span (or begin/end positions), but not every positive paragraph needs to have one. It further improves supervision quality ( ), because for example, it allows the model to filter out the noise in paragraph two in Fig. 1. Since the model is given a choice of any of the A-consistent mentions, it has the capability to assign zero probability mass on the supervisionconsistent mentions in that paragraph.
On the other hand, H3 has lower coverage ( ) than H1 and H2, because it provides a single positive example for the whole document, rather than one for each positive paragraph. It also reduces the strength of the supervision signal ( ), as the model now needs to figure out which mention to select from the larger document-level set Y A .
Note that we can only use H3 coupled with a document-level model, because a paragraph-level model cannot directly tradeoff answers from different paragraphs against each other, to select a single answer span from the document. As with the other distant supervision hypotheses, spanbased and position-based definitions of the possible consistent outcomes can be formulated. The log-probabilities of these events are defined in row three of Table 2, when using for Ξ. H3 was used by Kadlec et al. (2016) for cloze-style distantly supervised QA with recurrent neural network models.
The probability-space (paragraph vs. documentlevel) and the distant supervision assumption (H1, H2, and H3, each position or span-based) together define our interpretation of the distant supervision signal resulting in definitions of probability space outcomes consistent with the supervision. Next, we define corresponding optimization objectives to train a model based on this supervision and describe the inference methods to make predictions with a trained model.

Optimization and Inference Methods
For each distant supervision hypothesis, we maximize either the marginal log-likelihood of A-consistent outcomes (MML) or the log-likelihood of the most likely outcome (HardEM). The latter was found effective for weakly supervised tasks including QA and semantic parsing by Min et al. (2019). Table 2 shows the objective functions for all distant supervision assumptions, each comprising a pairing of a distant supervision hypothesis (H1, H2, H3) and position-based vs. span-based interpretation. The probabilities are defined according to the assumed probability space (paragraph or document). In the table, K denotes the set of all paragraphs in the document, and Y k denotes the set of weakly labeled answer spans for the paragraph p k (which can be {NULL} for paragraph-level models). Note that span-based and position-based objective functions are equivalent for H1 because of the independence assumption, i.e. P s (i k , j k ) = P b (i k )P e (j k ). Inference: Since the task is to predict an answer string rather than a particular mention for a given question, it is potentially beneficial to aggregate information across answer spans corresponding to the same string during inference. The score of a candidate answer string can be obtained as P a (x) = Ξ (i,j)∈X P s (i, j), where X is the set of spans corresponding to the answer string x, and Ξ can be either or max. 5 It is usually beneficial to match the training objective with the corresponding inference method, i.e. MML with marginal inference Ξ = , and HardEM with max (Viterbi) inference Ξ = max. Min et al. (2019) showed HardEM optimization was useful when using an H2 span-level distant supervision assumption coupled with max inference, but it is unclear whether this trend holds when inference is useful or other distant supervision assumptions perform better. We therefore study exhaustive combinations of probability space, distant supervision assumption, and training and inference methods.

Data and Implementation
Two datasets are used in this paper: TriviaQA (Joshi et al., 2017) in its Wikipedia formulation, and NarrativeQA (summaries setting) (Kočiský et al., 2018). Using the same preprocessing as Clark and Gardner (2018) for TriviaQA-Wiki 6 , we only keep the top 8 ranked paragraphs up to 400 tokens for each document-question pair for both training and evaluation. Following Min et al. (2019), for NarrativeQA we define the possible answer string sets A using Rouge-L (Lin, 2004) similarity with crouwdsourced abstractive answer strings. We use identical data preprocessing and the evaluation script provided by the authors.
In this work, we use the BERT-base model for text encoding and train our model with the default configuration as described in (Devlin et al., 2019), fine-tuning all parameters. We fine-tune for 3 epochs on TriviaQA and 2 epochs on Nar-rativeQA.

Optimization and Inference for Latent Variable Models
Here we look at the cross product of optimization (HardEM vs MML) and inference (Max vs Sum) for all distant supervision assumptions that result in models with latent variables. We therefore exclude H1 and look at the other two hypotheses, H2 and H3, each coupled with a span-based (Span) or position-based (Pos) formulation and a paragraphlevel (P) or a document level (D) probability space. The method used in Min et al. (2019) corresponds to span-based H2-P with HardEM training and Max inference. The results are shown in Fig. 3. First, we observe that inference with Sum leads to significantly better results on TriviaQA under H2-P and H2-D, and slight improvement under H3-D. On NarrativeQA, inference with Max is better. We attribute this to the fact that correct answers often have multiple relevant mentions for TriviaQA (also see §5.6), whereas for Narra-tiveQA this is rarely the case. Thus, inference with Sum in NarrativeQA could potentially boost the probability of irrelevant frequent strings.
Consistent with (Min et al., 2019), we observe that span-based HardEM works better than spanbased MML under H2-P, with a larger advantage on NarrativeQA than on TriviaQA. However, under H2-D and H3-D, span-based MML performs consistently better than span-based HardEM. For position-based objectives, MML is consistently better than HardEM (potentially because HardEM may decide to place its probability mass on beginend position combinations that do not contain mentions of strings in A). Finally, it can be ob- served that under each distant supervision hypothesis/probability space combination, the positionbased MML is always the best among the four objectives. Position-based objectives may perform better due to the independence assumptions for begin/end positions of the model we use and future work may arrive at different conclusions if position dependencies are integrated. Based on this thorough exploration, we focus on experimenting with position-based objectives with MML for the rest of this paper.

Probability Space and Distant Supervision Assumptions
In this subsection, we compare probability space and distant supervision assumptions. Table 3 shows the dev set results, where the upper section compares paragraph-level models (H1-P, H2-P), and the lower section compares documentlevel models (H1-D, H2-D, H3-D). The performance of models with both Max and Sum inference is shown. We report F1 and Exact Match (EM) scores for TriviaQA, and Rouge-L scores for NarrativeQA. For TriviaQA, H3-D achieves significantly bet-  ter results than other formulations. Only H3-D is capable of "cleaning" noise from positive paragraphs that don't have a correct answer (e.g. paragraph two in Fig. 1), by deciding which Aconsistent mention to trust. The paragraph-level models H1-P and H2-P outperform their corresponding document-level counterparts H1-D and H2-D. This may be due to the fact that without H3, and without predicting NULL, D models do not learn to detect irrelevant paragraphs. Unlike for TriviaQA, H2-D models achieve the best performance for NarrativeQA. We hypothesize this is due to the fact that positive paragraphs that don't have a correct answer are very rare in NarrativeQA (as summaries are relatively short and answer strings are human-annotated for the specific documents). Therefore, H3 is not needed to clean noisy supervision, and it is not useful since it also leads to a reduction in the number of positive examples (coverage) for the model. Here, document-level models always improve over their paragraph counterparts, by learning to calibrate paragraphs directly against each other.

Multi-Objective Formulations and Clean Supervision
Here we study two methods to further improve weakly supervised QA models. First, we combine two distant supervision objectives in a multitask manner, i.e. H2-P and H3-D for TriviaQA, and H2-P and H2-D for NarrativeQA, chosen based on the results in §5.3. H2 objectives have higher coverage than H3 while being more susceptible  to noise. Paragraph-level models have the advantage of learning to score irrelevant paragraphs (via NULL outcomes). Note that we use the same parameters for the two objectives and the multiobjective formulation does not have more parameters and is no less efficient than the individual models. Second, we use external clean supervision from SQUAD 2.0 (Rajpurkar et al., 2018) to train the BERT-based QA model for 2 epochs. This model matches the P probability space and is able to detect both NULL and extractive answer spans. The resulting network is used to initialize the models for TriviaQA and NarrativeQA. The results are shown in Table 4.
It is not surprising that using external clean supervision improves model performance (e.g. (Min et al., 2017)). We note that, interestingly, this external supervision narrows the performance gap between paragraph-level and document-level models, and reduces the difference between the two inference methods.
Compared with their single-objective components, multi-objective formulations improve performance on both TriviaQA and NarrativeQA.  L scores are reported. Compared to recent TriviaQA SOTA (Wang et al., 2018b), our best models achieve 4.9 F1 and 5.5 EM improvement on the full test set, and 6.8 F1 and 7.4 EM improvement on the verified subset. On the NarrativeQA test set, we improve Rouge-L by 3.0 over (Nishida et al., 2019). The large improvement, even without additional fully labeled data, demonstrates the importance of selecting an appropriate probability space and interpreting the distant-supervision in a way cognizant of the properties of the data, as well as selecting a strong optimization and inference method. With external fully labeled data to initialize the model, performance is further significantly improved.

Analysis
In this subsection, we carry out analyses to study the relative performance of paragraph-level and document-level models, depending on the size of answer string set |A| and the number of Aconsistent spans, which are hypothesized to correlate with label noise. We use the TriviaQA dev set and the best performing models, i.e. H2-P and H3-D with Sum inference.
We categorize examples based on the size of their answer string set, |A|, and the size of their corresponding set of A-consistent spans, |I|. Specifically, we divide the data into 4 subsets and  report performance separately on each subset, as shown in Table 6. In general, we expect Q sl and Q ll to be noisier due to the larger I, where Q sl potentially includes many irrelevant mentions while Q ll likely contains more incorrect answer strings (false aliases). We can observe that the improvement is more significant for these noisier subsets, suggesting document-level modeling is crucial for handling both types of label noise.

Related Work
Distant supervision has been successfully used for decades for information extraction tasks such as entity tagging and relation extraction (Craven and Kumlien, 1999;Mintz et al., 2009). Several ways have been proposed to learn with DS, e.g., multi-label multi-instance learning (Surdeanu et al., 2012), assuming at least one supporting evidence (Hoffmann et al., 2011), integration of label-specific priors (Ritter et al., 2013), and adaption to shifted label distributions (Ye et al., 2019). Recent work has started to explore distant supervision to scale up QA systems, particularly for open-domain QA where the evidence has to be retrieved rather than given as input. Reading comprehension (RC) with evidence retrieved from information retrieval systems establishes a weakly-supervised QA setting due to the noise in the heuristics-based span labels (Chen et al., 2017;Joshi et al., 2017;Dunn et al., 2017;Dhingra et al., 2017). One line of work jointly learns RC and evidence ranking using either a pipeline system (Wang et al., 2018a;Lee et al., 2018;Kratzwald and Feuerriegel, 2018) or an end-to-end model .
Another line of work focuses on improving distantly-supervised RC models by developing learning methods and model architectures that can better use noisy labels. Clark and Gardner (2018) propose a paragraph-pair ranking objective, which has components of both our H2-P and H3-D position-based formulations. They don't explore multiple inference methods or combinations of objectives and use less powerful representations. In (Lin et al., 2018), a coarse-to-fine model is proposed to handle label noise by aggregating information from relevant paragraphs and then extracting answers from selected ones. Min et al. (2019) propose a hard EM learning scheme which we included in our experimental evaluation.
Our work focuses on examining probabilistic assumptions for document-level extractive QA. We provide a unified view of multiple methods in terms of their probability space and distant supervision assumptions and evaluate the impact of their components in combination with optimization and inference methods. To the best of our knowledge, the three DS hypotheses along with position and span-based interpretations have not been formalized and experimentally compared on multiple datasets. In addition, the multi-objective formulation is new.

Conclusions
In this paper, we demonstrated that the choice of probability space and interpretation of the distant supervision signal for document-level QA have a large impact, and that they interact. Depending on the properties of the data, different configurations are best, and a combined multi-objective formulation can reap the benefits of its constituents.
A future direction is to extend this work to question answering tasks that require reasoning over multiple documents, e.g., open-domain QA. In addition, the findings may generalize to other tasks, e.g., corpus-level distantly-supervised relation extraction.