Learning a Cost-Effective Annotation Policy for Question Answering

State-of-the-art question answering (QA) relies upon large amounts of training data for which labeling is time consuming and thus expensive. For this reason, customizing QA systems is challenging. As a remedy, we propose a novel framework for annotating QA datasets that entails learning a cost-effective annotation policy and a semi-supervised annotation scheme. The latter reduces the human effort: it leverages the underlying QA system to suggest potential candidate annotations. Human annotators then simply provide binary feedback on these candidates. Our system is designed such that past annotations continuously improve the future performance and thus overall annotation cost. To the best of our knowledge, this is the first paper to address the problem of annotating questions with minimal annotation cost. We compare our framework against traditional manual annotations in an extensive set of experiments. We find that our approach can reduce up to 21.1% of the annotation cost.


Introduction
Question answering (QA) based on textual content has attracted a great deal of attention in recent years Xie et al., 2020). In order for state-of-the-art QA models to succeed in real applications (e.g., customer service), there is often a need for large amounts of training data. However, manually annotating such data can be extremely costly. For example, in many realistic scenarios, there exists a list of questions from real users (e.g., search logs, FAQs, service-desk interactions). Yet, annotating such questions is highly expensive (Nguyen et al., 2016;He et al., 2018;Kwiatkowski et al., 2019): it requires the screening of a text corpus to find the relevant document(s) and subsequently screening the document(s) to identify the answering text span(s). Motivated by the above scenarios, we study cost-effective annotation for question answering, whereby we aim to accurately 1 annotate a given set of user questions with as little cost as possible. Generally speaking, there has been extensive research on how to reduce effort in the process of data labeling (Haffari et al., 2009). For example, active learning for a variety of machine learning and NLP tasks (Siddhant and Lipton, 2018) aims to select a small, yet highly informative, subset of samples to be annotated. The selection of such samples is usually coupled with a particular model, and thus, the annotated samples may not necessarily help to improve a different model (Lowell et al., 2019). In contrast, we aim to annotate all given samples at low cost and in a manner that can subsequently be used to develop any advanced model. This is particularly relevant in the current era, where a dataset often outlives a particular model (Lowell et al., 2019). Moreover, there has also been some research into learning from distant supervision (Xie et al., 2020) or self-supervision . Despite being economical, such approaches often produce inaccurate or noisy annotations. In this work, we seek to reduce annotation costs without compromising the resulting dataset quality.
We propose a novel annotation framework which learns a cost-effective policy for choosing between different annotation schemes, namely the conventional manual annotation scheme (MAN) and a semi-supervised annotation scheme (SEM). Unlike the manual scheme, SEM does not require humans to screen a text corpus or document(s) in order to retrieve annotations. Instead, it leverages an initialized QA system, which can predict top-n candidate annotations for documents or answer spans and asks humans to provide binary feedback (e.g., cor- use sample to update models 5 x Figure 1: High-level overview of our framework: We leverage a QA model to predict candidate annotations for a given resource (e.g., x stands for a question or question-document pair, while y is the document or answer span). A policy model decides upon whether to invoke a MAN or SEM scheme based on those predictions. In the event that the semi-supervised strategy fails, we switch back to a manual annotation scheme. Finally, we use the annotated sample to update both the QA model and the policy model. rect or incorrect) to the candidates. While this annotation scheme comes at a low cost, it fails when human annotators mark all candidates as incorrect.
In such cases, the annotation cost has already been incurred and cannot be recouped. In order to produce an annotation, one must then draw upon the manual scheme (see Fig. 1), in which case the policy would have been more effective if it had chosen the manual annotation scheme instead. Therefore, how to choose the best annotation scheme for each question is the challenge we must address for this task.
To tackle the above challenge, we propose a novel approach for learning a cost-effective policy. Here the policy receives several candidates and decides on this basis which annotation scheme to invoke. We train the policy with a supervised objective and learn a cost-sensitive decision threshold. The inherent advantage of this method is that our policy immediately reacts to changing costs (without re-optimizing model parameters) and does not exceed the cost of conventional manual annotation. Our policy is updated iteratively as more annotations are obtained.
We compare our framework against conventional, manual annotations in an extensive set of experiments. We simulate the annotation of Nat-uralQuestions (Kwiatkowski et al., 2019), as it consists of real user questions from search logs. Models in our framework are initialized with an existing dataset (SQuAD, Rajpurkar et al., 2016) and, as more annotations on NaturalQuestions be-come available, the framework is continuously updated. We study the sensitivity of our framework to varying cost ratios between SEM and MAN. Our framework outperforms traditional manual annotation, even under conservative cost estimates for SEM, and in general reduces annotation costs in the range of 4.1% to 21.1%.
All source code is publicly available from github.com/bernhard2202/qa-annotation.

Related Work
Question answering: In this paper, we study costeffective annotation for question answering over textual content. There have been extensive efforts to create large-scale datasets for text-based QA, which have facilitated the development of state-ofthe-art neural network based models (e.g., Chen et al., 2017;Min et al., 2018;Kratzwald and Feuerriegel, 2018;Xie et al., 2020). Here we divide such datasets into two categories according to the way they were created: (1) Datasets whose questions were created by crowdsourcing during the annotation process. Prominent examples include the Stanford Question and Answer Dataset (SQuAD; Rajpurkar et al., 2016), HotPotQA (Yang et al., 2018), or NewsQA (Trischler et al., 2017). (2) "Natural" datasets in which real-world questions are a priori given. Here questions originate from, e.g., search logs or customer interactions. Prominent examples in this category include MS MARCO (Nguyen et al., 2016), DuReader (He et al., 2018), or Nat-uralQuestions (Kwiatkowski et al., 2019). This paper focuses on the latter category, that is, annotating "natural" datasets in a more cost-effective fashion where a set of questions is given.
Active Learning: In the fields of machine learning and NLP, extensive research has been conducted on ways to reduce labeling effort (e.g., Zhu et al., 2008). For example, the objective of active learning is to select only a small subset that is highly informative (e.g., Haffari et al., 2009) for annotation. To this end, researchers have developed various techniques based on, e.g., model uncertainty (cf. Siddhant and Lipton, 2018), expected model change (Cai et al., 2013), or functions learned directly from data (e.g., Fang et al., 2017). However, the success of active learning is often coupled with a particular model and domain (Lowell et al., 2019). For instance, a dataset actively acquired with the help of an SVM model might underperform when used to develop an LSTM model. These problems become even more salient when complex black-box models are used in NLP tasks (cf. Chang et al., 2019). To summarize, active learning reduces annotation costs by deciding which samples should be annotated. In our approach, we aim to annotate all samples and study how we should annotate them in order to reduce costs. Thus, the two approaches are orthogonal and can be combined.
Learning from weak supervision and user feedback: Another approach to reducing annotation costs is changing full supervision to some form of weak (but potentially noisier) supervision. This has been adopted for various tasks such as machine translation (Saluja, 2012;Petrushkov et al., 2018;Clark et al., 2018;Kreutzer and Riezler, 2019), semantic parsing (Clarke et al., 2010;Liang et al., 2017;Talmor and Berant, 2018), or interactive systems that learn from user interactions (Iyer et al., 2017;Gur et al., 2018;Yao et al., 2019Yao et al., , 2020. For instance, Iyer et al. (2017) used users to flag incorrect SQL queries. In contrast, similar approaches for text-based question answering are scarce. Joshi et al. (2017) used noisy distant supervision to annotate the answer span and document for given trivia questions and their answers. Kratzwald and Feuerriegel (2019) designed a QA system that continuously learns from noisy user feedback after deployment. In contrast to these works, this paper studies the problem of reducing labeling cost while maintaining accurate annotations.
Quality estimation and answer triggering: In a broader sense, this work is related to the literature on translation quality estimation (e.g., Martins et al., 2017;Specia et al., 2013). The goal in such works is to estimate (and possibly improve) the quality of translated text. Similarly, in question answering researchers use means of quality estimation for answer triggering (Zhao et al., 2017;Kamath et al., 2020). Here, QA systems are given the additional option to abstain from answering a question when the best prediction is believed to be wrong. In our work, we estimate the quality of a set of suggested label candidates and, on the basis of these estimates we decide which annotation scheme to invoke.

Proposed Annotation Framework
We study the problem of reducing the overall cost for annotating every given question [q 1 , . . . , q m ]. Specifically, our objective is to obtain the corresponding question-document-answer triples q i , d i , s i . In this paper, the natural language question q i is given, while we want to obtain the following annotations: the document from a text corpus d i ∈ D that contains the answer and the correct answer span s i within the document d i .

Framework Overview
Fig. 1 provides an overview of our framework for a cost-effective annotation of QA datasets. The framework comprises two main components: a QA model is used to suggest candidates for a resource to annotate while a policy model decides which annotation scheme to invoke (i.e., action). Our framework makes use of two annotation schemes: a traditional manual annotation scheme (MAN) and our semi-supervised annotation scheme (SEM). Both annotation schemes incur different costs and, hence, the learning task is to find and update a cost-effective policy π for making that decision.
QA model: We define Ω as an arbitrary QA model over a text corpus D with the following properties. First, the model can be trained from annotated data samples, e.g., . Second, for a given question the model can predict a number of top-n documents likely to contain the answer, i.e., Ω D : q → [d (1) , . . . , d (n) ] ∈ D. Third, for a given question-document pair the model can predict a number of top-n answer spans, i.e., Ω S : q, d → [s (1) , . . . , s (n) ]. These properties are fulfilled by recent QA systems dealing with textual content (e.g., Chen et al., 2017;.
Policy model: For every question, we distinguish two policy models: a policy model π D responsible for annotating documents and π S for answer spans. For brevity, we sometimes drop the superscripts S and D and simply refer to them as π. The policy models decide whether a manual annotation scheme or rather our proposed semi-supervised annotation scheme is used, each of which is associated to different costs.

Annotation Schemes
Manual annotation (MAN) scheme: This scheme represents the status quo in which all annotations are determined manually. In order to annotate a question q i , a human annotator must first manually search through the text corpus D in order to identify the document d i that answers the question. In a second step, a human annotator manually reads through the document d i and marks the answer span s i .
We assume separate costs, which are fixed over time, for every annotation-level. The price of annotating a document for a given question is defined as c D 0 and the price of annotating an answer span to a given question-document tuple as c S 0 . We explicitly distinguish these costs as the tasks can be of differing difficulty.
Semi-supervised annotation (SEM) scheme: This scheme is supposed to reduce human effort by presenting candidates for annotation, so that only simple binary feedback is needed. In particular, human annotators no longer need to search through the entire document or corpus. Instead, we use the QA model Ω to generate a set of candidates (e.g., top-ranked documents or answer spans) and ask human annotators to give binary feedback in response (e.g., accept the candidate or reject it). This replaces the complex search task with a simpler form of interaction. As an example, to annotate the answer span for a question-document pair, the human annotator would not be required to read the entire document d i , but only to determine which of the top-n answers provided by Ω S ( q i , d i ) are correct. We assume SEM costs c D 1 to annotate a document and c S 1 to annotate an answer span. The SEM scheme should make annotations more straightforward, as providing binary feedback requires less time than reading through the texts.
Hence, we assume that c S 1 < c S 0 and c D 1 < c D 0 hold. However, semi-supervised annotations might fail when none of the candidates is correct (i.e., the human annotators reject all candidates). In this case, our framework must revert to the MAN procedure in order to obtain a valid annotation. As a consequence, the associated cost will increase to the accumulated cost for both the SEM and the MAN schemes. Note that, no matter which scheme is chosen in practice, all annotations are confirmed by human annotators and our resulting dataset will be equal in quality to those resulting from traditional annotation.

Annotation Costs
Both annotation schemes, MAN and SEM, incur different costs that further vary depending on whether annotation is provided at document level (c D ) or at answer span level (c S ). For annotating documents, the cost amounts to is the selected annotation scheme and d * i is the ground-truth document annotation. Hence, d * i ∈ Ω D (q i ) indicates the candidate set contains the ground-truth annotation and SEM is successful. For annotating answer spans, the cost is given by (2) Alternatively, we can write the cost function as a matrix of annotation costs (Tbl. 1). The diagonal entries reflect the costs paid for choosing the optimal scheme. The off-diagonals refer to the costs paid for a sub-optimal method (misclassification costs).

Objective
We aim to minimize overall annotation cost via Annotation Costs for a Document d

Annotation Procedure and Learning
Our framework proceeds according to these nine steps when annotating a question q i (see Alg. 1): First, we predict a number of top-n documents that would be shown to annotators in the case of SEM annotation (line 2). Next, we decide upon the annotation scheme conditional on the prediction from the QA model (line 3) and, based on the selected scheme, request the ground-truth document annotation (line 4). After receiving the ground-truth document and observing the annotation costs, we update our policy network in line 5 (see Sec. 4.3).
Next, we predict a number of top-n answer span candidates for the question-document pair (line 6) and then decide upon the annotation scheme in line 7. After receiving the answer span annotation and observing a cost (line 8), we again update our policy model (line 9). Finally, we update the QA model with the newly annotated training sample in line 10 (see Sec. 4.4). In practice, both policy updates and QA model updates (lines 5, 9, and 10) are invoked after a batch of questions is annotated. Furthermore, we initialize all models with an existing dataset (e.g., SQuAD).
Algorithm 1: High-Level Procedure of Annotation and Learning

Policy Updates
Updating our policy model proceeds in three steps.
(1) We calculate whether the chosen action for past annotations was cost-optimal (i.e., whether the policy should have chosen the other scheme for annotation or not). (2) We use this information to update the policy model with a supervised binary classification objective. This trains the policy to predict the probability of an annotation scheme given a new sample p(a|x) without taking costs into account. (3) We find a cost-sensitive decision threshold that chooses the optimal action with respect to the costs. All three steps are repeated after a full batch of samples has been annotated.
Separating the policy update and the costsensitive decision threshold has several benefits. First, we know from cost-sensitive classification that we can calculate an optimal threshold point for ground-truth probabilities p(a|x) (c.f. Elkan, 2001;Ting, 2000). Therefore, we can focus our effort on determining probabilities as accurate as possible. Second, the decision threshold is calculated only from the costs c S a and c D a and, hence, if costs change, we do not need to re-estimate parameters but can directly adjust our policy.
(1) Finding the cost-optimal action: In order to train the policy with a supervised update, we require labels for the cost-optimal annotation scheme for a given sample. If we choose the SEM annotation scheme, we immediately know whether the action was cost-optimal or not. This is due to the fact that, if the semi-supervised annotation fails, we have to switch to the MAN scheme to receive an annotation and pay both costs. On the other hand, if we choose the MAN scheme, we can observe the optimal action only after receiving the ground-truth annotation: We can then simply run the QA model and validate whether the annotation was contained within the top-n candidates. If so, the SEM action would have been the better choice; otherwise, choosing MAN would have been cost-optimal.
(2) Supervised model updates: Our policy model is a neural network with parameters θ that predicts an annotation scheme for a given sample x, i.e., Note that we dropped the S and D indices here as both policies differ only in the neural network architecture used. We can then simply train the policy with a supervised binary cross-entropy loss given the cost-optimal action that we calculated beforehand. Since the SEM scheme is often suboptimal in the beginning, the training data is highly imbalanced. Therefore, we down-sample past annotations with a sampling ratio of α such that our training data is equally balanced.
(3) Cost-sensitive decision threshold: Choosing the annotation scheme with the highest probability does not take the actual costs into account. For instance, if SEM annotations are much cheaper than MAN annotations, we want to choose the semisupervised scheme even if its probability for success is low. More formally, we want to choose the annotation scheme a that has the lowest expected cost R(a|x), i.e., where c(a, a ) is the annotation cost for choosing scheme a when the optimal scheme was a (see Tab. 1). Since we used down-sampling in our training, we have to calibrate the probabilities p(a|x) with the sampling ratio α (Pozzolo et al., 2015). Assuming the calibrated probabilities are accurate, there exists an optimal classification threshold β (Elkan, 2001) that minimizes Eq. 5 and Eq. 3. Therefore, we define our policy as follows: (6) The optimal β can be derived from the classification cost matrix (Elkan, 2001) again omitting superscripts D and S for brevity. Eq. 7 is then simplified as the fraction of SEM annotation costs to MAN annotation costs. Therefore, we can derive β at document level and answer span level

Model Updates
Periodically updating the QA model Ω allows the framework to adapt to the question style and domain at hand during the annotation process. Therefore, we improve the top-n accuracy and the success rate of the SEM scheme over time. For practical reasons, we refrain from updating Ω after every annotation but periodically retrain the model after a batch of samples is annotated. In order to update the QA model, we can use the fully annotated QA samples in combination with a supervised objective.

Experimental Setup
In this section, we introduce our experimental setup and implementation details.

Datasets
We base our experiments on the NaturalQuestion dataset (Kwiatkowski et al., 2019). We choose this dataset as it is composed of about 300,000 real user questions posed to the Google search engine along with human-annotated documents and answer spans. Simulating the annotation of this dataset is similar to what would happen for domain customization of QA models in real practice (e.g., for search logs, FAQs, logs of past customer interactions). We focus on questions from the trainingsplit that possess an answer span annotation and leave the handling of questions that do not have an answer for future work. The corpus for annotations is fixed to the English Wikipedia, 2 containing more than 5 million text documents. Simulation of annotations: Annotations in our experiments are simulated from the original dataset. If the framework chooses MAN annotation, we simply use the original annotation from the dataset. If a SEM annotation is chosen, we simulate users that give positive feedback only to the ground-truth document and to the answer spans where the text matches 3 the ground-truth annotation. We then construct the new annotation using the candidate with positive feedback. Since we simulate annotations, we conduct extensive experiments on how annotation costs influence the performance of our framework.

Baselines
To the best of our knowledge, there is no comparable prior work. Owing to this fact, we evaluate our framework against several customized baselines. First, we compare our approach against a manual annotation baseline in which we always invoke the full MAN method to annotate samples. This represents the traditional method of annotating QA datasets and thus our prime baseline. Second, we draw upon a clairvoyant oracle policy that always knows the optimal annotation method. We use this baseline to report an upper bound of the savings that our framework could theoretically achieve. Third, we use our framework without updates on the QA model Ω. This quantifies the cost-savings achieved by the interactive domain customization during annotation. Finally, we present a randomized baseline where the annotation scheme is decided by a randomized coin-toss.

Implementation Details
The QA model Ω is built as follows. We use a stateof-the-art BERT-based (Devlin et al., 2018) implementation of RankQA . This combines a simple tf-idf-based information retrieval module with BERT as a module for machine comprehension. Both policy models π D and π S are implemented as three-layer feed-forward networks with dropout, ReLu activation, and a single output unit with sigmoid activation in the last layer. For the policy π D , we use the information retrieval scores as input. For π S , we use the statistical fea-tures of answer-span candidates as calculated by RankQA  as input. We also experimented with convolutional neural networks directly on top of the last layer of BERT, but without yielding improvements that justified the additional model complexity. We initialize all models with the SQuAD dataset (see our supplements).
Hyperparameter setting: We set the number of candidates that are shown to annotators during a SEM annotation to n = 5. The policy networks decide upon the annotation method based on features of the 2n highest-ranked candidates, i.e., the top-10. The batch size for updates in Alg. 1 is set to 1,000 annotated questions. Details on hyperparameters of our QA and policy models are provided in the supplements.

Experimental Results
We group our experiments into three parts. First, we focus only on annotating the answer span for given question-document pairs, as this is the more challenging task. 4 Second, we carry out a sensitivity analysis in order to demonstrate how our framework adapts to different costs of SEM annotations and to show that we never exceed the cost of traditional annotation. Third, we evaluate our framework based on the annotation of a full dataset, including both answer span and document annotations, in order to quantify savings in practice by using our framework.

Performance on Answer Span Annotations
The annotation framework was used to annotate 45 batches of question-document pairs with the corresponding answer spans. The annotation costs are set to one price-unit for each MAN annotation and one third of the unit for each SEM annotation. (In the next section, we carry out an extensive sensitivity analysis where the ratio for annotation costs between MAN and SEM is varied.) In Fig. 2 (left), we plot the average annotation costs in every batch with a dashed line, together with a running mean depicted as a solid line. Compared to conventional, manual annotation, our framework successfully reduces annotation cost by around 15% after only 20 batches. We further compare it with an oracle policy that always picks the best annotation method. The latter provides a hypothetical upper bound according to which approximately 40-45% of annotation cost could be saved. Finally, we show the performance of our framework without updates of the QA model Ω.
Here we can see that its improvement over time is lower, as the framework is not capable of adapting to the question style and domain used during annotation. In sum, our framework is highly effective in reducing annotation cost. Fig. 2 (right) shows how many samples we could annotate (y-axis) with a restricted budget (x-axis). For instance, assume we have a budget of 40k price units available for annotation. Conventional, manual annotation would result in exactly 40k annotated samples as we fixed the cost for each MAN annotation to one unit. With the same budget, our annotation framework with semi-supervised annotations succeeds in annotating an additional ∼9,000 samples.

Cost-Performance Sensitivity Analysis
The advantage of our framework over manual annotations depends on the cost ratio between the SEM and MAN schemes. In order to determine this, we identify the cost-range in which our framework is profitable as a function of SEM annotation costs. We study this via the following experiment: we repeatedly annotate 40k samples and keep the MAN annotation costs fixed to one price unit, while we increase the costs of smart annotations from 0.05 to 0.95 in increments of 0.05. Finally, we measure the average annotation costs for a single sample; see Fig. 3 (left). Fig. 3 (left) demonstrates that our framework effectively lowers annotation costs when the price for SEM annotations drops below 0.6 as compared to manual annotations, which are fixed to one priceunit. Most notably, even when SEM annotations become expensive and almost equal the costs of MAN annotations, the average annotation costs do not exceed those of strictly manual annotation. This can be attributed to our cost-sensitive decision threshold, which does not require exploration as in reinforcement learning, but directly sets the threshold in Eq. 6 sufficiently high.
In Fig. 3 (right), we again show the number of samples that were annotated with a restricted budget of 40k price units. We marked the absolute gain in number of samples over traditional annotation in the plot. The benefit of our framework becomes evident once again when the ratio of SEM annotation costs to MAN annotations costs falls below 0.6.
To summarize, our framework is highly costeffective: it reduces overall annotation costs or, alternatively, increases the number of annotated samples under a restricted budget if annotation costs of SEM are approximately half those of MAN. If the costs are less than half those of MAN annotation, the benefits are especially pronounced. Even if this assumption does not hold, our framework never exceeds the costs of manual annotation and never results in fewer annotated samples.

Performance on Full Dataset Annotation
In the last experiment, we simulate a complete annotation of the NaturalQuestions dataset, including annotations at both document level and answer span level. By annotating a complete dataset, we want to quantify the savings of our framework in practice. We again set the cost of each MAN annotation to one price-unit and repeated the experiment three times by setting the SEM annotation cost (c 1 ) Document-level Answer span-level Overall c1 = 1 /4 c1 = 1 /3 c1 = 1 /2 c1 = 1 /4 c1 = 1 /3 c1 = 1 /2 c1 = 1 /4 c1 = 1 /3 c1 = 1 /2  Table 2: Overall cost (×10 3 price unit) for annotating the NaturalQuestions dataset using our framework vs. conventional manual annotation for different SEM costs (c 1 ). Improvements are shown in parenthesis.
to one quarter, one third, and one half of the price unit. The results are shown in Tbl. 2. Depending on relative cost ratio c 1 , we are able to save between 4.1% and 21.1% percent of the overall annotation cost. This amounts to a total of 40,000 to 8,000 price units. 5

Discussion and Future Work
We assume for the purposes of this study that questions have an answer span contained in a single document and leave an extension to multi-hop questions and unanswerable questions to future research. The robustness of our framework is demonstrated in an extensive set of simulations and experiments. We deliberately choose to leave experiments including real human annotators to future research for the following reason. Outcomes of such an experiment would be sensitive to the design of the user interface as well as the study design itself. In this paper, we want to put the emphasis on the methodological innovation of our framework and the novel annotation scheme itself. On the other hand, experiments involving real users would provide valuable insights concerning the annotation costs and the quality of a dataset annotated with our method. Furthermore, it would be worth investigating how inter-annotator agreement or potential human biases manifest in traditional datasets as compared to those generated with our framework.
questions accurately while limiting costs as much as possible. We show that our framework never incurs higher costs than traditional manual annotation. On the contrary, it achieves substantial savings. For example, it reduces the overall costs by about 4.1% when SEM annotations cost about half of MAN annotations. When that ratio is lowered to one fourth, our framework can reduce the total costs by up top 21.1%. We think that our framework could contribute to more accessible annotation of datasets in the future and possibly even be extended to other fields and applications in natural language processing.

C Details on the Policy Model
The policy models π D and π S are implemented as feed forward networks composed of a dense layer with k output units and relu actvation, a dropout layer with dropout probability z, a second dense layer with k/2 outputs and relu activation, a dropout layer with dropout probability z, and a dense layer with a single output and sigmoid activation.
Initialization: for the first batch of annotations we initialize the policy models on SQuAD. After the first batch is annotated we only use the new data for policy updates.
Hyperparameter search: We tune hyperparamters on the SQuAD dataset using gridsearch with the values displayed in Tab. 3 and Tab. 4. Bold values mark final choices. We annotated the first 10 batches of SQuAD and choose the hyperparamters that had the lowest anntation cost. No hyperparemter tuning or architecture search was performed on the NaturalQuestions dataset which our experiments are based on.

D Estimation of Real Annotation Costs
In order to provide additional insights on the actual annotation costs involving real users we conducted a pre-test on Amazon MTURK. For this we showed a textual explanation of the MAN and SEM annotation scheme to workers and provided them with mockups for both inputs (answer-span annotation). Next, we asked 40 workers to report how much money they think would be a fair compensation for each of the tasks on a scale of one to ten. Work-ers reported on average a compensation of $5.9 for MAN annotations and $3.2 for SEM annotations. This ratio falls into the range where we make profits using our framework.

E System
All experiments were conducted with a Nvidia Titan Xp GPU on a Server with 192GB DDR4 RAM and two 10 Core Intel Xeon Silver 4210 2.2GHz Processors.