Controlled Crowdsourcing for High-Quality QA-SRL Annotation

Question-answer driven Semantic Role Labeling (QA-SRL) was proposed as an attractive open and natural flavour of SRL, potentially attainable from laymen. Recently, a large-scale crowdsourced QA-SRL corpus and a trained parser were released. Trying to replicate the QA-SRL annotation for new texts, we found that the resulting annotations were lacking in quality, particularly in coverage, making them insufficient for further research and evaluation. In this paper, we present an improved crowdsourcing protocol for complex semantic annotation, involving worker selection and training, and a data consolidation phase. Applying this protocol to QA-SRL yielded high-quality annotation with drastically higher coverage, producing a new gold evaluation dataset. We believe that our annotation protocol and gold standard will facilitate future replicable research of natural semantic annotations.


Introduction
Semantic Role Labeling (SRL) provides explicit annotation of predicate-argument relations. Common SRL schemes, particularly PropBank (Palmer et al., 2005) and FrameNet (Baker et al., 1998), rely on predefined role inventories and extensive predicate lexicons. Consequently, SRL annotation of new texts requires substantial efforts involving expert annotation, and possibly lexicon extension, limiting scalability.
Aiming to address these limitations, Question-Answer driven Semantic Role Labeling (QA-SRL) (He et al., 2015) labels each predicate-argument relationship with a question-answer pair, where natural language questions represent semantic roles, and answers correspond to arguments (see Table  1). This approach follows the colloquial perception of semantic roles as answering questions about the predicate ("Who did What to Whom, When, Where and How", with, e.g., "Who" corresponding to the agent role).
QA-SRL carries two attractive promises. First, using a question-answer format makes the annotation task intuitive and easily attainable by laymen, as it does not depend on linguistic resources (e.g. role lexicons), thus facilitating greater annotation scalability. Second, by relying on intuitive human comprehension, these annotations elicit a richer argument set, including valuable implicit semantic arguments not manifested in syntactic structure (highlighted in Table 1). The importance of implicit arguments has been recognized in the literature (Cheng and Erk, 2018;Do et al., 2017;Gerber and Chai, 2012), yet they are mostly overlooked by common SRL formalisms and tools.
Overall, QA-SRL largely subsumes predicateargument information captured by traditional SRL schemes, which were shown beneficial for complex downstream tasks, such as dialog modeling (Chen et al., 2013), machine comprehension (Wang et al., 2015) and cross-document coreference (Barhom et al., 2019). At the same time, it contains richer information, and is easier to understand and collect. Similarly to SRL, one can utilize QA-SRL both as a source of semantic supervision, in order to achieve better implicit neural NLU models, as done recently by He et al. (2020), as well as an explicit semantic structure for downstream use, e.g. for producing Open Information Extraction propositions (Stanovsky and Dagan, 2016). 1 Around 47 people could be arrested, including the councillor.
(1) Who might be arrested? 47 people | the councillor Perry called for the DAs resignation, and when she did not resign, cut funding to a program she ran.
(2) Why was something cut by someone?
she did not resign (3) Who cut something? Perry Previous attempts to annotate QA-SRL initially involved trained annotators (He et al., 2015) but later resorted to crowdsourcing (Fitzgerald et al., 2018) for scalability. Naturally, employing crowd workers is challenging when annotating fairly demanding structures like SRL. As Fitzgerald et al. (2018) acknowledge, the main shortage of their large-scale dataset is limited recall, which we estimate to be in the lower 70s (see §4). Unfortunately, such low recall in gold standard datasets hinders proper research and evaluation, undermining the current viability of the QA-SRL paradigm.
Aiming to enable future QA-SRL research, we present a generic controlled crowdsourcing annotation protocol and apply it to QA-SRL. Our process addresses worker quality by performing short yet efficient annotator screening and training. To boost coverage, we employ two independent workers per task, while an additional worker resolves inconsistencies, similar to conventional expert annotation. These steps combined yield 25% more roles than Fitzgerald et al. (2018), without sacrificing precision and at a comparable cost per verb. This gain is especially notable for implicit arguments, which we show in a comparison to PropBank (Palmer et al., 2005). Overall, we show that our annotation protocol and dataset are of high quality and coverage, enabling subsequent QA-SRL research.
To foster such research, including easy production of additional QA-SRL datasets, we release our annotation protocol, software and guidelines along with a high-quality dataset for QA-SRL evaluation (dev and test). 2 We also re-evaluate the existing parser (Fitzgerald et al., 2018) against our test set, setting the baseline for future developments. Finally, we propose that our systematic and replicable controlled crowdsourcing protocol could also be effective for other complex annotation tasks. 3 similar embeddings for semantically similar questions. These embeddings may be leveraged downstream in the same way as embeddings of traditional categorical semantic roles.
2 https://github.com/plroit/qasrl-gs 3 A previous preprint version of this paper can be found at https://arxiv.org/abs/1911.03243.  2 Background -QA-SRL Specifications In QA-SRL, a role question adheres to a 7-slot template, with slots corresponding to a WH-word, the verb, auxiliaries, argument placeholders (SUBJ, OBJ), and prepositions, where some slots are optional (He et al., 2015), as exemplified in Table 2. Such a question captures its corresponding semantic role with a natural, easily understood expression. All answers to the question are then considered as the set of arguments associated with that role, capturing both traditional explicit arguments and implicit ones.
Corpora The original 2015 QA-SRL dataset (He et al., 2015) was annotated by hired non-expert workers after completing a short training procedure. They annotated 7.8K verbs, reporting an average of 2.4 QA pairs per verb. Even though multiple annotators were shown to produce greater coverage, their released dataset was produced by a single annotator per verb. In subsequent work, Fitzgerald et al. (2018) employed untrained crowd workers to construct a large-scale corpus (2018) and used it to train a parser. In their protocol, a single worker ("generator") annotated a set of questions along with their answers. Two additional workers ("validators") validated each question and, in the valid case, independently annotated their own answers. In total, 133K verbs were annotated with 2.0 QA pairs per verb on average.
In a subset of the corpus (10%) reserved for parser evaluation, verbs were densely validated by 5 workers (termed the Dense set). 4 Yet, adding validators accounts only for precision errors in question annotation, while role coverage solely relies upon the output of the single generator. For this reason, both the 2015 and 2018 datasets struggle with coverage.
Also, while traditional SRL annotations contain a single authoritative and non-redundant annotation (i.e., a single role and span for each argument), the 2018 dataset provides raw annotations from all annotators. These include many redundant overlapping argument spans, without settling on consolidation procedures to provide a single gold reference, which complicates models' evaluation.
These limitations of the current QA-SRL datasets impede their utility for future research and evaluation. Next, we describe our method for creating a viable high quality QA-SRL dataset.

Controlled Crowdsourcing Methodology
Screening and Training We first release a preliminary crowd-wide annotation round, and then contact workers who exhibit reasonable performance. They are asked to review our short guidelines, 5 which highlight a few subtle aspects, and then annotate two qualification rounds, of 15 predicates each. Each round is followed by extensive feedback via email, pointing at errors and missed arguments, identified by automatic comparison to expert annotation. Total worker effort for the training phase is about 2 hours, and is fully compensated, while requiring about half an hour of an in-house trainer time per participating worker. We trained 30 participants, eventually selecting 11 well-performing ones.
Annotation We reuse and extend the annotation machinery of Fitzgerald et al. over Amazon's Mechanical Turk. First, two workers independently generate questions about a verb, and highlight answer spans in the sentence. Then, a third worker reviews and consolidates their annotations based on targeted guidelines, producing the gold standard data. At this step, the worker validates questions, merges, splits or modifies answers for the same role, and removes redundant questions. 6 Table 3 depicts examples from the consolidation task. We monitor the annotation process by sampling (1%) and reviewing. 5 Publicly available in our repository. 6 Notice that while the validator from Fitzgerald et al. (2018) viewed only the questions of a single generator, our consolidator views two full QA sets, promoting higher coverage.

Evaluation Metrics
Evaluation in QA-SRL involves, for each verb, aligning its predicted argument spans to a reference set of arguments, and evaluating question equivalence, i.e., whether predicted and gold questions for aligned spans correspond to the same semantic role. Since detecting question equivalence is still an open challenge, we propose both unlabeled and labeled evaluation metrics. The described procedure is used to evaluate both the crowd-workers' annotations ( §4) and the QA-SRL parser ( §5).
Unlabeled Argument Detection (UA) Inspired by the method presented in (Fitzgerald et al., 2018), argument spans are matched using a token-based matching criterion of intersection over union (IOU) ≥ 0.5. To credit each argument only once, we employ maximal bipartite matching 7 between the two sets of arguments, drawing an edge for each pair that passes the above mentioned criterion. The resulting maximal matching determines the truepositive set, while remaining non-aligned arguments become false positives or false negatives.
Labeled Argument Detection (LA) All aligned arguments from the previous step are inspected for label equivalence, similar to the joint evaluation reported in Fitzgerald et al. (2018). There may be many correct questions for a role. For example, What was given to someone? and What has been given by someone? both refer to the same semantic role but diverge in grammatical tense and argument place holders. Aiming to avoid judging non-equivalent roles as equivalent, we propose STRICT-MATCH to be an equivalence on the following template slots: WH, SUBJ, OBJ, as well as on negation, voice, and modality 8 extracted from the question. Final reported numbers on labelled argument detection rates are based on bipartite aligned arguments passing STRICT-MATCH. As this matching criterion significantly underestimates question equivalence, we later manually assess the actual rate of correct role equivalences.
Evaluating Redundant Annotations We extend our metric for evaluating manual or automatic redundant annotations, exhibited in the Dense dataset ( §2) as well as the output of the Fitzgerald et al. (2018) parser ( §5). To that end, we ignore redundant true-positives, and collapse false-positive errors (see Appendix for details).

Dataset Quality Analysis
Inter-Annotator Agreement (IAA) To estimate dataset consistency across different annotations, we measure F1 using our UA metric. 10 individual worker-vs-worker experiments yield 79.8 F1 agreement over 150 predicates, indicating high consistency across our annotators, in line with agreement rates in other structured semantic annotations, e.g. Abend and Rappoport (2013). Overall consistency of the dataset is assessed by measuring agreement between different consolidated annotations, obtained by disjoint triplets of workers, which achieves F1 of 84.1, averaged over 4 experiments, 35 predicates each. Notably, consolidation boosts agreement, indicating its necessity. For LA agreement, averaged F1 was 67.8; however, it is likely that the drop from UA is mainly due to falsely rejecting semantically equivalent questions under the STRICT-MATCH criterion, given that we found equal LA and UA scores in a manual evaluation of our dataset (see Table 4 below).

Dataset Assessment and Comparison
We assess our gold standard, as well as the recent Dense set, against an integrated expert set of 100 predicates. To construct the expert set, we first merged the annotations from the Dense set with our workers' annotations. Then, three of the authors blindly (i.e., without knowing the origin of each QA pair) selected, corrected and added annotations, resulting in a high-coverage unbiased expert set. We further manually corrected the evaluation decisions, accounting for some automatic evaluation mistakes introduced by the span-matching and question equivalence criteria. As seen in Table 4, our gold set yields comparable precision with drastically higher recall, in line with our 25% higher yield. 9 This work Dense ( Table 4: Automatic and manually-corrected evaluation of our gold standard and Dense (Fitzgerald et al., 2018) against the integrated expert set.
Examining disagreements between our gold and Dense, we observe that our workers successfully produced more roles, both implicit and explicit. To a lesser extent, they split more arguments into independent answers, as emphasized by our guidelines, an issue that was left under-specified in previous annotation guidelines.
Agreement with PropBank Data It is illuminating to observe the agreement between QA-SRL and PropBank (CoNLL-2009) annotations (Hajič et al., 2009). In Table 5, we replicate the experiments in He et al. (2015, Section 3.4) for both our gold set and theirs, over a sample of 200 sentences from the Wall Street Journal (evaluation is automatic and the metric is similar to our UA). We report macro-averaged (over predicates) precision and recall for all roles, including core and adjuncts, 10 while considering the PropBank data as the reference set. Our recall of PropBank roles is notably high, reconfirming the coverage obtained by our annotation protocol.
The measured precision with respect to Prop-Bank is low for adjuncts, but this is due to the fact that QA-SRL captures many correct implicit arguments, which fall out of PropBank's scope (where arguments are directly syntactically linked to the predicate). To examine this, we analyzed 100 arguments in our dataset not found in PropBank ("false positives"). We found that only 32 were due to wrong or incomplete QA annotations, while most others were valid implicit arguments, stressing QA-SRL's advantage in capturing those inherently. Extrapolating from this analysis estimates our true precision (on all roles) to be about 91%, consistent with the 88% precision in Table 4, while yielding about 15% more valid arguments than PropBank (mostly implicit). Compared with 2015, our QA-SRL gold yielded 1593 QA pairs (of which, 604 adjuncts), while theirs yielded 1315 QAs (336 adjuncts). Overall, the comparison to PropBank reinforces the quality of our gold dataset and shows its better coverage relative to the 2015 dataset.

Baseline Parser Evaluation
We evaluate the parser from Fitzgerald et al. (2018) on our dataset, providing a baseline for future work. As we previously mention, unlike typical SRL systems, the parser outputs overlapping arguments, often with redundant roles (Table 7). Hence, we employ our metric variant for evaluating redundant annotations. Results are reported in Table 6, demonstrating reasonable performance along with substantial room for improvement, especially with respect to coverage. As expected, the parser's recall against our gold is substantially lower than the 84.2 recall reported in (Fitzgerald et al., 2018) against Dense, due to the limited recall of Dense relative to our gold set.
Error Analysis Through manual evaluation of 50 sampled predicates, we detect correctly predicted arguments and questions that were rejected by the IOU and STRICT-MATCH criteria. Based on this inspection, out of the 154 gold roles (128 explicit and 26 implicit), the parser misses 23%,  Table 6: Automatic parser evaluation against our test set, complemented by automatic and manual evaluations on the Wikinews part of the dev set (manual evaluation is over 50 sampled predicates).
What suggests something?
Reports What suggests something?
Reports from Minnesota Where was someone carried?
to reclining chairs What was someone carried to? reclining chairs covering 82% of the explicit roles but only half of the implicit ones.

Conclusion
Applying our proposed controlled crowdsourcing protocol to QA-SRL successfully attains truly scalable high-quality annotation by laymen, facilitating future research of this paradigm. Exploiting the open nature of the QA-SRL schema, our nonexpert annotators produce rich argument sets with many valuable implicit arguments. Indeed, thanks to effective and practical training over the crowdsourcing platform, our workers' annotation quality, and particularly its coverage, are on par with expert annotation. We release our data, software and protocol, enabling easy future dataset production and evaluation for QA-SRL, as well as possible extensions of the QA-based semantic annotation paradigm. Finally, we suggest that our simple yet rigorous controlled crowdsourcing protocol would be effective for other challenging annotation tasks, which often prove to be a hurdle for research projects.