CROWD-IN-THE-LOOP: A Hybrid Approach for Annotating Semantic Roles

Crowdsourcing has proven to be an effective method for generating labeled data for a range of NLP tasks. However, multiple recent attempts of using crowdsourcing to generate gold-labeled training data for semantic role labeling (SRL) reported only modest results, indicating that SRL is perhaps too difficult a task to be effectively crowdsourced. In this paper, we postulate that while producing SRL annotation does require expert involvement in general, a large subset of SRL labeling tasks is in fact appropriate for the crowd. We present a novel workflow in which we employ a classifier to identify difficult annotation tasks and route each task either to experts or crowd workers according to their difficulties. Our experimental evaluation shows that the proposed approach reduces the workload for experts by over two-thirds, and thus significantly reduces the cost of producing SRL annotation at little loss in quality.


Introduction
Semantic role labeling (SRL) is the task of labeling the predicate-argument structures of sentences with semantic frames and their roles (Baker et al., 1998;Palmer et al., 2005). It has been found useful for a wide variety of NLP tasks such as question-answering (Shen and Lapata, 2007), information extraction (Fader et al., 2011) and machine translation (Lo et al., 2013). A major bottleneck impeding the wide adoption of SRL is the need for large amounts of labeled training data to * The work was done while the author was at IBM Research -Almaden. capture broad-coverage semantics. Such data requires trained experts and is highly costly to produce (Hovy et al., 2006). Crowdsourcing SRL Crowdsourcing has shown its effectiveness to generate labeled data for a range of NLP tasks (Snow et al., 2008;Hong and Baker, 2011;Franklin et al., 2011). A core advantage of crowdsourcing is that it allows the annotation workload to be scaled out among large numbers of inexpensive crowd workers. Not surprisingly, a number of recent SRL works have also attempted to leverage crowdsourcing to generate labeled training data for SRL and investigated a variety of ways of formulating crowdsourcing tasks (Fossati et al., 2013;Pavlick et al., 2015;. All have found that crowd feedback generally suffers from low interannotator agreement scores and often produces incorrect labels. These results seem to indicate that, regardless of the design of the task, SRL is simply too difficult to be effectively crowdsourced. Proposed Approach We observe that there are significant differences in difficulties among SRL annotation tasks, depending on factors such as the complexity of a specific sentence or the difficulty of a specific semantic role. We therefore postulate that a subset of annotation tasks is in fact suitable for crowd workers, while others require expert involvement. We also postulate that it is possible to use a classifier to predict whether a specific task is easy enough for crowd workers. Based on these intuitions, we propose CROWD-IN-THE-LOOP, a hybrid annotation approach that involves both crowd workers and experts: All annotation tasks are passed through a decision function (referred to as TASKROUTER) that classifies them as either crowd-appropriate or expertrequired, and sent to crowd or expert annotators accordingly. Refer to Figure 1 for an illustration of this workflow. We conduct an experimental evaluation that shows (1) that we are able to design a classifier that can distinguish between crowd-appropriate and expert-required tasks at very high accuracy (96%), and (2) that our proposed workflow allows us to pass over two-thirds of the annotation workload to crowd workers, thereby significantly reducing the need for costly expert involvement. Contributions In detail, our contributions are: • We propose CROWD-IN-THE-LOOP, a novel approach for creating annotated SRL data with both crowd workers and experts. It reduces overall labeling costs by leveraging the crowd whenever possible, and maintains annotation quality by involving experts whenever necessary.
• We propose TASKROUTER, an annotation task decision function (or classifier), that classifies each annotation task into one of two categories: expert-required or crowdappropriate. We carefully define the classification task, discuss features and evaluate different classification models.
• We conduct a detailed experimental evaluation of the proposed workflow against several baselines including standard crowdsourcing and other hybrid annotation approaches. We analyze the strengths and weaknesses of each approach and illustrate how expert involvement is required to address errors made by crowd workers.
Outline This paper is organized as follows: We first conduct a baseline study of crowdsourcing SRL annotation, and analyze the difficulties of relying solely on crowd workers (Section 2). Based on this analysis, we define the classification problem for CROWD-IN-THE-LOOP, discuss the design of our classifier, and evaluate its accuracy (Section 3). We then employ this classifier in the pro-posed CROWD-IN-THE-LOOP approach and comparatively evaluate it against a number of crowdsourcing and hybrid workflows (Section 4). We discuss related work (Section 5) and conclude the study in Section 6.

Crowdsourcing SRL
We first conduct a baseline study of crowdsourcing SRL. We illustrate how we design and create annotation tasks, how we gather and interpret crowd feedback, and analyze the results of the study to determine the applicability of crowdsourcing for producing SRL annotation. SRL formalism. In this study, and throughout the paper, we use the PROPBANK formalism of SRL (Palmer et al., 2005), which defines verbspecific frames (BUY.01, BUY.02), frame-specific core roles (A0 to A5), and frame-independent noncore roles (for temporal, location and other contexts).

Annotation Task Design
To design the annotation task, we replicate a setup proposed in previous work  in which crowd workers are employed to curate the output of a statistical SRL system. This setup generates annotation tasks as following:

Sentence
And many fund managers have built up cash levels and say they will be buying stock this week.

Question
What is being bought in this sentence? Is it: "stock"? buy.01

Answer Options
Yes No, what is being bought is not mentioned No, what is being bought is mentioned here: copy and paste text Figure 2: Example annotation task, consisting of a sentence with predicted role labels, a human readable question regarding to one label, and a set of answer options. By answering, crowd workers curate a prediction made by the SRL.
Step 1. We use a statistical SRL system to predict SRL labels for a set of sentences (see Figure 1). While state-of-the-art SRL will predict many correct labels, some predicted labels will be incorrect, and some labels will be missing. Annotation tasks are therefore designed to detect and correct precision and recall errors.
Step 2. We generate two types of annotation tasks for the study, namely CONFIRMPREDICTION and ADDMISSING tasks: (1) The first, CONFIRMPRE-DICTION tasks, ask users to confirm, reject or correct each predicted frame or role. This type of task addresses precision issues in the SRL. We present to workers a human-readable questionanswer pair (He et al., 2015) for each predicted label, an example of which is illustrated in Figure 2.
(2) The second, ADDMISSING tasks, address potentially missing annotation, i.e. recall issues in the SRL. We generate a question without a suggested answer and ask workers to either confirm that this role does not appear in the sentence, or supply the correct span. We identify potentially missing annotation using PropBank frame definitions; any unseen core role in a sentence is considered potentially missing. We use a manually created mapping of frameroles to questions to generate these tasks. See Table 1 for a mapping of the roles of the BUY.01 frame to questions.
Step 3. Each question is presented to crowd workers together with the sentence and a set of answer options. Example annotation tasks are illustrated in Figures 2 and 3. A task thus is defined as follows: Definition 1 Annotation Task: A task consists of a sentence, a human readable question regarding a predicted label, and a set of answer options.
We collect worker responses to these tasks. If the majority of crowd workers agrees on a correction, we remove or correct incorrectly predicted labels   Table 2: Tasks in our crowdsourcing study by ratio of how many workers agreed on an answer. If all five workers agree, the majority answer is correct in 93% of cases. If fewer workers agree, the precision of the majority answer decreases.
for CONFIRMPREDICTION tasks and add new labels for ADDMISSING tasks.

Crowdsourcing Study
We conduct a crowdsourcing study consisting of 2,549 annotation tasks, generated by running a state-of-the-art SRL system  over 250 randomly selected gold-labeled sentences from the English training dataset in the CoNLL-2009 shared task (Hajič et al., 2009). We generated tasks using our question mappings from the predicted labels. This setup allows us to compare crowd feedback to gold labels and determine how often the crowd provides incorrect answers. Human Annotators For crowd annotators, we employ five native speakers of English from UP-WORK 1 , selected using the following procedure: We required workers to complete a short tutorial 2 , followed by 20 annotation tasks, which we evaluated against the gold data. We used the results to select the best-scoring 5 of 7 applicants. We then asked them to complete the remaining labeling tasks. The study was conducted in a span of three weeks. Crowd workers were paid a fixed sum for the completion of the study, which resulted in a cost of 2 cents per worker per task. In total, workers estimated an average of 9 hours to complete the full task.

Analysis
We gather crowd feedback and compare the majority answer for each task with the gold label. Refer to Table 2 for an overview of results. We make several observations: The more workers agree, the better the answer. Generally, we note that majority answers tend to be more often correct if more workers agree. Specifically, as Table 2 shows, all 5 annota-  tors agreed in 1,801 out of all 2,549 tasks (71%). Of these tasks, the majority answer was correct in 1,679 cases, and incorrect in 122, yielding a precision of 93% for tasks in full agreement. If only 4 out of 5 agree (i.e. one annotator provided a different answer), the precision drops to 86%. If only three annotators agree on an answer, the precision is even lower, at 67%. Furthermore, we note 34 cases in which there was no majority answer (no agreement by at least 3 workers). We therefore see a direct correlation between agreement scores and the validity of majority answers.
Even if all workers agree, errors are made.
We also note that all 5 crowd workers sometimes unanimously agree an incorrect annotation, in a total of 122 cases. To illustrate such a case, consider the example in Figure 3: In our study, all 5 workers incorrectly selected yes as answer. However, (perhaps somewhat counterintuitively to nonexperts) under the PropBank paradigm it is the "phone representative" that provide explicit help in this sentence, not "Vanguard." Characteristics of difficult annotation tasks. As illustrated in Table 3, we break down annotation tasks by question types and semantic labels to gain a better understanding of which tasks are difficult for the crowd. The first row in the table lists results for CONFIRMPREDICTION tasks. We note that

Sentence
And Vanguard, among other groups, said it was adding more phone representatives today to help investors get through.

Question
Who is helping in this sentence? Is it: "Vanguard"? help.01

Answer Options
Yes No, who is helping is not mentioned No, who is helping is mentioned here: copy and paste text some tasks of this type require above-average expert involvement, such as confirmation questions that pertain to the frame label or higher numbered roles (A3 and A4). The second row lists results for ADDMISSING tasks. Here, we note that again higher order roles tend to be above average expertrequired 3 . However, while the breakdown in Table 3 indicates some general trends for the difficulty of annotation tasks, the question type itself does not suffice to determine whether an individual instance requires expert involvement or not. Summary. Our crowdsourcing study supports the initial hypothesis that a portion of SRL tasks is in fact appropriate for crowd workers, but also shows that identifying such tasks is not straightforward since neither crowd agreement scores nor the annotation task type is sufficient indicators of difficult tasks. We investigate this problem further in the next section.

TASKROUTER: Annotation Task Classification
Our study shows that some annotation tasks are appropriate for crowd workers, while others are not. In this section, we define a classification problem in which we wish to classify each task into one of the two following classes: Definition 2 Crowd-appropriate: A task for which: (1) All crowd workers agree on the answer.
(2) The agreed-upon answer is correct.
Definition 3 Expert-required: A task that is not crowd-appropriate.
According to these definitions, our crowdsourcing study found that the task in Figure 2 is crowdappropriate, i.e. easy enough for the crowd to provide correct and consistent answers, while the task in Figure 3 is considered expert-required.

Features
To solve the task classification problem, we note two groups of distinct features (see Table 4): Task-level features X g capture the general difficulty of a labeling task, as defined by a frame or role type. The intuition here is that certain frames/roles are inherently difficult for nonexperts, and that annotation tasks related to such frames/roles should be handled by experts. In the BUY.01 frame for instance, buyer (A0) is a simple crowd-appropriate semantic concept, while benefactive (A4) generally produces lower agreement scores. Task-level features therefore include the frame and role labels themselves, as well as the complexity of each question, measured in features such as the question word (what, how, when etc.), its length measured in number of tokens, and all tokens, lemmas and POS-tags in the question. Sentence-level features X l capture complexity associated with the specific task instance. The intuition is that some sentences are more complex and more difficult to understand than others. In such sentences, even roles with generally crowdappropriate definitions might be incorrectly answered by non-experts. We capture the complexity of a sentence with features such as its length (number of tokens in sentence), the numbers of frames, roles, verbs, and nouns in the sentence, as well as all tokens, lemmas and POS-tags.

Classification Model
In addition to task-and sentence-level features, we present a classifier that also models the interplay between multiple annotation tasks generated from the same sentence. The intuition here is that there is an interdependence between labeling decisions in the same sentence. For instance, the presence of a difficult role may alter the interpretation of a sentence and make other labeling decisions more  complicated. We thus propose a fuzzy classification model with two layers (Ishibuchi et al., 1995) of SVM classifiers (Wang et al., 2016), which introduces the context of the task using fuzzy indicators to model the interplay between the two groups of features. Specifically, we train a local-layer SVM classifier L l using the sentence-level features X l (computed from sentences). We also train a globallayer SVM classifier L g using the task-level features X g (computed from tasks). We refer to the predictions of the local and global classifiers as fuzzy indicators and we incorporate them as additional features to the fuzzy two-layer SVM classifier L f as follows. Given task a i among all tasks a 1 to a n for a sentence s, the first layer of the fuzzy classifier, consists of applying the local-layer classifier using the sentence-level features of s. The second layer of the fuzzy classifier consists in applying the global-layer classifier n times, each time using task-level features for task a j , 1 ≤ j ≤ n, resulting in n + 1 values: one local-layer indicator, and n global-layer indicators. Our final fuzzy classifier model uses the n+1 local and global indicators as features, in addition to the sentence-and task-level features of a i .
Note that the classification of task a i considers features from other tasks a j from the same sentence, but more efficiently than placing all tasklevel features of all tasks into a single feature vector. Formally, the objective function of the fuzzy two-layer SVM classification model L f is: where K(X f T X f ) is the fuzzy twolayer RBF kernel function, gT , · · · , Y j gT , · · · , Y n gT , Y l T ] is the fuzzy two-layer feature matrix, n is the number of annotation tasks generated from a sentence, Y j g represents the j-th fuzzy indicator generated by the j-th global classifier L g j , Y l is the fuzzy indicator generated by the local classifier L l , Y is the label matrix, 1 is a vector of all ones and C is a positive trade-off parameter.

Evaluation
To evaluate the accuracy of TASKROUTER we use the standard measure of accuracy for binary classifiers. As Table 5 shows, we evaluate four setups  in which we train an SVM with (1) task-level features, (2) sentence-level features, (3) all features, and (4) our proposed fuzzy two-layer classifier. Data. We use the dataset created in our crowdsourcing study (see Section 2.2), which consists of 2,549 annotation tasks labeled as either expertrequired or crowd-appropriate according to our definitions and the results of the study. We leverage five-fold cross validation to train the classifiers over a training split (80%). Results. The cross validation results are listed in Table 5. Our proposed classifier outperforms all baselines and reaches a classification accuracy of 0.96. Interestingly, we also note that tasklevel features are significantly more important than sentence-level features, as the setup SVM task outperforms SVM sentence by 6 accuracy points. Furthermore, our proposed approach outperforms SVM task+sentence , indicating a positive impact of modeling the global interplay of annotation tasks. These experiments confirm our initial postulation that it is possible to train a high quality classifier to detect expert-required tasks. We refer to the best performing setup as TASKROUTER.

CROWD-IN-THE-LOOP Study
Having created TASKROUTER, we now execute our proposed CROWD-IN-THE-LOOP workflow and comparatively evaluate it against a number of crowdsourcing and hybrid approaches. We wish to determine (1) to what degree does having the crowd in the loop reduce the workload of experts?
(2) How does the quality of the produced annotated data compare to purely crowdsourced or expert annotations?

Approaches
We evaluate the following approaches: 1. Baseline without curation The first is a simple baseline in which we use the output of SRL as-is, i.e. with no additional curation either by the crowd or experts. We list this method to show the quality of the starting point for the curation workload.

CROWD (Crowdsourcing)
The second baseline is a standard crowdsourcing approach as described in Section 2, i.e. without experts. We send all annotation tasks (100%) to the crowd and gather crowd feedback to correct labels in three different settings. We correct all labels based on majority vote, i.e., if at least 3 (CROWD min3 ), 4 (CROWD min4 ) or all 5 (CROWD all5 ) out of 5 annotators agree on an answer. 3. HYBRID (Crowdsourcing + Expert curation) In this setting, we replicate the approach proposed by : After first executing crowdsourcing (i.e. sending 100% of the tasks to the crowd), we identify all tasks in which crowd workers provided conflicting answers. These tasks are sent to experts for additional curation (expert answers are used for curation instead of the crowd response). We use three definitions of what constitutes a conflicting answer: (1) We consider all answers in which at least a majority (3 out of 5) agreed as crowd-appropriate and send the rest (2.2%) to experts. We refer to this setup as HYBRID min3 . (2) Only tasks where 4 out of 5 agreed are crowd-appropriate, the remaining 9.9% go to experts (HYBRID min4 ). (3) Any task in which there is no unanimous agreement (27.3%) is deemed expert-required (HYBRID all5 ).

CROWD-IN-THE-LOOP
This setup is the proposed approach in which we use TASKROUTER trained over a holdout training set to split annotation tasks into crowd and expert groups. In our experiments, TASKROUTER determines the following partitions: 66.4% of tasks to the crowd, the remaining 33.6% to experts. To give an indication of the lower bound of the approach given these partitions, we list results for two settings: (1) CROWD-IN-THE-LOOP Random , a lower bound setting in which we randomly split into these partitions.
(2) CROWD-IN-THE-LOOP T askRouter , the proposed setting in which we use TASKROUTER to perform this split.
Refer to Table 6 for an overview of these experiments. The WORKLOAD columns indicate what percentage of tasks is sent to crowd and experts.

Experimental Setup
Data We use the dataset created in the crowdsourcing study in Section 2, consisting of 2,549 annotation tasks labeled either as expert-required  or crowd-appropriate 4 . As shown in Section 3.3, we use 80% of the dataset to train TASKROUTER under cross validation, and conduct the comparative evaluation using the remaining 20%.

Human annotators & curation
We simulate an expert annotator using the CoNLL-2009 gold SRL labels and reuse the crowd answers from the study for crowd annotators. For each setting, we gather crowd and expert answers to the annotation tasks, and interpret the answers to curate the SRL labels that were produced by the statistical SRL system. After curation, we evaluate the resulting labeled sentences against gold-labeled data to determine the annotation quality in terms of precision, recall and F 1 -score. Evaluation Metrics Next to the quality of resulting annotations, we are interested to evaluate how effectively we integrate the crowd. We measure this in two metrics.
(1) One is the percentage of tasks that go to the crowd and to experts respectively. Note that in the HYBRID setup, some tasks go to both crowd workers and experts, so that the percentages can add up to over a hundred percent. This information is illustrated in column WORK-LOAD in Table 6. (2) The second is the overall validity of crowd feedback, referred to as correctness, measured as the ratio of correct answers among all answers retrieved from the crowd. We provide two values for correctness in Table 6, under column CORRECTNESS: The first is the correctness only over crowd feedback. Note that this value is the same for all CROWD and HYBRID setups since in these approaches 100% of annotation tasks are passed to the crowd. The second named hybrid is the overall correctness of the resolved answers with both expert and crowd feedback. 4 We will release the dataset soon.

Experimental Results
The results of our experiments are summarized in Table 6. We make the following observations: CROWD-IN-THE-LOOP significantly increases annotation quality. Our evaluation shows that CROWD-IN-THE-LOOP produces SRL annotation with significantly higher quality compared to crowdsourcing or hybrid scenarios. With a resulting F 1 -score of 0.94, it outperforms the best performing crowdsourcing setup (0.90) by 4 points. More importantly, our proposed approach also outperforms other hybrid approaches that partially leverage experts. It outperforms the best hybrid approach (0.91) by 3 points, indicating that TASKROUTER is better to select expert-required tasks than a method with only crowd agreement.
Significantly less expert involvement required.
In our experiments, more than two-thirds of all tasks were determined to be crowd-appropriate by TASKROUTER. This considerably reduces the need for expert involvement compared to expert labeling, while still maintaining relatively high annotation quality. In particular, our approach compares favorably to other hybrid setups in which a similar partition of tasks is completed by experts.
Since TASKROUTER is more capable to choose expert-required tasks than previous approaches, we achieve higher overall quality at similar levels of expert involvement.
Crowd workers more effective. As the correctness column in Table 6 shows, the selection of tasks by TASKROUTER is more appropriate for the crowd in general. Their average correctness increases to 0.92, compared to 0.84 if the crowd completes 100% of the tasks.

Discussion and Outlook
The proposed approach far outperforms crowdsourcing and hybrid approaches in terms of annotation quality. In particular, even at similar levels of expert involvement, it outperforms the HYBRID all5 approach. However, we also note that with an F 1 -score of 0.94, our approach does not yet reach the quality of gold annotated data.
Insights for further improving quality. To further improve the quality of generated SRL training data, future work may (1) investigate additional features (Wang et al., 2015) and classification models to improve the TASKROUTER to better distinguish between crowd-appropriate and expertrequired tasks, and (2) experiment with other SRL crowdsourcing designs to make more tasks crowdappropriate. Nevertheless, we suspect that a small decrease in quality cannot be fully avoided if large amounts of non-experts are involved in a labeling task such as SRL. Given such involvement of nonexperts, we believe that our proposed approach is a compelling way for increasing crowdsourcing quality while keeping expert costs relatively low.
Flexible trade-off of costs vs quality. Another avenue for research is to experiment with classifier parameters that allow us to more flexibly control the trade-off between how many experts we wish to involve and what annotation quality we desire (e.g., active learning (Wang et al., 2017)). This may be helpful to scenarios in which costs are fixed, or where one aims to compute the costs for generating annotated data of specific quality.
Use for SRL domain adaptation. One intended avenue for study is to apply our approach to generate training data for a specific textual domain for which little or no SRL training data currently exists. We believe that due to its relatively lower costs, our approach may be an ideal candidate for practical domain adaptation of SRL.
Applicability to other NLP crowdsourcing tasks. Finally, while in this paper we focused on the task of generating labeled training data for SRL, we believe that our proposed approach may be applicable to other NLP tasks that have only reported moderate results to-date. To study this applicability, one would first need to conduct a similar study as in Section 2 to identify crowdappropriate and expert-required tasks and attempt the training of a classifier.

Related Work
Crowdsourcing SRL Annotation Different approaches have been adapted to formulate SRL tasks for non-expert crowd workers (Hong and Baker, 2011). Typical tasks include selecting answers from a set of candidates (Fossati et al., 2013), marking text passages that contain specific semantic roles (Feizabadi and Padó, 2014), and constructing question-answer pairs (He et al., 2015(He et al., , 2016. However, a particular challenge is that SRL annotation tasks are often complex and crowdsourcing inevitably leads to low-quality annotations (Pavlick et al., 2015). Instead of attempting to design a better annotation task, our proposed approach addresses this problem by accepting that a certain portion of annotation tasks is too difficult for the crowd. We create a classifier to identify such tasks and involve experts whenever necessary. Routing Tasks Recent approaches have been developed to address the task routing problem in crowdsourcing (Bragg et al., 2014;Bozzon et al., 2013;Hassan and Curry, 2013). As workers vary in skill and tasks vary in difficulty, prior recommended approaches often consider the match between the task content and workers' profiles. However, these approaches are difficult to apply to the particular context of SRL annotation since we only distinguish between either experts familiar with PropBank, or non-expert crowd workers.
Rather than routing tasks to the most appropriate workers, our proposed approach determines which SRL tasks are appropriate for crowdsourcing, and sends the remaining ones to experts. Human-in-the-loop Methods Our method is similar in the spirit of human-in-the-loop learning (Fung et al., 1992;. Humanin-the-loop learning generally aims to leverage humans to complete easy commonsense tasks, such as the recognition of objects in images (Patterson et al., 2013). Recent work also proposed humanin-the-loop parsing (He et al., 2016) to include human feedback into parsing. However, unlike these approaches, we aim to combine both experts and non-experts to address the difficulty of some SRL annotation tasks, while leveraging the crowd for the majority of tasks.

Conclusion
In this paper, we proposed CROWD-IN-THE-LOOP an approach for creating high-quality annotated data for SRL that leverages both crowd and expert workers. We conducted a crowdsourcing study and analyzed its results to design a classifier to distinguish between crowd-appropriate and expert-required tasks. Our experimental evaluation showed that our proposed approach significantly outperforms baseline crowdsourcing and hybrid approaches, and successfully limits the need for expert involvement while achieving high annotation quality.