A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature

We present a corpus of 5,000 richly annotated abstracts of medical articles describing clinical randomized controlled trials. Annotations include demarcations of text spans that describe the Patient population enrolled, the Interventions studied and to what they were Compared, and the Outcomes measured (the ‘PICO’ elements). These spans are further annotated at a more granular level, e.g., individual interventions within them are marked and mapped onto a structured medical vocabulary. We acquired annotations from a diverse set of workers with varying levels of expertise and cost. We describe our data collection process and the corpus itself in detail. We then outline a set of challenging NLP tasks that would aid searching of the medical literature and the practice of evidence-based medicine.


Introduction
In 2015 alone, about 100 manuscripts describing randomized controlled trials (RCTs) for medical interventions were published every day.It is thus practically impossible for physicians to know which is the best medical intervention for a given patient group and condition (Borah et al., 2017;Fraser and Dunstan, 2010;Bastian et al., 2010).This inability to easily search and organize the published literature impedes the aims of evidence based medicine (EBM), which aspires to inform patient care using the totality of relevant evidence.* * now at Google Inc.
Computational methods could expedite biomedical evidence synthesis (Tsafnat et al., 2013;Wallace et al., 2013) and natural language processing (NLP) in particular can play a key role in the task.
Prior work has explored the use of NLP methods to automate biomedical evidence extraction and synthesis (Boudin et al., 2010;Marshall et al., 2017;Ferracane et al., 2016;Verbeke et al., 2012). 1 But the area has attracted less attention than it might from the NLP community, due primarily to a dearth of publicly available, annotated corpora with which to train and evaluate models.
Here we address this gap by introducing EBM-NLP, a new corpus to power NLP models in support of EBM.The corpus, accompanying documentation, baseline model implementations for the proposed tasks, and all code are publicly available. 2EBM-NLP comprises ∼5,000 medical abstracts describing clinical trials, multiply annotated in detail with respect to characteristics of the underlying trial Populations (e.g., diabetics), Interventions (insulin), Comparators (placebo) and Outcomes (blood glucose levels).Collectively, these key informational pieces are referred to as PICO elements; they form the basis for wellformed clinical questions (Huang et al., 2006).
We adopt a hybrid crowdsourced labeling strategy using heterogeneous annotators with varying expertise and cost, from laypersons to MDs.Annotators were first tasked with marking text spans that described the respective PICO elements.Identified spans were subsequently anno-tated in greater detail: this entailed finer-grained labeling of PICO elements and mapping these onto a normalized vocabulary, and indicating redundancy in the mentions of PICO elements.
In addition, we outline several NLP tasks that would directly support the practice of EBM and that may be explored using the introduced resource.We present baseline models and associated results for these tasks.

Related Work
We briefly review two lines of research relevant to the current effort: work on NLP to facilitate EBM, and research in crowdsourcing for NLP.

NLP for EBM
Prior work on NLP for EBM has been limited by the availability of only small corpora, which have typically provided on the order of a couple hundred annotated abstracts or articles for very complex information extraction tasks.For example, the ExaCT system (Kiritchenko et al., 2010) applies rules to extract 21 aspects of the reported trial.It was developed and validated on a dataset of 182 marked full-text articles.The ACRES system (Summerscales et al., 2011) produces summaries of several trial characteristic, and was trained on 263 annotated abstracts.Hinting at more challenging tasks that can build upon foundational information extraction, Alamri and Stevenson (2015) developed methods for detecting contradictory claims in biomedical papers.Their corpus of annotated claims contains 259 sentences (Alamri and Stevenson, 2016).
Larger corpora for EBM tasks have been derived using (noisy) automated annotation approaches.This approach has been used to build, e.g., datasets to facilitate work on Information Retrieval (IR) models for biomedical texts (Scells et al., 2017;Chung, 2009;Boudin et al., 2010).Similar approaches have been used to 'distantly supervise' annotation of full-text articles describing clinical trials (Wallace et al., 2016).In contrast to the corpora discussed above, these automatically derived datasets tend to be relatively large, but they include only shallow annotations.
Other work attempts to bypass basic extraction tasks and address more complex biomedical QA and (multi-document) summarization problems to support EBM (Demner-Fushman and Lin, 2007;Mollá and Santiago-Martinez, 2011;Abacha and Zweigenbaum, 2015).Such systems would directly benefit from more accurate extraction of the types codified in the corpus we present here.
Medical articles contain relatively technical content, which intuitively may be difficult for persons without domain expertise to annotate.However, recent promising preliminary work has found that crowdsourced approaches can yield surprisingly high-quality annotations in the domain of EBM specifically (Mortensen et al., 2017;Thomas et al., 2017;Wallace et al., 2017).

Data Collection
PubMed provides access to the MEDLINE database 3 which indexes titles, abstracts and metadata for articles from selected medical journals dating back to the 1970s.MEDLINE indexes over 24 million abstracts; the majority of these have been manually assigned metadata which we used to retrieved a set of 5,000 articles describing RCTs with an emphasis on cardiovascular diseases, cancer, and autism.These particular topics were selected to cover a range of common conditions.
We decomposed the annotation process into two steps, performed in sequence.First, we acquired labels demarcating spans in the text describing the clinically salient abstract elements mentioned above: the trial Population, the Interventions and Comparators studied, and the Outcomes measured.We collapse Interventions and Comparators into a single category (I).In the second annotation step, we tasked workers with providing more granular (sub-span) annotations on these spans.
For each PIO element, all abstracts were annotated with the following four types of information.We collected annotations for each P, I and O element individually to avoid the cognitive load imposed by switching between label sets, and to reduce the amount of instruction required to begin the task.All annotation was performed using a modified version of the Brat Rapid Annotation Tool (BRAT) (Stenetorp et al., 2012).We include all annotation instructions provided to workers for all tasks in the Appendix.

Non-Expert (Layperson) Workers
For large scale crowdsourcing via recruitment of layperson annotators, we used Amazon Mechanical Turk (AMT).All workers were required to have an overall job approval rate of at least 90%.Each job presented to the workers required the annotation of three randomly selected abstracts from our pool of documents.As we received initial results, we blocked workers who were clearly not following instructions, and we actively recruited the best workers to continue working on our task at a higher pay rate.
We began by collecting the least technical annotations, moving on to more difficult tasks only after restricting our pool of workers to those with a demonstrated aptitude for the jobs.We obtained annotations from ≥ 3 different workers for each of the 5,000 abstracts to enable robust inference of reliable labels from noisy data.After performing filtering passes to remove non-RCT documents or those missing relevant data for the second annotation task, we are left with between 4,000 and 5,000 sets of annotations for each PIO element after the second phase of annotation.

Expert Workers
To supplement our larger-scale data collection via AMT, we collected annotations for 200 abstracts for each PIO element from workers with advanced medical training.The idea is for these to serve as reference annotations, i.e., a test set with which to evaluate developed NLP systems.We plan to enlarge this test set in the near future, at which point we will update the website accordingly.
For the initial span labeling task, two medical students from the University of Pennsylvania and Drexel University provided the reference labels.In addition, for both stages of annotation and for the detailed subspan annotation in Stage 2, we hired three medical professionals via Upwork, 5 an online platform for hiring skilled freelancers.After reviewing several dozen suggested profiles, we selected three workers that had the following characteristics: Advanced medical training (the majority of hired workers were Medical Doc-tors, the one exception being a fourth-year medical student); Strong technical reading and writing skills; And an interest in medical research.In addition to providing high-quality annotations, individuals hired via Upwork also provided feedback regarding the instructions to help make the task as clear as possible for the AMT workers.

The Corpus
We now present corpus details, paying special attention to worker performance and agreement.We discuss and present statistics for acquired annotations on spans, tokens, repetition and MeSH terms in Sections 4. 1, 4.2, 4.3, and 4.4, respectively.

Spans
For each P, I and O element, workers were asked to read the abstract and highlight all spans of text including any pertinent information.Annotations for 5,000 articles were collected from a total of 579 AMT workers across the three annotation types, and expert annotations were collected for 200 articles from two medical students.
We first evaluate the quality of the annotations by calculating token-wise label agreement between the expert annotators; this is reported in Table 2. Due to the difficulty and technicality of the material, agreement between even well-trained domain experts is imperfect.The effect is magnified by the unreliability of AMT workers, motivating our strategy of collecting several noisy annotations and aggregating over them to produce a single cleaner annotation.We tested three different aggregation strategies: a simple majority vote, the Dawid-Skene model (Dawid and Skene, 1979) which estimates worker reliability, and HMM-Crowd, a recent extension to Dawid-Skene that includes a HMM component, thus explicitly leveraging the sequential structure of contiguous spans of words (Nguyen et al., 2017).
For each aggregation strategy, we compute the token-wise precision and recall of the output labels against the unioned expert labels.As shown in Table 3, the HMMCrowd model afforded modest improvement in F-1 scores over the standard Dawid-Skene model, and was thus used to generate the inputs for the second annotation phase.
The limited overlap in the document subsets annotated by any given pair of workers, and wide variation in the number of annotations per worker make interpretation of standard agreement statis-   4).
When comparing the average precision and recall for individual crowdworkers against the aggregated labels in Table 4, scores are poor showing very low agreement between the workers.Despite this, the aggregated labels compare favorably against the expert labels.This further supports the intuition that it is feasible to collect multiple lowquality annotations for a document and synthesize them to extract the signal from the noise.
On the dataset website, we provide a variant of the corpus that includes all individual worker span annotations (e.g., for researchers interested in crowd annotation aggregated methods), and also a version with pre-aggregated annotations for convenience.

Hierarchical Labels
For each P, I, and O category we developed a hierarchy of labels intended to capture important sub categories within these.Our labels are aligned to (and thus compatible with) the concepts codified by the Medical Subject Headings (MeSH) vocabulary of medical terms maintained by the National Library of Medicine (NLM). 6In consulta-  tion with domain experts, we selected subsets of MeSH terms for each PIO category that captured relatively precise information without being overwhelming.For illustration, we show the outcomes label hierarchy we used in Figure 2. We reproduce the label hierarchies used for all PIO categories in the Appendix.At this stage, workers were presented with abstracts in which relevant spans were highlighted, based on the annotations collected in the first annotation phase (and aggregated via the HMM- Crowd model).This two-step approach served dual purposes: (i) increasing the rate at which workers could complete tasks, and (ii) improving recall by directing workers to all areas in abstracts where they might find the structured information of interest.Our choice of a high recall aggregation strategy for the starting spans ensured that the large majority of relevant sections of the article were available as inputs to this task.The three trained medical personnel hired via Upwork each annotated 200 documents and reported that spans sufficiently captured the target information.These domain experts received feedback and additional training after labeling an initial round of documents, and all annotations were reviewed for compliance.The average inter-annotator agreement is reported in Table 6.With respect to crowdsourcing on AMT, the task for Participants was published first, allowing us to target higher quality workers for the more technical Interventions and Outcomes annotations.We retained labels from 118 workers for Participants, the top 67 of whom were invited to continue on to the following tasks.Of these, 37 continued to contribute to the project.Several workers provided ≥ 1,000 annotations and continued to work on the task over a period of several months.
To produce final per-token labels, we again turned to aggregation.The subspans annotated in this second pass were by construction shorter than the starting spans, and (perhaps as a result) informal experiments revealed little benefit from HMMCrowd's sequential modeling aspect.The introduction of many label types significantly increased the complexity of the task, resulting in both lower expert inter-annotator agreement (Table 6 and decreased performance when comparing the crowdsourced labels against those of the experts (Table 7.Most observed token-level disagreements (and errors, with respect to reference annotations) involve differences in the span lengths demarcated by individuals.For example, many abstracts contain an information-dense description of the patient population, focusing on their medical condition but also including information about their sex and/or age.Workers would also sometimes fail to capture repeated mentions of the same information, producing Type 2 errors more frequently than Type 1.This tendency can be seen in the overall token-level confusion matrix for AMT workers on the Participants task, shown in Figure 3.

Participants
In a similar though more benign category of error, workers differed in the amount of context they included surrounding each subspan.Although the instructions asked workers to highlight minimal subspans, there was variance in what workers considered relevant.For the same reasons mentioned above (little pairwise overlap in annotations, high variance with respect to annotations per worker), quantifying agreement between AMT workers is again difficult using traditional measures.We thus again take as a measure of agreement the precision, recall, and F-1 of the individual annotations against the aggregated labels and present the results in Table 8.

Repetition
Medical abstracts often mention the same information in multiple places.In particular, interventions and outcomes are typically described at the beginning of an abstract when introducing the purpose of the underlying study, and then again when discussing methods and results.It is important to Workers identified repeated information as follows.After completing detailed labeling of abstract spans, they were asked to group together subspans that were instances of the same information (for example, redundant mentions of a particular drug evaluated as one of the interventions in the trial).This process produces labels for repetition between short spans of tokens.Due to the differences in the lengths of annotated subspans discussed in the preceding section, the labels are not naturally comparable between workers without directly modeling the entities contained in each subspan.The labels assigned by workers produce repetition labels between sets of tokens but a more sophisticated notion of co-reference is required to identify which tokens correctly represent the entity contained in the span, and which tokens are superfluous noise.
As a proxy for formally enumerating these entities, we observe that a large majority of start- ing spans only contain a single target relevant to the subspan labeling task, and so identifying repetition between the starting spans is sufficient.For example, consider the starting intervention span "underwent conventional total knee arthroplasty"; there is only one intervention in the span but some annotators assigned the SURGICAL label to all five tokens while others opted for only "total knee arthroplasty."By analyzing repetition at the level of the starting spans, we can compute agreement without concern for the confounds of slight misalignments or differences in length of the subspans.
Overall agreement between AMT workers for span-level repetition, measured by computing precision and recall against the majority vote for each pair of spans, is reported in Table 10.

MeSH Terms
The National Library of Medicine maintains an extensive hierarchical ontology of medical concepts called Medical Subject Headings (MeSH terms); this is part of the overarching Metathesaurus of the Unified Medical Language System (UMLS).Personnel at the NLM manually assign citations (article titles, abstracts and meta-data) indexed in MEDLINE relevant MeSH terms.These terms have been used extensively to evaluate the content of articles, and are frequently used to facilitate document retrieval (Lu et al., 2009;Lowe and Barnett, 1994).
In the case of randomized controlled trials, MeSH terms provide structured information regarding key aspects of the underlying studies, ranging from participant demographics to methodologies to co-morbidities.A drawback to these annotations, however, is that they are applied at the document (rather than snippet or token) level.To capture where MeSH terms are instantiated within a given abstract text, we provided a list of all terms associated with said article and instructed workers to select the subset of these that applied to each set of token labels that they annotated.
MeSH terms are domain specific and many re-   The technical specificity of the more obscure MeSH terms is also exacerbated by their sparsity.Of the 6,963 unique MeSH terms occurring in our set of abstracts, 87% of them are only found in 10 documents or fewer and only 2.0% occur in at least 1% of the total documents.The full distribution of document frequency for MeSH terms is show in Figure 4.
To evaluate how often salient MeSH terms were instantiated in the text by annotators we consider only the 135 MeSH terms that occur in at least 1% of abstracts (we list these in the supplementary material).For each term, we calculate its "instantiation frequency" as the percentage of abstracts containing the term in which at least one annotator assigned it to a span of text.The total numbers of MeSH terms with an instantiation rate above different thresholds for the respective PIO elements are shown in Table 11.

Tasks & Baselines
We outline a few NLP tasks that are central to the aim of processing medical literature generally and to aiding practitioners of EBM specifically.First, we consider the task of identifying spans in abstracts that describe the respective PICO elements (Section 5.1).This would, e.g., improve medical literature search and retrieval systems.Next, we outline the problem of extracting structured information from abstracts (Section 5.2).Such models would further aid search, and might eventually facilitate automated knowledge-base construction for the clinical trials literature.Furthermore, automatic extraction of structured data would enable automation of the manual evidence synthesis process (Marshall et al., 2017).
Finally, we consider the challenging task of identifying redundant mentions of the same PICO element (Section 5.3).This happens, e.g., when an intervention is mentioned by the authors repeatedly in an abstract, potentially with different terms.Achieving such disambiguation is important for systems aiming to induce structured representations of trials and their results, as this would require recognizing and normalizing the unique interventions and outcomes studied in a trial.
For each of these tasks we present baseline models and corresponding results.Note that we have pre-defined train, development and test sets across PIO elements for this corpus, comprising 4300, 500 and 200 abstracts, respectively.The latter set is annotated by domain experts (i.e., persons with medical training).These splits will, of course, be distributed along with the dataset to facilitate model comparisons.

Identifying P, I and O Spans
We consider two baseline models: a linear Conditional Random Field (CRF) (Lafferty et al., 2001) and a Long Short-Term Memory (LSTM) neural tagging model, an LSTM-CRF (Lample et al., 2016;Ma and Hovy, 2016).In both models, we treat tokens as being either Inside (I) or Outside (O) of spans.
For the CRF, features include: indicators for the current, previous and next words; part of speech tags inferred using the Stanford CoreNLP tagger (Manning et al., 2014); and character information, e.g., whether a token contains digits, uppercase letters, symbols and so on.
For the neural model, the model induces features via a bi-directional LSTM that consumes distributed vector representations of input tokens sequentially.The bi-LSTM yields a hidden vector at Table 13: Baseline models for the token-level, detailed labeling task.each token index, which is then passed to a CRF layer for prediction.We also exploit characterlevel information by passing a bi-LSTM over the characters comprising each word (Lample et al., 2016); these are appended to the word embedding representations before being passed through the bi-LSTM.

Extracting Structured Information
Beyond identifying the spans of text containing information pertinent to each of the PIO elements, we consider the task of predicting which of the detailed labels occur in each span, and where they are located.Specifically, we begin with the starting spans and predict a single label from the corresponding PIO hierarchy for each token, evaluating against the test set of 200 documents.Initial experiments with neural models proved unfruitful but bear further investigation.
For the CRF model we include the same features as in the previous model, supplemented with additional features encoding if the adjacent tokens include any parenthesis or mathematical operators (specifically: %, +, −).For the logistic regression model, we use a one-vs-rest approach.Features include token n-grams, part of speech indicators, and the same character-level information as in the CRF model.

Detecting Repetition
To formalize repetition, we consider every pair of starting PIO spans from each abstract, and assign binary labels that indicate whether they share at least one instance of the same information.Although this makes prediction easier for long and information-dense spans, a large enough majority of the spans contain only a single instance of relevant information that the task serves as a reasonable baseline.Again, the model is trained on the aggregated labels collected from AMT and evaluated against the high-quality test set.
We train a logistic regression model that operates over standard features, including bag-ofwords representations and sentence-level features such as length and position in the document.All baseline model implementations are available on the corpus website.

Conclusions
We have presented EBM-NLP: a new, publicly available corpus comprising 5,000 richly annotated abstracts of articles describing clinical randomized controlled trials.This dataset fills a need for larger scale corpora to facilitate research on NLP methods for processing the biomedical literature, which have the potential to aid the conduct of EBM.The need for such technologies will only become more pressing as the literature continues its torrential growth.
The EBM-NLP corpus, accompanying documentation, code for working with the data, and baseline models presented in this work are all publicly available at: http://www.ccs.neu.edu/home/bennye/EBM-NLP.

Figure 1 :
Figure 1: Annotation interface for assigning MeSH terms to snippets.

Figure 2 :
Figure 2: Outcome task label hierarchy Average pair-wise Cohen's κ between three medical experts for the 200 reference documents.

Figure 3 :
Figure 3: Confusion matrix for token-level labels provided by experts.

Figure 4 :
Figure 4: Histogram of the number of documents containing each MeSH term.

P
Fourteen children (12 infantile autism full syndrome present, 2 atypical pervasive developmental disorder) between 5 and 13 years of age

Table 1 :
Partial example annotation for Participants, Interventions, and Outcomes.The full annotation includes multiple top-level spans for each PIO element as well as labels for repetition.

Table 2 :
Cohen's κ between medical students for the 200 reference documents.

Table 3 :
Precision, recall and F-1 for aggregated AMT spans evaluated against the union of expert span labels, for all three P, I, and O elements.

Table 4 :
Token-wise statistics for individual AMT annotations evaluated against the aggregated versions.

Table 10 :
Comparison against the majority vote for span-level repetition labels.

Table 14 :
Baseline model for predicting whether pairs of spans contain redundant information.