Leveraging coreference to identify arms in medical abstracts: An experimental study

Performing systematic reviews is a critical yet manual, labor-intensive step in evidence-based medicine. Automating systematic reviews is an active area of research, requiring innovations in machine learning and computational linguistics. We examine how coreference resolution can aid in identifying the arms of a study, an often overlooked piece of information needed to synthesize the results in a systematic review. A classiﬁcation model 1 that performs better with the coreference features supports the intuition that coreference is able to capture the discourse salience of arms. We note that control arms do not beneﬁt as much from these features.


Introduction
Evidence-based medicine (EBM) is a paradigm that seeks to inform medical practitioners of the optimal treatment, based on the totality of the available evidence (i.e., the results of all relevant clinical trials). To this end, teams of medical experts often conduct systematic reviews, which synthesize all published medical literature pertaining to a specific clinical question. The first step in a systematic review is to formulate the research question to be investigated, and then find all of the relevant citations. Abstracts and then full texts are screened to exclude irrelevant trials. Once a set of trials pertinent to the research question are identified (typically 10-20 trials), key pieces of information are extracted from each trial. This information generally consists 1 https://github.com/elisaF/extractGroups of the patient Population under study, the Intervention(s) being tested, the Comparison and the Outcomes (abbreviated as PICO). Results from all identified trials are typically statistically combined via meta-analysis to produce an aggregated result.
Producing systematic reviews is a timeconsuming, largely manual process. This is exacerbated by the rapidly growing evidence base: PubMed 2 contains 800,000+ publications on clinical trials in humans (Wallace et al., 2013), and on average reports of 75 new trials are published daily. A single systematic review can take over a year to produce -at which point it risks becoming outdated. Therefore, automating evidence synthesis poses an enormous yet enticing challenge for automation.
A crucial step towards automating synthesis is identifying the arms, or groups, in trials. A clinical trial consists of one control arm, and one or more intervention arms. For example, a study comparing the efficacy of aspirin versus a placebo would consist of two arms: those taking aspirin (the intervention group), and those taking the placebo (the control group). Previous work has mostly focused on identifying the PICO elements. However, the PICO elements alone are insufficient to convey the design of the study, a key piece of evidence necessary in the downstream task of data synthesis and analysis. Thus, the present study focuses on improving the automated identification of arms. We observed that arms are often salient in the discourse of the abstract, in that they corefer more often than other to-Randomised controlled trial with 12 month intervention. Change in body mass index (BMI) standard deviation score (SDS) over 12 months with assessment 18 months after the start of the intervention. Using the last available data on all participants (n=106), those in the Mandometer group arm1 had significantly lower mean BMI SDS at 12 months compared with standard care arm2 . The mean meal size in the Mandometer group arm1 fell by 45 g. Those in the Mandometer group also had greater improvement in concentration of high density lipoprotein cholesterol. Those in chain3 the Mandometer group also had greater improvement in concentration of high density lipoprotein cholesterol. Table 2: Medical abstract with annotated arms and coreference chains. The chains were automatically determined as described in section 4.3. All phrases with the same chain label are judged to co-refer.
kens. This study is exploratory work that focuses on investigating the effectiveness of using coreference features for identifying arms.
The remainder of this paper is organized as follows. We motivate the choice of coreference features for arm identification. We then examine prior work in identifying the arms in medical texts, and how coreference resolution has been applied to the medical field. Next, we present an experiment to classify whether tokens in annotated medical abstracts are part of an arm. We propose features that take advantage of the discourse salience of arms, and we discuss the results with and without the coreference features.

Motivation
Identifying the arms is not a simple information extraction task. The arms in a study consist of one control group, and one or more intervention groups. Often, the control group is never explicitly mentioned in the abstract. In the following excerpt, only the intervention arm is mentioned: To determine whether modifying eating behaviour with use of a feedback device facilitates weight loss in obese adolescents.
An arm in a study is typically a noun phrase (NP), where this NP is repeated, either verbatim or anaphorically, throughout the abstract. An example of the discourse salience of arms in a medical abstract is in Table 1. The intervention arm, Mandometer group, is repeated several times verbatim throughout the abstract.
Given this recurring linguistic pattern in medical abstracts, we investigated the use of coreference resolution to help identify arms. The goal of coreference resolution is to determine which mentions in a text refer to the same entity. A referring expression, or mention, is the natural language expression used by discourse participants to refer to entities. Two or more mentions that refer to the same entity are coreferent, and together form a coreference chain. An anaphor and its antecedent (or cataphor and its postcedent) will form a coreference chain. Mentions can be indefinite noun phrases, definite noun phrases, proper names and pronouns, where clinical trial abstracts contain mostly NP's. Using an off-theshelf coreference tool (to be discussed in more detail in section 4.3) yields the mentions and coreference chains illustrated in Table 2.
Note that the token intervention, which is not part of an arm, appears at most 2 times within a single coreference chain, whereas Mandometer, part of the experimental arm, appears 3 times. Further, intervention is found only in 1 chain, whereas Mandometer appears in 2 chains. More generally, we hypothesize a token forming part of an arm is more salient in two ways: (i) an arm token appears more often within a single coreference chain, and (ii) an arm token appears more frequently across different chains (within the same abstract). These observations motivate the coreference features presented in section 4.3. In Table 2, standard care is not a member of any chains. More generally, we can expect salience to help more with intervention arms than control. 3 3 Related work

Automated Identification of Arms
Previous work has identified PICO elements either at the word or sentence level. Most research has extracted information from medical abstracts, although some studies have used the full text of the articles (De Bruijn et al., 2008;Zhao et al., 2012;Wallace et al., 2016). One of the seminal studies in PICO extraction (Demner-Fushman and Lin, 2007) collapsed intervention and comparator, where interventions were short noun phrases based largely on recognition of semantic types (mapped to UMLS concepts) and a few manually constructed rules. The intervention/comparator extractor returned a list of all the interventions under study, and the extractor was evaluated at the sentence level. However, it is important to distinguish between experimental and control treatments as the bias for the experimental group must be accounted for in the data synthesis step (Lumley, 2002).
Beyond PICO, De Bruijn et al. (2008) extracted data from full-text articles based on the CONSORT Plus Guideline, 4 a list of required, recommended and optional items to include in a systematic review compiled by medical experts. The study found that one of the most difficult items to identify was the experimental treatment, which varied widely beyond just drug names. Elsewhere, Chung (2009) identified interventions as a coordinating structure in a single sentence, and found the major weakness in this approach was parsing errors when identifying the boundaries of the conjuncts. And Summerscales et al. (2011) focused on the downstream task of calculating the absolute risk reduction (ARR), identifying the number of bad outcomes for the control and experimental treatment groups, along with the sizes of both treatment groups. This study found outcomes hardest to detect because of their variability, but also had an overall poor recall partly because coreference was not taken into account.
Most recently, Trenta et al. (2015) proposed a novel approach for identifying the arms and PICO elements that does not rely on a first stage of sentence classification, but instead classifies each token directly, followed by an inference process to constrain the labels to more accurate results. As with previous studies, outcome results were the hardest because they are more variable. A significant limitation of this study is that the abstracts were limited to two-arm trials, and in a specific domain.

Automated Coreference Resolution
Coreference resolution is a long-studied task that remains a challenging problem. Most recent work on coreference resolution builds mainly on one of four models.
• The first and most widely-used approach is the mention-pair model (Soon et al., 2001;Ng and Cardie, 2002b). A classifier first identifies all the pairs of mentions which are coreferent. These pairs are then grouped into coreferent chains by clustering techniques such as closest-first (Soon et al., 2001) or best-first (Ng and Cardie, 2002b;Ng and Cardie, 2002a).
In closest-first, you link to the closest preceding mention, whereas in best-first, you choose the likeliest one. Common features in these models include distance between the two mentions, syntactic features (e.g., POS tags), semantic features (e.g., named entity type), lexical features (e.g., head word of the mention), and string matching.
• The mention-ranking model (Denis and Baldridge, 2008), reframes the task as a ranking function rather than a classification function, ranking all the candidate antecedents of a mention to determine which candidate antecedent is the most probable.
• The entity-centric model makes use of entitylevel information, focusing on features of mention clusters, and not just pairs (Raghunathan et al., 2010). The coreference clusters are built up incrementally, using information from partially-completed coreference chains to guide later decisions. Features include whether a mention head word matches any of the head words in the antecedent cluster.
• The antecedent tree model (Yu and Joachims, 2009) builds a graph from a document, where the nodes are the mentions and arcs are the links between mention pairs that are coreferent candidates. The coreference chains are then modeled as latent trees in the graph.
Constraints are imposed on these models for improved results, such as enforcing a transitive closure to guarantee you end up with legal assignments (Finkel and Manning, 2008). For example, if John Smith is coreferent with Smith, and Smith with Jane Smith, then it should not follow that John Smith and Jane Smith are coreferent. Other work has shown that joint models improve performance. Denis et al. (2007) recognized that anaphoricity (whether an entity is the first mention) and coreference should be treated as a joint task since one informs the other. Durrett and Klein (2014) models coreference together with named entity recognition and linking named entities to Wikipedia entities. Combinations of these models have also yielded improved results, such as Clark and Manning (2015) stacking mention-pair and entity-centric systems (which the current paper uses as its off-the-shelf coreference resolver). Many coreference resolvers exploit deeper linguistic knowledge, beyond the features mentioned above. Chowdhury and Zweigenbaum (2013) eliminated less-informative training instances prior to model training by creating a list of criteria based on semantic and syntactic intuitions such as a mismatch in semantic types. Peng et al. (2015) created predicate schemas to constrain inference, such as two predicates with a semantically shared argument. Yang et al. (2015) used semantic role labeling to link the time and locations for event mentions, and for verbal mentions they linked their participants. More recently, Kilicoglu et al. (2016) focused on sortal anaphoras which they found to commonly occur in biomedical literature, resolving anaphors that carry a specific semantic type, or sort, such as these drugs. Many of these studies take advantage of linguistic resources such as WordNet 5 and FrameNet 6 .
In the medical area, coreference resolution has been most closely studied for analyzing clinical narrative text such as that found in Electronic Health Records (EHRs), and biomolecular studies. In fact, there have been corpora (i2b2/VA Corpus (Uzuner et al., 2012), GENIA Event Corpus (Kim et al., 2008)) and shared tasks (SemEval-2015 shared task on Analysis of Clinical Text (Task 14) (Elhadad et al., 2015), BioNLP09 shared task (Kim et al., 2009), ShARe/CLEF eHealth 2013 Evaluation Lab Task 1(Pradhan et al., 2013)) created specifically to advance this area. Given that resources such as FrameNet and WordNet are based mostly on news (e.g. British National Corpus, U.S. newswire), a large number of resources have been created to aid in natural language processing of medical texts. By far the largest and most complex is the Unified Medical Language System (UMLS) 7 , consisting of three main components: Metathesaurus with terms and codes from many vocabularies (including CPT, ICD-10-CM, MeSH, RxNorm, and SNOMED CT), Semantic Network with semantic types and semantic relations, and the SPECIAL-IST Lexicon, which contains syntactic, morpholog-ical and orthographic information on terms, along with NLP tools such as POS tagger and word sense disambiguator. Other tools include MetaMap 8 , a tool for recognizing UMLS concepts, DrugBank 9 , a database of drug names, BANNER 10 , a named entity recognizer for biomedical texts, BioText for identifying entities and relations in bioscience texts, and BioFrameNet 11 , an extension of FrameNet for molecular biology (and BioWordNet (Poprat et al., 2008) was a failed attempt at extending WordNet also to the biomolecular field). However, when applied to clinical trial texts, these tools prove useful mainly for identifying only medical terms and drug names, and thus more linguistically-motivated resources are still lacking for clinical trial texts.
In the area of clinical narratives, Raghavan et al. (2012) took advantage of the temporal features present in these texts to help determine whether two medical concepts corefer with each other. Their 2014 paper (Raghavan et al., 2014) expanded on this idea to identify medical events spanning across narratives, such as admission notes, medical reports, and discharge notes. Yoshikawa et al. (2011) exploited coreference information for extracting eventargument relations from biomedical texts in the Genia Event Corpus. Jindal and Roth (2013) used very specific domain knowledge to resolve coreference in clinical narratives, such as creating a specific discourse model (i.e. a single patient, several doctors and a few family members) to resolve entities of type "person". Despite the active interest in coreference resolution, there has been much less research investigating its application to clinical trial texts. Most of the literature that does exist is applied to the bio-medical field, focusing more on full-text articles (Gasperin and Briscoe, 2008;Huang et al., 2010;Kilicoglu et al., 2016) than on abstracts (Castano et al., 2002;Yang et al., 2004). To the best of the authors' knowledge, there have been no papers using coreference features to identify arms in clinical trial abstracts.

Experiment
The goal of this experiment is to explore empirically whether incorporating coreference features improves the performance of a classifier for arm identification, as compared to a baseline model without coref features (note that we do not aim to necessarily achieve state-of-the-art results on this task). The task of the classifier is to label a token as either part of an arm or not.

The corpus
The corpus 12 consists of 263 abstracts from the British Medical Journal (BMJ) annotated with the experimental and control groups (and other PICO elements) by Summerscales (2013). The BMJ requires structured input, and the number of sections varies with some abstracts only containing a few sections such as BACKGROUND, METHODS, FINDINGS and INTERPRETATION. These structured abstracts usually consist of short phrases and incomplete sentences.

Experimental setup
Sentences were tokenized, lower-cased and stop words were removed . Each token was paired with its abstract to form an [abstract, token] pair to uniquely correlate the token with the medical abstract where it appeared (e.g. [abstract 3, "intervention"], [abstract 129, "intervention"]). A binary classifier was implemented to label each token as belonging to an arm or not (scikit-learn implementation of Support Vector Machine, Pedregosa et al. (2011)). Due to the imbalance of classes (9% positive), the class weights in the model were adjusted to be inversely proportional to the class frequencies in the corpus. We performed five-fold cross validation.

Features
The following features, summarized in Table 5, were used in the machine learning algorithm. bag-of-words The number of times the token occurs within its medical abstract (i.e., the count of [abstract, token] pairs for the given token and abstract). As evident in Table 5, abstracts can be quite repetitive in their vocabulary, but on average a token appears only a couple of times within the same abstract.
drugbank Whether the token exists in the Drug-Bank database version 4.3 13 . The clinical trials often compare the efficacy of different drugs, such that intervention arms would contain drug names. However, note from Table 5 that most words are not drugs, keeping in mind that interventions also consist of therapies, behavior changes and other nondrug-related treatments. tf-idf: Term frequency-inverse document frequency for term t in document d for corpus D: where: One is added in the equation (1) so that terms with zero idf (those that occur in all documents of a training set) are not entirely ignored. The goal of this metric is to capture how informative a word is. For coreference: The Coreference Resolution annotator packaged in Stanford Core NLP 3.0 14 (a model that stacks mention-pair and entity-centric systems) is used to calculate the maximum number of times the token occurs in a single coreference chain within the same medical abstract (max counts) and the number of chains the token appears in the same medical abstract (num chains). This tool was chosen because it is publicly available and yields state-of-the-art results on the 2012 CoNLL data set. The coreference features aim to capture the discourse salience of arms in medical abstracts. As mentioned before, the (max counts, num chains) values for mandometer are (3,2), but for intervention are (2,1). Note from Table 5 that although a token can occur very frequently in a single chain (max counts) and across many chains (max chains), a token on average is not part of a chain at all. This observed statistic lends weight to the use of coreference features as a measure of salience. Previous work has employed other features such as dependency trees and other predicate argument structures to capture this discourse salience. Summerscales (2013) implemented a form of post-hoc coreference resolution as a way to cluster labeled words into groups, for example into a control group versus an intervention group. However, the present study uses the coreference features at the front end to detect the mentions, and is presently not concerned with differentiating among the different arms. Table 4 summarizes the evaluation scores. The results of the classifier are evaluated against the spans of text that were annotated as arms, following Summerscales (2013). Because an arm consists of several contiguous words (e.g. mandometer group), we want to ensure the classifier is able to correctly label the more informative words in that span (mandometer vs. group). A labeled group of words is considered a match for an annotated group if they consist of the same set of words, ignoring had, group(s), and arm. For example, a labeled span of mandometer for the annotated span mandometer group is a true positive. On the other hand, a labeled span of only group is a false positive. Although the scores are relatively low for both models, we emphasize the goal of this experiment is not to achieve state-of-the-art results but to investigate the viability of salience for arm identification. Further, we are being strict in our evaluation, compared to prior work (e.g., Summerscales (2013) ).

Baseline
The baseline model includes the features for how many times a token appears in a single abstract (b-o-w), whether the token exists in the Drug-Bank (drugbank), and the term-frequency inversedocument-frequency measure for the token (tf-idf).

With Coreference
The coref model additionally includes the maximum number of times the token appears in a single coreference chain for a given abstract (max counts), and the number of coreference chains the tokens appears in for a given abstract (num chains).

Error Analysis
The coref model performed better than the baseline model in almost all the metrics: precision (improved 6.8 points) and F1 (+9.3). Additionally, these improvements are consistent across all the crossvalidation runs, as illustrated in Figure 1. Adding the coreference features lowers recall by 5.9 points. To understand the results in more detail, we compare the confusion matrices of the two models. The raw counts in Figure 2 illustrate the class imbalance of the data, giving the impression that a false positive is more likely than a false negative. The normalized confusion matrices in Figure 3 show that false negatives are a higher percentage of the errors than false positives, so that the positive class is the harder one to label.
Given that false negatives are the most common errors across both models, we analyze their occurrences first. The control arm is the most susceptible to this type of error, as it is not as salient in the discourse as the experimental arms. The control words are typically drawn from a finite and small vocabulary (e.g. control, placebo, sham, standard), so their tf-idf scores are usually low. The false negative rate worsens in the coref model partly because it places more weight on discourse salience, and control arms are often not part of a coreference chain, compared with experimental arms. We refer back to the abstract presented in Table 1. A small ablation study was conducted to determine that the b-o-w feature is able to correctly label standard (count=4) as part of an arm. With the coreference features, the word is no longer labeled as an arm, as it does not appear in any coreference chain.
Next, we analyze the false positives across both models. Given that all the features (except drugbank) in both models are aimed at extracting salient words, they also pick out other relevant PICO information. For example, both models incorrectly label knee as part of an arm in the following abstract, where each of these mentions is, in fact, annotated as part of an outcome: ...reduce the incidence of knee and ankle injuries in young people participating in sports. The rate of acute injuries to the knee or ankle. A structured programme of warm-up exercises can prevent knee and ankle injuries...
Another issue with false positives is that the gold data is not comprehensively annotated. Note that in Table 2, the annotator failed to label the third occurrence of mandometer as an arm, although both models attempt to classify it as such. However, striving for a thoroughly annotated data set is not realistic, and so the models should be more robust to these gaps and inconsistencies. The false positive rate improves in the coref model partly because the coreference features prove to be a better measure of discourse salience for the intervention arms. As noted earlier, repetition in medical abstracts is not limited to the words describing the arm. For example, in the abstract from Table 1, the baseline model incorrectly labels the high-frequency tokens eating, months and mean as parts of an arm. The coref model instead correctly labels these as negative, given that they do not occur in a coreference chain.
Finally, we note that the coreference features help in grouping together words with conflicting tf-idf measures. In the abstract from Table 1, the baseline model correctly labels mandometer (tf-idf=26.3), but misses group (tf-idf=4.2). However, the coref model correctly labels the entire span mandometer group as an arm, because both of these tokens appear together in a mention and have the same coreference features.

Conclusion
We introduced a new approach to identify the arms in a clinical trial abstract by creating coreference features aimed at capturing the discourse salience of arms. The coreference features were shown to help in classifying a word as part of an arm, confirming the intuition that mentions of arms throughout the abstract often corefer. However, we note this pattern holds more for the experimental than control arms. The error analysis also revealed that arms are not the only concepts that are coreferent: other PICO elements such as the outcome often have the same features. This observation could motivate a model that jointly labels these PICO elements along with the arms, since one would inform the other. There are several other recurring linguistic patterns yet to be explored that could further aid in arm identification, such as apposition: A computerised device, Mandometer, providing real time feedback... and paraphrasing: ..half were produced automatically with a larger volume of material...The larger booklets produced automatically were...
Another avenue of research is to investigate how these linguistic features pattern across abstracts in the same review. For example, finding the paraphrases across all abstracts that study the same treatment (as defined in a systematic review) could yield finer-grained information on the language used to describe that intervention. To compensate for the inconsistent and small number of annotations, label propagation might be used to retrieve clusters of relations and find the structure in the data.
As noted earlier, the present study focused on the effect of salience on arm identification. In a future study, we plan to implement Summerscales (2013) as a strong baseline (which achieved an F-score of 0.69) to understand whether coreference can still yield improved results when compared to a model that nears state-of-the-art performance.