CRAFT Shared Tasks 2019 Overview — Integrated Structure, Semantics, and Coreference

As part of the BioNLP Open Shared Tasks 2019, the CRAFT Shared Tasks 2019 provides a platform to gauge the state of the art for three fundamental language processing tasks — dependency parse construction, coreference resolution, and ontology concept identification — over full-text biomedical articles. The structural annotation task requires the automatic generation of dependency parses for each sentence of an article given only the article text. The coreference resolution task focuses on linking coreferring base noun phrase mentions into chains using the symmetrical and transitive identity relation. The ontology concept annotation task involves the identification of concept mentions within text using the classes of ten distinct ontologies in the biomedical domain, both unmodified and augmented with extension classes. This paper provides an overview of each task, including descriptions of the data provided to participants and the evaluation metrics used, and discusses participant results relative to baseline performances for each of the three tasks.


Introduction
With its multiple layers of annotation, the Colorado Richly Annotated Full Text (CRAFT) corpus provides a unique foundation for integrating natural language processing (NLP) tasks involving structure, semantics, and coreference. As part of the BioNLP Open Shared Tasks 2019, the CRAFT corpus was used for the evaluation of three fundamental NLP tasks: dependency parse construction, coreference resolution, and ontology concept annotation. Each of these tasks is a foundational element to many NLP systems and their performances can propagate downstream and directly affect overall system accuracy. Dependency parses have been successfully employed for information extraction, e.g. from clinical records (Gupta et al., 2018), relation extraction, e.g. identifying protein post-translational modifications (Sun et al., 2017), and used as features for machine learning tasks, e.g. gene mention detection (Smith and Wilbur, 2009), among other uses. By linking noun phrases to a referent entity, coreference systems serve as annotation multipliers, amplifying results of entity recognition systems (Cohen et al., 2017), and have been shown to improve information extraction in biomedical text (Choi et al., 2016). The concept annotation task, also known as named entity recognition (NER), is a prerequisite for many biomedical NLP applications. Its importance is buttressed by the many previous shared tasks that have included aspects of NER (Hirschman et al., 2005;Smith et al., 2008;Krallinger et al., 2013) . Measuring the state of the art of these foundational tasks will inform the BioNLP community by resetting the performance benchmarks and demonstrating optimal methodologies.
The CRAFT Shared Tasks (CRAFT-ST) 2019 mark the inaugural use and subsequent release of thirty articles annotated in CRAFT that had previously been held in reserve. All 97 articles and accompanying annotations of the CRAFT corpus are now available in the public domain. To augment the results of the CRAFT-ST 2019, and to account for the relatively low participation rate, baseline systems for each task were evaluated in the same manner as the participant systems. The CRAFT-ST 2019 made use of the CRAFT v3.1.3 release 1 . Original task descriptions are available on the CRAFT-ST website 2 . An integrated scoring platform capable of supporting the evaluation of all three sub tasks of the CRAFT-ST 2019 is also available as a standalone system 3 , and as a pre-built Docker container 4 .

The CRAFT Structural Annotation Task
For the structural annotation task (CRAFT-SA), participants were asked to automatically parse full-length biomedical journal articles of the CRAFT Corpus into dependency structures for each sentence. The CRAFT-SA task targets dependency parses as opposed to constituency parses in order to emphasize differences that directly affect the meaning of a parsed sentence; differences in constituent parse conventions can result in parse differences that do not affect the resultant meaning of a parsed sentence (Clegg and Shepherd, 2007). There have been previous shared tasks in the general domain NLP community to evaluate dependency parse construction using both the CoNLL-X (Buchholz and Marsi, 2006) and CoNLL-U (Zeman et al., 2018) file formats. Although the dependency parses initially distributed with the CRAFT corpus more closely resemble the older CoNLL-X format, the CRAFT dependency data was transformed into a quasi-CoNLL-U format to allow the input provided to participants to be only the text of the documents making for a more realistic scenario compared to the CoNLL-X shared tasks which required participants to match gold standard tokenization for evaluation purposes.

Data preparation -CoNLL-X
The dependency parses distributed as part of the CRAFT corpus are automatically derived (Choi and Palmer, 2012) from the manually annotated Penn Treebank style data, which identifies the syntactic structure of each sentence. During the course of data preparation and testing, several updates were made to the Treebank data. The constituency parses for two sentences that were missing from the Treebank data were added. Also, in cases where the automatically derived dependency parse contained multiple ROOT nodes, the corresponding syntactic parse was edited, usually by dividing into multiple sentences, to ensure each dependency parse contained only a single ROOT node. Once the errors were fixed and the CoNLL-X formatted data was finalized, the data was transformed into a quasi-CoNLL-U form.

Data preparation -CoNLL-U
The CoNLL-U format 5 is a revised version of the CoNLL-X format that adds a number of features such as universal part-of-speech tags, languagespecific part-of-speech tags, and a standardized multi-language dependency format. It includes representations of the original raw text in addition to its segmented and tokenized form. This is required for training systems that address sentence boundary detection and tokenization as part of extracting syntactic dependencies from raw text.
The CoNLL-X representation of the CRAFT dependency parses was converted into CoNLL-U format using scripts that 1) introduce document, paragraph, and sentence boundary markers and include the original untokenized text of each sentence, 2) supplement the Penn Treebank part-of-speech tags with their corresponding universal tags following the mapping proposed by the Universal Dependencies (UD) project 6 , and 3) introduce morphological features based on the same part-of-speech migration guide. Spacing and paragraph information is added to the CRAFT CoNLL-U files by aligning the CoNLL-X files with the raw text for each article.
We note that while the resulting data is in the CoNLL-U format and includes UD part-of-speech tags and features, it retains the Stanford Dependency structure and labels from the CoNLL-X files and thus, does not fully conform to the UD representation in terms of its content.

Scoring
Scoring of the CRAFT-SA task made use of the scoring software provided for the CoNLL 2018 Shared Task (Zeman et al., 2018). Dependency parse performance is measured using three metrics, LAS, MLAS, and BLEX. We provide brief definitions of these metrics in the following and refer to Zeman et al. (2018) for details.

LAS
The Labeled Attachment Score (LAS) metric is the de facto standard metric for evaluating de-pendency parsing performance, and is commonly defined simply as the fraction of tokens for which the predicted head and dependency relation type (label) match the gold standard, i.e. #correct/#tokens. In the CoNLL 2018 setting applied in the CRAFT-SA task, this definition is generalized to account for cases where the predicted tokenization does not fully match the gold standard tokenization, and LAS is defined over aligned predicted (pred-tokens) and gold standard tokens (gold-tokens) as the harmonic mean (F1-score) over the precision #correct/#pred-tokens and recall #correct/#gold-tokens.

MLAS
The Morphology-aware Labeled Attachment Score (MLAS) is a modification of LAS that focuses on content words -ignoring e.g. punctuation and determiners -while also taking into account the part-of-speech, aspects of morphology, and associated function words. For a predicted token to be considered correct according to the MLAS criteria, it must match the gold standard values for the head and dependency label (as in LAS), and also the universal POS tag, selected morphological features (e.g. Case, Number, and Tense) and function words attached with particular dependency relations (e.g. aux and case). Similarly to LAS, MLAS is defined for system-predicted tokenization in terms of precision, recall and F1-score.

BLEX
Like MLAS, the Bilexical Dependency Score (BLEX) is a modification of LAS that focuses on content words, emphasizing lemmas instead of morphology. A predicted token is correct according to BLEX criteria if it matches the head, dependency relation, and lemma of the corresponding gold token. BLEX accounts for differences between the predicted and gold tokenization similarly to LAS and MLAS.

Baseline system
SyntaxNet (Andor et al., 2016), a transition-based neural network framework built using TensorFlow was used as the baseline system for the structural annotation task. The system was composed of two models of similar architecture: a part of speech (POS) tagger and a dependency parser. The Python NLTK punkt (Bird et al., 2009)  tokenizer was used to segment the articles into sentences which where used as input to the POS tagger model to generate POS annotations. The dependency parser model uses the POS annotations as input and generates dependency parses for each sentence. Each of the models was trained using the CRAFT training data as a gold standard.

Results
Two teams submitted five runs in total for the CRAFT-SA task (Table 1). Team T013 used the SpaCy dependency parser with (Run1) and without (Run2) the OGER NER system to test whether adding semantic information in the form of named entities can improve resultant dependency parses. In the case of this evaluation, the incorporation of an NER system caused a drop in performance, however this decrease in performance is confounded by tokenization differences resulting from their system grouping entities as single tokens. Using a neural approach and custom biomedical word embeddings, Team T014 demonstrated state of the art performance in dependency parsing over biomedical text, achieving high marks for all submitted runs. Both submitted systems out-performed the baseline by a large margin.

The CRAFT Coreference Resolution Task
Coreference resolution, linking strings of text that have the same referent, is a challenging NLP task that offers potential benefit to downstream tasks if done successfully. The challenge arises in linking strings of text over long distances across a document, or possibly between documents. The benefit of doing so can be substantial as coreference resolution has the ability to amplify results of upstream tasks such as concept recognition, thereby potentially improving the performance of downstream tasks, e.g. information extraction, that require explicitly represented entities. It has been estimated that successful coreference resolution would inherently add over 106,000 additional concept annotations to the CRAFT corpus through referent linkages (Cohen et al., 2017). Coreference resolution is an active area in the NLP research community, and the most relevant previous shared task on coreference resolution is the CoNLL-2012Shared Task (Pradhan et al., 2012, which evaluated identity chains curated in the OntoNotes project (Hovy et al., 2006). The OntoNotes corpus consists of text from conversational speech, broadcast conversations, broadcast news, magazine articles, newswire, and web data in three languages (English, Arabic, and Chinese), covering 1M words per language. The CRAFT corpus presents some unique challenges to the coreference resolution task. While slightly smaller than the OntoNotes corpus in regards to word count (1M), 620k works is still substantial, and scientific text is a domain not covered in OntoNotes explicitly. Further, CRAFT equals the highest median token count (24.0) per sentence (news wire) and the second highest median sentence count per document (318 vs. 565 for broadcast conversations) in the OntoNotes corpus. The combination of longer sentences and more sentences per document allows for an increase in the potential distances between coreference mentions within the sentences themselves and within each document. Adding further complexity to the task is CRAFT's use of discontinuous mentions, i.e. coreference mentions that have intervening text (see example of a discontinuous mention in Figure 1). Discontinuous mentions comprise 5.7% of all identity chain mentions in the CRAFT corpus. This is the first task on coreference resolution that allows for discontinuous mentions as far as the task organizers are aware.

Data
Annotation of the identity chains in the CRAFT corpus is described in (Cohen et al., 2017). For the purposes of the CRAFT-CR task, the strings of text (referred to as mentions below) that are linked to form coreference chains must exist in the same document, but can be localized any distance from one another. Some mentions may be  Identity chains link mentions of the same referent, and can span the entire document. Apposition relations link adjacent noun phrases that have the same referent and are not linked by a copula. The CRAFT-CR task focuses on reproducing the manually curated identity chains.

Data preparation
During the course of data preparation for the CRAFT-CR task, some errors in the coreference annotations were discovered, and subsequently fixed. The most common error involved two identity chains sharing a single base noun phrase mention. Each shared mention was manually reviewed, and the two identity chains were merged in cases where the chains were deemed to be about the same referent. In cases where the presence of a shared mention in one chain was clearly an error, it was removed and the identity chains remained distinct. The CRAFT-CR training and test data are summarized in Table 2.

Data format
The CRAFT-CR task makes use of the CoNLL-2011/2012 data format for representing identity chains 7 , with a modification to enable representation of discontinuous mentions. Discontinuous mentions are denoted by the addition of a character or characters (non-digit) after the chain identifier (integer) as depicted in Figure 1.

Scoring
There are a wide range of coreference resolution scoring metrics available. For historical purposes, the five reference metrics ( (2016) is also used to measure coreference system performance. The LEA metric was designed specifically to address the shortcomings of the previously used metrics.
By taking into account all coreference links and evaluating resolved coreference relations instead of resolved mentions, the LEA metric accurately assesses recall and precision. The coreference scoring implementations were modified in two ways for the CRAFT-CR task. First, because the CRAFT-CR data allows for mentions with discontinuous spans, the implementations were augmented to take as input the modified CoNLL-Coref 2011/2012 file format. Second, the implementations were updated to allow overlapping mentions to match instead of enforcing strict mention boundary matching. This option was added to allow for a slightly more flexible, permissive evaluation. The augmented implementations of all metrics used in the CRAFT-CR task have been made publicly available 8 .

Baseline system
For comparison purposes, we evaluated the Berkeley coreference resolution system using the CRAFT-CR task test data (Durrett and Klein, 2013). The Berkeley system is an english coreference system predicated on learning using simple, but large numbers of lexicalized features. This baseline evaluation made use of the builtin preprocessing machinery for sentence splitting, tokenization, and parsing, and their pre-trained CoNLL 2012 model. Prior to evaluation, results from the Berkeley system were post-processed to adjust for some system idiosyncrasies, e.g. replacing "-LRB-" in the 'word' column with the "(" or "[" that is found in the actual text, and then the coreference information was mapped onto the gold standard tokenization provided with the test data.

Results
One team submitted three runs for evaluation in the CRAFT-CR task (Table 3). They augmented the state-of-the-art end-to-end neural coreference resolution system of Lee et al. (2017) by incorporating extra syntactic features including grammatical number agreements between mentions, as well as semantic features using MetaMap to identify entity mentions. They also investigated the use of PubMed word vectors (Chiu et al., 2016) (Run1) and SciBERT word vectors (Beltagy et al., 2019) (Run2, Run3) as inputs to their model. As implemented, the system of Team T010 performed admirably compared to the baseline. F-scores are in line with some previous coreference systems used on CRAFT (Cohen et al., 2017), thus emphasizing the challenge of coreference resolution in general, and of coreference resolution over biomedical text in particular. While the baseline system and Run1 of the participant system produced on average shorter chains than those in the evaluation set (p<0.01, Mann-Whitney U test), Run2 and Run3 of the participant system were both able to generate distributions of coreference chain lengths that were not significantly different from the evaluation set (Run2: p=0.94, Run3: p=0.79, Mann-Whitney U test) suggesting that inclusion of the SciBERT embeddings helps to achieve the proper chain length distribution.

The CRAFT concept annotation task
Concept annotation has been a mainstay in BioNLP shared tasks dating back to the very first BioCreative, which involved the detection of gene/protein mentions in abstracts and their subsequent normalization to gene identifiers from model organism databases (Hirschman et al., 2005). Detecting biomedical concepts is a foundational NLP task, and performance of this task   Table 3: Results for the coreference resolution task. Runs achieving highest coreference F-score are shown. The APM subscript indicates that partial mention matches were allowed. P M : mention precision; R M : mention recall; F M : mention F-score; P CR : coreference precision; R CR : coreference recall; F CR : coreference F-score impacts many potential downstream applications. Mapping textual mentions of ontology concepts presents its own set of challenges. Well-known among these are conceptual synonymy, by which a given represented concept may be indicated by multiple unique textual mentions, and textual polysemy, by which a given text string may refer to multiple represented concepts. Particularly prevalent in the biomedical literature are acronyms and other abbreviations of represented concepts. Additionally, some ontologies employ standard patterns for concept labels, but some of these may result in long, complex labels that are infrequently seen in the literature (Ogren et al., 2005;Funk et al., 2014).
The CRAFT corpus is uniquely positioned to gauge the state of the art in ontological concept recognition as it comprises over 159,000 concept annotations spanning ten ontologies from the Open Biomedical Ontologies (OBO) (Smith et al., 2007) collection. Participants in the CRAFT con-cept annotation (CRAFT-CA) task were provided the plain-text version of each article and a file containing each ontology in the OBO format 9 . The CRAFT-CA task was further subdivided into two subtasks. The first subtask involved recognition of concepts in the original OBO files. The second subtask involved the recognition of concepts in the original OBO files augmented with extension classes, which are classes created by CRAFT developers but defined in terms of proper OBO classes. These extension classes were created for various reasons 10 : Some were created to capture mentions of concepts different from, but corresponding to, concepts represented in the ontologies, e.g., functionally defined entities corresponding to represented molecular functionalities. Oth-ers are semantically broadened forms of the represented concepts, while others were created to unify classes from different ontologies that were semantically equivalent so that there would not be multiple concept annotations for the same text spans if disparate annotation sets are aggregated.

Data
Concept annotations in the CRAFT corpus span ten Open Biomedical Ontologies (Smith et al., 2007), including the Chemical Entities of Biomedical Interest (ChEBI) ontology (Degtyarenko et al., 2007), the Cell Ontology (CL) (Bard et al., 2005), the Biological Process (GO BP), Cellular Component (GO CC) and Molecular Function (GO MF) subontologies of the Gene Ontology (Ashburner et al., 2000), the Molecular Process Ontology (MOP) 11 , the NCBI Taxonomy (NCBITaxon) (Federhen, 2011), the Protein Ontology (PR) (Natale et al., 2010), the Sequence Ontology (SO) (Eilbeck et al., 2005), and the Uberon cross-species anatomy ontology (UBERON) (Mungall et al., 2012). Note that concept annotations in the CRAFT corpus are permitted to have discontinuous spans with intervening text; e.g., for the phrase somatic and germ cells, the combination of the two substrings somatic and cells is annotated with the concept for somatic cells (CL:0002371) even though somatic and cells are not adjacent to one another in the text. There are over 2,300 concept annotations with discontinuous spans in the CRAFT corpus. The ontologies provided for the CRAFT-CA task were the same versions used during the annotation of CRAFT. As with the other tasks, the data is divided into a training set consisting of 67 full-text articles from the PMC Open Access subset, and a test set of 30 full-text articles chosen using identical selection criteria. Concept annotation of the CRAFT articles is described in detail in Bada et al. (2012) and . Summary statistics showing total annotation counts for the ten ontologies used in the CRAFT corpus are shown in Table 4.

Data preparation
Some minor concept annotation errors were discovered and addressed during preparation for the CRAFT-CA task. These errors included an NCBITaxon concept that was found to not exist 11 http://obofoundry.org/ontology/mop. html in the version of the NCBI Taxonomy used to annotate CRAFT, as well as some erroneous extension class prefixes used in the GO MF extended ontology file. Errors were addressed prior to the commencement of the shared tasks.

Data format
The CRAFT corpus is distributed with a script that can convert its native annotation format to a variant of the BioNLP format 12 which is used for both input and output for the CRAFT-CA task. This format captures span information, the concept identifier, and the covered text for each annotation (See Figure 2).

Scoring
The method of Bossy et al. (2013) was used to measure performance of the concept annotation systems with respect to the CRAFT corpus. This method employs a hybrid measure taking into account both the degree to which the predicted annotation boundaries match the reference, as well as a similarity metric for scoring the concept match. The boundary match uses the modified Jaccard index scheme described in Bossy et al. (2012), which allows for flexible matching but prefers exact matches. The concept similarity metric of Wang et al. (2007) is used to score the predicted concepts. As suggested by Bossy et al. (2013) , the weight factor, w, was set to 0.65, which ensures that ancestor/descendant predictions always have a greater value than sibling predictions, while root predictions never yield a similarity greater than 0.5. An implementation of the scoring algorithm has been made publicly available 13 .

Baseline system
We evaluated a baseline system on the CRAFT-CA data to use as a comparison for the participantsubmitted runs. The baseline system is a two-stage machine learning system proposed in Hailu (2019) and trained only on the CRAFT corpus. The first stage makes use of NERSuite (Cho et al., 2010) to detect concept mention spans using a conditional random field (CRF) model. The CRF model was trained as described in Okazaki (2007), and uses as features words, parts of speech, and constituency parse information within a window of three tokens 12 http://2013.bionlp-st.org/ file-formats 13 https://github.com/UCDenver-ccp/ craft-shared-tasks; doi:10.5281/zenodo.3460928   upstream and downstream of each concept mention. The second stage links each textual mention identified by the CRF to an ontology identifier using a stacked Bi-LSTM approach implemented by the OpenNMT system (Klein et al., 2018). By modeling concept normalization as sequence-to-sequence translation at the character level, the baseline system maps characters in the text spans identified in the first stage to characters in ontology identifiers to normalize concepts.

Results
One team submitted three runs to the CRAFT-CA task (Table 5). They used variants of two systems, one a modified ontology-specific BioBert 14 model with (Run3) and without (Run1) input from the OGER NER system (Furrer et al., 2019) and with weights pretrained on PubMed using identifiers from the ontologies as the tag set, and the other a BiLSTM with ontology pretraining (Run2). With regard to overall system performances, marked improvement in recognition of concepts from CHEBI, GO BP, GO MF, and SO was observed compared to past evaluations using the CRAFT public dataset (Funk et al., 2014). However, it is important to note that past evaluations were performed on CRAFT v1/2 concept annotations, whereas the testing of this shared task was performed on v3 concept annotations, which constitute a major update of the concept annotations relative to those of v1/2 (including first usage of extension classes), so we do not believe it is safe to directly compare evaluations performed on these substantially different versions of the concept annotations. The BioBert approach augmented with the OGER NER system (Run3) generally outperforms the other approaches when normalizing to proper OBO concepts, whereas the Bi-LSTM approach is generally better when the extension classes are used. Neither the baseline system, nor any of the submitted runs identified annotations with discontinuous spans. Though annotations with discontinuous spans make up only a small percentage (1.46%) of the overall annotations, their exclusion from system output could represent potential low hanging fruit for improving overall system performance. Protein Ontology concept recognition remains a target for future work as system performances did not surpass an F-score of 0.55. In-14 https://github.com/dmis-lab/biobert clusion of the extension classes generally resulted in improvement of performance when compared to runs using only the proper ontology concepts, possibly attributable to the labels and synonyms that were provided for the extension classes. One exception is for GO MF EXT where performance is expected to suffer with inclusion of the extension class annotations as the proper ontology class count was limited to a very small subset of the original ontology. Overall, however, performance on the CRAFT-CA task demonstrated state-of-theart performance for ontological concept recognition in biomedical text.

Conclusion
The CRAFT-ST 2019 provides a platform to gauge performance on three fundamental NLP tasks, automated dependency parse construction, coreference resolution, and ontology concept annotation against a high quality, manually annotated corpus of full-text biomedical articles. Submitted runs from participating systems demonstrate promising results, particularly with respect to automated dependency parse construction and some aspects of ontological concept annotation. Clear needs for improved extraction of protein ontology concepts remain, while the neural approaches used have addressed long standing deficiencies in the recognition of biological process concepts in text. Coreference resolution system performances highlight the existing challenges of coreference resolution in general, and of coreference resolution over biomedical text in particular.
The approaches taken by participants in the CRAFT-ST 2019 mirror the current themes in AI and NLP today. Neural approaches are unsurprisingly the preferred methodology for addressing these NLP tasks. The CRAFT ST 2019 have provided new benchmarks for these fundamental NLP tasks, setting the stage for the next evolution of system development.