WNUT-2020 Task 1 Overview: Extracting Entities and Relations from Wet Lab Protocols

This paper presents the results of the wet lab information extraction task at WNUT 2020. This task consisted of two sub tasks: (1) a Named Entity Recognition (NER) task with 13 participants and (2) a Relation Extraction (RE) task with 2 participants. We outline the task, data annotation process, corpus statistics, and provide a high-level overview of the participating systems for each sub task.


Introduction
Wet Lab protocols consist of natural language instructions for carrying out chemistry or biology experiments (for an example, see Figure 1). While there have been efforts to develop domain-specific formal languages in order to support robotic automation 1 of experimental procedures (Bates et al., 2017), the vast majority of knowledge about how to carry out biological experiments or chemical synthesis procedures is only documented in natural language texts, including in scientific papers, electronic lab notebooks, and so on.
Recent research has begun to apply human language technologies to extract structured representations of procedures from natural language protocols (Kuniyoshi et al., 2020;Vaucher et al., 2020;Kulkarni et al., 2018;Soldatova et al., 2014;Vasilev et al., 2011;Ananthanarayanan and Thies, 2010). Extraction of named entities and relations from these protocols is an important first step towards machine reading systems that can interpret the meaning of these noisy human generated instructions.
However, performance of state-of-the-art tools for extracting named entity and relations from wet lab protocols still lags behind well edited text genres (Jiang et al., 2020). This motivates the need for continued research, in addition to new datasets and tools adapted to this noisy text genre. In this overview paper, we describe the development and findings of a shared task on named entity and relation extraction from the noisy wet lab protocols, which was held at the 6-th Workshop on Noisy User-generated Text (WNUT 2020) and attracted 15 participating teams.
In the following sections, we describe details of the task including training and development datasets in addition to the newly annotated test data. We briefly summarize the systems developed by selected teams, and conclude with results.

Wet Lab Protocols
Wet lab protocols consist of the guidelines from different lab procedures which involve chemicals, drugs, or other materials in liquid solutions or volatile phases. The protocols contain a sequence of steps that are followed to perform a desired task. These protocols also include general guidelines or warnings about the materials being used. The publicly available archive of protocol.io contains such guidelines of wet lab experiments, written by researchers and lab technicians around the world. This protocol archive covers a large spectrum of experimental procedures including neurology, epigenetics, metabolomics, stem cell biology, etc. Figure  1 shows a representative wet lab protocol.
The wet lab protocols, written by users from all over the worlds, contain domain specific jargon as well as numerous nonstandard spellings, abbreviations, unreliable capitalization. Such diverse and Train Dev Test-18 Test-20  Total  #protocols  370  122  123  111  726  #sentences  8444  2839  2813  3562  17658  #tokens  107038 36106  36597  51688 231429  #entities  48197 15972  16490 104654 185313  #relations  32158 10812  11242  70591 124803 per noisy style of user created protocols imposed crucial challenges for the entity and relation extraction systems. Hence, off-the-shelf named entity recognition and relation extraction tools, tuned for well edited texts, suffer a severe performance degradation when applied to noisy protocol texts (Kulkarni et al., 2018).
To address these challenges, there has been an increasing body of work on adapting entity and relation extraction recognition tools for noisy wet lab texts (Jiang et al., 2020;Luan et al., 2019;Kulkarni et al., 2018). However, different research groups have used different evaluation setups (e.g., training / test splits) making it challenging to perform direct comparisons across systems. By organizing a shared evaluation, we hope to help establish a common evaluation methodology (for at least one dataset) and also promote research and development of NLP tools for user generated wet-lab text genres.

Annotated Corpus
Our annotated wet lab corpus includes 726 experimental protocols from the 8-year archive of ProtocolIO (April 2012 to March 2020). These protocols are manually annotated with 15 types of relations among the 18 entity types 2 . The fine-grained entities can be broadly classified into

Train and Development data
The training and development dataset for our task was taken from previous work on wet lab corpus (Kulkarni et al., 2018) that consists of from the 623 protocols. We excluded the eight duplicate protocols from this dataset and then re-annotated the 615 unique protocols in BRAT (Stenetorp et al., 2012). This re-annotation process aided us to add the previously missing 20,613 missing entities along with 10,824 previously missing relations and also to facilitate removing the inconsistent annotations. The updated corpus statics is provided in Table 1. This full dataset (Train, Dev, Test-18) was provided to the participants at the beginning of the task and they were allowed to use any of part of this dataset to train their final model.

Test Data
For this shared task we added 111 new protocols (Test-20) which were used to evaluate the submitted models. Test-20 dataset consists of 100 randomly sampled general protocols and 11 manually selected covid-related protocols from ProtocolIO (https://www.protocols.io/). This 111 protocols were double annotated by three annotators using a web-based annotation tool, BRAT (Stenetorp et al., 2012). Figure 1 presents a screenshot of our annotation interface. We also provided the annotators a set of guidelines containing the entity and relation type definitions. The annotation task was split in multiple iterations. In each iteration, an annotator was given a set of 10 protocols. An adjudicator then went through all the entity and relation annotations in these protocols and resolved the disagreements. Before adjudication, the interannotator agreement is 0.75 , measured by Cohen's Kappa (Cohen, 1960).

Baseline Model
We provided the participants baseline model for both of the subtasks. The baseline model for named entity recognition task utilized a feature-based CRF tagger developed using the CRF-Suite 3 with a standard set of contextual, lexical and gazetteer features. The baseline relation extraction system employed a feature-based logistic regression model developed using the Scikit-Learn 4 with a standard set of contextual, lexical and gazetteer features.

NER Systems
Thirteen teams (Table 3) participated in the named entity recognition sub-task. A wide variety of approaches were taken to tackle this task. Table 2 summarizes the word representations, features and the machine learning approaches taken by each team. Majority of the teams (11 out of 13) utilized contextual word representations. Four teams combined the contextual word representations with global word vectors. Only two teams did not use any type of word representations and relied entirely on hand-engineered features and a CRF taggers. The best performing teams utilized a combination of contextual word representation with ensemble of learning. Below we provide a brief description of the approach taken by each team.
B-NLP (Lange et al., 2020) modeled the NER as a parsing task and uses a biaffine classifier. The second classifier of their system used the predictions from the first classifier and then updated the labels of the predicted entities. Both of the classifiers utilized word2vec (Mikolov et al., 2013) and SciBERT   DSC-IITISM (Gupta et al., 2020) developed a BiLSTM-CRF model that utilized a concatenation of CamemBERT base (Martin et al., 2020), Flair(PubMed) (Akbik et al., 2018), and GloVe(en) (Pennington et al., 2014) word representations.

RE Systems
Two teams (Table 3) participated in the relation extraction sub-task. Both of the teams followed fine-tuning of contextual word representation and did not use any hand-crafted features. Table 5 summarizes the word representations and the machine learning approaches followed by each team. Below we provide a brief description of the model developed by taken by each team.
Big Green (Miller and Vosoughi, 2020) considered the protocols as a knowledge graph, in which relationships between entities are edges in the knowledge graph. They trained a BERT (Devlin et al., 2019) based system to classify edge presence and type between two entities, given entity text, label, and local context.    (Gu et al., 2020) as input to the relation extraction model that enumerates all possible pairs of arguments using deep exhaustive span representation approach.

Evaluation
In this section, we present the performance of each participating systems along with a description of the errors made by the model types.  Table 4: Results on extraction of 18 Named Entity types from the Test-20 dataset. Exact Match reports the performance when the predicted entity type is same as the gold entity and the predicted entity boundary is the exact same as the gold entity boundary. Partial Match reports the performance when the predicted entity type is same as the gold entity and the predicted entity boundary has some overlap with gold entity boundary.

NER Errors Analysis
boundary are exactly same as the gold entity type and boundary. Whereas, the partial match refers to the case where the predicted entity type is same as   the gold entity type and predicted entity boundary has some overlap with the gold entity boundary. We observe that ensemble models with contextual word representations outperforms all other approaches by achieving 77.99 F 1 score in exact match (Team:BITEM) and 81.75 F 1 score in partial match (Team:PublishInCovid19).
In Figure 2, we present an error analysis of different NER systems.
Analysis of the errors these different NER model prediction demonstrate that, the BERT based models make less mistakes in false positive and incorrect type errors compared to the traditional neural networks and feature based models. We also observed that, these BERT models suffer from higher false negatives errors compared to the other approaches.
To combine the advantages of these different approaches, we made an majority voting based ensemble classifier. Our ensemble NER tagger utilizes the predictions of all the submitted systems and then it assigns each word the most frequently predicted tag. This ensemble classifier performs better than all the single fine-tuned BERT models and it out-performed the traditional neural and feature based models by achieving 76.84 F 1 (Table 4). However, our ensemble NER tagger performed 1.15 F 1 below the neural ensemble models (Team:BITEM, Pub-lishInCovid19). We would like to note that, we did not have access to the participant model's predictions on development and training set. Hence, it was not possible for us to fine-tune our ensemble classifier on the entity recognition task. Table 6 shows the comparison of precision (P), recall (R) and F 1 score among the participating teams, evaluated on the Test-20 corpus. Both of the teams utilized the gold entities and then predicted the relations among these entities by fine-tuning contextual word representations. We observed that fine-tuning of domain related PubMedBERT provides significantly higher performance compared to the general BERT fine-tuning. While examining the relation predictions from both of these systems, we found that system with PubMedBERT fine-tuning (Team:mgsohrab) resulted in significantly less amount of errors in every category (Figure 3  The error analysis over different participant predictions revealed that the general domain BERT has less false negative errors compared to the domain-related BERT. However, the domain related Pubmed-BERT models have significantly less number of false positive and incorrect type errors compared to the general domain BERT. To combine the advantages of these different approaches, we made an ensemble classifier from the prediction of the submitted systems, where we assign the most frequently predicted relation for each entity pair. This ensemble classifier outperforms the winner system by achieving 81.32 F 1 score.

Related Work
The task of information extraction from wet lab protocols is closely related to the event trigger extraction task. The event trigger task has been studied extensively, mostly using ACE data (Doddington et al., 2004) and the BioNLP data (Nédellec et al., 2013). Broadly, there are two ways to classify various event trigger detection models: (1) Rule-based methods using pattern matching and regular expression to identify triggers (Vlachos et al., 2009) and (2) Machine Learning based methods focusing on generation of high-end hand-crafted features to be used in classification models like SVMs or maxent classifiers . Kernel based learning methods have also been utilized with embedded features from the syntactic and semantic contexts to identify and extract the biomedical event entities (Zhou et al., 2014). In order to counteract highly sparse representations, different neural models were proposed. These neural models utilized the dependency based word embeddings with feed forward neural networks (Wang et al., 2016b), CNNs (Wang et al., 2016a) and Bidirectional RNNs (Rahul et al., 2017). Previous work has experimented on datasets of well-edited biomedical publications with a small number of entity types. For example, the JNLPBA corpus (Kim et al., 2004) with 5 entity types (CELL LINE, CELL TYPE, DNA, RNA, and PROTEIN) and the BC2GM corpus (Hirschman et al., 2005) with a single entity class for genes/proteins. In contrast, our dataset addresses the challenges of recognizing 18 finegrained named entities along with 15 types of relations from the user-created wet lab protocols.

Summary
In this paper, we presented a shared task for consisting of two sub-tasks: named entity recognition and relation extraction from the wet lab protocols. We described the task setup and datasets details, and also outlined the approach taken by the participating systems. The shared task included larger and improvised dataset compared to the prior literature (Kulkarni et al., 2018). This improvised dataset enables us to draw stronger conclusions about the true potential of different approaches. It also facilitates us in analyzing the results of the participating systems, which aids us in suggesting potential research directions for both future shared tasks and noisy text processing in user generated lab protocols.