mgsohrab at WNUT 2020 Shared Task-1: Neural Exhaustive Approach for Entity and Relation Recognition Over Wet Lab Protocols

We present a neural exhaustive approach that addresses named entity recognition (NER) and relation recognition (RE), for the entity and re- lation recognition over the wet-lab protocols shared task. We introduce BERT-based neural exhaustive approach that enumerates all pos- sible spans as potential entity mentions and classifies them into entity types or no entity with deep neural networks to address NER. To solve relation extraction task, based on the NER predictions or given gold mentions we create all possible trigger-argument pairs and classify them into relation types or no relation. In NER task, we achieved 76.60% in terms of F-score as third rank system among the partic- ipated systems. In relation extraction task, we achieved 80.46% in terms of F-score as the top system in the relation extraction or recognition task. Besides we compare our model based on the wet lab protocols corpus (WLPC) with the WLPC baseline and dynamic graph-based in- formation extraction (DyGIE) systems.


Introduction
The entity and relation recognition over wet-lab protocol (Tabassum et al., 2020) shared task 1 is an open challenge that allows participants to use any methodology and knowledge sources for the wet lab protocols that specify the steps in performing a lab procedure. The task aims at two sub-tasks in wet lab protocols domain: named entity recognition (NER), and relation recognition or extraction (RE). In NER, the task is to detect mentions and classify them into entity types or no entity. NER has drawn considerable attentions as the first step towards many natural language processing (NLP) applications including relation extraction (Miwa and Bansal, 2016), event extraction (Feng et al., 2016), and co-reference resolution (Fragkou, 2017). In contrast, relation extraction (RE) is a task to identify relation types between known or predicted entity mentions in a sentence.
In this paper, we present a BERT-based neural exhaustive approach that addresses both NER and RE tasks. We employ a neural exhaustive model (Sohrab and Miwa, 2018;Sohrab et al., 2019b) for NER and the extended model that addresses RE task. The model detects flat and nested entities by reasoning over all the spans within a specified maximum span length. Unlike the existing models that rely on token-level labels, our model directly employs an entity type as the label of a span. The spans with the representations are classified into their entity types or non-entity. With the mentions predicted by the NER module, we then feed the detected or known mentions to the RE layer that enumerates all trigger-argument pairs as trigger-trigger or trigger-entity pairs and assigns a role type or no role type to each pair.
The best run for each sub-task achieved the Fscore of 76.60% on entity recognition task that stands third rank system and the F-scores of 80.46% on relation extraction task as the top system. Besides, we also compare our model with the state-of-the-art models over the wet lab protocols corpus (WLPC). We compare the WLPC baseline model based on LSTM-CRF and maximumentropy-based approaches to address NER and RE tasks respectfully. We also compare our model with dynamic graph-based information extraction (Dy-GIE) system. Our model outperforms by 4.81% for NER and 7.79% for RE over the WLPC baseline and 3.61% for NER over the DyGIE system.

Related Work
Most NER work focus on flat entities. Lample et al. (2016) proposed a LSTM-CRF (conditional ran-dom fields) model and this has been widely used and extended for the flat NER, e.g., Akbik et al. (2018). In recent studies of neural network based flat NER, Gungor et al. (2018Gungor et al. ( , 2019 have shown that morphological analysis using additional word representations based on linguistic properties of the words, especially for morphologically rich languages such as Turkish and Finnish, improves the NER performances further compared with using only representations based on the surface forms of words. Recently, nested NER has been widely interested in NLP. Zhou et al. (2004) detected nested entities in a bottom-up way. They detected the innermost flat entities and then found other NEs containing the flat entities as sub-strings using rules on the detected entities. The authors reported an improvement of around 3% in the F-score under certain conditions on the GENIA data set (Collier et al., 1999). Recent studies show that the conditional random fields (CRFs) can produce significantly higher tagging accuracy in flat or nested (stacking flat NER to nested representation) NERs (Son and Minh, 2017). Ju et al. (2018) proposed a novel neural model to address nested entities by dynamically stacking flat NER layers until no outer entities are extracted. A cascaded CRF layer is used after the LSTM output in each flat layer. The authors reported that the model outperforms state-of-the-art results by achieving 74.5% in F-score on the GE-NIA data set. Sohrab and Miwa (2018) proposed a neural model that detects nested entities using exhaustive approach that outperforms the state-ofthe-art results in terms of F-score on the GENIA data set. Sohrab et al. (2019b) further extended the span representations for entity recognition and addressed sensitive span detection tasks in the MED-DOCAN (MEDical DOCument ANonymization) shared task 2 , and the system achieved 93.12% and 93.52% in terms of F-score for NER and sensitive span detection, respectively.
Recent successes in neural networks have shown impressive performance on coupling information extraction (IE) tasks as in joint modeling of entities and relations (Miwa and Bansal, 2016). Yi et al. (2019) proposed a dynamic graph information extraction (DyGIE) system for coupling multiple IE tasks, a multi-task learning approach to entity, relation, and coreference extraction. DyGIE uses dynamic graph propagation to explicitly incorpo-2 http://temu.bsc.es/meddocan/ rate rich contextual information into the span representations, and the system achieved significant F1 score improvement on the different datasets. Kulkarni et al. (2018) establised a baseline for IE on the wet lab protocols corpus (WLPC). They employ an LSTM-CRF for entity recognition approach. For relation extraction, they assume the presence of gold entities and train a maximum-entropy classifier using features from the labeled entities.

Neural Exhaustive Approach for NER and Relation Extraction
Our BERT-based neural exhaustive approach is built upon a pipeline approach of two modules: • Named entity recognition that uses a contextual neural exhaustive approach • Relation extraction that aims to predict relations from detected/given mentions.
To solve entity and relation recognition tasks, the pipeline approach can be presented as three layers: BERT layer, entity recognition layer, and relation recognition layer. Figure1 shows the system architecture of entity and relation recognition.

BERT Layer
For a given sequence, the BERT layer receives sub-word sequences and assigns contextual representations to the sub-words via BERT. We assume each sentence S has n words and the i-th word, represented by S i , is split into sub-words. This layer assigns a vector v i,j to the j-th sub-word of the i-th word. It also produces the representation v S as a local context for the sentence S, which corresponds to the embedding of [CLS] token.

Entity Recognition layer
We build mention detection layer, a.k.a named entity recognition (NER) on top of the BERT. This layer assigns entity or trigger types to overlapping text spans, or word sequences, in a sentence. We firstly generate mention candidates based on the same idea as the span-based model (Lee et al., 2017;Sohrab and Miwa, 2018;Sohrab et al., 2019a), in which all continuous word sequences are generated given a maximum span length L x . Since BERT layer works only on sub-words, we choose the embedding of the first sub-word v i,1 as word embedding v i of i-th word. The representation x b,e ∈ R dx for the span from the b-th word to the e-th word in a sentence is calculated from the  Figure 1: System Architecture for Neural Exhaustive Approach for NER and Relation Extraction. The example sequence is taken from Wet Lab Protocols Data set embeddings of the first word, the last word, and the weighted average of all words in the span as follows: where α b,e,i denotes the attention value of the i-th word in a span from the b-th word to the e-th word, and [; ; ] denotes concatenation.

Relation Recognition Layer
The relation recognition layer enumerates all trigger-argument pairs (trigger-trigger and triggerentity pairs) given triggers and entities detected by the entity recognition layer and assigns a role type or no role to each pair. We generate relation representation based on the same idea as the deep event extraction system (Trieu et al., 2020). Since each role is constructed by a trigger and an argument, we firstly compute representations of all triggers and arguments detected by the entity recognition layer. The representations of a trigger and an argument are calculated in the same way. A trigger t ranging from the starting t s -th word to the ending t e -th word is represented with the concatenation of its span representation x t (from Equation 1) and a 300-dimensional entity type embedding s t , as follows: Similarly, the representation of an argument a can be calculated as The representation r i ∈ R dr for a relation pair i is then calculated from its trigger representation v t , argument representation v a , and the context representation v S which is obtained from the sentence representation of the BERT layer: where W r and b r are learnable weights and biases respectively and GELU is the Gaussian Error Linear Unit activation function. After obtaining the pair representation r i , we classify it with a softmax function to predict the corresponding role type.

Experimental Settings
We provide empirical evidence on the effectiveness of the pipeline architecture in both NER and relation extraction over the wet lab protocols 3 task of the W-NUT 2020 4 . The wet lab protocols corpus with eighteen entity types 5 and fifteen relation types 6 are randomly split into four subsets: train, development, test, and test release (unlabeled) sets, which contain 370, 122, 123 and 111 lab protocols respectively. In our experiments, we merge the train and development as train-set, test-set use as development-set, and predict the annotations for test release set which is used as test-set. Our model is implemented in the PyTorch 7 framework. We employed the official wet lab protocols evaluation script for NER 8 and relation extraction 9 to evaluate our system's performances on both tasks.

Data Preprocessing
Each text and the corresponding annotation file were preprocessed by several simple rules 10 only for tokenization 11 . After tokenization, each text with mapping annotation files were directly passed to the deep neural approach for mention detection and relation extraction. Note that the offsets were restored to the original offsets in evaluation.

Training Settings
We train the model in a pipeline manner based on the pre-trained BERT model. We employed the pre-trained PubmedBERT (Gu et al., 2020)  According to our investigation, we choose 10 as the maximum span length of mention candidates. We also truncate every sentences at 256 sub-words without losing any gold entities or relations (we maintain a 100% recall of gold entities and relations in the training set).

Results and Discussions
In order to evaluate the performance of NER, we conduct experiments on different sets of BERT-based learning representations, including PubmedBERT with merging training-and devset (PubmedBERT-Merge), PubmedBERT along with training (PubmedBERT-Train), SciBERT with merging training-and dev-set (SciBERT-Merge), and SciBERT along with training (SciBERT-Train).
In contrast to relation extraction, as based on our primary results of NER with PubmedBERT and SciBERT where PubmedBERT is outperforming to SciBERT. Therefore, we conduct all our relation extraction experiments using PubmedBERT. For relation extraction task, we learn our model on two data scenarios. First, we perform a clustering approach on training-and dev-set to find the similar or duplicate text files in wet-lab data set. We found that many similar text files with inconsistent annotations exist in the train-and dev-set. The similarity approach with a setting threshold is applied on the train and dev-set to cluster the similar or duplicate text protocols. We then eliminate those text and its corresponding annotation files which appear in the training set to avoid model learning confusion and data leakage. We also applied the predefined relation rules (Kulkarni et al., 2018) to filter out any invalid relations appearing in the system output. We conduct experiments on different sets of PubmedBERT-based learning representations, including PubmedBERT using finetune with filtering approach (PubmedBERT-Finetune-Filter), PubmedBERT along with finetune (PubmedBERT-Finetune), PubmedBERT along with filter approach (PubmedBERT-Filter) and PubmedBERT without finetune and filtering approaches (PubmedBERT).
In second data scenario, we learn our model by keeping all the original training set, development set, and test set. Based on original data setting, the PubmedBERT-based learning representations are PubmedBERT-Original-Finetune-Filter, PubmedBERT-Original-Finetune, PubmedBERT-Original-Filter and PubmedBERT-Original.
We also report the result of ensemble learning that combines the predictions using different span representations to reduce the variance of predictions and the generalization error.       82.77% in terms of F-score. Table 2 shows the NER task results on the participated teams. In this table results are listed in descending order in terms of exact match-based F-score. The top system achieves 77.99% where our team achieves 76.60% in terms of F-score for NER task. Table 3 shows the results of relation extraction task on the dev-and test-set. Here, all the re-ported learning approaches in this table are used for ensemble approach. In this table, it is shown that the ensemble approach using maximum voting of all the approaches is also effective to improve the relation extraction system performance with achieving 87.53% and 80.46% in terms of Fscore over the dev-and test-set respectfully. In contrast, the PubmedBERT-Original-Finetune-Filter and PubmedBERT-Finetune-Filter are showing the best performances as an individual learning on relation extraction with achieving 87.10% and 80.09%     in terms of F-score over the dev-and test-set respectively. Table 4 shows the relation extraction task results on the participated teams. Our relation extraction system achieves 80.56% in terms of F-score as a top system in this task. We outperformed the second best system by 20.89% in terms of F-score.

Relation Extraction Performances
Our system is based on span-based representation, therefore we also investigate the performances for all vs single-token vs multi-token entities. Table 5 shows the break down performances of our model on different entity levels over the NER and relation extraction on the test set.
In contrast, we also compare our model with the state-of-the-art models over the wet lab protocols corpus (WLPC). Table 6 shows the comparison of our model with the WLPC baseline and Dy-GIE systems. In NER, our model outperforms the WLPC baseline and DyGIE systems by 4.81% and 3.61% respectively in terms of F-score. In RE, our model outperforms the WLPC baseline by 7.79% in terms of F-score. In compare the RE task for DyGIE, NER predictions are given as input in Dy-GIE where gold data boundary is given as input in our and WLPC baseline models. We report the DyGIE RE performance without comparing with our RE performance. In these comparisons, we use the same train-, dev-, test-set, and evaluation script that reported in the WLPC baseline (Kulkarni et al., 2018) for fair comparisons.

Ablation Study
We show the performances of different BERTbased learning models for NER and relation extraction tasks on the development column in Table 1 and Table 3 to compare the possible scenarios of the given solutions and to report the best system submissions for NER and relation extraction. For NER and relation extraction tasks, all the results in Table 1 and Table 3 in development column window show that almost all the results in different approaches are close to each other to solve the NER and relation extraction tasks. Table 7 shows the categorical performances using ensemble learning of NER on the dev-set. In this table, we also break down the number of predicted and correct mentions among the gold annotations of each category. Here, prediction can be denoted as number of predicted entities, annotation as number of gold entities of each category, and correct as number of true positive outcomes where the model correctly predicts the positive category. In this table, it can be observed that for the frequent classes (e.g. Action, Reagent, Amount etc.), the model shows high performance because there are a reasonable number of training instances for the classes. In contrast, for the rare classes (e.g. Size, Mention, Ph, numerical etc.), the performances are also consistence. Table 8 shows the categorical performances using ensemble learning of relation extraction on the dev-set. In this table, it shows the categorical performances using ensemble learning of relation extraction on the development set. In this table, it seems that the model is well generalized to classify the relation types that leads to achieve the top system in the shared task.
Since we provided gold entities in the RE task, therefore, we also examine two different strategies for training the RE model that present in the Table 9.
In this table, it shows that we can significantly boost the RE performance just by pre-finetuning the NER layer using gold entities. Table 10 shows the performances of NER and RE based on the original BERT base in compare to the Pubmed-BERT. The results show that PubmedBERT is outperformed both in NER and RE tasks. In Table 11, we compared our model in different span length.
We chose the maximum span size from 8, 10, and 12 that covers more than 99% mentions to judge the sensitivity of our approach in different span length. In ths table, it can be observed that the performances of our model are consistence even with different span lengths.

Conclusion
This paper presented a BERT-based neural exhaustive approach that addresses both named entity recognition (NER) and relation extraction (RE) tasks. This neural approach consider all possible spans exhaustively, for NER which is capable to detect flat and nested entities from the generated mention candidates.
Several enhancements, namely PubmedBERT, SciBERT, BERT-base-uncased, filtering, clustering, and ensembling are investigated for the wet-lab protocol data set to enhance the system performance.
In NER task, we achieved 76.60% in terms of Fscore as third rank system among the participated systems. In relation extraction task, we achieved 80.46% in terms of F-score as the top system that participated in the relation extraction task. Moreover, our model outperforms by 4.81% for NER and 7.79% for RE over the WLPC baseline and 3.61% for NER over the DyGIE system.
In the future direction, we will implement a joint modeling that addresses NER and relation extraction in an end-to-end manner.