BiTeM at WNUT 2020 Shared Task-1: Named Entity Recognition over Wet Lab Protocols using an Ensemble of Contextual Language Models

Recent improvements in machine-reading technologies attracted much attention to automation problems and their possibilities. In this context, WNUT 2020 introduces a Name Entity Recognition (NER) task based on wet laboratory procedures. In this paper, we present a 3-step method based on deep neural language models that reported the best overall exact match F1-score (77.99%) of the competition. By fine-tuning 10 times, 10 different pretrained language models, this work shows the advantage of having more models in an ensemble based on a majority of votes strategy. On top of that, having 100 different models allowed us to analyse the combinations of ensemble that demonstrated the impact of having multiple pretrained models versus fine-tuning a pretrained model multiple times.


Introduction
The last decades have seen both the amount and the complexity of biological experiments grow. Coupling this phenomenon with the improvement in machine-reading technologies seem to have led researchers to look for ways to automate wet laboratory procedures. Such technologies should allow reproducibility while reducing human errors in the process. However, as current protocols are usually written in a natural language, a collection of wet laboratory protocols annotated with entities and relations would help assess current machine-reading performances in this specific setting (Kulkarni et al., 2018).
In this context, WNUT (Workshop on Noisy User-generated Text 1 ) 2020 (Tabassum et al., 2020) proposes two tasks, a Named Entity Recognition (NER) task and a Relation Extraction (RE) task. In this paper, we present a 3-step method we used for the NER task. Our approach is essentially based on a deep neural language models supported by transformer-like architectures (Vaswani et al., 2017). First, we fine-tuned 10 different pretrained language models on the downstream task. Then, we generated 10 instances of those pretrained models, each time with a new random initialization of the last layer, namely the classifier. Finally, we used an ensemble strategy based on a majority of votes. Our approach achieves the exact-match F 1 -score of 77.99% that ranks first in the shared task.

Related work
Deep learning approaches trained on large unstructured data have shown considerable success in NLP problems, including NER (Devlin et al., 2019;Liu et al., 2019;Lample et al., 2016;Beltagy et al., 2019;Jin et al., 2019). These models use the learned representations over the large data and reuse them in a supervised setting for a downstream task. For domain-specific tasks, the models that are trained on large general text can be further trained on domain specific large data and then adapted for a downstream task Gururangan et al., 2020;Alsentzer et al., 2019) or the models can be trained only on domain-specific data and then adapted for a specific task (Beltagy et al., 2019).

Data
The data provided for this task is a subset of Kulkarni et al.'s corpus (Kulkarni et al., 2018). The dataset consists of 615 unique protocols annotated with 17 types of entities and action (an example is shown in Figure 1).
The organizers provided a set of protocols for training, development, and test. They further released a final set of unlabelled protocols for test during the competition (called test 2020). Table  1 shows the way the dataset has been split into a   training set, a development set, 2 a test set, and the competition test (test 2020).
In Table 2, we see the distribution of all the entities by each subset. As we can see, we have 18 entities and only two of them (Action and Reagent) represent about 50% of annotations. This table also shows us that entities' proportions are fairly similar across all the subsets.
Our method has been driven in 3 steps. First, we chose 10 different pretrained models and fine-tuned them on the downstream task. Then, using a voting strategy, we created ensemble models. Finally, we fine-tuned 9 more times each model, each time with a new random initialization of the fully connected layer, to see if sampling ensemble models from this set of models would improve the results even more.

Transformers with a fully connected layer on top of the token representations
In order to use transformers as a NER model, the only preprocessing we had to do was to break each protocol into sentences. Those sentences will then be the sequences that are fed into our model. As there were no overlapping entities in the text, we used a sof tmax function which allowed us to classify each token to only one entity. As transformers usually use tokenizers that work on word bits (or sub-tokens), we had to deal with it by assigning a dummy entity to each sub-token that was part of a word. In such cases, at training time, we only assign the true entity to the first sub-token. This allowed us to build back the original text quite easily. Indeed, during prediction, a word will get the highest probable entity label among all the subtokens' predictions of that word. In other words, the highest probable entity label will be assigned to all the sub-tokens of the word and the sub-tokens will be merged to build back the original word with the respective assigned label. Finally, in a given  sequence, if two adjacent words were given the same entity prediction, we would consider the two words as a passage related to that entity. Using the above setup, we fine-tuned 10 pretrained transformers for 10 epochs using an Adam optimizer (Kingma and Ba, 2014), a learning rate of 3e −5 , a batch size of 24 and a maximum sequence length of 256 tokens. We used 1x T4 GPU for all base models and 2x T4 GPUs for the large ones. For a given model, it took in average roughly 16 minutes per epoch to train, thus about 2.67 hours for the 10 epochs. After each epoch, we predicted the development set, computed the F 1 -score and saved the model if it improved the previous epoch score. Table 3 shows more information about all the pretrained models that we fine-tuned on the NER task. Indeed, 4 models out of 10 were trained on Biomedical corpus, such as PubMed and/or BioMed whereas the others were trained on general corpora, such as Wikipedia. Another key difference is the model type which defines the way a given model has been trained. This includes the training task (e.g., MLM, next sentence prediction, . . . ), the tokenizer algorithm, the optimizer and more. We used 5 different kinds of BERT-based, 3 of RoBERTa-based and 2 of XLNet-based models. For more details regarding the specifics of the architectures, please refer directly to their respective papers.

An ensemble based on a voting strategy
As implemented in (Copara et al., 2020b,a), our ensemble model strategy is based on a majority of votes. This means that for a given ensemble model composition, each composing model has the right to vote. In other words, for a given protocol and a given sequence, each model will return its predictions which can be interpreted as passage/entity combinations. Once we collected all models' predictions, we then counted all the passage/entity combinations and validated only those that had cast a majority of votes.

Sampling
Once we had all the models trained and ready, we were wondering if we could improve efficiency by adding more voters. The idea is to repeat the first step where each time we have a new random initialization on the fully connected layer. We ended up with 100 different models, corresponding to 10 different pretrained models fine-tuned 10 times.
With only a few models to choose from, we would have been able to predict all the possible model compositions; however, as using 50 models out of 100 would have resulted in about 10 29 possible ensembles, we had to sample randomly ensemble model compositions. For each number of models taken into account in a given ensemble, we took a sample size of 1000 combinations. This will later allow us to show the results distribution of our ensemble models and examine how it will behave in certain circumstances.
The ensemble model we chose to use for the submission was the one that gave us the best F 1 -score on the test set. It is a composition of 14 models that were fine-tuned on the task. It contained the following pretrained models: 2× BioBERT (BioBERT models with two different random initializations or seeds), 2× BioClinicalBERT (2 random seeds), 3× PubMedBERT (3 random seeds), 2× RoBERTa base (2 random seeds), 1× RoBERTa large 1× BioMed RoBERTa and 3× XLNet large (3 random seeds).

Results and Discussions
In Table 4, we see the F 1 -score for all the 10 models we fine-tuned across all the 18 entities. The reported baseline is the CRF baseline 3 that was provided for the shared task. First, we can see that the ensemble model outperforms the baseline by far. When comparing all the models (ensemble apart), we also notice that PubMedBERT is quite consistent as it often outperforms all the other models, including the ensemble for a few entities, namely Mention, Seal, Temperature and pH. Additionally, when compared to its peers, it clearly shows the best micro and macro F 1 -scores.
However, when looking at Speed, it seems that the transformers-based models we used are not able to do a better job than the baseline. A closer look at the errors should be done in order to see what caused such a difference with the baseline (see Section 5.4).  When comparing the micro to the macro F 1score standard deviations across all the models, we can see that the macro F 1 -score standard deviations are systematically higher. This is probably due to the fact that some entities, namely Generic-Measure, Mention, Seal, Size and pH, which account for less than 1% of the test set each (see Table  2), seem to have a relatively high F 1 -score standard deviations level. The same applies to Measure-Type, Numerical and Speed that are less than 2% of the test set each. This is in line with the results reported by Dodge et al. (2020) which shows that results can vary a lot across the seeds when a small amount of data is available. Indeed, as these entities are quite rare, a simple misclassification can have a high impact on the macro F 1 -score. That being said, the micro F 1 -scores seem relatively stable across all the pretrained models.

Ensemble results analysis
In this section, we will try to analyse the results we observe when sampling on different ensemble model compositions. These results are exclusively computed on the test set. The idea behind this experiment is to try to understand the behaviour of some metrics when adding more models. Figures 2 to 4 show the F 1 -score, the recall and precision distributions with respect to the number of models taken in a given ensemble, respectively.
The first thing we notice from Figures 2 to 4 is that the more the number of models taken into account in an ensemble grows, the more the metrics variance tends to be smaller and steadier.
When looking at Figure 3 and 4, we clearly see that odd number of voters has a positive impact on the recall while it looks like it has a negative impact on the precision. For the moment, this is unclear to us why this behaviour can be observed; however, we think it could be linked to the majority rule we introduced in our voting strategy where majority is easier to reach in an odd system. When looking closely at Figure 2, it appears that the "odd/even number effect" tends to cancel out when the number of voters increases and even number of voter getting slightly better results.
In Figure 3, there is clearly a positive slope that seems to flatten at the end, which means that the more models we have in our ensemble, the higher recall we should expect. Conversely, this trend doesn't seem that clear for precision (Figure 4) where it looks like we have a positive relation with odd numbers of voters, a negative one with even number of voters which at the end seem to converge into a flat trend for both of them. However, in both figures, as already mentioned, the variance of their respective metrics seems to get steadier and smaller when adding more models in the ensemble composition. Figures 6 to 8 show the same metrics while trying to isolate the effect of adding a new pretrained model versus the effect of adding an already taken pretrained model with a new random fully connected layer initialization.
In order to understand the setting of this experiment, we first build a matrix (see Figure 5) where each column is a pretrained model and each row is a fine-tuned version of it. We then compare the performances of ensemble models based on combinations of columns to those of the ensemble models based on combinations of rows. In Figures  6 to 8, the x−axis represents the number of row or columns taken into account.
For instance, the first two boxplots are computing metrics distributions of ensembles taking either one row or one column as an ensemble, the following two boxplots will take a combination of either two rows or two columns as ensemble and so on up to 9 rows/columns combinations.
More precisely, the first pink boxplot will compose an ensemble taking one column of models, namely, all the BERT base models to begin with, then all the BERT large models and so on until it computes the metrics for an ensemble composed with all the XLNet large models. Then, the second pink boxplot will take the composition of 2 columns, for example, it will first compute an ensemble with all the BERT base models and all the BERT large models, then another with all the BERT base models and all XLNet base models and so on until it computes an ensemble containing all the XLNet base and XLNet large models.
On the other hand, the blue boxplots will compose ensembles with combination of rows. This means that the first blue boxplot will first compute an ensemble composed of the first row (BERT base 1 , BERT large 1 , . . . , XLNet base 1 , XLNet large 1 ), then of the second row (BERT base 2 , BERT large 2 , . . . , XLNet base 2 , XLNet large 2 ) and so on until it computes an ensemble with the last row (BERT base 10 , BERT large 10 , . . . , XLNet base 10 , XLNet large 10 ). In the same manner, the second blue boxplot will compute ensembles composed by the combinations of two rows. First, all the models in the first and second rows, then, all the models in the first and third rows and so on until it computes an ensemble composed with all the models of the last two rows.
In this setting, as the maximum number of possible combinations of row is 252 = 10 5 , we were able to compute all the possible combinations instead of sampling them. As we have the same number of pretrained models as fine-tuned versions, we end up with the same number of possible com- Figure 6: Micro F 1 -score distribution using ensemble composed of either 1 to 9 different pretrained models (each time with 10 different fine-tuning) vs. 1 to 9 different fine-tuning using all the pretrained models.
binations of ensemble. That being said, for each number of models taken into account in an ensemble, this allows us to compare the pink boxplot with the blue one in a more convenient manner.
It is worth noting that the more we increase the number of columns and rows present in an ensemble model, the more they share a certain number of models. For example, at 9, the pink boxplot shows the distribution of the metrics for all the possible ensemble models containing 9 columns of models (90 models out of 100), while the blue boxplot shows the same metrics for 9 rows of models (also 90 models out of 100). At this point, it is expected to see both boxplots converging as they both share 64 models predictions out of 90.
Focusing on the left part of Figure 6, we clearly see the benefits of using more pretrained models. First, it shows better results with only an ensemble of 10 different pretrained models. Then, it really looks steadier as the F 1 -score distribution is much narrower than the ensemble composed of multiple fine-tuning of the same pretrained model.
When looking at Figure 7, we see that the major difference between both distributions are the variances of the recall distributions, indeed, taking different pretrained models tends to retrieve important passages more systematically. The trend of both selection strategies seems to be increasing, in other words, in both cases, the more we add models, the more we retrieve important passages.
Finally, it is interesting to see in Figure 8 that the precision begins quite high and tends to decrease when we add more fine-tuned models. Conversely, when taking more pretrained models, it seems the  precision has a positive relation to the number of models we use. As explained before, this relation is also due to the fact that we share more and more models in both ensemble selection strategies.
This analysis helped us to understand a bit more about what was happening behind our majority of votes strategy, it would be interesting to take notes of some of the observed behaviours and try to devise new strategies accordingly.

Official results
The official results in terms of Precision, Recall, and F 1 on the test 2020 set is shown in Table 5. Each team was allowed to submit only one run. Our submitted run was based on the ensemble model described in sections 4.2 and 4.3. Our BiTeM team achieved the highest precision score in both ex-  act match and partial match evaluation reaching 84.73% and 88.72%, respectively, and F 1 -score in exact match evaluation reaching 77.99% among 13 teams. The F 1 -score of our model in partial match (81.67%) was slightly lower than the best F 1 -score (81.75%).

Results of the ensemble model on test 2020 data
The precision, recall, and F 1 -score results of all entities and Action on the test 2020 in the exact match evaluation is represented in Table 6. The best F 1 -score was achieved for pH. Size was the most difficult entity for detection. Figure 9 shows the normalized confusion matrix for the predictions (exact match) of the ensemble model on the test 2020 data. As we can see, more than 78% of Size predictions are mislabelled as Amount. This can be due to the few number of training instances of Size entity. As we can see in the following examples, 50 mL can refer to both Size and Amount depending on the context. In the first example, 50 mL refers to Amount and in the second example, it refers to Size.

Error analysis
Example 5.4.1 Add more NEB -no β−mercaptoethanol to final volume of 50 mL.
Example 5.4.2 Transfer the aqueous phase to a.new 50 mL Falcon tube.
About 17% of the Device predictions are mislabelled as Location that can be due to the inconsistencies in the annotation process, for example magnetic rack is annotated as Device in a few protocols (protocol 0680, protocol 0683, protocol 0685), and as Location in others (protocol 32148, protocol  Similarly freezer is annotated interchangeably as Location and Device. Generic-Measure is mostly confused with Concentration label (20.4%), and Method is mostly confused by Action. About 12% of Numerical is annotated as Concentration.

Conclusion
With almost no preprocessing, we have seen that current pretrained language models seem to be quite efficient in any NER task (Copara et al., 2020a,b). By analysing our voting strategy, we have also demonstrated the strengths as well as the weaknesses of such ensemble models. For instance, it looks like the more models we use, the more the performances tend to be high and stable, however, it appears that new pretrained model brings more information than fine-tuning again a pretrained model with a new fully connected weights random initialization. With this voting strategy, our submission achieved the best exact match overall F 1 -score of the competition. This clearly shows the power of such models. With almost no knowledge on the topic of wet laboratory protocols required, we think that those models open opportunity to out-of-field researchers.
In future work, it would be interesting to improve the number of pretrained models selection and explore bootstrapping instead of fine-tuning multiple times the same pretrained model. It would also be interesting to see if some preprocessing tweaks could help us to improve the detection performance of Speed where our models were outperformed by the baseline.