A Deep Learning-Based System for PharmaCoNER

The Biological Text Mining Unit at BSC and CNIO organized the first shared task on chemical & drug mention recognition from Spanish medical texts called PharmaCoNER (Pharmacological Substances, Compounds and proteins and Named Entity Recognition track) in 2019, which includes two tracks: one for NER offset and entity classification (track 1) and the other one for concept indexing (track 2). We developed a pipeline system based on deep learning methods for this shared task, specifically, a subsystem based on BERT (Bidirectional Encoder Representations from Transformers) for NER offset and entity classification and a subsystem based on Bpool (Bi-LSTM with max/mean pooling) for concept indexing. Evaluation conducted on the shared task data showed that our system achieves a micro-average F1-score of 0.9105 on track 1 and a micro-average F1-score of 0.8391 on track 2.

The Biological Text Mining Unit at BSC and CNIO organized the first shared task on chemical & drug mention recognition from Spanish medical texts called PharmaCoNER (Pharmacological Substances, Compounds and proteins and Named Entity Recognition track) in 2019. The shared task includes two tracks: one for NER offset and entity classification (track 1) and the other one for concept indexing (track 2). We developed a pipeline system based on deep learning methods for this shared task, specifically, a subsystem based on BERT (Bidirectional Encoder Representations from Transformers) for NER offset and entity classification and a subsystem based on Bpool (Bi-LSTM with max/mean pooling) for concept indexing. Evaluation conducted on the shared task data showed that our system achieves a micro-average F1-score of 0.9105 on track 1 and a microaverage F1-score of 0.8391 on track 2.

Introduction
Efficient access to mentions of clinical entities is very important for using clinical text. The way to extract clinical entities embedded in the text is natural language processing (NLP). In the last decades, clinical entity extraction has attracted plenty of attention of researchers, clinicians, and enterprises in the clinical domain. The development of technology for clinical entity extraction mainly benefits from related NLP challenges including tasks of biomedical entity recognition and normalization, such as the BioCreative (Critical Assessment of Information Extraction systems in Biology) challenges (e.g., the CHEMDNER (Chemical compound and drug name recognition) track (Leaman et al., 2013)), the i2b2 (the Center of Informatics for Integrating Biology and Bedside) challenges (Uzuner et al., 2011), SemEval (Semantic Evaluation) challenges (Elhadad et al., 2015) and the ShARe/CLEF eHealth Evaluation Lab shared tasks (Kelly et al., 2016). A large number of various kinds of methods have been proposed for biomedical entity recognition and normalization. Lots of machine learning methods such as conditional random fields (CRF) (Lafferty et al., 2001), structured support vector machines (SSVM) (Tsochantaridis et al., 2005) and bidirectional long-short-term memory with conditional random fields (BiLSTM-CRF) (Huang et al., 2015) have been applied for biomedical entity recognition, support vector machines (SVM) (Grouin et al., 2010) and ranking based on convolutional neural network (CNN) (Li et al., 2017) for clinical entity normalization. Although there have been a few promising results, most of them focus on the clinical text in English. Recently, clinical entity extraction for clinical text in other languages has also begun to receive much attention. For example, in 2016, NTCIR organized the first challenge about information extraction from clinical documents in Japanese (Morita et al., 2013). In 2017, CCKS organized the first challenge about information extraction from clinical records in Chinese (Hu et al., 2017).
To accelerate development of techniques of information extraction from clinical text in Spanish, Martin Krallinger et al. organized a shared task particular for chemical & drug mention recognition from Spanish medical texts called PharmaCoNER in 2019 (Gonzalez-Agirre, Aitor et al., 2019), which includes two tracks: track 1 for NER offset and entity classification and track 2 for concept indexing. The organizers provided an annotated corpus of 1000 clinical cases, 500 cases out of which were used as the training set, 250 cases as the development set and 250 cases as the test set. We participated in this shared task and developed a pipeline system based on two latest deep learning methods: BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2019) and Bpool (Bi-LSTM with max/mean pooling) (Conneau et al., 2017). The system developed on the training and development sets achieved a micro-average F1-score of 0.9105 on track 1 and a microaverage F1-score of 0.8391 on track 2 on the independent test set.

Material and Methods
As shown in Figure 1, We first developed a preprocessing module to split clinical cases into sentences, tokenized the sentences and extracted some features for each token, then a BERT-based subsystem for NER offset and entity classification, and finally a Bpool-based system for concept indexing. All of them were individually presented in the following sections in detail.

Dataset
The PharmaCoNER organizers asked medical experts to annotate a corpus of 1000 clinical cases with chemical & drug mentions for the shared task according to a pre-defined guideline. The corpus was divided into a training set, a development set and a test set. The test set was hidden in a background set of 3751 clinical cases when testing during the competition. The statistics of the corpus, including the number of documents, chemical & drug mentions in different types are listed in Table 1, where "UNK" denotes unknown. It should be noted that the chemical & drug mentions annotated with UNCLEAR were not considered during the competition.

Preprocessing
We split each clinical case into sentences using ';', '?', '!', '\n' or '.' which is not in numbers, and further split each sentence into tokens using the method proposed by Liu (Liu et al., 2015), which was specially designed for clinical text. We adopted Ab3P tools 1 to extract full names of abbreviations, and SPACCC_POS-TAGGER tool 2 for POS tagging and lemmatization. Besides, we used the same way as Liu (Liu et al., 2015) to get each word's word shape.

NER offset and entity classification
NER offset and entity is a typical NER problem usually recognized as a sequence labeling problem. In this study, we adopted "BIO" tagging schema to represent chemical & drug mentions, where 'B', 'I' and 'O' represent beginning, inside and outside of a chemical & drug mentions respectively, and developed a system based on BERT. First, character-level representation, POS tagging representation and word shape representation of each word were concatenated into the word representation of BERT, and then a CRF layer was appended to BERT for chemical & drug mentions recognition.

Concept Indexing
After chemical & drug mentions were recognized, we first constructed <mention, standard terminology> pairs as candidates for matching, and then built a Bpool-based matching model (Conneau et al., 2017) according to the candidates. Standard terminologies were selected into candidates in the following two ways: 1 (https://github.com/ncbi-nlp/Ab3P) 2 (https://github.com/PlanTL-SANIDAD/SPACCC_POS-TAGGER) Figure 1: Overview architecture of our system for the PharmaCoNER task 1) Top n terminologies ranked by Levenshtein distance 3 with a given mention at char-level and at token-level. 2) Terminologies selected by 1) and the given mention's synonyms appearing in the standard terminology vocabulary. After the terminology selection, a Bpool-based matching model at character-level was utilized to judge whether two mentions were matching or not.

Evaluation
The performance of our system was measured by micro-average precision (P), recall (R), and F1-score (F1), which were calculated by the official tool provided by the PharmaCoNER organizers 4 .

Experiments Setup
In this study, for track1, we first optimized model on the development set and then fine-tuned the model on the training and development sets for 5 more epochs. For standard terminology selection, we optimized n from 10 to 50 with step 10, and finally set it to 40. For track2, we optimized the model on the training and development sets via 10-fold cross validation. The hyper-parameters and parameter estimation algorithm used for model training were listed in Table 2. The pre-trained BERT 5 was used as the initial neural language model and fine-tuned on all datasets provided by the shared task organizers. The embeddings of character, POS and word shape were randomly initialized from a uniform distribution. It is worth noting that in the BERT model, the update of the parameters included in the BERT used the learning rate of 2e-5, and the parameter update of other features used a learning rate of 0.003.

Results
The highest micro-average precisions, recalls and F1-scores of our system on the two tracks were listed in Table 3. Our system achieved a microaverage precision of 0.9123, recall of 0.9088 and F1-score of 0.9105 on track1, and a microaverage precision of 0.8284, recall of 0.8502 and F1-score of 0.8391 on track2. Among three types of chemical & drug mentions considered in the shared task, our system performed best on NORMALIZABLES and worst on NO_NORMALIZABLES for track1, which may be proportional to the number of mentions of each type.  Table 4 provided additional ablation study results analyzing the contribution of individual features on track 1 and reporting the performance of each standard terminology selection method (STS) on track 2. We found that both character-level embedding, POS tagging representation, and word shape representation contributed towards our system on track 1. They brought 1.69%, 0.51%, and 0.63% improvements on F1-score, respectively. On track 2, when removing the extended synonyms, the F1 score declined from 0.8048 to 0.7932.

Discussion
For task 1, our analysis found that data processing had a great influence on the NER offset results. Separating alphabets and digitals in a word , for example, "PaO2" was split into 'PaO' and '2' , caused some errors of entity boundary or entity type. Separating words by the hyphen '-' also caused some errors. For example, "4methyilumbelliferyl α-D-galactosidasa" is totally identified as 'PROTEINAS', but in "daclizumabtacrolimus-MMF-esteroide", "daclizumab" is identified as "PROTEINAS", "tacrolimus", "MMF" and "esteroide" are identified as "NORMILIZED". Our experiments on the development set showed that the effect of tokenization on micro-average F1 score on NER was about 2%.
There were mainly the following three types of errors caused by our system. (1) abbreviation recognition errors: it is difficult to identify abbreviations in a record correctly; (2) long entity: entities consisting of four or more tokens are hard to identify correctly, such as 'anticuerpos antitransglutaminasa tisular IgA'. (3) drugs: model cannot recognize drugs such as 'dasatinib', 'nilotinib' and so on.
Since we experimented with a pipeline model, the mistakes of task 1 will be propagated to task 2 and there are about 8% errors caused by track1. In addition, about 10% errors are caused by the matching model. We summarized the modes of low recall rate by standard terminology selection methods when constructing <mention, standard terminology> pairs. The modes are: (1) about 40% entities are abbreviations, which is difficult to find the candidates from SNOMED-CT; (2) about 20% of entities have the same candidates in SNOMED-CT 6 , which are not normalized entities in the shared task.
For further improvements, there may be two directions: (1) using joint learning methods for task 1 and task 2. (2) integrating knowledge graph into our system.

Conclusion
In this study, we developed a deep learningbased pipeline system for the PharmaCoNER shared task, a challenge specifically for clinical entity extraction from clinical text in Spanish.