KFU NLP Team at SMM4H 2019 Tasks: Want to Extract Adverse Drugs Reactions from Tweets? BERT to The Rescue

This paper describes a system developed for the Social Media Mining for Health (SMM4H) 2019 shared tasks. Specifically, we participated in three tasks. The goals of the first two tasks are to classify whether a tweet contains mentions of adverse drug reactions (ADR) and extract these mentions, respectively. The objective of the third task is to build an end-to-end solution: first, detect ADR mentions and then map these entities to concepts in a controlled vocabulary. We investigate the use of a language representation model BERT trained to obtain semantic representations of social media texts. Our experiments on a dataset of user reviews showed that BERT is superior to state-of-the-art models based on recurrent neural networks. The BERT-based system for Task 1 obtained an F1 of 57.38%, with improvements up to +7.19% F1 over a score averaged across all 43 submissions. The ensemble of neural networks with a voting scheme for named entity recognition ranked first among 9 teams at the SMM4H 2019 Task 2 and obtained a relaxed F1 of 65.8%. The end-to-end model based on BERT for ADR normalization ranked first at the SMM4H 2019 Task 3 and obtained a relaxed F1 of 43.2%.


Introduction
Short-text communication forms, such as Twitter microblogging, present a wide variety of facts and opinions on numerous topics, and this treasure trove of information is currently severely underexplored. Here we focus on the problem of discovering adverse drug reaction (ADR) concepts in Twitter messages as part of the Social Media Mining for Health (SMM4H) 2019 shared tasks.
This work is based on the participation of our team, named KFU NLP, in the first three tasks. Organizers of SMM4H 2019 Tasks 1-3 (Weissenbacher et al., 2019) provided participants with datasets of English tweets annotated at the message level with binary annotation indicating the presence or absence of ADRs, text spans of reported ADRs, and their corresponding medical codes from the Medical Dictionary for Regulatory Activities (MedDRA). The goal of Task 1 is to classify the tweets according to the presence of ADRs. For the second task, named entity recognition (NER) aims to detect the mentions of ADRs. The third and final task is designed as an end-toend problem, intended to perform full evaluation of a system operating in real conditions: given a set of raw tweets, the system has to find the tweets that are mentioning ADRs, find the spans of the ADRs, and normalize them with respect to a given knowledge base (KB). These tasks are especially challenging due to specific characteristics of usergenerated texts from social networks which are noisy, containing misspelled words, abbreviations, emojis, etc. Motivated by the recent success of deep architectures in general and language representation networks in particular, we explore an application of Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) and its extension for biomedical domain BioBERT (Lee et al., 2019) to the SMM4H 2019 tasks. For both ADR extraction and medical concept normalization, we conclude that BERT outperforms previous state-of-the-art baselines based on recurrent neural architectures (RNNs), including bidirectional Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997), and Gated Recurrent Units (Cho et al., 2014) paired with word2vec word embeddings.
The paper is organized as follows. In Section 2, we present the task description, machine learning baselines, and classification experiments for Task 1. We describe our models for end-to-end extraction of ADR concepts in Sections 3 and 4. Finally, 53 we discuss future directions in Section 5.

Task 1: Classification
The goal of this sub-task is to identify the tweets with ADR mentions. This is a necessary filtering step to remove noise, since most of the healthrelated chatter in the domain does not contain relevant information.

Dataset
The training set consists of 25,678 tweets with 2,377 labeled as positive examples with ADRs; this statistic shows that the corpus has huge class imbalance. Tweet text lengths vary from 1 to 53 words, the average length is 20 words. The test dataset includes 4,576 tweets. Minimum tweet length is also 1 and the maximum consists of 186 words, which is much longer than in the training set. However, the average amount of words in tweets is on par with the training set and equals 23 words.

Method
Previous studies have shown the effectiveness of classical machine learning approaches (Ofoghi et al., 2016;Jonnagaddala et al., 2016;Alimova and Tutubalina, 2017). We applied the SVM-based model with a set of features as a baseline method. For SVM features, we utilized the bag-of-words representation, drug name, and ADRs from a Diego Lab ADR lexicon (Sarker et al., 2015). The list of drug names was obtained from the Food and Drug Administration (FDA). We've also explored the potential of sent2vec tool for tweets representation (Pagliardini et al., 2018). The Twitter unigram pre-trained model was applied for obtaining vectors 1 .
Our main solution is a classifier based on the BERT architecture. For the BERT-based model, the tweet's representation was obtained with the Transformer architecture (Vaswani et al., 2017), and then logistic regression was used as a classifier. We used the implementation from the model's official repository 2 .

Experiments
For the SVM-based classifier, we set class weights to 0.3 and 0.7 for non-ADR and ADR classes re-1 https://github.com/epfml/sent2vec 2 https://github.com/google-research/ bert spectively and applied a linear kernel. The BERTbased model was trained on 20 epochs with learning rate equal to 5 * 10 −5 , maximum sequence size 128, and batch size 32. The official evaluation metrics are precision (P), recall (R), and F1-measure (F1) computed for the positive class. During preprocessing, we removed all URLs, user mentions, and symbols of re-tweets using the tweet-preprocessor package 3 . We conducted a set of experiments on the training set with 5-fold cross-validation. Results of these experiments shows that utilizing sent2vec as tweet representations did not improve classification quality. Results on the test set are presented in Table 1. Our baseline SVM classifier (run-2) obtained the F1 score of 51.64%, which is on par with average results. The BERT-based classifier (run-1) achieved the F1 score of 57.38 and outperformed by 7.19% the F1 score averaged across 43 submissions.

Task 2: Extraction of Adverse Effect Mentions
Following state-of-the-art research (Miftahutdinov et al., 2017;Tutubalina and Nikolenko, 2017;Lee et al., 2019), we view the second task from the perspective of a sequence labeling problem. Sequence labeling refers to the task of learning to predict a label for each token in a sequence of tokens. State-of-the-art methods employ neural architectures based on bidirectional LSTMs and conditional random fields (CRF) (Lample et al., 2016;Tutubalina and Nikolenko, 2017;Giorgi and Bader, 2019). Recent advancements in language representation models such as BERT have opened up new directions of research in sequence labeling.

Dataset
The data for the second sub-task includes 2,367 tweets that are fully annotated for ADR mentions and Indications. This set contains a subset of (i) 1,212 tweets from Task 1 tagged as 'hasADR' and (ii) 1,155 tweets marked as 'noADR' (1,828 ADR mentions in total).

Method
Sequence labeling methods view a message as a sequence of tokens labeled using the BIO tagging scheme: B indicates the beginning of the entity mention, I is used for tokens inside the entity mention, and O indicates tokens outside any entities. To solve the sequence labeling task, we utilize and empirically compare several models: (i) bidirectional LSTM-CRF; (ii) BERT; (iii) BERT for Biomedical Text Mining named BioBERT. We have also utilized a CRF tagger on top of BioBERT. A technical explanation of these neural models is omitted due to space constraints; we refer to the studies listed above.
We have also combined deep neural network representations with additional dictionary-based features. Dictionary-based features are calculated for each token in a text as follows: first, all the occurrences of predefined vocabulary entries were found in the text, then the first token of the matched part tagged was with B-tag, the last with I-tag, and all other tokens in the text with O-tag. The dictionary-based features are concatenated with the representation learned by the neural network that captures extensional semantic information of an entity mention. We adopted the dictionaries from our previous work (Miftahutdinov et al., 2017).

Experiments
For the NER sub-task each network was trained for 25 epochs with batch size set to 32. We used the Adam algorithm as the optimizer with initial learning rate 5 * 10 −5 . We used the publicly available implementation of BioBERT-CRF 4 . Training all 10 networks took 2-3 hours on eight NVIDIA Tesla P40 GPUs. Additionally, we have used the CADEC corpus along with the corpus provided by the organizers.
Since the boundaries of an entity mention in social media texts are hard to define, two types of evaluation were used: strict and relaxed. Precision, recall, and F-measure are used for performance evaluation.
In order to select the best neural models, we evaluated our models on the CADEC corpus using 5-fold cross-validation at the develop-4 https://github.com/dmis-lab/biobert  ment stage. BERT showed 5-7% improvement in the strict evaluation over LSTM-CRF, while BioBERT showed slightly better performance over BERT. BioBERT with CRF stayed roughly on par with the model without CRF.
During BioBERT evaluation, we encountered unstable results on development sets. Therefore, for the final submission we combined the results of ten BioBERT-CRF with the same settings using a simple voting scheme with the intent of increasing the robustness of the final system. Table 2 shows a comparison of the ensemble model to the official average scores computed using the participants' submissions. Our model has obtained the highest relaxed F1 score of 65.8% among 9 teams.

Task 3: Medical Concept Normalization
A crucial part of this problem is to translate a text from social media language (e.g., "felt sick to my stomach" or "couldn't sleep much") to formal medical language (e.g., "nausea" and "insomnia", respectively). The SMM4H 2019 Task 3 is designed as an endto-end task. This setup is closer to a real production environment, where the system has freeform text as input and should be able to produce a set of extracted medical concepts. This end-toend setup is more challenging due to the sequential two-stage pipeline: the system has to (i) first detect ADR mentions and then (ii) map extracted ADRs to knowledge base entries. For the first step, we use the NER model described in Section 3. The system used for concept normalization is based on our previous works (Tutubalina et al., 2018;Miftahutdinov and Tutubalina, 2019) and presented below.

Dataset
ADR mentions from the SMM4H 2019 dataset are mapped to Preferred Terms (PTs) of the Medical  Dictionary for Regulatory Activities (MedDRA). The training SMM4H 2019 set consists of 1,828 phrases mapped to 489 MedDRA codes. The average number of ADR mentions mapped to a given concept is 3.74. The minimum and maximum numbers of queries mapped to a given concept are 1 and 65, respectively. Figure 1 shows a plot of the code frequency distribution of MedDRA concepts presented in the training set. Additionally, we present statistics on the top 20 entity mentions from the training set in Figure 2.

Method
Following state-of-the-art research (Tutubalina et al., 2018;Sarker et al., 2018;Miftahutdinov and Tutubalina, 2019), we view concept normalization as a classification task. Following (Miftahutdi-nov and Tutubalina, 2019), we convert each ADR mention into a vector representation using BERT or RNN. Next, we employ the standard softmax activation for the output layer. The softmax layer over all possible medical codes from the training set yields a probability for the sequence.

Experiments
We trained the BERT model for 40 epochs, using batch size 96 and learning rate 5 * 10 −5 . In order to prevent neural networks from overfitting, we used a dropout of 0.2 to control the inputs and the softmax layer. We used the publicly available implementation of BERT 5 .
The strict and relaxed evaluations proposed for Task 2 were also adopted for Task 3. As in previous work, we evaluated our models on the CADEC corpus at the development stage using 5fold cross-validation. The BERT model consistently outperformed attention-based bidirectional LSTM and GRU paired with pre-trained word embeddings in this set of experiments, showing a 6-9% improvement. We did not experiment with BioBERT for this task.
For the final submission, we used the two-stage pipeline based on the ensemble of BioBERT-CRF for NER and BERT for normalization. Table 3 shows a comparison of our best model to the official average scores computed using the participants' submissions. The end-to-end model ranked first at SMM4H 2019 Task 3 and obtained a relaxed F1 of 43.2%. The strict recall of the endto-end system is 15% lower than the recall of the NER system: 42.7 vs 57.6. Results in Tables 2 and 3 indicate that more than 80% of extracted ADR mentions have been correctly mapped to MedDRA concepts.

Conclusion
In this work, we have explored an application of Bidirectional Encoder Representations from Transformers (BERT) to the task of text classification, extraction of adverse drug reactions, and concept normalization. We have evaluated BERT and BioBERT empirically against bidirectional LSTM and GRU. Experiments have shown that BERT outperforms LSTM and GRU on all three tasks, achieving new state-of-the-art results in ADR extraction and normalization.
We foresee three directions for future work. One potential direction is to investigate neural architectures including BERT and RNNs in the endto-end setup on other existing corpora. Another future direction is to explore how to effectively use of contextual information to map entity mentions to medical concepts. Additionally, the effect of data imbalance can be investigated for BERTbased models.