Transfer Learning for Health-related Twitter Data

Transfer learning is promising for many NLP applications, especially in tasks with limited labeled data. This paper describes the methods developed by team TMRLeiden for the 2019 Social Media Mining for Health Applications (SMM4H) Shared Task. Our methods use state-of-the-art transfer learning methods to classify, extract and normalise adverse drug effects (ADRs) and to classify personal health mentions from health-related tweets. The code and fine-tuned models are publicly available.


Introduction
Transfer learning is promising for NLP applications, as it enables the use of universal pre-trained language models (LMs) for domains that suffer from a shortage of annotated data or resources, such as health-related social media. Universal LMs have recently achieved state-of-the-art results on a range of NLP tasks, such as classification (Howard and Ruder, 2018) and named entity recognition (NER) (Akbik et al., 2018). For the Shared Task of the 2019 Social Media Mining for Health Applications (SMM4H) workshop team TMRLeiden focused on employing state-ofthe-art transfer learning from universal LMs to investigate its potential in this domain.
2 Task descriptions ADR extraction The purpose of Subtask 1 (S1) is to classify tweets as containing an adverse drug response (ADR) or not. Subsequently, these ADR mentions are extracted in Subtask 2 (S2) and normalized to MedDRA concept IDs in Subtask 3 (S3). MedDRA (Medical Dictionary for Regulatory Activities) is an international, standardized medical terminology. 2 1 https://github.com/AnneDirkson/ SharedTaskSMM4H2019 2 https://www.meddra.org/ Personal Health Mention Extraction The goal of Subtask 4 (S4) is to identify tweets that are personal health mentions, i.e. posts that mention a person who is affected as well as their specific condition (Karisani and Agichtein, 2018), as opposed to posts discussing health issues in general. Generalisability to both future data and different health domains is evaluated by including data from the same domain collected years after the training data, as well as data from entirely different disease domain.
3 Our approach

Preprocessing
We preprocessed all Twitter data using the lexical normalization pipeline by Sarker (2017). We also employed an in-house spelling correction method (Dirkson et al., 2019). Additionally, punctuation and non-UTF-8 characters were removed using regular expressions.

Additional Data
Personal Health Mentions For S4, the training data consists of data from one disease domain, namely influenza, in two contexts: having a flu infection and getting a flu vaccination. To improve generalisability, we supplemented this data with six labelled data sets from different disease domains (Karisani and Agichtein, 2018). We refer to this combined data set as S4+. For each subset, 10% was used for a combined validation set. For fine-tuning the ULMfit universal language model based on 28,595 Wikipedia articles (Wikitext-103) (Merity et al., 2017b), the DIEGO Drug Chatter corpus (Sarker and Gonzalez, 2017) was combined with the data from S1 and S4+ to form a larger unsupervised corpus of health-related Twitter data ('TwitterHealth'). For S4, fine-tuning was also attempted with only the S4+ data. S1 S2* S3 S4 S4+  Dev  -130  76  --Train  14,634  910  1,756 6,996 11,832  Validation  1,626  130  76  777  1,314  Test  5000  1000 1000  TBA  TBA   Table 1: Data sets. *Only tweets containing ADRs were used for developing the system. TBA: To be announced Concept Normalization The MedDRA concept names and their aliases in both MedDRA and the Consumer Health Vocabulary 3 were used to supplement the data from S3. This data set is hereafter called S3+.

Text Classification
Text classification was performed with fast.ai ULMfit (Howard and Ruder, 2018). As recommended, the initial learning rate (LR) of 0.01 was determined manually by inspecting the log LR compared to the loss. Default language models were fine-tuned using AWD LSTM (Merity et al., 2017a) with (1) 1 cycle (LR = 0.01) for the last layer and then (2) 10 cycles (LR = 0.001) for all layers. Subsequently, this model is used to train a classifier with F 1 as the metric, a dropout of 0.5 and a momentum of (0.8,0.7), in line with the recommendations. Training is done with (1) 1 cycle (LR = 0.02) on the last layer; (2) unfreezing of the second-to-last layer; (3) another cycle running from a 10-fold decrease of the previous LR to this LR divided by 2.6 4 (as recommended in the fast.ai MOOC). 4 This is repeated for the next layer and then for all layers. The last step consists of multiple cycles until F 1 starts to drop.
As an alternative classifier for S1, we used the absence of ADRs (noADE) according to the Bert embeddings NER method (see below) which was developed for the subsequent sub-task (S2) and aims to extract these ADR mentions. As a baseline for text classification, we used a Linear SVC with unigrams as features. The C parameter was tuned with a grid of 0.0001 to 1000 (steps of x10).

Named Entity Recognition
For S2, we experimented with different combinations of state-of-the-art Flair embeddings (Akbik et al., 2018), classical Glove embeddings and Bert embeddings (Devlin et al., 2018) using the Flair package. We used pre-trained Flair embeddings based on a mix of Web data, Wikipedia and subtitles; and the 'bert-base-uncased' variant of Bert embeddings. We also experimented with Flair embeddings combined with Glove embeddings (dimensionality of 100) based on FastText embeddings trained on Wikipedia (GloveWiki) or on Twitter data (GloveTwitter). Training for all embeddings was done with initial LR of 0.1, batch size of 32 and max epochs set to 150.
As a baseline for NER, we used a CRF with the default L-BFGS training algorithm with Elastic Net regularization. As features for the CRF, we used the lowercased word, its suffix, the word shape and its POS tag. 5

Concept normalization
For S3, pre-trained Glove embeddings were used to train document embeddings on the extracted ADR entities in the S3 data including or excluding the aliases from CHV (S3+) with concept IDs as labels. We used the default RNN in Flair with a hidden size of 512. Glove embeddings (dim = 100) were based on FastText embeddings trained on Wikipedia. Token embeddings were re-projected (dim = 256) before inputting to the RNN.  For all four subtasks, our best transfer learning system consistently performs better than the average over all runs submitted to SMM4H. For classifying ADR mentions, our overall best performing system is a ULMfit model trained on the Twitter-Health corpus (see Table 2). Yet, the highest recall is attained by using the absence of named entities (noADE) as a classifier. This is in line with our validation results (see Table 6). For extracting ADRs, our best system is a combination of Bert with Flair embeddings without a separate classifier    for sentences containing ADR mentions (see Table  3). However, using Bert embeddings alone with the ULMfit classifier from S1 appears to be more precise. During validation, we found that combinations of Glove embeddings (based on Twitter or Wikipedia) and Flair embeddings performed poorly compared to the submitted systems (see Table 7). For mapping the ADRs to MedDRA concepts, we only submitted one system with different preceding NER models (see Table 4), since adding the alias information (S3+) decreased both precision and recall (see Table 8). Our RNN document embeddings with only the S3 data, however, performed better than average. Lastly, for the classification of personal health mentions, our best classifier was a ULMfit model fine-tuned on the S4+ data (see Table 5), which outperformed the average result and the ULMfit model trained on the larger TwitterHealth corpus on all metrics. This system similarly outperformed the other ULMfit model on the validation data (see Table 9).

Conclusions
Transfer learning using default and recommended settings offers above average results for various NLP tasks using health-related Twitter data. More research is necessary to investigate whether stateof-the-art performance may be possible with further domain-specific adaptation, for instance by tuning hyper-parameters, training embeddings on medical data or by dealing with domain-specific vocabulary absent in the language model.