Identifying Personal Experience Tweets of Medication Effects Using Pre-trained RoBERTa Language Model and Its Updating

Post-market surveillance, the practice of monitoring the safe use of pharmaceutical drugs is an important part of pharmacovigilance. Being able to collect personal experience related to pharmaceutical product use could help us gain insight into how the human body reacts to different medications. Twitter, a popular social media service, is being considered as an important alternative data source for collecting personal experience information with medications. Identifying personal experience tweets is a challenging classification task in natural language processing. In this study, we utilized three methods based on Facebook’s Robustly Optimized BERT Pretraining Approach (RoBERTa) to predict personal experience tweets related to medication use: the first one combines the pre-trained RoBERTa model with a classifier, the second combines the updated pre-trained RoBERTa model using a corpus of unlabeled tweets with a classifier, and the third combines the RoBERTa model that was trained with our unlabeled tweets from scratch with the classifier too. Our results show that all of these approaches outperform the published methods (Word Embedding + LSTM) in classification performance (p < 0.05), and updating the pre-trained language model with tweets related to medications could even improve the performance further.


Introduction
Personal experience is an important piece of information for health-related surveillance activities. Understanding one's health experience can help gain insight into the status of one's health, changes of one's health condition after the intervention, or the effects related to any medications one took.
Investigating effects related to the use of pharmaceutical products is an important activity of post-market surveillance. First-hand information related to patients' medication use most directly reflects the effects of the medication, beneficially or adversely. In that case, it is necessary to find valuable data sources and construct efficient methods for processing and analyzing this data.
The widespread availability of social media has made it possible for people to share their personal experiences freely online. Twitter is one of the most prevalent social media services, and studies have shown that the data from social media such as Twitter has been applied to many health-related applications. Examples are as follows: drug adverse events (Bian et al. 2012), public health (Paul et al. 2011;Parker et al. 2013), mental health (Coppersmith et al. 2014;Reece et al. 2017), dental pain (Heaivilin et al. 2011), influenza (Lee et al. 2013Paul et al. 2015;Gesualdo et al. 2013;Aramaki et al. 2011;Byrd et al. 2016;Kagashe et al. 2017), breast cancer (Thackeray et al. 2013), and epidemic outbreak and spread detection (Ji et al. 2012).
Personal experience is about a person's encounters or observations related to his or her life. Personal experience information related to the use of medication is of unique value for post-market surveillance because it is the first-hand information that reflects the health condition changes due to medication usage. Personal Experience Tweets (PETs) related to medication use are a kind of Twitter post expressing one's personal experience and information after the administration of medication. The types of experiences could be undesirable feelings caused by medications' sideeffects, or beneficial effects that help improve a medication user's health condition. The collection and understanding of these experiences' information can help promote the safe use of medications and advance our healthcare practices.
Here are some examples of PETs related to medication use (the underscored text is for medication effects and the boldfaced for the medication): "Slow release morphine almost killed me." "my mother developed bleeding ulcers from naproxen and now they switched her to celebrex isnt that just as bad?" "Ill check it out -I have a friend on Abilify and hes had some personality changes, IE agitation, hitting stuff, ect." These tweets show that the effects are associated with a person's experience. In contrast, we define a tweet not describing a personal experience as a non-PET. The following are some examples: "wish i had some xanax to put me to sleep" "ativan please help me get some sleep tonight" "i just took a dose of percocet with some strippers" The above non-PETs, albeit mentioning medications or containing effect expressions, do not reflect the personal experience.
Extracting PETs from various kinds of Twitter posts is challenging because the Twitter data is of abundant noises, and most of the tweets may be irrelevant to personal experience about health conditions. In addition, users usually post tweets with informal and causal styles, without following the rules of grammar and/or spelling. Finally, Twitter users are creative in coining short text to include the needed information within the space limit. These unique characteristics make it more challenging to identify PETs accurately.

Related Works
Distinguishing PETs and non-PETs can be treated as a binary classification task. In the conventional machine learning field, algorithms require a set of manually engineered features extracted from the raw text and/or metadata (Jiang et al., 2016;Wijeratne et al., 2017), usually known as feature engineering, and features chosen can significantly impact the classifier's performance. However, extracting/engineering valuable yet optimal features from tweets is difficult due to the limitation of human knowledge and understanding even for the domain experts. Besides, feature engineering extracts features that are typically based on the analysis of statistics regarding information gain usually with little or no direct consideration of the semantics. In other words, conventional machine learning with feature engineering methods may not be optimal for this task.
Efforts of performance improvement have been made in previous research endeavors in the task of predicting personal experience tweets related to medication effects. In one of the earliest efforts, personal pronouns were considered as an important feature (Jiang and Zheng, 2013). Later, Alvaro and colleagues engineered a set of features (Alvaro et al., 2015), and their features include Twitterspecific features, n-grams, punctuation elements, and topics, but the group decided to discard the topic feature due to the significant efforts required and its minimum merit of improving classification performance. A set of 22 engineered features based upon both textual content and metadata of tweets was proposed in constructing a corpus of personal experience tweets (Jiang et al., 2016). Subsequently, Calix and colleagues introduced the concept of deep gramulator to include a textual feature that contains expressions in one class but not in the opposite class, to improve the discriminatory ability of the classification (Calix et al., 2017). Advancement in neural embedding, which demonstrated state-of-art results in many classification tasks on textual data, motivated the development of a new approach of combining word embedding (word2vec) and a recurrent neural network which demonstrated a significant improvement of classification performance (p < 0.05) (Jiang et al., 2018).
Thanks to the development of word embedding techniques and the long short term memory ( (Bojanowski et al. 2016) and word2vec (Mikolov et al. 2013) to build vector space models (VSM) to represent the semantics of tweets by learning from a corpus of 22 million unlabeled tweets. The vector representations of tweets were fed into an LSTM neural network for classification. All of these methods achieved better performance in classification measures than the previous methods with 22 human-engineered features using conventional classification algorithms (Jiang et al. 2016).
Unlike the word embedding + LSTM method, which need to learn the VSM first and then train the LSTM network from scratch for classification, Google introduced a fine-tuning based approach by proposing the Bidirectional Encoder Representations from Transformers (BERT) model (Devlin et al. 2018), which achieved recordbreaking results in 18 downstream NLP tasks. Besides, Google's new method relies on contextual information rather than term co-occurrences. After that, Facebook made some optimization based on BERT and released a Robustly Optimized BERT Pretraining Approach (RoBERTa) model (Liu et al. 2019) which generated even better performance than BERT in downstream tasks. One important and useful aspect of both approaches is that the pretrained models can be updated with new data, without the need to generate a new model from scratch with the added data, which generally requires a significant amount of computation resources.
In this study, we set the performance of the word embedding + LSTM neural network method as the baseline and investigated the performance improvements of PETs prediction with the pretrained RoBERTa language model. We also studied a procedure of updating the pre-trained RoBERTa language model and training the RoBERTa from scratch with the medication-related tweets and analyzing the impact on the performance change.

Method
In this work, we introduced three ways to identify personal experience tweets about medication effects by using RoBERTa language model: (1) Pretrained RoBERTa -adding a classifier to the standard pre-trained RoBERTa model and finetuning the model for classification; (2) Updated RoBERTa -updating the pre-trained RoBERTa language model with our dataset first, then adding a classifier to RoBERTa and fine-tuning the model for classification; and (3) Twitter RoBERTatraining the RoBERTa language model with our corpus of unannotated tweets from scratch, then adding a classifier for classification. Finally, 10fold cross-validation was performed to gather the performance data, and statistical analysis was performed to determine if the differences in performance among different methods were due to the chance.
The pipelines of data processing and analysis is illustrated in Figure 1. Our process started with gathering Twitter data and performing text encoding after preprocessing. Afterwards, the encoded texts were used with the RoBERTa model and the classifier for our methods. The left pipeline is for the Pretrained RoBERTa approach, the middle one for Updated RoBERTa, and the right one for Twitter RoBERTa.

Text Encoding
Byte-Pair Encoding (BPE) (Sennrich et al. 2015) and Attention Mask were applied to encode raw text. BPE is a sub-word level encoding method that uses bytes as the base sub-word units. In the process of tokenization, tokens like acronyms, abbreviations or spelling mistakes which are not in the vocabulary are split into known sub-word tokens, Compared to the word-level encoding method, it is flexible enough for tokenized words with special forms and adaptable for most of English documents, and also it could efficiently avoid most of the unknown tokens in the input text. A sub-word vocabulary with 50K unique tokens was built before pre-training, which was tested with our dataset to ensure that our data could be completely covered by this vocabulary and tweet text was tokenized properly without leaving any unknown tokens. In that case, we reused this subword vocabulary to encode our data and each of the tweets was converted into a sequence of indices of tokens in the vocabulary.
After encoding, each tweet started with a special <s> token and ended with </s>. To achieve the fixed length of a sequence, we set the max token length to 64, and a special <pad> token was introduced to pad sequences to the max length. We ensured that this value of max token length could fit almost all of the tweets: only 0.003% of them were longer than 64 tokens. Also, an Attention Mask was applied to all of the input data to avoid performing attention on padding tokens. For each sentence, 0 is for padding tokens that should be masked, and 1 is for others that are not masked. Figure 2 shows an example of text encoding.

Pre-training
Pre-training the language model in a large corpus could help the model learn a series of general common properties of the language, and it is expected to be used in some of the downstream target tasks with a small dataset where it could perform better. The pre-training model we used is based upon the model of RoBERTa, whose structure is based on Google's BERT model, with 12 layers, 768 hidden neurons, 12 self-attention heads and a total of 110M parameters. The RoBERTa model was released by Facebook AI (Liu et al. 2019), pre-trained with masked language model (MLM) task: 15% of tokens were randomly and dynamically selected for replacement; 80% of them were replaced by a special token <mask>; 10% were kept unchanged; the rest of 10% of the tokens were replaced by a random token in vocabulary. The pre-training procedure was performed on a total of over 160GB uncompressed texts for 500K steps with an 8K batch size.

Language Model (LM) Updating
Although the pre-trained model extracts the general features of linguistic expression in a large corpus, the dataset of our task could be in a different distribution. To make the pre-trained model adapt to our task, we updated the pre-trained RoBERTa model with our corpus of 10M unlabeled tweets before training the classifier. In this updating procedure, we implemented the same masking strategy as that of the masked LM task in the pre-training procedure, described previously, with a set of newly designated hyperparameters (training steps: 53K/106K/160K batch size: 64, optimizer: Adam, learning rate 2×10 -5 )

Training RoBERTa from Scratch
Another way to let the model learn the property and distribution of a new language environment is to train a new model from scratch with the new dataset. As for our task, it is also a selectable approach. To determine whether training the RoBERTa model with our corpus of tweets could perform better than Facebook's pre-trained one and to use the updating approach in predicting personal experience tweets, a new Twitter RoBERTa model was constructed with the same corpus of tweets as the updating procedure use. Due to the hardware   Figure 3 illustrates the overview of the procedure of LM updating and training the Twitter RoBERTa from scratch.

Classifier Fine-tuning
A classifier with a simple feedforward neural network was constructed by following RoBERTa's original design, which is adapted for RoBERTa's base concepts and structure. This is also officially recommended to use for the most of downstream classification tasks by Facebook AI. The classifier is made up of one hidden layer containing 768 units and a tanh activation function followed by a sigmoid output. Between the RoBERTa model and its classifier, the first dimension of RoBERTa's output tensor (also annotated as the beginning of sentence token <s>) was extracted and treated as the input of the classifier. A dropout with a rate of 0.1 was added before the hidden layer to prevent overfitting. We utilized this classifier structure for all of our three methods and fine-tuned the whole model with officially recommended hyperparameters (epochs: 2, batch size: 32, optimizer: Adam, learning rate: 1×10 -5 ).

Baselines
Jiang and colleagues (2018; 2019) investigated and published a set of outstanding methods based on Word Embedding algorithms and the LSTM neural network, which outperformed those using humanengineered features with conventional classification models. Using a large corpus of unlabeled tweets, their approach generated a vector space model (VSM) to encode the words and trained and tested an LSTM-based classifier with a smaller set of annotated tweets. In our approach, we built the same (baseline) models by following the published structures and procedures: a VSM built by word2vec, GloVe and fastText algorithms with 128 dimensions and an LSTM layer with 128 hidden units and L2 regularizer followed by a fully connected layer with the sigmoid output. The models were trained by an Adam optimizer with a learning rate of 2×10 -4 and a batch size of 32 for 5 epochs.

Data
Two corpora of Twitter data were used in our work.
A total of 22 million raw tweets were collected using Twitter Streaming APIs from August 25, 2015, to December 7, 2016, and another set of 52 million raw tweets was collected from 2006 to 2017 using a home-made crawler based upon the permission policy specified in Twitter's robots.txt file. Both sets were gathered by searching tweets with the keywords of a set of brand and generic medication names. These two corpora were merged and filtered. After dropping duplicates and eliminating non-English twitters, a corpus of 10 million tweets was collected. To study the changes in classification performance, the same corpus of 12,331 annotated tweets, published on Github by (Jiang, et al., 2018), was utilized.
For this task, the corpus of 10 million cleaned tweets were selected for training the Twitter RoBERTa model from scratch as well as updating the LM -note that the both LM updating and training from scratch procedures did not use any labels of the annotation and the annotated 12K tweets were excluded from the 10 million tweets. Interestingly, the baseline methods used the same 10 million raw tweets to build vector space models of neural embedding. Likewise, the baseline classifiers were also trained and tested with 12,331 labeled tweets. Table 1 lists the composition of annotated tweets.

Statistical Analysis
To determine if any differences in the results among different methods could be due to chance, we conducted statistical analyses on the results between our methods and baseline methods. In our hypothesis testing, the null hypothesis was that the difference between a pair of method does not exist (null hypothesis) while the data remain the same. To do so, we partitioned data into the same subsets for all the methods in cross-validation -that is, each fold has the same set of tweets for different methods. This treatment facilitated us to use the paired t-test on the performance measures of each pair of the method. We set the p-value threshold to 0.05, meaning that any p-value less than 0.05 (p < 0.05) indicates that the difference does exist and it is not due to chance.

Results
To compare the performance differences between our methods and baseline methods, 10-fold crossvalidation was conducted for each method and the mean value of each classification measure was collected. Table 2 shows the measures of the classification performance between our methods and baselines' (the highest values are in boldface). Table 3 (in appendix) lists the statistical analysis results of each performance measure in crossvalidation between our methods and baseline methods.

Discussions
According to the results in Table 2, we can see that compared to baseline methods, the approaches of RoBERTa model with or without updating achieved better performance in all measures, and the Twitter RoBERTa model trained with our data also performed better except in precision, and such differences were confirmed to exist statistically by the p-values in Table 3 (p < 0.05). In general, we can consider that the RoBERTa models performed better than Word Embedding + LSTM method in this task.
A noticeable improvement between pre-trained and updated RoBERTa models and baseline methods is the precision and recall, whereas the precision of Twitter RoBERTa model remained relatively unchanged at the same time. The recall is the sensitivity of how many true instances are predicted correctly and precision rates how many positive predictions are correct. A higher recall could help the model discover more potential positive instances and higher precision means more true positives (TP) and less false positives (FP) in the prediction. In other words, RoBERTa models can improve the sensitivity and identify PETs more precisely, resulting in more true positives in the predicted PET class.
Another remarkable measure could be the ROC/AUC score, which was also improved significantly as shown by the curves in Figure 4. ROC (Receiver Operating Characteristic) is a curve plotting true positive rate (TPR, or sensitivity) in the y-axis and false positive rate (FPR, in the x-axis, and is commonly used to show how well the model can distinguish two different objects. The area under the curve (AUC) of ROC is used to quantify the score of ROC. The results in Table 3 show that the lowest p-value between our methods and baseline methods is ROC, which may imply that ROC was improved most significantly among all performance measures. That is to say, our methods can be good choices with improved ROCs in this task and they are much more robust in distinguishing PETs and non-PETs.
Our methods also achieved a modest improvement in accuracy, but it could not be interpreted as that better accuracy leads to better performance. Because our dataset is imbalanced (PETs: non-PETs = 1: 3.16, as shown in Table 1) and accuracy is based upon the prediction of both positive and negative classes, higher accuracy could be attributed to the imbalance. Thus, accuracy is not an important measure that should be of concern.
The results also show that performing LM updating before classifier fine-tuning could yield more improvement in accuracy, precision, F1, and AUC. Nevertheless, the p-values indicate that they are not significant if updating the LM for more steps. But as for the Twitter RoBERTa model, which was trained from scratch, the steps of training affected performances in some measures which were supported by our statistical analysis. This outcome suggests that a larger number of steps are needed for performance improvement when training from scratch, and small steps are enough for LM updating to achieve better performance than the original RoBERTa model.
The possible reason for the improvement of these RoBERTa-based methods over baseline approaches could be attributed to the level of features. As is known, the features extracted by VSM such as word2vec, which is based upon word-level and co-occurrence. But RoBERTa, which extracts contextual-level features, maybe more powerful in processing tweet-like text which is poisoned by misspelling and incorrect grammars. The possible explanation for the performance difference between Updated RoBERTa and Twitter RoBERTa can be the slow learning process. The updating process is based on the pre-trained RoBERTa model, which is already pre-trained with a very large dataset by Facebook. It may be easier to adapt itself to our dataset, and the larger number of updating steps did less to help improve performance. But for Twitter RoBERTa, since it was trained from scratch and only 15% of tokens were randomly masked, the model could only learn a small part of sentences for each step. Therefore, it may take more time to learn the data distribution, and the larger number of training steps is recommended.

Conclusion
In this study, we investigated different ways to use Facebook's RoBERTa model to improve performance in predicting personal experience tweets on medication use. Our results demonstrated that using the fine-tuning method on the pre-trained RoBERTa model achieved better classification performance than previous Word Embedding + LSTM methods, and the original pretrained RoBERTa could perform better than training a new RoBERTa model from scratch. More importantly, updating the pre-trained RoBERTa language model with our data could yield better performance. The 10-fold crossvalidation was used to test statistically the performance differences between our approaches and baseline methods. The results confirmed that the improvement does exist with statistical significance (p < 0.05). This suggests the pretrained RoBERTa model and LM updating method are better choices for this task and significantly boost the capability to identify personal experience tweets. It is conceivable that our method could apply to other classification tasks using Twitter data related to health issues.

Acknowledgement
Authors wish to thank College of Technology at Purdue University Northwest for providing funding to support this work. Alvaro, N., Conway, M., Doan, S., Lofi, C., Overington, J. and Collier, N., 2015. Crowdsourcing Twitter annotations to identify firsthand experiences of prescription drug use. Journal of biomedical informatics, 58, pp.280-287.