Shot Or Not: Comparison of NLP Approaches for Vaccination Behaviour Detection

Vaccination behaviour detection deals with predicting whether or not a person received/was about to receive a vaccine. We present our submission for vaccination behaviour detection shared task at the SMM4H workshop. Our findings are based on three prevalent text classification approaches: rule-based, statistical and deep learning-based. Our final submissions are: (1) an ensemble of statistical classifiers with task-specific features derived using lexicons, language processing tools and word embeddings; and, (2) a LSTM classifier with pre-trained language models.


Introduction
Public opinion about vaccines is diverse. Most people support vaccination, but some of these people do not receive vaccination. On the other hand, people who are vaccinated may also have concerns regarding the safety or efficacy of vaccines. In other words, a person's stance towards vaccines (referred to as 'vaccine hesitancy') is distinct from whether or not they received a vaccine shot (referred to as 'vaccination behaviour'). While automatic detection of vaccine hesitancy has been explored in the past, computational approaches to detect vaccination behaviour have been limited. Towards this, our paper deals with vaccination behaviour detection (SMM4H shared task #4). Vaccination behaviour and vaccine hesitancy can together help to understand penetration of vaccination programmes and the trust that communities place in large-scale vaccination programmes (Holt et al., 2016).
Vaccination behaviour detection is the task of predicting whether or not a given piece of text refers to a person receiving or intending to receive a vaccine. For example, the tweet 'I took the vaccine this morning, feeling great!' is positive because the speaker reports having received the vac-cine. On the contrary, 'Vaccines drastically reduce risks of infection' is negative because the tweet describes vaccines but does not report a vaccine being administered.
Past work in vaccination behaviour detection uses n-grams as features of a statistical classifier (Skeppstedt et al., 2017;Huang et al., 2017). However, alternatives to n-grams have shown promise in several Natural Language Processing (NLP) tasks. Therefore, we compare three typical NLP approaches for vaccination behaviour detection: rule-based, statistical and deep learning techniques. Our submissions to the shared task use statistical and deep learning-based text classification. The systems are trained on a concatenation of the training and the validation set. The work reported in this paper ranked first among nine teams, as communicated by the shared task committee.

Approaches
In this section, we describe the three approaches that we employ for vaccination behaviour detection: Statistical, rule-based and deep learningbased.

Statistical Approach
Our statistical approach uses an ensemble of three classifiers: logistic regression, support vector machine with both using LIBLINEAR (Fan et al., 2008), and random forest using scikit-learn (Pedregosa et al., 2011). We use the following nondefault parameters: (a) Positive misclassification cost is set to 3 in logistic regression; (b) 100 estimators in random forest. Majority voting is used to combine predictions from the classifiers, i.e., a tweet must be predicted as positive by at least two classifiers for it to be predicted as positive by the ensemble.
The random forest classifier uses unigrams as features. The features for logistic regression and  (Bird and Loper, 2004). This feature follows the intuition that presence of certain POS tags such as verbs may serve as signals; 4. Negation: Presence of a negation word. This is to serve as a negation feature where, although the act of receiving a vaccine is mentioned, the negation word changes the output class; 5. Word Similarity: For each word, we obtain similarity with 'receive, 'get' and 'take', and use the highest similarity as this feature. We use pre-trained embeddings from Mikolov et al. (2013). This is to allow presence of words related to the act of receiving to be used as a signal for prediction; 6. Sentence Vector: A sentence vector is computed as an average of word vectors using GloVe embeddings (Pennington et al., 2014); 7. Length: Number of characters and words; 8. Emotion: Word counts of each emotion category as given by SenticNet (Cambria et al., 2014).
The combination of classifiers, misclassification costs and features has been experimentally validated.

Rule-based Approach
Since vaccination behaviour detection may appear to be only about detecting administration of a vaccine, we implement a naïve method to detect vaccination behaviour. Our rule-based approach looks for words indicating 'receive' (without negation) to predict vaccination behaviour as follows: 1. If a tweet contains one among the words 'give', 'take', 'taking', 'gave', 'giving', 'get', 'getting', 'took', 'receive' or 'received' and no negation word, predict the tweet as positive.
2. Else, predict the tweet as negative.

Deep Learning-based Approach
We experiment with five typical deep learningbased models: We then build a softmax layer on top of this pretrained LSTM, and fine-tune the neural network with supervision (Howard and Ruder, 2018).
All models are implemented using Tensor-Flow (Abadi et al., 2016). The parameters are experimentally determined.

Experimental Setup
The shared task provided three labeled datasets of tweets for evaluation: a training dataset (5751 tweets of which 1692 are positive), a validation dataset (1215 tweets of which 306 are positive) and a test dataset (161 tweets, labels undisclosed). We re-implement two past works as baselines (Skeppstedt et al., 2017;Huang et al., 2017). The two baselines use n-grams as features of statistical classifier.

Results
We present our results in six parts. We first describe the performances on the training, validation and test sets. Then, to understand the components contributing to the performance, we perform additional evaluation: (a) impact of the size of the training set on the performance; (b) impact of data source from which language model is trained in case of the deep learning approach; and (b) impact of the features on the performance of the statistical approach. Finally, we present an analysis of errors made by our system.

Performance on Training Set
The performance of our methods using 10-fold cross-validation is shown in Table 2. The performance of the re-implementation of baselines are comparable to the original papers. The low values in case of the rule-based approach highlight that vaccination behaviour detection is not a trivial task of detecting words that indicate administration of a vaccine. The best F-scores are achieved by the statistical approach (80.75%) and LSTM-LM (80.87%). This is an improvement of 3-4% over the baseline.

Performance on Validation and Test Sets
The statistical approach achieves an average Fscore of 81.56%, while LSTM-LM achieves 80.43% on the validation set. Similarly, the performance of our methods on the test dataset is in Table 3. We obtain a F-score of 86.06% with the statistical approach and 88.74% with the LSTM-LM on the test set of 161 instances.

Impact of Size of the Training Set
To analyse the impact of the training set size on the resultant performance, we show the F-scores for the two best approaches for varying sizes of the training set in Table 4. '20%' indicates that 20% of the training set was used to train the system while the validation set was used for evaluation. We observe that when training on a small size of labeled data, LSTM-LM performs much better than statistical model. This shows the benefit of transfer learning that it can utilize knowledge learned from unlabeled data to train a model with a small number of labeled instances.

Impact of Language Model Source in LSTM-LM
A pre-trained language model represents knowledge learned from source data that is applied to a classifier. To understand if the domain of this source data has an impact on the performance of the resultant classifier, we compare how effective different domains are for vaccination behaviour detection. We compare three datasets in Table 6. The SMM4H dataset is the training dataset for the task while WikiText-103 (Merity et al., 2016)    and IMDB (Maas et al., 2011) are datasets from wikipedia and a movie review corpus respectively. The latter are significantly larger than the SMM4H dataset. However, they only result in a marginally higher performance.

Impact of Features in the Statistical Approach
To understand how the features contribute to the statistical approach, we conduct ablation tests. The degradation in F-score when each of the features is removed is in Table 5. The positive values in all fields validate the value of the proposed features. The highest degradation is observed in case of POS-based features.

Error Analysis
We analyse incorrectly predicted instances from the validation set. About 50% of errors have first or second person pronouns. Nearly 44% of false  negatives have negative sentiment about flu shots because of actual or expected, unpleasant sideeffects. The ratio of false negatives to false positives is 1.40. An analysis of 50 random false positives and 50 random false negatives are shown in Figures 1 and 2 respectively. The label 'Unsure' indicates that the error could not be assigned to any of the other categories. Some incorrectly classified instances for the different error sources are: • Negative opinion but no claim whether they would take it, as in the case of 'Getting a flu vaccine after reading this article is crazy!'.
• Mentions of taking a flu shot without expressing sentiment, such as 'Flu shots for hubby and daughter... check.'.
• Took it or about to take it and expressed favourable opinion about shots, as in the case of the tweet 'We're headed to the @Brigham-Womens flu shot clinic! Getting vaccinated is good for you and your community.'.

Conclusions
We evaluate three text classification approaches for the task of vaccination behaviour detection. The rule-based approach considers simple presence of words, the statistical approach uses an ensemble of classifiers and task-specific features while the deep learning approaches employ five neural models. On comparing the three approaches, we observe that an ensemble of statistical classifiers using task-specific features and a deep learning model using pre-trained language model and LSTM classifier obtain comparable performance for vaccination behaviour detection. Our findings in the error analysis which show that vaccine hesitancy often conflicts with vaccination behaviour detection, will be helpful for future work.