Adverse Drug Effect and Personalized Health Mentions, CLaC at SMM4H 2019, Tasks 1 and 4

CLaC labs participated in Task 1 and 4 of SMM4H 2019. We pursed two main objectives in our submission. First we tried to use some textual features in a deep net framework, and second, the potential use of more than one word embedding was tested. The results seem positively affected by the proposed architectures.


Introduction
The ongoing SMM4H challenge tasks define evolving challenges defined on Twitter data (Weissenbacher et al., 2019). The intention of epidemiologists is to detect mentions of health issues early on Twitter. One of the challenges is to detect real reports of personally experienced health issues and to distinguish them from generalizations, hypotheticals, news, and institutional advice.
Task 1 of SMM4H 2019, "Automatic classification of adverse effects mentions in tweets", asks to distinguish tweets that report an adverse drug effect (AE) from those that do not. Training data consists of 25,672 tweets with imbalanced distribution: 2,374 positive and 23,298 negative labels. An example of an adverse effect mention in a tweet is: saphris gives me a mad appetite omg i hate this Task 4 is on "Generalizable identification of personal health experience mentions". Two specialized training sets were released , "flu vaccination" and "flu infection", comprising approximately 6,200 and 1,100 tweets. Task 4 training data was balanced. A sample positive tweet from this task is: I must say that flu shot packed a punch. #WorstInoculationEver The CLaC submission to SMM4H 2019 had three general goals: first, to experiment with architectures that can address both tasks, second, to compare different word embeddings for their individual, but also their combined effectiveness, and third, to test whether we can augment the basic word vectors input with additional local and global knowledge from word lists and text preprocessing. The experiments remain inconclusive, due to an error in our submission pipeline.
User mentions (@) were removed from the tweets. URLs are annotated, as are the first person personal pronouns I, my, mine.
Negation and modality The span of negation and modality is determined using (Rosenberg et al., 2012) and projected onto the token representation: tokens present in the span of a negation or modality are indicated by a binary flag appended to the respective word vector (see Figure 1). The presence of negation and/or modality might reflect uncertainty in a given tweet and it may not convey facts.
URL Tweets about a personal experience do not usually include a URL. Specifically for Task 4, 80% of the tweets including a URL are negative. A binary URL feature encodes presence or absence of a URL in the tweet.

POS embedding
We experimented with the notion of part of speech embeddings to address sparsity. Here, a representation for each POS tag is obtained using Word2vec by training on a POS tagged corpus (instead of words themselves). We use the Penn tree bank tag set (36 tags) with a window size of 5.
ADR lexicon Terms from the Diego Lab adverse drug reaction lexicon (Nikfarjam et al., 2015) are indicated as a binary, tweet level feature, in order to increase recall.
First person personal pronoun First person pronouns I, my, and mine are indicated at token level by a separate binary feature. In both tasks, a personal experience is more likely to be a positive sample, therefore, enhancing recall.

System architecture
Our system has two parallel branches and is trained in two stages. One branch works only with BERT word embeddings, the other branch works on our concatenated token level features plus word embeddings (Word2Vec/Glove) shown in Figure 1. The input vectors of each branch are fed into Bi-LSTMs and are followed by attention and finally two softmax decision neurons.
After optimizing each branch with binary crossentropy loss, the parameters of the networks are frozen for the second stage of training. We train an SVM on the input vector that concatenates class probabilities provided by the softmax neurons with the tweet level features, ADR and URL.
The network is optimized using the Adam optimizer (Kingma and Ba, 2014) with learning rate lr = 0.001 for 5 epochs (for both tasks). For Task 1, the class weights of cw pos = 1 and cw neg = 0.4 are used as thresholds for positive and negative samples respectively. For the SVM, the RBF kernel is used with γ = 0.001. The hyper-parameters have been chosen by cross validations. The first stage deep net learning is implemented using Keras 3 and the second stage SVM classification is implemented using Scikitlearn (Pedregosa et al., 2011).

Development phase
During the development phase we considered a number of different features and performed an ablation study with more than 130 different configurations. For this phase, 22,000 and 3,672 samples were considered for training and test sets respectively.
An interesting observation was the different behavior of word embeddings in the presence of language features. For Task 1, Glove embeddings usually performed higher, whereas in Task 4, Word2Vec embeddings were generally superior. In Task 1, adding textual features to Word2Vec embeddings resulted in a decrease in performance, however, adding the same features to Glove re-3 https://keras.io sulted in increased performance. This effect was small, but persistent across ablation of the other features, and we concluded that the different behaviors of the embedding vectors could be leveraged in an ensemble situation. For Task 1, the ADR word list generally increased recall in our ablation studies, demonstrating that domain specific gazetteer lists can effectively supplement training data. In combination with Glove, textual features such as negation and modality increased precision, but diminished recall. Adding ADR to this combination (Glove+Neg+Mod+BERT) compensates for the drop in recall without significantly decreasing precision. The results also corroborates the hypothesis that the 1st feature enhances the recall (W2V+1st and W2V+1st+BERT compared to W2V and W2V+BERT).
Looking at the confusion matrix reveals that the model (specifically Glove+BERT) associates drug mentions in the subject position with positive labels, incurring a considerable amount of false positives, see for instance: this lozenge has my sore throat fading paxil makes you susceptible to sunburns?
The ADR feature (Glove+ADR+BERT) reduces these false positives while it causes other instances of false positives. As mentioned before, ADR generally increases recall, but in some configurations with Glove it has increased precision which is interesting and we will study it in more detail.
Modality reduces false positives and is the most effective token level textual feature. Two instances of false positives (in Glove+BERT) which are correctly classified in the presence of modality are: When combined with Glove, we observed that the negation feature degrades the F1 score, however, it inter-plays well with the modality feature.
For Task 4, combining textual features with Word2Vec increases precision. The URL feature by itself increases precision even more, but incurs a larger drop in recall.

Evaluation phase
Task 1 We submitted three configurations to Task 1: Glove with our textual features, W2V alone, and W2V with the first person pronoun feature (all used in an ensemble with BERT). These were not our top performing configurations during development, rather we included W2V to bridge to Task 4 and we included two runs with different textual features and one without. The performance of our system in the competition is provided in Tables 3, the competition performance of all three models is commensurate with our development results with ±2% in F1 measure. Moreover, the three configurations performed near identically and all three were above the competition mean.
It is interesting to note that the Word2Vec embeddings trained on Sentiment140 data proved as effective on this data set as Glove with the textual features, in contrast to our development experiments. We interpret the fact that W2V in an ensemble with BERT lies above the competition's mean to confirm the importance of our genre selection for Word2vec training.   Table 4 demonstrate.

Conclusions
We participated in the SMM4H 2019 shared task with two major ideas. First, we tried to use textual annotations in a deep net architecture and specifically proposed encodings for negation, modality, and use of a gazetteer list. Our observations during the development phase showed that textual features are effective for enhancing the performance of the system but that standard embedding vectors without additional textual features give comparable performance on these datasets. Our second idea was to have more than one type of embedding in our system to have an ensemble and try to aggregate the predictions using a support vector machine rather than using a simple majority voting. This worked well, but again, on the datasets of this challenge, the computational overhead seems questionable for the degree of improvement achieved.