CLaC at SMM4H Task 1, 2, and 4

CLaC Labs participated in Tasks 1, 2, and 4 using the same base architecture for all tasks with various parameter variations. This was our first exploration of this data and the SMM4H Tasks, thus a unified system was useful to compare the behavior of our architecture over the different datasets and how they interact with different linguistic features.


Base system
The base system is a feed-forward neural network with a recurrent neuron. We decided to explore that architecture for independent purposes and used the SMM4H tasks to compare performance on different datasets and task descriptions.
We considered three variations of this architecture: Full: A recurrent neuron that outputs a 20 dimensional vector is followed by a 3 layer feedforward neural net, all embedded in two decision neurons with soft-max activations. The feedforward network has 50, 25 and 12 neurons in first, second and third layers respectively. Unless otherwise mentioned, the network has been trained for 100 epochs.
The recurrent neuron consists of an LSTM cell using tanh activations [Hochreiter and Schmidhuber, 1997]. The activation functions for the feedforward networks are also tanh.
NR: Only the recurrent neuron and the decision neurons are used, the feed-forward (N)etwork is (R)emoved.
Full+At: Attention is added to the full architecture. In contrast to Full, where the LSTM cell outputs a single vector, in Full+At, the recurrent neuron outputs the sequence of each time step.
We used the Keras package [Chollet and others, 2015] to implement the neural networks using TensorFlow as backend [Abadi et al., 2015].

Input parameters
Tweets are normalized to a size of 25, padded with leading zeros or shortened from the end as required.
The input per tweet consists thus of 25 word vectors of size 100 compiled by the Word2Vec method [Mikolov et al., 2013] over the training data. The Gensim package [Řehůřek and Sojka, 2010] is used for the training of word vectors. The minimum number of occurrences for a word to be considered in the vocabulary is 1 and the window size has been set to 5. Other parameters involved in word vector training were left to the default values of the Gensim package.
Tweet representations are then binned to a batch size of 5, unless otherwise indicated.

Text features and knowledge sources
We also experimented with a few linguistic text features and a gazetteer list to see whether they might influence the results.

Gazetteer
Inspired by Task 1, detection of drug mentions, we scraped the name field of product fields in Drug-Bank [Wishart et al., 2017] to compile a gazetteer list for drugs. Due to time constraints, this resource was only minimally refined and contained many multi-word drug names such as One A Day and dosage specifications (Aspirin 80mg). The gazetteer information was appended to the word vector. Runs that use the gazetteer are identified as +Gaz.  [Toutanova et al., 2003]. Following sentence splitting, tweets were tokenized and Twitter specific tokens (@name and URLS) were removed from the token set. The remaining tokens were assigned one of 36 part-ofspeech tags, resulting in a feature value range of integers from 1 to 36.
Following [Doandes, 2003], the part-of-speech tags were used to identify verb clusters. Voice, tense and aspect were assigned to each verb cluster, and the main verb in each verb cluster was identified. These features were also added to the respective word vectors of the main verbs.
We selected only indicative tenses for our binary tense feature.
Tokens were also checked against two ad hoc gazetteer lists of explicit negation triggers and modality terms and the binary features neg and mod were added to the respective word vectors.
Thus we created 4 linguistic features (tense, voice, POS, and modality) in addition to the gazetteer feature, that can be appended to word vectors for those words onto which the features project.

Task 1
Task 1 was a basic binary categorization task, identifying tweets where a drug was mentioned in its medical sense (the detailed description of the tasks and data can be found in the overview paper [Weissenbacher et al., 2018]). The training data consisted of over 9000 tweets, balanced in both categories. Table 1 shows the results from some of the runs we compared in order to evaluate the effectiveness of our features. We selected a validation set of 1000 tweets from the training data and trained on the remaining tweets. We compared the training accuracy and the validation accuracy to get some indication of the degree of overtraining. We observe that the difference for training accuracy and validation accuracy is surprisingly small for such a small dataset. Moreover, the differences between our different feature bundles is also rather small. The gazetteer list led to a marked improvement for training accuracy, but not necessarily validation accuracy. Paradoxically, the two best validation accuracy performances came from NR and Full+All (with Full+All+At adding a percentage point). That means that on the validation data, the contribution of the neural net plus gazetteer plus all linguistic feature (plus attention) was matched by simply removing the neural net (NR).
We achieved a greater performance increase in training accuracy across all our configurations when training on Task 2 training data as well as on Task 1 training data. This improvement carries over to validation accuracy and F1 measure, but inconsistently. However, the overall results of different configurations showed less variation when also training on task 2 training data. We speculate that this stabilization may stem from some disruptive effect of data from another task (but that can be expected to contain drug mentions) which might counterbalance overfitting. Our competition runs were all trained on both, Task 1 and Task 2 training data.
It was clear from the beginning that our architecture is severely mismatched to the simple categorization task. The very small difference that our different experiments generated show that the variations do not truly access different tweets. The extremely high training accuracy indicates to us a high degree of overfitting, with the danger of making the entire system somewhat brittle. Table 2 shows that our best competition run on Task 1 was with the Full architecture, the addition of the gazetteer list and two linguistic features reduced the performance. But the near equal performance of Full and NR+Gaz+POS 1 confirms the findings of the validation data, namely that the performance contribution of the network can be matched by the gazetteer list plus some linguistic features. Interestingly, our official test results top the results we obtained on our validation set, which shows that the performance in this case was stable. The performance difference between the best and the last system was 0.1399.

Task 2
Task 2 had a semantic component that Task 1 lacked: it concerned distinguishing actual medication intake from possible medication intake and mere mention of a medication in a 3-way decision. We augmented the basic architecture with a third decision neuron for this task. The training data size for Task 2 was 14482 tweets that were highly imbalanced. Again, a validation set of 1000 tweets was randomly selected from the training data. Table 3 shows that the richer task definition led to a greater variance in team performance: the difference between the first and last placed team's best runs is .341 micro-averaged F measure. Unlike for Task 1, our performance was not commensurate with our validation performance: in validation runs Full+All+At was also the best run with a validation accuracy of 0.85. Note, that our performance is determined in part by the lowest recall. 1 less obvious due to rounding in Table 2 These results suggest to us that firstly, a custom tailored architecture that better addresses the task can make a greater difference and that our architecture showed more signs of overfitting than in Task 1.

Task 4
Task 4 was the most semantics oriented task we attempted. The binary task was to identify tweets that clearly indicate that someone received, or intended to receive, a flu vaccine.
Of the 8000 tweets mentioned in the task description, only 4502 tweets could be downloaded for our training data. Despite the very small size of the training data and the potentially deeper semantic distinction, our system performed the closest to the competition mean. Note that the general drug name gazetteer list was not useful for this task. CLaC's best run was by Full+All. It is interesting that what appeared to us as the semantically most difficult task has our best performance (measured in distance to the competition mean) due to a recall of .89. We speculate that there may be certain linguistic patterns that our features were able to detect that made this task more amenable to our architecture (in comparison) based on the fact that Full+All outperforms NR and Full+Voice+Tense significantly.

Conclusion
CLaC decided late to participate in SMM4H with a uniform architecture to test across several tasks that was not inspired by them. Our conclusion is that the architecture and in particular the input binning and normalizing techniques have to be carefully reviewed, as they risk ignoring key terms in the input. The linguistic features showed some effect, as did the addition and removal of the network. Repeatedly, trial runs showed that removing the network could be offset by adding linguistic features to the recurrent neuron. The detailed interplay of these parameters has to be further studied.
However, we conclude that using the same architecture across several tasks (that are related, but differ significantly) is an interesting exercise and allowed us to gain additional insight. Despite its potential for gross overfitting, the architecture has shown promise. The linguistic features also proved effective, and most importantly, the two components interplay effectively as demonstrated in the fact that in two tasks Full+All was our best performing run.
While each of the three tasks is interesting in itself and clearly has relevance to society at large, we find the juxtaposition of the three tasks very interesting for the ML/NLP researcher.