NLP@UNED at SMM4H 2019: Neural Networks Applied to Automatic Classifications of Adverse Effects Mentions in Tweets

This paper describes a system for automatically classifying adverse effects mentions in tweets developed for the task 1 at Social Media Mining for Health Applications (SMM4H) Shared Task 2019. We have developed a system based on LSTM neural networks inspired by the excellent results obtained by deep learning classifiers in the last edition of this task. The network is trained along with Twitter GloVe pre-trained word embeddings.


Introduction
The Shared Task (Weissenbacher et al., 2019) of the 2019 Social Media Mining for Health Applications (SMM4H) Workshop proposed several Natural Language Processing (NLP) tasks using social media mining for health monitoring. Since these tasks involve NLP techniques, they are as interesting as difficult to solve because these systems should be able to work with many linguistics variations and model the different ways people express medical-related concepts in social media. In addition, we must take into account the level of noise caused by creative sentences, misspellings or ambiguous and sarcastic expressions which makes hard to tackle these tasks.
For this shared task we decided to participate in the first task. This task proposes to find tweets mentioning Adverse Drug Reactions (ADR), taking into account the linguistic variations between ADRs and indications (the reason to use the medication). We have developed a system based on LSTM networks due to their latest achievements in the last edition of this task (Xherija, 2018).

Dataset
In this section we describe the dataset of the task 1 and the applied pre-processing. This task proposes to find tweets mentioning ADRs, therefore we have to deal with raw text extracted from Twitter.
The publicly available dataset contains for each tweet: (i) the user ID, (ii) the tweet ID, and (iii) the binary annotation indicating the presence or absence of ADRs. The dataset contains 24606 tweets manually tagged, being around 10% (2358) of tweets mentioning ADRs, and around the remaining 90% (22248) are tweets without ADRs.

Pre-processing
Regarding the dataset we normalized typical Twitter strings such as @user by <USER>, #hashtag by <HASHTAG> or https://... by <URL> to decrease the vocabulary size and reduce the dataset variability by grouping several tokens under the same meaning.
We also handle several elongated words such as "my goooood". In these cases we replaced each token by a unique representation, for example "aaargh" and "arrggggh" by "argh".
Finally the last step was to replace several constructions like "it's" by "it is" or "OMG" by "Oh my god" and tokenize the text. For this step we used regular expressions and NLTK (Loper and Bird, 2002) to tokenize the text. We used specifically the class TweetTokenizer which is especially useful processing tweets since it splits the text into tokens, as others tokenizers, but also it takes into account some text elements like emojis or exclamatory particles, which are correctly separated into new tokens.
We didn't remove any stop-word or convert to lowercase the text because that might change the meaning of a tweet drastically.
used along with Twitter GloVe (Pennington et al., 2014) embeddings. The input of the system is a tweet (a sequence of words) which is used by the Embedding Layer with a fixed input size, while the weights of this layer are given by the GloVe word embeddings trained with 2 billion tweets. We have chosen these embeddings instead of others like word2vec (Mikolov et al., 2013), godin (Godin et al., 2015) or shin (Shin et al., 2016) because Twitter GloVe is trained with tweets, what is very useful since it allows us to have a greater vocabulary and also more similar to the text provided by the task. As it can be seen in Figure 1, the next layer of our system is a Bi-LSTM layer. We decided to use it because a single LSTM network have not access to further tokens as they have not been seen. A Bi-LSTM has access to past tokens and future tokens, so this layer will give us a complete knowledge about the tweet; one LSTM will scan the sentence in one direction and the other will scan in the reverse direction. After these two layers we set a Dropout layer to prevent overfitting (Peng et al., 2015) with a rate of 0.3 for the Embedding layer and 0.5 for the Bi-LSTM layer. Finally we added a Dense layer with a sigmoid activation function at the end of the network to get the final results.
Regarding hyper parameters we used some configurations before we submitted the runs. For these tests we have tuned the epochs, the size of the batch (32, 64 and 128), the size of the embedding (vector of 50 and 100 dimensions in both embed-dings), and the optimizer by considering a couple of them as Adam (Kingma and Ba, 2014) and Ada-Grad (Duchi et al., 2011). We also handle the vocabulary tokens by adding pad right. At the end we chose the 3 configurations that reported the best results, whose hyper parameters are shown in Table 1.

Experiments and Results
For the implementation of the system we chose Keras and Tensorflow (Abadi et al., 2016) while for the pre-processing of the data we used Scikitlearn (Pedregosa et al., 2011), in particular for padding and split the dataset into validation, train and test sets.
In order to test the functioning of our system we used the evaluation script provided by the organizers. Several experiments are shown in Table 2. In these experiments we used a network without embeddings (Base) and with two types of embeddings, one pre-trained on Wikipedia pages (Wikipedia GloVe) and the other one based on tweets (Twitter GloVe). Due to the better performance shown by the configuration that used Twitter GloVe pre-trained embeddings, we decided to use it for the runs that we submitted to the task. Table 3 shows the official results for the three runs that we submitted to the task 1 and the task average score provided by the organizers. According to the results obtained, it could be said that a greater number of epochs provides better results although the recall begins to fall.

Conclusions
Taking into account the experiments carried out on the training set and the results obtained, we can say that the use of embeddings pre-trained on tweets has been positive, that a greater number of epochs has provide us a better performance and that the best feature of our system is the recall as it obtains a value above the average.
In the future, we will try to create a more complex system to improve its performance. For this task we will add new features such as POS tagging and char embeddings as well as an attention mechanism.