Detecting and Extracting of Adverse Drug Reaction Mentioning Tweets with Multi-Head Self Attention

This paper describes our system for the first and second shared tasks of the fourth Social Media Mining for Health Applications (SMM4H) workshop. We enhance tweet representation with a language model and distinguish the importance of different words with Multi-Head Self-Attention. In addition, transfer learning is exploited to make up for the data shortage. Our system achieved competitive results on both tasks with an F1-score of 0.5718 for task 1 and 0.653 (overlap) / 0.357 (strict) for task 2.


Introduction
Automatic adverse drug reaction (ADR) detection and extraction are of great social benefits to public health, with which pharmacovigilance (Sarker and Gonzalez, 2015) can be performed at a broader and more automatic level. Recent research focus their attention on online public sources such as tweets due to their availability and authenticity (Onishi et al., 2018;Adrover et al., 2015;Salathé and Khandelwal, 2011).
The SMM4H shared task is proposed (Weissenbacher et al., 2019) to enhance ADR recognition. Task 1 is a binary classification task between ADR mentioned tweets and drug name only tweets, followed by task 2 to extract the particular position of ADR entities. Based on the work we did last year (Wu et al., 2018), we extend our previous model with hierarchical tweet representation and multi-head self-attention (HTR-MSA) to a model using both hierarchical tweet representation and attention (HTA) to jointly participate both tasks. Moreover, additional features and a language model are incorporated to enhance the semantic representations. In task 1, transfer learning † Equal contribution. on a smaller dataset is exploited. In task 2, we add a CRF layer for the named entity recognition task.

Our Approach
Our HTA model can be divided into the following three parts: hierarchical word representation, hierarchical tweet representation and tweet classification, which are introduced as follows.

Hierarchical Word Representation
In order to combat out-of-vocabulary medical terminology, misspellings and user created abbreviations, we propose a character modeling at a lower level before traditional word representation. We denote the character sequence of i th word as where V denotes the character vocabulary size and D denotes the dimension of character embedding.
After a character embedding is generalized, character-level convolutional neural network is employed to capture local combined character feature. Assuming the window size of CNN filters is 2w + 1 and U c , b c are kernel and bias parameters respectively, a convolutional representation h i,j of character embedding vectors from position j − w to j + w is formed as follows: To remove unnecessary information, we apply the max pooling to pertain only the most salient feature of the i th word.
Other features are added at a word level, such as word2vec-twitter (Godin et al., 2015) word embedding, pos-tag from NLTK library (Bird et al., 2009) and sentiment lexicon 1 . To strengthen the medical meaning of word representation, word appearance in SIDER 4.1 medical lexicon 2 is transformed to one-hot vector as additional feature. Besides, the language model ELMo embedding (Peters et al., 2018) is incorporated to overcome the shortage of limited data and get better semantic meaning. Since ELMo contains character level information in their model, it fits better to our task goal than other language model that utilizes a fixed word look-up dictionary.
The final output of our hierarchical word representation is the concatenation of character representation, word embedding, pos-tag, sentiment lexicon, medical lexicon feature and language model output.

Hierarchical Tweet Representation
We first send word representation obtained in the previous module to a Bi-LSTM layer to encode long-distance information. The Bi-LSTM output of a sentence of length M is denoted as H The second layer takes advantage of multi-head self-attention (Vaswani et al., 2017) to mine internal relation between words in the same sentence. In our layout, the representation vector m i,j of the j th word learned by the i th attention head is computed by weighted summation of H: U i and W i are the parameters of the i th self-attention head, and α i j,k represents the related weight between j th and k th words. After concatenating outputs from h different selfattention heads, we get the representation m j = [m 1,j ; m 2,j ; ...; m h,j ] of the j th word.

Tweet Classification
For task 1, we use an additive attention mechanism to selectively combine word representations. The model is trained with a cost-sensitive weighted loss function (Santos-Rodrguez et al., 2009). Sentence level binary labels are then generated for task 1. However, in task 2 word level labels are needed, so we use a CRF layer to predict word level entity tags after self-attention vectors produced in the lower level.

Experiment Settings
In our experiments,the word embedding we use is 400 dimension and Bi-LSTM network has 2×200 units. The CNN network has 400 filters with window size of 3. There are 16 heads in the multi-head self-attention network, and the output dimension of each head is 16. Adam is selected as the optimizer.
Transfer learning is conducted on the CADEC medical ADR dataset (Karimi et al., 2015) first in task 1. However, we do not adopt this method in task 2 due to the relative small training dataset of this task. For the word classification, we train for this task a marginal CRF with probabilities as output.

Experiment Results
Detailed evaluation score is illustrated in table 1, which illustrated the effectiveness of our approach. In task 1, our model outperforms the average score among all participants by 0.070. In task 2, the improvement on relax F1 is also significant, we improve 0.115 on relax F1 and 0.040 on strict F1. Besides, compared to the best model we submitted for task 1 last year (Wu et al., 2018), which reached a 0.522 F1 score, our method with the language model and transfer learning improves the original model by 0.050.

Conclusion
We design HTA, a hierarchical tweet representation and attention model for SMM4H shared task 1 and 2, our model attains high evaluation scores on both tasks and generates promising application value.