Hierarchical neural model with attention mechanisms for the classification of social media text related to mental health

Mental health problems represent a major public health challenge. Automated analysis of text related to mental health is aimed to help medical decision-making, public health policies and to improve health care. Such analysis may involve text classification. Traditionally, automated classification has been performed mainly using machine learning methods involving costly feature engineering. Recently, the performance of those methods has been dramatically improved by neural methods. However, mainly Convolutional neural networks (CNNs) have been explored. In this paper, we apply a hierarchical Recurrent neural network (RNN) architecture with an attention mechanism on social media data related to mental health. We show that this architecture improves overall classification results as compared to previously reported results on the same data. Benefitting from the attention mechanism, it can also efficiently select text elements crucial for classification decisions, which can also be used for in-depth analysis.


Introduction
Mental health problems represent a major public health challenge worldwide, and the accumulation of big data offers the opportunity for improving healthcare processes, interventions, and public health policies (Stewart and Davis, 2016).Recent advances in data science, machine learning and Natural Language Processing (NLP) hold great promise in providing technical solutions for the analysis of large sets of clinically relevant information in Psychiatry (Torous and Baker, 2016).This includes not only routinely collected data such as Electronic Health Records (EHRs), but also patient-generated text or speech.Patientgenerated content has been made available by social media, mainly in the form of tweets or forum posts (Névéol and Zweigenbaum, 2017;Gonzalez-Hernandez et al., 2017).
As opposed to e.g.documentation produced by healthcare professionals, social media data captures thoughts, feelings and discourse in people's own voice, and these types of data sources are becoming very important for monitoring a number of public health issues including mental health problems such as drug abuse, alcohol, and depression (De Choudhury et al., 2014;Wongkoblap et al., 2017;Conway and OConnor, 2016;Mikal et al., 2016;Sarker et al., 2016).
In this work, we address the problem of automatically classifying social media posts related to mental health derived from Reddit.Convolutional neural networks (CNNs) applied to this task have shown good performance in previous studies (Gkotsis et al., 2017).However, the performance of recurrent neural networks (RNNs) for the same task remains understudied.RNNs can be particularly beneficial in this case as they are able to model the sequential structure of text.We also attempt to explore the contribution of attention mechanisms to establishing a certain hierarchy in the sequences.
To be more precise, we apply a hierarchical RNN architecture as described in (Yang et al., 2016) to the classification of social media posts related to mental health problems, and seek to answer the following main questions: (a) Is a sequence-based model more beneficial than a CNN model for the accurate classification of social media posts?(b) Which parts of posts are more important for the classification of a post into its mental health topic as defined by the attention mechanism?
Our main contribution in this work is twofold: (1) an attempt to apply an RNN architecture to the text classification task of determining which mental health problem a post is about, which, to our knowledge, is the first attempt of its kind.We show that the ability of RNNs to take the sequence of events reflected in the post content can be beneficial for the classification of health-related social media text; (2) we also study the results of applying an attention mechanism to pinpoint the parts of a text that are contributing more to classification decisions.Those results can be useful for an in-depth analysis, to filter out irrelevant content, and to reduce the computational costs for real-life applications.We provide a few examples, and discuss future directions in this area.
Recently, CNNs were actively exploited for text classification in the medical domain (Baker and Korhonen, 2017;Yates et al., 2017).For instance, Yates et al. (2017) made an attempt at hierarchical classification.They merge outputs of several CNNs per post to create a representation (roughly, a feature set learned automatically) of the user activity across his/her posts.
CNNs learn to extract a hierarchy of crucial text elements.RNNs, on the other hand, handle text as a sequence.This property of RNNs can be especially beneficial to analyze health-related text, for which the order of described events can be important.
RNNs have been successfully used for document representation and consequently applied to a series of downstream NLP tasks such as topic labeling, summarization, and question answering (Li et al., 2015;Yang et al., 2016;Liu and Lapata, 2017).
As RNN architectures typically exploit an attention mechanism for hierarchical analysis, we also study whether this mechanism can provide insight into which words and sentences contribute to classification decisions.The mechanism opens a range of attractive, less costly modeling perspectives, for instance, in an attempt to replace recursion by Vaswani et al. (2017).One of the side benefits of using an attention mechanism is that the results of its application can be interpreted and provide a powerful tool for further text analysis.In our work we reproduce the hierarchical document classification architecture (HIERRNN) as proposed by Yang et al. (2016).This architecture progressively builds a document representation from its sentence representations, which in turn are composed of the representations of the words they contain.Those document representations are directly used by the architecture to make classification decisions.
To do so, the architecture implies a series of RNN encoders.The encoder reads an input sequence of words X = {x 1 . . .x J } and calculates a forward sequence of hidden states , and a backward sequence of hidden states ( ← − h 1 , ..., ← − h J ).The hidden states − → h j and ← − h j are concatenated to obtain the resulting representation h j .
To be more precise, the architecture contains bidirectional encoders, modeling sentences of a document d = {x 1 . . .x T }.Each sentence vector can be computed out of word representations: average, maximum, sum etc.We compute a weighted sum of those representations as weighted by the attention mechanism.Those vectors are input to the document encoder.The resulting document vector (again computed out of sentence representations) is in turn input to the softmax layer over document labels (see Figure 1).
The attention mechanism is used to weight aggregated representations.More formally, an atten-tion function consists in mapping a query and a set of key-value pairs to an output.The output is a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
To detect both words and sentences that are important to the meaning of a document we employ the hierarchical attention mechanism: where where t(•) is a non-linear activation function (tanh in our case).The importance of a unit is thus measured as the similarity of u j to the context vector u g , jointly learned during the training process.This vector serves a query.The importance weight is normalized importance through a softmax function.The document vector is thus computed as follows:

Experimental Setup
We study the performance of the hierarchical architecture on the task of classifying posts from social media related to mental health.

Data
We use a dataset of posts from the social media platform Reddit.Each entry has been posted to a so called subreddit -a topic-specific community within the platform.We use the posts and subreddits related to 11 mental health problems (i.e. a multiclass classification problem) that have been previously identified and used for text classification (Gkotsis et al., 2016(Gkotsis et al., , 2017)). 1 In total, the dataset consists of 538,272 posts, with an imbalanced distribution per mental health topic (ranging from 4,360 posts in addiction to 197,436 in depression).The data and the mental health topics are described in detail in (Gkotsis et al., 2017).The 11 mental health topics are listed in Table 2.
1 Data was obtained through the corresponding author of these studies and stored on encrypted computers.

Implementation Details
We implemented our document-level architecture using the Keras toolkit with Gated Recurrent Units (GRUs) (Cho et al., 2014) as RNNs.We followed the implementation details in Yang et al. (2016): the word embedding dimensionality is set to 200.The size of the hidden units of the encoder is 50.We set the input vocabulary size to 30K.We limit the sentence length to 70 tokens as standard in downstream NLP tasks (Hewlett et al., 2017).We fixed the size of a document to 17 sentences (empirically chosen value, which corresponds to the third quartile of the overall distribution of sentence length values), shorter documents were extended with dummy sentences.For training, we use a mini-batch size of 70.We use stochastic gradient descent to train all models with momentum of 0.9.We train the system to minimize the categorical cross-entropy loss and choose the best learning rate using grid search.
As our dataset is highly imbalanced we provide the system with class weights computed as inversely proportional to each class frequency.

Evaluation
We compare our results for HIERRNN with the attention mechanism (RNN-att) to two other configurations, where we a) take a maximum of vectors (RNN-max) or b) an average of vectors (RNN-av) at both word and sentence levels.We also compare our results to a baseline result reported by Gkotsis et al. (2017) for a CNN-based architecture (CNN).This architecture is a rather simple architecture with 5 layers: an embedding layer, a convolution layer (a filter window of 5), a max-pooling layer, a fully-connected layer and an output sigmoid layer.The results are directly comparable as performed for the same data split.
In terms of evaluation metrics we use the standard set of precision (PR), recall (RC) and Fmeasure (FM).In addition, we manually review a random sample of the results from the attention mechanism, and provide a few paraphrased examples (Benton et al., 2017).

Results
Results of our experiments are presented in Tables 1 and 2. All the three HIERRNN configurations yield an improvement over CNN: with a minor improvement of 1 FM for RNN-av, 2 FM for RNN-max and the highest improvement of 4 FM for RNN-att.Thus, we believe that considering the sequential characteristic of text, as done by RNN models, can be beneficial for analyzing posts related to mental health.
We should also note the improvement due to the attention mechanism as compared to the maximum and averaging strategies (on average 2.5 FM).Those results are consistent with the results presented by Yang et al. (2016) for other types of texts (e.g., reviews) and other types of labels (e.g., ratings).
As for per class performance, RNN-att improves this performance by 6 FM on average.The improvement in precision is twice as low as the improvement in recall (6% relative change in PR vs. 12% in RC).This difference is particularly remarkable for more rare classes.We tend to attribute this to intrinsic properties of RNNs (see Table 2). 2 relatively high performance improvement of 8 FM is observed for the 8 classes of posts (BPD, bipolar, schizophrenia, selfharm, addiction, cripplingalcoholism, Opiates, autism), which are under represented (on average represent 4% of all the test set posts) and with a relatively low document length (9 sentences on average vs. 11 sentences for all the classes).Except for intrinsic properties of RNNs, our modeling approximation (we limit the document size to 17 sentences to avoid optimization issues) could also contribute to this improvement.
As can be seen from the confusion matrix in Figure 2 the intrinsic overlap of post content across the themes can be misleading for classification: e.g., and again, as shown by Gkotsis et al. (2017), a lot of Opiates posts are misclassified as cripplingalcoholism and vice versa.However, HI-ERRNN is in general more precise and reveals less confusion between classes: e.g., the amount of confusion for schizophrenia with depression has reduced twice as compared to CNN.
One of the advantages of the attention mechanism is that its weights can be visualized and interpreted by humans (which is not always the case with neural network layers).In this work, we focus on the analysis of sentence-level attention weights.This information can be especially helpful for reducing the quantity of analyzed post sentences to create less costly classification solu- tions.
Table 3 provides results of our analysis of attention weights distributions.For this analysis we filtered out one-sentence documents.We study how often an absolute sentence position receives a maximum or a minimum weight from the total amount of cases this position is present across documents (a document is long enough).We report top three maximum and minimum positions.We also report average entropy values for the distributions per sentence. 3e also report similar statistics for a selection of classes in Table 4.
Our analysis shows that RNN-att is able to distinguish a certain semantic importance pattern: the most attention is paid to the first, then to the second and finally last sentences.The least attention is systematically paid to a sentence after a peak attention at the beginning (4th sentence), to a sentence in the middle (7th position) and to a sentence before the end (14th position).
At the same time, attention weights are quite equally spread between peak positions (average entropy of 1.93).The entropy values tend to increase for the classes that are better represented and for which posts are on average longer (e.g., depression, suicidewatch).Relevant information is not concentrated in those longer documents and several sentences are likely to be equally important.
Table 5 provides some examples of attention distributions for documents of different lengths and belonging to different classes.So that, for a longer document from suicidewatch the most relevance is given to the first 2 sentences containing words like "rejection" and "depression", whereas a neutral sentence "I met this girl."receives a low  weight.For a short document of 2 sentences from cripplingalcoholism 3/4 of the weight is concentrated on the 1st sentence.This sentence is especially relevant to the topic and contains keywords such as "beer" and "sober".Note that, for instance, for a schizophrenia post (a class for which performance was significantly improved by 10 FM as compared to CNN) the elaboration of the topic of auditory hallucinations in the first two sentences might have been taken into account by RNNs.
However, RNNs usually require more computational power to be trained than other neural architectures. 4 We believe that such information on attention distributions can be particularly useful for the creation of low-resource models, which could operate with filtered data (e.g., only two first sentences of a post).

Discussion and Conclusions
In this paper, we have applied a hierarchical Recurrent Neural Network (RNN) architecture to the classification of posts related to mental health, which is, to our knowledge, is the first attempt of the kind.The ability to classify posts in this manner is the first step towards targeted interventions, e.g. by redirecting posts requiring moderator attention.
Our model progressively builds a document representation: it aggregates important words into sentence vectors and then aggregates important sentence representations to document representations, directly used for inference.
We have shown that the intrinsic ability of RNNs to consider input in its sequence in general, and the hierarchical structure of this architecture specifically can be beneficial for the analysis of health-related online text.We observed a performance improvement of 4 F-measure (FM) as compared to Convolutional Neural Network (CNN) solutions.This improvement is mainly due to the performance improvement for more rare classes (8 FM on average).
We have also shown that the attention mechanism is capable to efficiently distinguish words and sentences of a document relevant for classification decisions.We provided a detailed study of attention distribution patterns at the sentence level and showed that the beginning of a document, as well as the last sentence are the most important.At the same time, attention tends to be equally distributed between those positions.
In the future, we plan to reproduce our study for other types of health-related text, including Electronic Health Records (EHRs), where the sequence of events can be even more important for classification decisions.We also plan to investigate attention weights at the word level and compare those results to the results produced using state-of-the-art weighting techniques, e.g., TF-IDF.
We also plan to systematically compare performance of different attention mechanisms with the purpose of finding a robust solution able to replace the computationally expensive recursion step.

Table 3 :
Absolute sentence positions that receive the most and the least attention.We provide top three positions with the percentage of their occurrences that received maximum or minimum attention.E.g.: 2nd sentence receives the most attention in 13% of the cases a post contains a 2nd sentence.H refers to entropy

Table 4 :
Absolute sentence positions that receive the most and the least attention: selection of classes.We provide top three positions with the percentage of their occurrences that received maximum or minimum attention.E.g.: for Opiates the 17th sentence receives maximum attention in 4% of the cases a post contains a 17th sentence.H refers to entropy.

Table 5 :
Paraphrased examples of attention weights distributions over post sentences.Medication names have been replaced with[medication]