Adapting Deep Learning Methods for Mental Health Prediction on Social Media

Mental health poses a significant challenge for an individual’s well-being. Text analysis of rich resources, like social media, can contribute to deeper understanding of illnesses and provide means for their early detection. We tackle a challenge of detecting social media users’ mental status through deep learning-based models, moving away from traditional approaches to the task. In a binary classification task on predicting if a user suffers from one of nine different disorders, a hierarchical attention network outperforms previously set benchmarks for four of the disorders. Furthermore, we explore the limitations of our model and analyze phrases relevant for classification by inspecting the model’s word-level attention weights.


Introduction
Mental health is a serious issue of the modern-day world. According to the World Health Organization's 2017 report and Wykes et al. (2015) more than a quarter of Europe's adult population suffers from an episode of a mental disorder in their life. The problem grows bigger with the fact that as much as 35-50% of those affected go undiagnosed and receive no treatment for their illness. In line with WHO's Mental Health Action Plan (Saxena et al., 2013), the natural language processing community helps the gathering of information and evidence on mental conditions, focusing on text analysis of authors affected by mental illnesses.
Researchers can utilize large amounts of text on social media sites to get a deeper understanding of mental health and develop models for early detection of various mental disorders (De Choudhury et al., 2013a;Coppersmith et al., 2014;Gkotsis et al., 2016;Benton et al., 2017;Sekulić et al., 2018;Zomick et al., 2019). In this work, we experiment with the Self-reported Mental Health Diagnoses (SMHD) dataset (Cohan et al., 2018), consisting of thousands of Reddit users diagnosed with one or more mental illnesses. The contribution of our work is threefold. First, we adapt a deep neural model, proved to be successful in large-scale document classification, for user classification on social media, outperforming previously set benchmarks for four out of nine disorders. In contrast to the majority of preceding studies on mental health prediction in social media, which relied mostly on traditional classifiers, we employ Hierarchical Attention Network (HAN) (Yang et al., 2016). Second, we explore the limitations of the model in terms of data needed for successful classification, specifically, the number of users and number of posts per user. Third, through the attention mechanism of the model, we analyze the most relevant phrases for the classification and compare them to previous work in the field. We find similarities between lexical features and ngrams identified by the attention mechanism, supporting previous analyses. Reddit) they post in. Diagnosed users' language is normalized by removing posts with specific mental health signals and discussions, in order to analyze the language of general discussions and to be more comparable to the control groups. The nine disorders and the number of users per disorder, as well as average number of posts per user, are shown in Table 1. For each disorder, Cohan et al. (2018) analyze the differences in language use between diagnosed users and their respective control groups. They also provide benchmark results for the binary classification task of predicting whether the user belongs to the diagnosed or the control group. We reproduce their baseline models for each disorder and compare to our deep learning-based model, explained in Section 2.3.

Selecting the Control Group
Cohan et al. (2018) select nine or more control users for each diagnosed user and run their experiments with these mappings. With this exact mapping not being available, for each of the nine conditions, we had to select the control group ourselves. For each diagnosed user, we draw exactly nine control users from the pool of 335,952 control users present in SMHD and proceed to train and test our binary classifiers on the newly created sub-datasets.
In order to create a statistically-fair comparison, we run the selection process multiple times, as well as reimplement the benchmark models used in Cohan et al. (2018). Multiple sub-datasets with different control groups not only provide us with unbiased results, but also show how results of a binary classification can differ depending on the control group.

Hierarchical Attention Network
We adapt a Hierarchical Attention Network (HAN) (Yang et al., 2016), originally used for document classification, to user classification on social media. A HAN consists of a word sequence encoder, a word-level attention layer, a sentence encoder and a sentence-level attention layer. It employs GRU-based sequence encoders (Cho et al., 2014) on sentence and document level, yielding a document representation in the end. The word sequence encoder produces a representation of a given sentence, which then is forwarded to a sentence sequence encoder that, given a sequence of encoded sentences, returns a document representation. Both, word sequence and sentence sequence encoders, apply attention mechanisms on top to help the encoder more accurately aggregate the representation of given sequence. For details of the architecture we refer the interested readers to Yang et al. (2016).
In this work, we model a user as a document, enabling an intuitive adaptation of the HAN. Just as a document is a sequence of sentences, we propose to model a social media user as a sequence of posts. Similarly, we identify posts as sentences, both being a sequence of tokens. This interpretation enables us to apply the HAN, which had great success in document classification, to user classification on social media.

Experimental Setup
The HAN uses two layers of bidirectional GRU units with hidden size of 150, each of them followed by a 100 dimensional attention mechanism. The first layer encodes posts, while the second one encodes a user as a sequence of encoded posts. The output layer is 50-dimensional fullyconnected network, with binary cross entropy as a loss function. We initialize the input layer with 300 dimensional GloVe word embeddings (Pennington et al., 2014). We train the model with Adam (Kingma and Ba, 2014), with an initial learning rate of 10 −4 and a batch size of 32 for 50 epochs. The model that proves best on the development set is selected.
We implement the baselines as in Cohan et al. (2018). Logistic regression and the linear SVM  were trained on tf-idf weighted bag-of-words features, where users' posts are all concatenated and all the tokens lower-cased. Optimal parameters were found on the development set, and models were evaluated on the test set. FastText (Joulin et al., 2016) was trained for 100 epochs, using character n-grams of size 3 to 6, with a 100 dimensional hidden layer. We take diagnosed users from predefined train-dev-test split, and select the control group as described in Subsection 2.2. To ensure unbiased results and fair comparison to the baselines, we repeat the process of selecting the control group five times for each disorder and report the average of the runs.

Binary Classification per Disorder
We report the F 1 measures per disorder in Table 2, in the task of binary classification of users, with the diagnosed class as a positive one. Our model outperforms the baseline models for Depression, ADHD, Anxiety, and Bipolar disorder, while it proves insufficient for PTSD, Autism, OCD, Schizophrenia, and Eating disorder. We hypothesize that the reason for this are the sizes of particular sub-datasets, which can be seen in Table 1. We observe higher F 1 score for the HAN in disorders with sufficient data, suggesting once again that deep neural models are data-hungry (Sun et al., 2017). Logistic regression and linear SVM achieve higher scores where there is a smaller number of diagnosed users. In contrast to Cohan et al. (2018), supervised FastText yields worse results than tuned linear models. We further investigate the impact of the size of the dataset on the final results of classification. We limit the number of posts per user available to the model to examine the amount needed for reasonable performance. The results for 50, 100, 150, 200, and 250 posts per user available are presented in Figure 1. Experiments were run three times for each disorder and each number of available posts, every time with a different control group selected. We observe a positive correlation between the data provided to the model and the performance, although we find an upper bound to this tendency. As the average number of posts per user is roughly 160 (Table 1), it is reasonable to expect of a model to perform well with similar amounts of data available. However, further analysis is required to see if the model reaches the plateau because a large amount of data is not needed for the task, or due to it not being expressive enough.

Attention Weights Analysis
The HAN, through attention mechanism, provides a clear way to identify posts, and words or phrases in those posts, relevant for classification. We examine attention weights on a word level and compare the most attended words to prior research on depression. Depression is selected as the most prevalent disorder in the SMHD dataset with a number of studies in the field (Rude et al., 2004;Chung and Pennebaker, 2007;De Choudhury et al., 2013b;Park et al., 2012). For each post, we extracted two words with the highest attention weight as being the most relevant for the classification. If the two words are appearing next to each other in a post we consider them as bigram. Some of the top 100 most common unigrams and bigrams are presented in Table 3, aggregated under the most common LIWC categories.
We observe similar patterns in features shown  relevant by the HAN and previous research on signals of depression in language. The importance of personal pronouns in distinguishing depressed authors from the control group is supported by multiple studies (Rude et al., 2004;Chung and Pennebaker, 2007;De Choudhury et al., 2013b;Cohan et al., 2018). In the categories Affective processes, Social processes, and Biological processes, Cohan et al. (2018) report significant differences between depressed and control group, similar to some other disorders. Except the above mentioned words and their abbreviations, among most commonly attended are swear words, as well as other forms of informal language. The attention mechanism's weighting suggests that words and phrases proved important in previous studies, using lexical features and linear models, are relevant for the HAN as well.

Related Work
In recent years, social media has been a valuable source for psychological research. While most studies use Twitter data (Coppersmith et al., 2015a(Coppersmith et al., , 2014Benton et al., 2017;Coppersmith et al., 2015b), a recent stream turns to Reddit as a richer source of high-volume data (De Choudhury and De, 2014;Shen and Rudzicz, 2017;Gjurković andŠnajder, 2018;Cohan et al., 2018;Sekulić et al., 2018;Zirikly et al., 2019). Previous approaches to author's mental health prediction usually relied on linguistic and stylistic features, e.g., Linguistic Inquiry and Word Count (LIWC) (Pennebaker et al., 2001) -a widely used feature extractor for various studies regarding mental health (Rude et al., 2004;Coppersmith et al., 2014;Sekulić et al., 2018;Zomick et al., 2019). Recently, Song et al. (2018) built a feature attention network for depression detection on Reddit, showing high interpretability, but low improvement in accuracy. Orabi et al. (2018) concatenate all the tweets of a Twitter user in a single document and experiment with various deep neu-ral models for depression detection. Some of the previous studies use deep learning methods on a post level to infer general information about a user (Kshirsagar et al., 2017;Ive et al., 2018;Ruder et al., 2016), or detect different mental health concepts in posts themselves (Rojas-Barahona et al., 2018), while we focus on utilizing all of the users' text. Yates et al. (2017) use a CNN on a postlevel to extract features, which are then concatenated to get a user representation used for selfharm and depression assessment. A CNN requires a fixed length of posts, putting constraints on the data available to the model, while a HAN utilizes all of the data from posts of arbitrary lengths.
A social media user can be modeled as collection of their posts, so we look at neural models for large-scale text classification. Liu et al. (2018) split a document into chunks and use a combination of CNNs and RNNs for document classification. While this approach proves to be successful for scientific paper categorization, it is unintuitive to use in social media text due to an unclear way of splitting user's data into equally sized chunks of text. Yang et al. (2016) use a hierarchical attention network for document classification, an approach that we adapt for Reddit. A step further would be adding another hierarchy, similar to Jiang et al. (2019), who use a multi-depth attention-based hierarchical RNN to tackle the problem of longlength document semantic matching.

Ethical considerations
Acknowledging the social impact of NLP research (Hovy and Spruit, 2016), mental health analysis must be approached carefully as it is an extremely sensitive matter (Šuster et al., 2017). In order to acquire the SMHD dataset, we comply to the Data Usage Agreement, made to protect the users' privacy. We do not attempt to contact the users in the dataset, nor identify or link them with other user information.

Conclusion
In this study, we experimented with hierarchical attention networks for the task of predicting mental health status of Reddit users. For the disorders with a fair amount of diagnosed users, a HAN proves to be better than the baselines. However, the results worsen as the data available decreases, suggesting that traditional approaches remain better for smaller datasets. The analysis of atten-tion weights on word level suggested similarities to previous studies of depressed authors. Embedding mental health-specific insights from previous work could benefit the model in general. Future work includes analysis of post-level attention weights, with a goal of finding patterns in the relevance of particular posts, and, through them, time periods when a user is in distress. As some of the disorders share similar symptoms, e.g., depressive episodes in bipolar disorder, exploiting correlations between labels through multi-task or transfer learning techniques might prove useful. In order to improve the classification accuracy, a transformerbased model for encoding users' posts should be tested.

Acknowledgments
This work has been funded by the Erasmus+ programme of the European Union and the Klaus Tschira Foundation.