Detection of Mental Health from Reddit via Deep Contextualized Representations

We address the problem of automatic detection of psychiatric disorders from the linguistic content of social media posts. We build a large scale dataset of Reddit posts from users with eight disorders and a control user group. We extract and analyze linguistic characteristics of posts and identify differences between diagnostic groups. We build strong classification models based on deep contextualized word representations and show that they outperform previously applied statistical models with simple linguistic features by large margins. We compare user-level and post-level classification performance, as well as an ensembled multiclass model.


Introduction
Global prevalence of mental disorders has been estimated at 29.2% in a meta-study of 174 surveys across 63 countries (Steel et al., 2014). Mental illness is one of the leading causes of disability globally and the costs of mental health treatment have run into the trillions of dollars (Organization et al., 2014;Vigo et al., 2016;Patel et al., 2018). Additionally, individuals suffering from mental illness are estimated at forming 14.3% of deaths worldwide, significantly higher than a control population (Walker et al., 2015). Limited mental health resources and funding have necessitated new approaches to addressing the global impact of this problem. However, early detection of mental illness and early intervention have shown promising results for improving treatment and long-term outcome results for many psychiatric disorders; these have the potential to reduce the costly burden that mental illness has placed on our society as well as our global economies (Bird et al., 2010;Treasure and Russell, 2011;De Girolamo et al., 2012;Murru and Carpiniello, 2018).
Advances in artificial intelligence in general and computational linguistics in particular have made important contributions to detecting and predicting mental illness among the population, particularly in social media (Guntuku et al., 2017;Wongkoblap et al., 2017). Using computational linguistics, researchers have been able to leverage the widespread use of social media to analyze large, publicly available datasets for identifying linguistic markers of mental illness. To date, unique linguistic markers and patterns have been identified for several psychiatric conditions, such as major depressive disorder (MDD) (De Choudhury et al., 2013;Vedula and Parthasarathy, 2017), general anxiety disorder (GAD) (Shen and Rudzicz, 2017), bipolar disorder (BD) (Huang et al., 2017;Sekulić et al., 2018), eating disorders (ED) (Mohammadi et al., 2019;Naderi et al., 2019), schizophrenia (SZ) (Mitchell et al., 2015;Birnbaum et al., 2017;Zomick et al., 2019), obsessive compulsive disorder (OCD) (Coppersmith et al., 2015a), posttraumatic stress disorder (PTSD) (Coppersmith et al., 2014), as well as others (Coppersmith et al., 2015a). Linguistic findings have spanned various domains of language, including the use of pronouns, emotion words, tentative language, tangentiality, punctuation, and content analysis. The majority of these models have been developed to successfully predict if a given user has self-disclosed receiving a diagnosis for a psychiatric condition and is currently suffering with mental illness.
However, much of this previous research on social media and mental health has focused on comparing users with particular disorders with control users. In this work we expand this focus to compare across a wide set of common disorders. This is directly applicable to real-world diagnostic scenarios, where clinicians select a diagnosis from a large set of disorders, rather than simply diagnosing an individual as healthy or not. In addition, prior work has focused on data collection and analysis, with less emphasis on building strong predictive models. In this work, we apply state-of-the-art neural network models developed for other natural language tasks to the problem of mental health detection from social media.

Related Work
In recent years, there has been increased interest in the NLP community in the automatic detection of psychiatric conditions from language. Many researchers have focused on analyzing vast amounts of language from social media posts to study mental health (Birnbaum et al., 2017;Coppersmith et al., 2015a;Mitchell et al., 2015). With the advent of social media, many people who suffer from various forms of mental illness have found a sense of community and support, and these platforms offer a mode of expression for discussing their experiences openly online. Additionally, many online platforms allow users to post anonymously, giving them a sense of security and anonymity to discuss their experiences and struggles without the fear of being stigmatized or discriminated against (Balani and De Choudhury, 2015;Berry et al., 2017;Highton-Williamson et al., 2015).
In order to analyze language patterns related to various disorders from social media data, researchers have developed innovative approaches for automatically labeling this data. (Coppersmith et al., 2014) developed a widely used approach for gathering data for a range of psychological disorders, using regular expressions to identify public self-disclosures of diagnoses on social media. They tested this approach using Twitter data and collected a dataset of tweets from individuals with bipolar, depression, PTSD, SAD, and a control group. They analyzed several linguistic features across conditions using a clustering algorithm and built predictive classifiers to distinguish between diagnosed and control users. (Cohan et al., 2018a) expanded this approach to study a larger set of disorders using Reddit data. Reddit is one of the fastest growing and widely used social media platforms, averaging over 330 million active monthly users, and, as of 2018, was the fourth most visited website in the US (Hutchinson, 2018). Unlike Twitter, Reddit imposes no limits on the length of posts, enabling an analysis of longer language samples. In addition, Reddit is composed of subreddits, which are forums dedicated to specific topics, and there are many subreddits related to specific mental health conditions. They collected a large dataset of Reddit posts and analyzed linguistic features between different conditions and a control group. They also trained binary classifiers to distinguish between each condition and the control.
Our work directly builds on this prior work. Following (Coppersmith et al., 2014) and (Cohan et al., 2018a), we collect a large expanded dataset of Reddit posts. Unlike prior work, we do not focus on pairwise analyses of linguistic features between conditions and the control group; rather, we compare features between conditions to highlight important differences that can distinguish between various disorders. While others have trained simple predictive models of these disorders, we instead use state-of-the-art deep contextualized models that have been highly successful across several NLP tasks. Our work makes important contributions to the problem of mental health detection from social media data and provides insights for others to further build on this work.

Data Collection
We focus in this study on 8 mental health conditions: schizophrenia (SZ), borderline personality disorder (BPD), post-traumatic stress disorder (PTSD), eating disorder (ED), major depression disorder (MDD), general anxiety disorder (GAD) and bipolar disorder. While datasets for many of these conditions have been collected on varying scales, to the best of our knowledge our dataset includes the largest cohort of users whose posts have been collected for many of these conditions. To build this cohort, we collect users with self-identified mental health conditions from Reddit using the Pushshift API 1 . We search for users in mental health related subreddits and use keywords to search for mental health related words. Our distant labeling approach is further explained below, in Section 3.1. We also identify a group of control users who do not have any of the targeted conditions. We first collect a large scale user pool by scraping posts from common subreddits like r/AskReddit, and filter the control users by process described in subsection 3.2. The number of posts collected in each condition is shown in Table 1 and the number of users whose posts were collected in each is shown in Table 2.

Distant Labeling
We generally follow the self-identification technique previously employed in (Mitchell et al., 2015;Coppersmith et al., 2015a;Cohan et al., 2018a). Specifically, we construct separate regular expressions for self-identification checking and condition resolution. We use 2-way human annotation to verify the performance of our labeling algorithm. Our second version of labeling algorithm achieves high precision (over .95) when tested on a held-out validation set. We found that posts directly identifying with "eating disorder" are scarce, so we collapse identification with "anorexia", "arfid", "bulimia" and "binge" into a single category for "eating disorder". We also calculate comorbidity statistics for our extracted user set as is shown in Figure 1, and have found it correlated well with statistics previously reported (Coppersmith et al., 2015a;Cohan et al., 2018b).

Preprocessing
Following Cohan et al. (2018b) we do not include in our classification any control user that has any sensitive post, defined as either (1) containing mental health related keywords or (2) posted in a mental health related subreddit. In addition, under the (CL) condition of our classification experiments (described below in Section 6), we remove these sensitive posts from mental group users. For post level preprocessing, we replace emojis with descriptive text using the demoji package 2 , normalize html characters like "&#x200b;", "&amp;" and "&nbsp;" etc., and mask out url, email and subreddit references with regular expression.

Linguistic Indicators of Mental Health
After collecting and preprocessing the data, we analyzed linguistic characteristics of mental health using Linguistic Inquiry and Word Count (Pennebaker et al., 2015). LIWC is a text analysis program that computes word counts for semantic classes and structural features. It relies on an internal dictionary that maps words to psychologically motivated categories. These include standard linguistic features (e.g. percentage of words that are pronouns, articles), markers of psychological processes (e.g. affect, social, cognitive words), and punctuation categories (e.g. periods, commas). LIWC dimensions have been used in many studies to predict outcomes including personality (Pennebaker and King, 1999), deception (Newman et al., 2003), and health (Pennebaker et al., 1997). We extracted 73 features using LIWC 2015; a full description of these features is found in (Pennebaker et al., 2015). To construct a single feature vector per user, we concatenated all posts per user and then extracted the LIWC features from the combined posts and performed length normalization.
Prior work on identifying linguistic indicators of mental health has compared LIWC features from users' individual disorders with healthy control users. However, it is often unclear whether the findings are specific to the disorder, or if they are indicative of mental disorders more generally. For example, in pairwise analyses, personal pronoun usage has been found to be increased in individuals with schizophrenia (Zomick et al., 2019). However, this pattern might or might not be specific to schizophrenia, but may be indicative of other mental disorders as well. Because of this gap in prior work, we began by comparing LIWC features directly across the 8 diagnostic groups and the control group. Figure 2 shows a heatmap of the z-score normalized average LIWC features across users in each group. The x-axis shows the 8 diagnostic groups and the control group, and the y-axis shows the normalized LIWC feature values. The color of each cell indicates whether the scaled value is high (blue), low (red) or average (white). As shown in this figure, the control group has the greatest number of red features, or LIWC features which have a low frequency. It is clear from the figure that the 8 diagnostic groups have different language usage patterns from the control group, and particularly show a higher frequency for several linguistic dimensions. Further, there seem to be several interesting similarities and differences in linguistic patterns across the diagnostic groups. To further investigate these differences, we ran one-way ANOVAs comparing each LIWC feature across the 8 diagnostic groups and the control group. To correct for family-wise type I errors, we used Bonferroni correction. The results indicated that there were significant differences across groups for all 73 LIWC variables. We ran Tukey posthoc tests to identify which pairs of conditions were most similar and most different. Because of limited space, we focus here on the linguistic dimensions with the greatest variance among the groups, indicated by the highest F-statistics. These categories were anx, the use of anxiety words (F(8, 24442) = 531.911, p<.0001), and I, the use of the first person singular pronoun (F(8, 24442) = 438.738, p<.0001). Figure   3 shows the results of the posthoc analysis. Pairwise comparisons among psychiatric conditions revealed several interesting findings. While each condition differed significantly for both features when compared with the control group (users in the control group were significantly less likely to use anxiety related words and "I" when compared to each condition), when comparing between the psychiatric conditions differences varied. For example, users with SZ were significantly less likely to use anxiety related words in comparison with other groups. Another interesting finding was that users with BPD used 1st person singular pronouns significantly more than all other psychiatric conditions with the exception of ED. These findings shed light on linguistic variation across different psychiatric conditions, and provide further motivation for developing methods to distinguish between individuals with these disorders by leveraging social media posts.

Methods for Classification Experiments
Having identified significant differences in linguistic features between the disorders, and between the control users, we next explore several classification methods for automatically identifying different mental health conditions. Previous efforts to identify such conditions in Reddit have primarily employed simple logistic regression or SVMs using a bag-of-words representation or LIWC features, and some have explored RNN/CNN based text encoder models (Coppersmith et al., 2014(Coppersmith et al., , 2015aCohan et al., 2018b;Sekulic and Strube, 2019). However, recent advances in contextual representations like ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018), which have enabled substantial performance increases across many NLP task, have not been well integrated into mental health identification tasks. This is due to model size and scalability issues of the large number of posts generated by a user. In this work we focus on methods utilizing contextual representations for mental health identification and compare their effectiveness to a logistic regression baseline trained on LIWC features. We present an attention-based model using BERT representations as input features, as well as a REALMlike model (Guu et al., 2020) inspired by recent advances in open domain question-answering. All of these models are trained for a user-level classification task, to detect whether a user has a particular diagnosis, based on an aggregated representation of their posts. In addition, we conduct post-level classification experiments with classic BERT finetuning settings. This experiment is done to discover the importance of global context in classification.
Finally, in addition to these binary classifiers (diagnosis vs. control), we ensemble all our binary classification models as a multi-label classifier among different diagnostic groups, which is the ultimate goal for the application of this work.
For user-level classification, we select users not belonging to the co-morbidity group as a control group. To reduce the size of the data, we remove posts less than 50 characters long from both the mental group and the control group, as we hypothesize that these may not provide enough information for classification. Also we do not include control users with fewer than 20 non-sensitive posts. When pairing with mental health users, We select control users who have a similar total number of posts, who do not have mental health sensitive posts, and who have at least some subreddit overlap with the mental health users, as described by Cohan et al. (2018b). We consider two experimental settings: for the CL (clean) experiments we exclude all mental health sensitive posts for mental health users, and for the UNCL (unclean) experiments we include these posts for their corresponding users. The intuition is that under the UNCL setting, our model should be able to make predictions based on some explicit semantic triggers, thus resulting in better performance. However, under the CL setting, the model may rely on underlying syntactic differences that may generalize better than explicit semantic features.
Below we describe the Attention-Based model and the REALM model that we adapt for this work.

Attention-Based Model
A direct solution to the scalability issue with this data is to restrain the gradient update in a model to a small portion of the model parameters. We propose to use pre-trained BERT model from Hugging Face (Wolf et al., 2019) to encode every post; we then averaged representation of all positions as a pooling result to build an attention-based classifier (Bahdanau et al., 2014;Sutskever et al., 2014) over all the post-level representations for a single user 3 . This resembles the settings of many "probing tasks" (Hewitt and Manning, 2019) used to investigate whether BERT embeddings encode useful linguistic information about a user's mental health condition. Guu et al. (2020) propose Retrieval-Augmented Language Model pretraining to augment a pretrained LM as a textual knowledge retriever. To tackle the scalability issue of retrieving over large corpora, a retrieval encoder parameterized by θ tuned over their top-k retrieval results is used to encode all documents in the textual knowledge corpus. Guu et al. (2020) shows by gradient analysis that a document z will receive a positive update if the estimated probability of a correct answer y based on z is better than the expectation over all documents in the textual knowledge corpus. To adapt this REALM model to our task, we reformulate our classification problem as a "retrieve-then-predict" pipeline similar to Open Domain Question Answering (ODQA). Specifically, given a user's total set of posts Z, we first select the top-k posts {z 1 , . . . , z k } that are most helpful in predicting the user's mental health condition and we base our prediction only on these posts. Unlike in ODQA we have a ques-tion x that can be utilized for relevant document selection, so we now use a trainable attention head to calculate the retrieval probability p(z). Thus the probability of a user having condition y can be factorized as:

REALM-like Models
where Here, Embed doc (·) is implemented as a BERT-style transformer parameterized by θ and p(y|z) is implemented as a BERT-based classifier parameterized by φ. When training, we first index all user posts with Embed doc (z) using our model θ, and jointly tune θ and φ and h on the top-k user posts w.r.t. p(z). For every several epochs, we re-index all posts with tuned parameter θ .

Experimental Settings
For REALM-like models we update the user corpus index every 5 epochs. At every step we use the top 10 documents to tune the model for each user, and we set the learning rate for the attention-based classifier to 1e-3, and the learning rate for BERT parameters to 1e-5. For the attention-based model, we set the learning rate for the classifier to 1e-3 and keep the BERT parameters frozen. For our post-level classification model we set the learning rate in the same way as in the REALM-like model. Note that when fine-tuning the BERT-based model, we pool the sentence representation with [CLS] token, unlike the non-tunable model (BERT-ATT) where we average across all positions to get the pooling result. In all cases except for our LIWCfeature-based logistic regression model, we use a held-out development set for model selection; for LIWC-based regression we run a parameter grid search using cross-validation on the training set. For the multi-label classification experiments, we ensemble the best model set under the CL setting as the multi-label classifier.

Mental Health Detection Results
In this section we present the results for user-level and post-level binary classification, and for the multiclass ensemble classification. probably because the REALM model makes its predictions using only the top-10 segments retrieved from all of a user's posts, while the BERT-ATT model is able to attend to all the posts at once. This result aligns well with the intuition that linguistic traits for mental health conditions should be global, and may be more difficult to determine from a small portion of posts -especially when posts containing sensitive keywords are removed.

User-Level Classification
The BERT-ATT model performs best for the bipolar category (CL-F1: .879; UNCL-F1: .931) for which we have the largest user group, indicating the importance of obtaining large scale datasets for the success of deep mental health detection. In all cases, our results suggest that contextualized representation is a better feature for mental health prediction compared with LIWC features, but is also more likely to model shallow semantic traits.

Post-Level Classification
To create a balanced set comparable to user level classification, we sample 50,000 mental group posts and 50,000 control group posts as the training set, and 5,000 + 5,000 posts for dev and test. Table 4 shows the post-level binary classification results. This post-level classification result ranges from an F1 of .596 for MDD to an F1 of .736 for ED, substantially lower than the user-level classification performance. This suggests that linguistic signals related to mental health problems do not appear in all posts of a mental group user. However, model performance exhibits similar trends when we compare post-level with user-level classifications LIWC features, indicating that the ED subset is the easiest and MDD the hardest: this result may mean that linguistic traits for ED have a broader coverage among user posts while for MDD the scope is probably smaller. This is consistent with the results reported by Coppersmith et al. (2015a). As the BERT-ATT model performs best under CL setting, we ensemble all BERT-ATT model as the multi-label classifier. Again, to create a balanced testing set, we sample 100 users from each condition group's test set. We then predict the user condition by selecting the label with the highest score from the model. With this naive ensemble method we achieve F − 1 micro = 0.2175 and F − 1 macro = 0.195. The fact that these results are only slightly above the random baseline (.125) indicates that, though under binary settings deep contextualized word representation is a strong feature, the model is not well calibrated, (DeGroot and Fienberg, 1983;Niculescu-Mizil and Caruana, 2005) as is often the case for modern deep networks (Guo et al., 2017). To see whether there are any identifiable patterns in the errors, we plot the prediction heatmap for the multi-label classification task, as shown in Figure 4. We find that there is a discrepancy of confidence between dif-   Table 4: Post-level BERT Classification Results ferent models, and this confidence neither strongly correlates with training data size nor with binary classification performance. Though we observe that cells on the main diagonal in general have a darker shade indicating a promising separation of feature sets that are useful in identifying their designated condition, the mislabeling distribution has little resemblance to the comorbidity distribution characterized in Figure 1. Further experimentation is needed to improve the multiclass ensemble classification performance.

Conclusions and Future Work
In this paper we collect and analyze a large scale dataset of social media posts from users various mental health conditions. We analyze linguistic characteristics of the posts, directly comparing the features of the various conditions. We build strong classification models based on deep contextualized representations and demonstrate that they outperform the LIWC feature based logistic regression baseline by a large margin. Although the LIWC feature representation is not as useful for classification, it is a useful representation for analysis of posts to gain insight about the differences between groups. Our experimental results show that linguistic traits for mental health detection are more easily recognized at the user-level and thus effectively aggregating post-level signals is crucial to accurate prediction. Also, we find that these contextualized representations rely heavily on semantic content and always perform better when semantic indicators are obvious. We also show that the prediction scores of our classification models, even the accurate ones, are not well calibrated and thus are not an accurate uncertainty estimator of mental health risk. These results call for a more interpretable model for mental health detection. Future research may look into the direction of learning better deep features and exploring additional classification paradigms to further improve performance for this impactful problem.