Do Models of Mental Health Based on Social Media Data Generalize?

Proxy-based methods for annotating mental health status in social media have grown popular in computational research due to their ability to gather large training samples. However, an emerging body of literature has raised new concerns regarding the validity of these types of methods for use in clinical applications. To further understand the robustness of distantly supervised mental health models, we explore the generalization ability of machine learning classifiers trained to detect depression in individuals across multiple social media platforms. Our experiments not only reveal that substantial loss occurs when transferring between platforms, but also that there exist several unreliable confounding factors that may enable researchers to overestimate classification performance. Based on these results, we enumerate recommendations for future mental health dataset construction.


Introduction
In the last decade, there has been substantial growth in the area of digital psychiatry. Automated methods using natural language processing have been able to detect mental health disorders based on a person's language in a variety of data types, such as social media (Mowery et al., 2016;Morales et al., 2017), speech (Iter et al., 2018) and other writings (Kayi et al., 2017;Just et al., 2019). As in-person clinical visits are made increasingly difficult by socioeconomic barriers and public-health crises, such as COVID-19, tools for measuring mental wellness using implicit signal become more important than ever (Abdel-Rahman, 2019; Bojdani et al., 2020).
Early work in this area leveraged traditional human subject studies in which individuals with clinically validated psychiatric diagnoses volunteered their language data to train classifiers and perform quantitative analyses (Rude et al., 2004;Jarrold et al., 2010). In an effort to model larger, more diverse populations with less overhead, a substantial portion of research in the last decade has instead explored data annotated via automated mechanisms (Coppersmith et al., 2015a;Winata et al., 2018).
Studies leveraging proxy-based annotations have supported their design by demonstrating alignment with existing psychological theory regarding language usage by individuals living with a mental health disorder (Cavazos-Rehg et al., 2016;Vedula and Parthasarathy, 2017). For example, feature analyses have highlighted higher amounts of negative affect and increased personal pronoun prevalence amongst depressed individuals (Park et al., 2012;De Choudhury et al., 2013). Given these consistencies, the field has largely turned its attention toward optimizing predictive power via state of the art models (Orabi et al., 2018;Song et al., 2018).
The ultimate goal of these efforts has been threefold-to better personalize psychiatric care, to enable early intervention, and to monitor population-level health outcomes in real time. Nonetheless, research has largely trudged forward without stopping to ask one critical question: do models of mental health conditions trained on automatically annotated social media data actually generalize to new data platforms and populations?
Typically, the answer is no-or at least not without modification. Performance loss is to be expected in a variety of scenarios due to underlying distributional shifts, e.g. domain transfer (Shimodaira, 2000;Subbaswamy and Saria, 2020). Accordingly, substantial effort has been devoted to developing computational methods for domain adaptation (Imran et al., 2016;Chu and Wang, 2018). Outcomes from this work often provide a solid foundation for use across multiple natural language processing tasks (Daume III and Marcu, 2006). However, it is unclear to what extent factors specific to mental health require tailored intervention.
In this study, we demonstrate that at a baseline, proxy-based models of mental health status do not transfer well to other datasets annotated via automated mechanisms. Supported by five widely used datasets for predicting depression in social media users from both Reddit and Twitter, we present a combination of qualitative and quantitative experiments to identify troublesome confounds that lead to poor predictive generalization in the mental health research space. We then enumerate evidencebased recommendations for future mental health dataset construction.
Ethical Considerations. Given the sensitive nature of data containing mental health status of individuals, additional precautions based on guidance from Benton et al. (2017a) were taken during all data collection and analysis procedures. Data sourced from external research groups was retrieved according to each dataset's respective data usage policy. The research was deemed exempt from review by our Institutional Review Board (IRB) under 45 CFR § 46.104.

Domain Adaptation in Mental Health
Domain adaptation (or "transfer") of statistical classifiers is a well-studied computational problem with high relevance across several areas of natural language processing (Jiang, 2008;Peng and Dredze, 2017). It is particularly useful in situations where acquiring ample training data for a target application is intractable (e.g. monetary, time constraints) or impossible (e.g. privacy constraints) (Rieman et al., 2017). For example, in the sub-field of machine translation, significant effort is devoted to finding ways to effectively use large corpora of formal parallel text to train models for application in domains with informal and dynamic language, such as social media and conversational speech (Wang et al., 2017;Murakami et al., 2019).
Traditional challenges encountered when transferring models between domains include variance in source and target class distributions (Japkowicz and Stephen, 2002), semantic misalignment (Wu and Huang, 2016), and sparse vocabulary overlap (Stojanov et al., 2019). Fortunately, once these issues are identified, it is typically possible to decrease the transfer performance gap via methods such as structural correspondence learning, feature subspace mapping, and adversarial training (Blitzer et al., 2006;Bach et al., 2016;Tzeng et al., 2017).
Domain adaptation is of particular interest in the mental health space, where there exist numerous complexities in obtaining a sufficient sample of training data. For instance, the sensitive nature of mental health data necessitates extra care when creating and supporting new datasets (Benton et al., 2017a). Additionally, behavioral disorders are known to display variable clinical presentations amongst different populations, which can make identification of ground truth difficult (De Choudhury et al., 2017;Arseniev-Koehler et al., 2018).
The latter point highlights the presence of label noise inherent in mental health data (Mitchell et al., 2009;Shing et al., 2018). This facet serves as one of two primary issues unique to this research space that may hinder attempts at domain transfer. Indeed, prior work found that diverse and sometimes conflicting views humans have regarding suicidal ideation can make obtaining reliable gold-standard labels fundamentally challenging and lead to degradation in model performance .
Sampling-related biases present the other main area of concern for successful domain transfer by mental health classifiers. Attributes such as personality, gender, age, and disorder co-morbidity have been found to significantly affect the presentation of mental health disorders in language data (Cummins et al., 2015;Preoţiuc-Pietro et al., 2015). Moreover, the proxy-based annotation mechanisms used to label large social media data sets with mental health status invite the introduction of selfdisclosure bias into the modeling task (Amir et al., 2019). Specifically, labels sourced from populations of individuals who self-disclose certain attributes may contain activity-level and thematic biases that cause poor generalization in larger populations (Lippincott and Carrell, 2018).
Research leveraging text data for mental health status classification has primarily only considered a constrained form of domain transfer. In a withinsubject analysis, Ireland and Iserman (2018) examined differences in language usage by Reddit users who had posted in an anxiety support forum within and outside mental health forums. Similarly, Wolohan et al. (2018) explored the predictive power of models trained to detect depression within Reddit users as a function of access to text from explicit mental health related subreddits. Both studies highlighted a mitigation of overt mental health discussion outside of the support forums, but still detected linguistic nuances in individuals with an affiliation to the mental health forums.  Shen et al. (2018) attempted to use transfer learning with large amounts of English Twitter data annotated with individual-level depression labels to improve predictive performance of depression classifiers in Chinese Weibo data. Using the English and Chinese versions of the Linguistic Inquiry and Word Count tool (LIWC) (Pennebaker et al., 2001;Huang et al., 2012) in conjunction with other modalities of social data (e.g. profile metadata, images), the authors showed that signal from Twitter was useful for classification on Weibo.
Recent work from Ernala et al. (2019) was the first to explore some of aforementioned difficulties with domain transfer in the mental health space. Multiple different annotation mechanisms were used to train Twitter-based models for identifying schizophrenia and then applied to Facebook data from an independent population of clinically diagnosed schizophrenia patients. Three different types of proxy signals with varying degrees of manual supervision were each found to generalize poorly to the clinical population. While the authors' analysis suggested the domains were similar enough to justify transfer attempts, only limited post-hoc analysis of the data platform effect was carried out. Thus, it remains unclear to what extent the annotation methodologies as opposed to platform effects (or other confounds) caused the degradation.

Data
We select depression classification as our task because it is perhaps the most widely studied, has multiple datasets from different platforms, and is of critical importance to society. Estimated to affect 4.4% of the global population, depression presents a significant economic burden and remains the most common psychiatric disorder associated with deaths by suicide (Hawton et al., 2013;Organi-zation et al., 2017). Occupying a lion's share of the computational literature, depression classification is a critical first target for evaluating generalization of mental health models in social media (Chancellor and De Choudhury, 2020).
To quantify the nature of domain transfer loss, we consider five datasets. Datasets were selected based on their common adoption in the literature  Table 1 presents summary statistics. Construction details are in Appendix A as a courtesy to the reader.

Mitigating Bias
Each dataset was curated in part by a system of simple rules (e.g. matches to "I was diagnosed with depression," participation in a depression support forum). While these heuristics are useful for identifying candidates to include within each dataset, they also risk introducing bias that may render the modeling task trivial. For example, individuals who disclose a depression diagnosis are likely to also share their experience with other psychiatric conditions (Benton et al., 2017b), while language used in dedicated mental-health subreddits systematically differs from the rest of Reddit (De Choudhury and De, 2014;Ireland and Iserman, 2018).
To encourage our mental health classifiers to learn subtle linguistic nuances that cannot be easily captured using straightforward logic, we make efforts to exclude unambiguous mental health content from all training and evaluation procedures. In line with prior work, we discard posts that include mentions of clinically-defined psychiatric conditions, adopting the list of mental health terms enumerated by Cohan et al. (2018) as a reference. This list (N=458) extends work from Yates et al. (2017) by including disorders tangential to depression, common misspellings, and colloquial references.
As is standard for mental health modeling, we also discard posts made in subreddits dedicated to providing mental health support (Yates et al., 2017;Cohan et al., 2018;Wolohan et al., 2018). Since new subreddits are created daily and our version of the Topic-Restricted Text dataset contains posts made after collection of RSDD and SMHD, we create an updated list of mental health support subreddits. To do so, we examine the empirical distribution of posts amongst subreddits within the Topic-Restricted Text dataset and rank each subreddit S based on pointwise mutual information (PMI) for the depression group D, log(p(S|D)/p(S)). We manually examined the top 1000 subreddits based on PMI and identified all subreddits whose description affirmed an association to mental health.
Our list (N=242) expands existing resources from Yates et al. (2017) and Cohan et al. (2018) by providing 162 additional mental health subreddits, many of which were actually created before the collection of RSDD and SMHD. 1 While this step diminishes the risk of mental health content saturating the Topic-Restricted Text dataset, the list's expansion beyond that of the RSDD and SMHD lists suggests that the former two Reddit datasets may indeed still have overt mental health content. We explore how different degrees of subreddit-based filtering may affect generalization in §6.4.

Models
We begin by training classification models for predicting depression on each dataset. All classification experiments leverage the same training procedure and features (see Appendix D for details). As a classifier, we use 2 -regularized logistic regression. Despite our model's relative simplicity we are able to achieve respectable within-domain classification performance while maintaining an ability to interpret learned parameters. Logistic regression has served as a difficult benchmark to beat given access to appropriate engineered features for prior mental health studies (Benton et al., 2017b).

Model Validation
To validate our modeling framework against prior work, we first establish within-domain predictive baselines. This step also allows us to contextualize performance by estimating the intrinsic difficulty of modeling each dataset (DeMasi et al., 2017).
Methods. We use train/development/test splits if they have been established by the dataset distributor; otherwise, we sample 20% from the available data to be used as a held-out test set and then create an additional 80/20 train/dev split using the remaining data. For each dataset, we use an independent grid search to select regularization strength C that maximizes F1 in the dataset's development split (see Appendix E). We use a binarization threshold of 0.5 (noninclusive) for all datasets.
Results. We report test set F1 for each dataset in the bottom row of Table 2. Our models perform on par with prior research for the two Twitter datasets and the Topic-Restricted Text dataset. Results for RSDD and SMHD improve upon their respective baseline models, but are inferior to neural methods.

Transfer Experiments
We conduct a series of experiments to measure the generalization of models between depression datasets and explain sources of model degradation.

Cross-domain Transfer
Task formulation and dataset design remain a significant source of nuance across prior studies for mental health status prediction (Morales et al., 2017;Chancellor and De Choudhury, 2020). As such, we hypothesize that standardizing training settings (e.g. class balance, sample size) will account for discrepancies in cross-domain performance.
Methods. We consider two experimental designs. In the first experiment ( †), we downsample all datasets to have the same training/development size of the smallest class in the smallest dataset (i.e. CLPsych). In the second experiment ( † †), we balance class distributions independently for each dataset based on the dataset's smaller class, but allow sample size to vary between datasets. The former experiment allows us to establish equitable baselines between datasets, while the latter experiment enables us to explore whether access to additional training data ameliorates transfer loss.  For both experiments, we start by combining training and development splits. Then, for each dataset, we sample from the combined splits based on the parameters of the experiment and split the resulting sample into 5 class-stratified folds. We train 5 classifiers per dataset, using 4 folds for training each time, and apply the classifiers to each dataset's test set. Since a substantial portion of individuals in SMHD are part of RSDD, we refrain from conducting experiments between the two datasets.
Results. We report F1 score (µ ± σ) for both experiments in Table 2. In line with existing research, within-domain training outperforms crossdomain training in each of our datasets for both sampling settings. While additional samples available for training in the second experiment moderately improve within-domain performance, they are not uniformly helpful for mitigating transfer loss to other datasets. Models generally outperform a random classifier at ranking depression risk in cross-domain transfer scenarios. However, some models are poorly calibrated for new domains and consequently obtain low F1 scores (e.g. CLPsych → SMHD). Addressing miscalibration in domain transfer scenarios remains an open research question (Pampari and Ermon, 2020;Park et al., 2020).
We find that models trained on Twitter data transfer to Reddit data better than models in the reverse direction. Not surprisingly, given their overlap in training samples, models trained on the SMHD and RSDD datasets transfer to other domains in an equitable manner, trading improvements with each other across transfer settings. These results indicate that sample size and class balance are not solely responsible for generalization loss.

Temporal Transfer
Typical sources of transfer loss concern differences in features between domains (Blitzer et al., 2007;Ben-David et al., 2010). However, other factors may govern model degradation for depression clas-sification. One such cause of loss is temporal misalignment between the datasets (Table 1). Prior work has shown that language dynamics may hinder models upon deployment (Dredze et al., 2016;Huang and Paul, 2018). In social media, where users adopt new linguistic norms rapidly, performance may be more volatile (Brigadir et al., 2015).

Class Misalignment
As an exercise to understand whether temporal artifacts are present in the datasets, we first consider training and evaluating single-domain models with a temporal misalignment between the control and depression groups. By training on mutuallyexclusive time periods for each class, we hypothesize the classifier will not only able to learn how to distinguish between groups, but also to distinguish between time periods. If this hypothesis holds true, we expect performance metrics to be artificially inflated when a temporal exclusivity per class exists.
Methods. We split each dataset into one year periods based on the calendar year. For each year, we identify individuals in the Twitter datasets with at least 200 posts and individuals in the Reddit datasets with at least 100 posts. 2 We balance the number of individuals across time periods and groups within each dataset, but allow this sample size to vary across datasets. To account for growth in post frequency over time (which increases the number of documents that generate individual feature vectors), we perform additional post-level sampling. We randomly select 200 posts per year in the Twitter datasets and 100 posts per year in the Reddit datasets. Samples of individuals within each time period are additionally separated into 5 stratified folds. Folds are established so that individuals in the training data of one time period are never present in the test data of another time period.  To evaluate the degree to which temporal effects are present, we sample groups from all possible combinations of time periods. For example, in one setting, both the control and depression groups are sampled from 2013; in another setting, the control group is sampled from 2013, while the depression group is sampled from 2015. For each combination, we use 4 of the stratified folds for training and use the remaining fold for evaluation, and then repeat the process for all folds. We compare performance when classes are sampled from the same time period against performance when classes are sampled from mutually exclusive time periods.
Results. We achieve a 3-22% increase in F1 across all datasets when classes are sampled from mutually exclusive time periods instead of being temporally-aligned. The improvement suggests that temporal artifacts exist, as the classifier is able to not only identify signal relevant to classifying depression, but also to classifying data from different periods of time. This result highlights the importance of sampling classes evenly over time.

Latency
We now measure the effect temporal artifacts have on cross-domain performance. We hypothesize model degradation scales with deployment latency.
Methods. We use the same data sampling mechanism described in §5.2.1. However, we now only consider the case in which control and depression groups are sampled from the same time period. As before, we train a classifier on 4 of the 5 stratified folds for a time period in one dataset. We then evaluate within-domain performance using the re-maining fold and cross-domain performance using one fold from each time period in the other datasets. We assume ground truth is consistent over multiple time periods; given the episodic nature of depression, we recognize this may promote pessimistic results for some periods (Tsakalidis et al., 2018).
Results. Examining within-domain results in Figure 1 (left), predictive performance tends to be better for more recent temporal splits regardless of training period. Classifiers trained on old data (relative to the evaluation period) tend to perform on par with aligned regimens, while classifiers trained on new data show linear losses over time. Losses are significant after 2-3 years depending on the dataset.
Though some trends do emerge, cross-domain performance as a function of temporal latency is relatively variable. Visualized in Figure 1 (right), models trained on the Twitter datasets benefit most from temporal alignment in cross-domain settings. Models trained on Topic-Restricted Text show significant drop offs in predictive performance when applied to older samples within all Reddit datasets. While models trained on RSDD perform better on Topic-Restricted Text as latency is reduced, models trained on SMHD do not exhibit the same trend.

Post-hoc Analysis
In the previous section, we identified the degree to which loss occurs under a variety of domain transfer settings. However, these settings do not account for all performance disparities. In this section, we measure differences between the datasets to understand the source of loss.

Vocabulary Overlap
Traditionally, different feature vocabularies account for domain transfer loss (Serra et al., 2017;Chen and Gomes, 2019;Stojanov et al., 2019). Therefore, we hypothesize that limited feature overlap and poor vocabulary alignment across datasets could hinder cross-domain generalization.
Methods. We explore this phenomenon by computing the Jaccard Similarity (JS) of vocabularies between each dataset. We examine correlations between JS and F1 scores from the cross-domain transfer experiments discussed in §5.1.
Results. We find the minimum similarity occurs between the CLPsych and RSDD datasets (JS = 0.10) while the maximum occurs between the Topic-Restricted Text and SMHD datasets (JS = 0.65). 3 Only a weak correlation between similarity and performance exists (Pearson ρ < 0.18), suggesting poor generalization is not solely due to differences in vocabulary.

Topical Alignment
Our classification models leverage reduced feature representations in the form of LDA topicdistributions (Blei et al., 2003) and mean-pooled pre-trained GloVe embeddings (Pennington et al., 2014). Designed to capture and reflect semantics, we hypothesized these low-dimensional features would mitigate transfer loss due to poor vocabulary alignment. Lacking support from our cross-domain transfer results, we look closer at the themes present within each dataset.
Methods. We identify the unigrams that are most unique to each dataset and group. For each dataset, we use scores assigned by our KLdivergence-based feature selection method (see Appendix D) to rank the most informative features per class (Chang et al., 2012). We jointly examine the top-500 most informative unigrams per class, noting high-level themes common across the datasets.
Results. With respect to similarities, we note that words used in discussion about gender and sexuality are strongly associated with each of the depression groups (e.g. 'cis', 'homophobia', 'masculine'), likely a reflection of marginalized groups being at higher risk of depression (Budge et al., 2013). Also ubiquitous amongst each of the datasets are references to self-injurious behavior (e.g 'wrists', 'self-harm', 'hotline'). Increased emoji usage and references to athletics ('nbafinals', 'scorer') are strong indicators of the control group in each dataset, as well as terms reflecting current events.
With respect to differences, associations between word usage and depression are subjectively easier to interpret within the Reddit datasets. For example, discussion of mental-health treatment (e.g. 'counselor', 'therapy', 'wellbutrin') and familial and intimate relationships ('brother-in-law', 'soulmate') are prominent within the Reddit datasets. In contrast, language associated with depression within the Twitter datasets tends to reflect slightly more nuanced elements of the condition-e.g. social inequity ('sexism', '#yesallwomen') and fantasy ('fanfics', 'cosplay', 'villians'). These themes align with empirical findings that women are at a higher risk of depression (Kessler, 2003) and depressed individuals often find solace in niche subcultures (Blanco and Barnett, 2014;Bowes et al., 2015).
Additionally, we find several temporallyisolated references within the Twitter datasets (e.g. '#RIPRobinWilliams', '#SDCC'). In the Multi-task Learning dataset, we also see several terms using non-American English (e.g. 'colour', 'favourite') which may represent a geographic imbalance amongst the sampled individuals.

Stability of LIWC
The Linguistic Inquiry and Word Count (LIWC) dictionary has been an effective tool for measuring linguistic-nuances of mental health disorders regardless of textual formality (Mowery et al., 2016;Turcan and McKeown, 2019). Our version of the dictionary (2007) maps approximately 12k words to 64 dimensions (e.g. negative emotion, leisure) that have been empirically validated to capture an individual's social and psychological states (Tausczik and Pennebaker, 2010). 4 A single LIWC feature value represents the proportion of words used across an individual's post history that match the given LIWC dimension. In the same way that we expect semantic distributions ( §6.2) to ameliorate transfer loss, we hypothesize that models trained on this representation will be more robust when vocabulary overlap is sparse.
Methods. We explore this hypothesis from three angles: 1) We perform cross-domain transfer experiments using LIWC as the only feature set provided for training and evaluation; 2) We fit LIWC-based classifiers 100 times per dataset using random 70% samples and examine correlations of the learned coefficients; 3) We compute the average feature value of each LIWC dimension per class and measure the difference between classes.
Results. We note that domain-transfer experiments using LIWC as the only feature set maintain high degrees of transfer loss while sacrificing within-domain performance. Moreover, correlations between coefficients of models between datasets are relatively low across all comparisons, maxing out at a Spearman R value of 0.338 for the comparison between RSDD and SMHD datasets, which happen to have significant user overlap as is. In general, LIWC coefficients tend to be more correlated within platforms than between them.
Examination of the underlying class differences provides insight into linguistic differences between each dataset's depression group. In line with prior work, function word use, first-person pronoun use, and cognitive mechanisms are more common within the depression group of each dataset, though their relative prevalence varies. Conversation regarding relativity (i.e. space, motion, time) is strongly associated with the control groups in the Twitter data, but is more associated with the depression groups in the Reddit data. Anger and perceptual topics are more prevalent within the depression groups for Twitter than Reddit.

Self-disclosure Bias
In the aforementioned analysis, posts from mental health subreddits and those including mental health terms were excluded. Nonetheless, individuals within each of the depression groups for the Reddit datasets displayed language that was unambiguously associated with seeking support or sharing personal experience with mental health issues. Accordingly, we hypothesize that existing filters are unable to remove confounds in individuals who disclose a depression diagnosis on Reddit.
Methods. To measure this effect, we examine differences in the distribution of subreddits that individuals in the depression group of the Topic-Restricted Text data post in relative to individuals in the control group. Specifically, we fit a logistic regression model mapping the subreddit distribution of individuals' posts to their mental health status after applying each subreddit filter list (e.g. RSDD, SMHD, Ours). We compare predictive per-formance of these models and the learned coefficient weights to understand the effect of filtering. As a baseline, we maintain posts from the r/depression subreddit in the feature set. Then, in sequence of coverage from least to most, we apply subreddit filters from RSDD, SMHD, and our study, and measure classification performance. For each filter, we examine the learned coefficient weights to develop a sense for the personality and interests of individuals in the depression group.
Results. The baseline F1 score in the development set maxes out at 0.83, representing the fact that several individuals in the control group had posted in the r/depression subreddit at some point in their history, but were not labeled as having depression due to the sole use of recent original posts by the automatic annotation procedure. Performance degrades with the expansion of excluded subreddits from each filter, settling at an F1 of 0.72. Coefficients from the model highlight subreddits related to themes of sexuality (r/bisexual, r/actuallesbians), gender (r/ftm), personality (r/introvert, r/INFP), drugs (r/Trees, r/LSD), and relationships (r/MakeNewFriendsHere, r/BreakUps) as being predictive of depression.
The strong classification performance achieved after our filtering measures is evidence that distributional differences in online interaction remain in the "cleaned" Topic-Restricted Text dataset. As our subreddit list is more robust than both the RSDD and SMHD lists, there is reason to believe similar confounds exist in these datasets. The coefficient analysis provides a window into the types of themes that could incorrectly confuse a classification model during generalization attempts.

Recommendations
We have demonstrated that issues of transfer loss persist in the mental health space, at least for the proxy-based social media datasets considered in our study. Importantly, we identified confounds that emerge as a result of each dataset's respective design. Critically, existing datasets have flaws that make them difficult to use for constructing models for new data types and populations.
Topical Alignment. Researchers must account for self-disclosure bias and confounds of personality when curating new datasets. First discussed in §6.2, models trained on the Reddit datasets learn dependencies between support-driven topics, such as medication usage and relationship advice, and depression. In contrast, models trained on the Twitter datasets identify the same correlations between sexuality, gender, and depression that Redditbased models detect, but also learn about the recreational outlets (i.e. fantasy) and social concerns (i.e. racism, sexism) common in depressed individuals.
We hypothesize that semantic divergences reflect self-disclosure bias and differences in platform interaction patterns (Malik et al., 2015;Shelton et al., 2015). Twitter's status-and reply-based structure serves as a place for individuals to share personal thoughts and experiences in reaction to their daily life. Meanwhile, Reddit's community-based forums require active engagement with specific topics and may silo individuals who wish to discuss their mental health beyond defined areas. The latter gains support from our analysis of subreddit distributions in the Topic-Restricted Text data ( §6.4).
Topical nuances in language may appropriately reflect elements of identity associated with mental health disorders (i.e. traumatic experiences, coping mechanisms). However, if not contextualized during model training, this type of signal has the potential to raise several false alarms upon application to new populations. Accordingly, we urge researchers to minimize the presence of overt topical disparities between classes in their datasets.
Mitigating Temporal Artifacts. Researchers must take steps to remove temporal artifacts in new datasets. Experiments conducted in §5.2 reveal that group-based temporal alignment and latency between model training and deployment can have a significant effect on predictive performance. Variability of performance over time is surprising, as there is no clinical evidence to suggest that the underlying symptoms of depression (on a population level) change over time (APA, 2013).
We hypothesize two reasons for this observation. First, since depression presents in an episodic manner, we may expect data closest to the date of annotation to be the most predictive of an individual's labeled mental status (Melartin et al., 2004). If most posts used for annotation occurred in recent time windows, then it is possible that content in older posts is less relevant to the depressive state of individuals in our data sets. Second, and more problematic, is the possibility that signal used by our classifiers is only a spurious correlation.
At a bare minimum, our results highlight the importance of sampling classification groups so that post volume is equal over time. Discrepancies may wrongly suggest that temporal artifacts are useful for detecting mental health disorders. Going further, researchers should remove temporally-specific references and minimize highly-dynamic language in their datasets. Avenues for accomplishing the latter include using NER to redact n-grams that serve as spurious correlations (Ritter et al., 2011) and leveraging adversarial training to evaluate the degree to which mental health signal may be learned without a notion for time (Tzeng et al., 2017).

Limitations and Future Work
Though our study provides a robust perspective toward understanding generalization capabilities of mental health classifiers for social media, we recognize that more learning opportunities exist. Our study only considers a handful of datasets, two platforms, a single mental health disorder, and homogeneous annotation mechanisms. Still unexplored, in large part due to the precautions necessary for securing sensitive mental health data, is how well models trained on data from actual clinical populations generalize to proxy-based datasets and other clinical populations. While high co-morbidity rates between depression and other mental health disorders may allow us to infer model behavior for alternative conditions, we also recognize that presentations of different psychiatric disorders can be quite variable and warrant their own research (Benton et al., 2017b;Arseniev-Koehler et al., 2018).
Another limitation in our work is the lack of depression to control group matches from original reference material. Preoţiuc-Pietro et al. (2015) andDe Choudhury et al. (2017) demonstrate that mental health disorders such as depression can have variable presentations based on demographic attributes. The attributes used to construct our Twitter datasets originally were inferred via now-outdated text-based models. Accordingly, demographic inference errors may be propagated to and correlated with depression classification errors. Moreover, these attributes were not considered within the construction of any of the Reddit datasets we explored. The effect of demographics on generalization remains a valuable insight for future exploration.
Finally, our attempts at domain transfer are constrained. Namely, we do not invoke explicit domain adaptation methods (Peng and Dredze, 2017;Li et al., 2018;Huang and Paul, 2019). Moving forward, we plan to explore algorithmic strategies to mitigate the biases discovered in this study. To align the theme of language generated by individuals across classification groups, each individual in the depression group was greedily matched with 12 individuals from the candidate control pool based on Hellinger distance between each individual's post distribution over subreddits. To preserve privacy of individuals within the dataset, usernames were anonymized and post metadata was redacted. Accordingly, linkages between each individual within the depression group and their respective control group pairs could not be recreated.
SMHD. The Self-Reported Mental Health Diagnoses (SMHD) dataset was constructed in a similar manner as RSDD, albeit being expanded to support 9 conditions, leverage more precise regular expressions, and abide by a more conservative term/subreddit filter set (Cohan et al., 2018). As with RSDD, linkages between individuals in the depression group and their controls were not preserved in our version of the dataset nor could they be readily reproduced. A substantial portion of individuals in SMHD are also part of RSDD; for this reason, we refrain from conducting domain transfer experiments between the two datasets.
Topic-Restricted Text. To expand the scope of our analysis, we follow methods described in Wolohan et al. (2018) to curate an additional Reddit dataset in which annotations are assigned based on community participation and explicit mental health signal is removed (hence "topic-restricted text"). Per the original paper, individuals who initiated one of 10k recent posts in r/depression were considered members of the depression group, while individuals who initiated one of 10k recent posts in r/AskReddit (but not in the recent r/depression query) were considered to be members of the control group. Due to the anonymous nature of the RSDD and SMHD datasets, we were unable to determine if any individuals found within the Topic-Restricted Text dataset were also in RSDD or SMHD.

B Temporal Filtering
To limit the introduction of temporal artifacts into the classification process, all datasets were truncated in time so that at least 100 unique data points (e.g. Tweets, Reddit comments) were present in the first and final month across individuals in both classes. Date ranges selected based on this criteria are presented in Table 1.

C Tokenization
To maintain our ability to interpret results consistently, the same preprocessing pipeline was applied across all datasets. Text within both Tweets and Reddit comments was tokenized using a modified version of the Twokenizer (O'Connor et al., 2010). English contractions were expanded, while specific retweet tokens, username mentions, URLs, and numeric values were replaced by generic tokens. As pronoun usage tends to differ in individuals living with depression (Vedula and Parthasarathy, 2017), we removed any English pronouns from our stop word set. 6 Case was standardized across all tokens, with a single flag included if an entire post was made in uppercase letters.

D Features
Text from all documents for an individual are concatenated together and tokenized using the procedure described in Appendix C. The vocabulary of each training procedure is fixed to a maximum of 100-thousand unigrams selected based on KL-divergence of the class-unigram distribution with the class-distribution of stop words (Chang et al., 2012). This reduced bag-of-words representation is then used to generate the following additional feature dimensions: a 50-dimensional LDA topic distribution (Blei et al., 2003), a 64dimensional LIWC category distribution (Tausczik and Pennebaker, 2010), and a 200-dimensional mean-pooled vector of pretrained GloVe embeddings (Pennington et al., 2014). The reduced bagof-words representation is transformed using TF-IDF weighting (Ramos et al., 2003). 7