On the State of Social Media Data for Mental Health Research

Data-driven methods for mental health treatment and surveillance have become a major focus in computational science research in the last decade. However, progress in the domain remains bounded by the availability of adequate data. Prior systematic reviews have not necessarily made it possible to measure the degree to which data-related challenges have affected research progress. In this paper, we offer an analysis specifically on the state of social media data that exists for conducting mental health research. We do so by introducing an open-source directory of mental health datasets, annotated using a standardized schema to facilitate meta-analysis.


Introduction
The last decade has seen exponential growth in computational research devoted to modeling mental health phenomena using non-clinical data (Bucci et al., 2019). Studies analyzing data from the web, such as social media platforms and peerto-peer messaging services, have been particularly appealing to the research community due to their scale and deep entrenchment within contemporary culture (Perrin, 2015;Fuchs, 2015;Graham et al., 2015). Such studies have yielded novel insights into population-level mental health (De Choudhury et al., 2013;Amir et al., 2019a) and shown promising avenues for the incorporation of data-driven analyses in the treatment of psychiatric disorders (Eichstaedt et al., 2018).
These research achievements have come despite complexities specific to the mental health space often making it difficult to obtain a sufficient sample size of high-quality data. For instance, behavioral disorders are known to display variable clinical presentations amongst different populations, rendering annotations of ground truth inher-ently noisy Arseniev-Koehler et al., 2018). Scalable methods for capturing an individual's mental health status, such as using regular expressions to identify self-reported diagnoses or grouping individuals based on activity patterns, have provided opportunities to construct datasets aware of this heterogeneity (Coppersmith et al., 2015b;Kumar et al., 2015). However, they typically rely on oversimplifications that lack the same clinical validation and robustness as something like a mental health battery (Zhang et al., 2014;Ernala et al., 2019).
Ethical considerations further complicate data acquisition, with the sensitive nature of mental health data requiring tremendous care when constructing, analyzing, and sharing datasets (Benton et al., 2017). Privacy-preserving measures, such as de-identifying individuals and requiring IRB approval to access data, have made it possible to share some data across research groups. However, these mechanisms can be technically cumbersome to implement and are subject to strict governance policies when clinical information is involved due to HIPAA (Price and Cohen, 2019). Moreover, many privacy-preserving practices require that signal relevant to modeling mental health, such as an individual's demographics or their social network, are discarded (Bakken et al., 2004). This missingness has the potential to limit algorithmic fairness, statistical generalizability, and experimental reproducibility (Gorelick, 2006). Although mental health researchers may anecdotally recall difficulties acquiring quality data or reproducing prior art due to data sharing constraints, no study to our knowledge has explicitly quantified this challenge.
Indeed, prior reviews of computational research for mental health have noted several of the aforementioned challenges, but have predominantly discussed technical methods (e.g. model architectures, feature engineering) developed to surmount existing constraints (Guntuku et al., 2017;Wongkoblap et al., 2017). Recent work from Chancellor and De Choudhury (2020), completed concurrently with our own, was the first review to focus specifically on the shortcomings of data for mental health research. Our study affirms the findings of Chancellor and De Choudhury (2020), using an expanded pool of literature that more acutely focuses on language found in social media data. To this end, we construct a new open-source directory of mental health datasets, annotated using a standardized schema that not only enables researchers to identify relevant datasets, but also to identify accessible datasets. We draw upon this resource to offer nuanced recommendations regarding future dataset curation efforts.

Data
To generate evidence-based recommendations regarding mental health dataset curation, we require knowledge of the extant data landscape. Unlike some computational fields which have a surplus of well-defined and uniformly-adopted benchmark datasets, mental health researchers have thus far relied on a decentralized medley of resources. This fact, spurred in part by the variable presentations of psychiatric conditions and in part by the sensitive nature of mental health data, thus requires us to compile a new database of literature. In this section, we detail our literature search, establish inclusion/exclusion criteria, and define a list of dataset attributes to analyze.

Dataset Identification
Datasets were sourced using a breadth-focused literature search. After including data sources from the three aforementioned systematic reviews (Guntuku et al., 2017;Wongkoblap et al., 2017;Chancellor and De Choudhury, 2020), we searched for literature that lie primarily at the intersection of natural language processing (NLP) and mental health communities. We sought peerreviewed studies published between January 2012 and December 2019 in relevant conferences (e.g. NAACL, EMNLP, ACL, COLING), workshops (e.g. CLPsych, LOUHI), and health-focused journals (e.g. JMIR, PNAS, BMJ).
We searched Google Scholar, ArXiv, and PubMed to identify additional candidate articles. We used two search term structures -1) (mental health | DISORDER) + (social | electronic) + media, and 2) (machine learning | prediction | infer-ence | detection) + (mental health | DISORDER). '|' indicates a logical or, and DISORDER was replaced by one of 13 mental health keywords. 2 Additional literature was identified using snowball sampling from the citations of these papers. To moderately restrict the scope of this work, computational research regarding neurodegenerative disorders (e.g. Dementia, Parkinson's Disease) was ignored.

Selection Criteria
To enhance parity amongst datasets considered in our meta-analysis, we require datasets found within the literature search to meet three additional criteria. While excluded from subsequent analysis, datasets that do not meet this criteria are maintained with complete annotations in the aforementioned digital directory. In future work, we will expand our scope of analysis to reflect the multi-faceted computational approaches used by the research community to understand mental health.
1. Datasets must contain non-clinical electronic media (e.g. social media, SMS, online forums, search query text).
2. Datasets must contain written language (i.e. text) within each unit of data .
3. Datasets must contain a dependent variable that captures or proxies a psychiatric condition listed in the DSM-5 (APA, 2013).
Our first criteria excludes research that examines electronic health records or digitally-transcribed interviews (Gratch et al., 2014;Holderness et al., 2019). Our second criteria excludes research that, for example, primarily analyzes search query volume or mobile activity traces (Ayers et al., 2013;Renn et al., 2018). It also excludes research based on speech data (Iter et al., 2018). Our third criteria excludes research in which annotations are only loosely associated with their stated mental health condition. For instance, we filter out research that seeks to identify diagnosis dates in self-disclosure statements (MacAvaney et al., 2018), in addition to research that proposes using sentiment as a proxy for mental illness (Davcheva et al., 2019). This last criteria also inherently excludes datasets that lack annotation of mental health status altogether (e.g. data dumps of online mental health support platforms and text-message counseling services) (Loveys et al., 2018;Demasi et al., 2019).

Annotation Schema
We develop a high-level schema to code properties of each dataset. In addition to standard reference information (i.e. Title, Year Published, Authors), we note the following characteristics: • Platforms: Electronic media source (e.g. Twitter, SMS) • Tasks: The mental health disorders included as dependent variables (e.g. depression, suicidal ideation, PTSD) • Annotation Method: Method for defining and annotating mental health variables (e.g. regular expressions, community participation/affiliation, clinical diagnosis) • Annotation Level: Resolution at which ground-truth annotations are made (e.g. individual, document, conversation) • Size: Number of data points at each annotation resolution for each task class • Language: The primary language of text in the dataset • Data Availability: Whether the dataset can be shared and, if so, the mechanism by which it may be accessed (e.g. data usage agreement, reproducible via API, distribution prohibited by collection agreement) If a characteristic is not clear from a dataset's associated literature, we leave the characteristic blank; missing data points are denoted where applicable. While we simplify these annotations for a standardized analysis -e.g. different psychiatric batteries used to annotate depression in individuals (e.g. PHQ-9, CES-D) are simplified as "Survey (Clinical)" -we maintain specifics in the digital directory. 3, and 3 additional publications respectively. All datasets known to be available for distribution are available with annotations in the appendix, while remaining datasets are found our digital directory. Platforms. We identified 20 unique electronic media platforms across the 102 datasets. Twitter (47 datasets) and Reddit (22 datasets) were the most widely studied platforms. YouTube, Facebook, and Instagram were relatively underutilized for mental health research -each found less than ten times in our analysis -despite being the three mostwidely adopted social media platforms globally (Perrin and Anderson, 2019). We expect our focus on NLP to moderate the presence of YouTube and Instagram based datasets, though not entirely given both platforms offer expansive text fields (i.e. comments, tags) in addition to their primary content of video and images (Chancellor et al., 2016a;Choi et al., 2016). It is more likely that use of these platforms (and Facebook) for research is hindered by increasingly stringent privacy policies and ethical concerns (Panger, 2016;Benton et al., 2017).

Analysis
Tasks. We identified 36 unique mental health related modeling tasks across the 102 datasets. While the majority of tasks were examined less than twice, a few tasks were considered quite frequently. Depression (42 datasets), suicidal ideation (26 datasets), and eating disorders (11 datasets) were the most common psychiatric conditions examined. Anxiety, PTSD, self-harm, bipolar disorder, and schizophrenia were also prominently featured conditions, each found within at least four unique datasets. A handful of studies sought to characterize finer-grained attributes associated with higher-level psychiatric conditions (e.g. symptoms of depression, stress events and stressor subjects) (Mowery et al., 2015;Lin et al., 2016). The dearth of anxiety-specific datasets was somewhat surprising given the condition's prevalence and the abundance of pyschometric batteries for assessing anxiety (Cougle et al., 2009;Antony and Barlow, 2020). That said, generalized anxiety disorder (GAD) only accounts for a small proportion of the overall prevalence of anxiety disorders (Bandelow and Michaelis, 2015) and many other types of anxiety disorders (e.g. social anxiety, PTSD, OCD, etc.) were typically treated as independent conditions (Coppersmith et al., 2015a;De Choudhury et al., 2016).
Annotation. We identified 24 unique annotation mechanisms. It was common for several annotation mechanisms to be used jointly to increase precision of the defined task classes and/or evaluate the reliability of distantly supervised labeling processes. For example, some form of regular expression matching was used to construct 43 of datasets, with 23 of these including manual annotations as well. Community participation/affiliation (24 datasets), clinical surveys (22 datasets), and platform activity (3 datasets) were also common annotation mechanisms. The majority of datasets contained annotations made on the individual level (63 datasets), with the rest containing annotations made on the document level (40 datasets). 3 Size. Of the 63 datasets with individual-level annotations, 23 associated articles described the amount of documents and 62 noted the amount of individuals available. Of the 40 datasets with document-level annotations, 37 associated articles noted the amount of documents and 12 noted the number of unique individuals. The distribution of dataset sizes was primarily right-skewed.
One concerning trend that emerged across the datasets was the presence of a relatively low number of unique individuals. Indeed, these small sample sizes may further inhibit model generalization from platforms that are already demographicallyskewed (Smith and Anderson, 2018). The largest datasets, which present the strongest opportunity to mitigate the issues presented by poorly representative online populations, tend to leverage the noisiest annotation mechanisms. For example, datasets that define a mainstream online community as a control group may expect to find approximately 1 in 20 of the labeled individuals are actually living with mental health conditions such as depression (Wolohan et al., 2018), while regular expressions may fail to distinguish between true and non-genuine disclosures of a mental health disorder up to 10% of the time (Cohan et al., 2018).
Primary Language. Six primary languages were found amongst the 102 datasets -English (85 datasets), Chinese (10 datasets), Japanese (4 datasets), Korean (2 datasets), Spanish (1 dataset), and Portuguese (1 dataset). This is not to say that some of the datasets do not include other languages, but rather that the predominant language found in the datasets occurs with this distribution. While an overwhelming focus on English data is a theme throughout the NLP community, it is a specific concern in this domain where culture often influences the presentation of mental health disorders Loveys et al., 2018).
Availability. We were able to identify the availability of only 48 of the 102 unique datasets in our literature search. Of these 48 datasets, 13 were known not to be available for distribution, generally due to limitations defined in the original collection agreement or removal from the public record (Park et al., 2012;Schwartz et al., 2014). The remaining 35 datasets were available via the following distribution mechanisms: 18 may be reproduced using an API and instructions provided within the associated article, 12 require a signed data usage agreement and/or IRB approval, 3 are available without restriction, and 2 may be retrieved directly from the author(s) with permission. Of the 22 datasets that used clinically-derived annotations (e.g. mental health battery, medical history), 7 were unavailable for distribution due to terms of the original data collection process and 1 was removed from the public record. The remaining 14 had unknown availability.

Discussion
In this study, we introduced and analyzed a standardized directory of social media datasets used by computational scientists to model mental health phenomena. In doing so, we have provided a valuable resource poised to help researchers quickly identify new datasets that support novel research. Moreover, we have provided evidence that affirms conclusions from Chancellor and De Choudhury (2020) and may further encourage researchers to rectify existing gaps in the data landscape. Based on this evidence, we will now discuss potential areas of improvement within the field.
Unifying Task Definitions. In just 102 datasets, we identified 24 unique annotation mechanisms used to label over 35 types of mental health phenomena. This total represents a conservative estimate given that nominally equivalent annotation procedures often varied non-trivially between datasets (e.g. PHQ-9 vs. CES-D assessments, affiliations based on Twitter followers vs. engagement with a subreddit) (Faravelli et al., 1986;Pirina and Çöltekin, 2018). Minor discrepancies in task definition reflect the heterogeneity of how several mental health conditions manifest, but also introduce difficulty contextualizing results between different studies. Moreover, many of these definitions may still fall short of capturing the nuances of mental health disorders (Arseniev-Koehler et al., 2018). As researchers look to transition computational models into the clinical setting, it is imperative they have access to standardized benchmarks that inform interpretation of predictive results in a consistent manner (Norgeot et al., 2020).
Sharing Sensitive Data. Most existing mental health datasets rely on some form of self-reporting or distinctive behavior to assign individuals into task groups, but admittedly fail to meet ideal ground truth standards. The clinically-annotated datasets that do exist are either proprietary or do not provide a clear mechanism for inquiring about availability. The dearth of large, shareable datasets based on actual clinical diagnoses and medical ground truth is problematic given recent research that calls into question the validity of proxy-based mental health annotations (Ernala et al., 2019;Harrigian et al., 2020). By leveraging privacypreserving technology (e.g. blockchain, differential privacy) to share patient-generated data, researchers may ultimately be able to train more robust computational models (Elmisery and Fu, 2010;Zhu et al., 2016;Dwivedi et al., 2019). In lieu of implementing complicated technical approaches to preserve the privacy of human subjects within mental health data, researchers may instead consider establishing secure computational environments that enable collaboration amongst authenticated users (Boebert et al., 1994;Rush et al., 2019).
Addressing Bias. There remains more to be done to ensure models trained using these datasets perform consistently irrespective of population.
Several studies in our review attempted to leverage demographically-matched or activity-based control groups as a comparison to individuals living with a mental health condition (Coppersmith et al., 2015b;Cohan et al., 2018). A recent article found discrepancies between the prevalence of depression and PTSD as measured by the Centers for Disease Control and Prevention and as estimated using a model trained to detect the two conditions (Amir et al., 2019b). While the study posits reasons for the difference, it is unable to confirm any causal relationship.
More recently, Aguirre et al. (2021) found evidence of demographic (gender and racial/ethnic) bias within datasets from Coppersmith et al. (2014aCoppersmith et al. ( , 2015c) that can create fairness issues in downstream tasks. They found poor representation and strong group imbalance in these datasets; however, simple changes in dataset size and balance alone could not fully account for performance disparities between groups. Indeed, common signs of depression recognized in prior linguistic analyses (e.g. differences in distributions for some categories of LIWC) were found not to be equally informative for all demographics. Thus, while performance disparities between demographic groups may certainly arise due to poor representation at training time, disparities may also arise due to an ill-founded assumption that mental health outcomes for all groups can be treated equivalently (Kessler et al., 2003;Shah et al., 2019). Either way, there exists a need to rethink dataset curation and model evaluation so traditionally underrepresented groups are not further hindered from receiving adequate mental health care. This all said, the presence of downstream bias in mental health models is admittedly difficult to define and even more difficult to fully eliminate (Gonen and Goldberg, 2019;Blodgett et al., 2020).
Nonetheless, the lack of demographically-representative sampling described above would serve as a valuable starting point to address. Increasingly accurate demographic inference tools may aid in constructing datasets with demographically-representative cohorts (Huang and Carley, 2019;Wood-Doughty et al., 2020). Researchers may also consider expanding the diversity of languages in their datasets to account for variation in mental health presentation that arises due to cultural differences Loveys et al., 2018).

A Available Datasets
Ultimately, we identified 35 unique mental health datasets that were available for distribution. A subset of annotations for these datasets, along with original reference information, can be found in Table 1 (see next page). We categorize dataset availability using four distinct distribution mechanisms.
• DUA: The dataset requires researchers to sign a data usage agreement that outlines the terms and conditions by which the dataset may be analyzed; in some cases, this also requires institutional authorization and oversight (e.g. IRB approval) • API: The dataset may be reproduced (with a reasonable degree of effort) using instructions provided in the dataset's primary article and access to a public-facing application programming interface (API) • AUTH: The dataset may be accessed by directly contacting the original author(s) • FREE: The dataset is hosted on a publicfacing server, accessible by all without any additional restrictions Of the datasets that were available for distribution via one of the above mechanisms, we noted the following 27 unique mental health conditions/predictive tasks: • Attention Deficit Hyperactivity Disorder (ADHD) • Alcoholism (ALC) • Anxiety (ANX) • Social Anxiety (ANXS) • Asperger's (ASP) • Autism (AUT) • Bipolar Disorder (BI)