Qualitative Analysis of Depression Models by Demographics

Models for identifying depression using social media text exhibit biases towards different gender and racial/ethnic groups. Factors like representation and balance of groups within the dataset are contributory factors, but difference in content and social media use may further explain these biases. We present an analysis of the content of social media posts from different demographic groups. Our analysis shows that there are content differences between depression and control subgroups across demographic groups, and that temporal topics and demographic-specific topics are correlated with downstream depression model error. We discuss the implications of our work on creating future datasets, as well as designing and training models for mental health.


Introduction
Models of mental health trained on social media data exhibit biases in downstream performance on different gender and racial/ethnic demographic groups (Aguirre et al., 2021). An important factor is that minority groups (People of Color in general) are underrepresented in datasets and thus models under perform compared to majority groups. While size and balance of datasets contribute to the gap in performance, there may be differences in the manner in which depressive behavior is exhibited across demographic groups, creating problems in generalization.
Difference in depression prevalence across demographics have long been known (Brody et al., 2018), although there is no clear explanation for why this is the case (Hasin et al., 2018). On social media, demographic-based mental health analyses have used matched control samples (Dos Reis and Culotta, 2015), which allow for comparison of behaviors across groups (Coppersmith et al., 2014;Amir et al., 2019). These types of analyses have focused on downstream performance of trained models (Aguirre et al., 2021) and how they show differences in depression rates, but there have been no qualitative studies investigating these demographic differences (Chancellor and De Choudhury, 2020;Harrigian et al., 2020b).
Others have used qualitative studies to analyze behaviors and performance of machine learning models in general (Chen et al., 2018). Previous work has analyzed representative sentences (Ettinger, 2020), hashtags (Sykora et al., 2020), performed a thematic analysis by using the Linguistic Inquiry and Word Count dictionary (Wolohan et al., 2018) or trained topic models (Harrigian et al., 2020a;Yazdavar et al., 2017;Mitchell et al., 2015).
We propose a qualitative language analysis to reveal what differences occur, and how these differences can contribute to downstream performance. What language trends characterize depression and how do these vary across demographic groups? We use an analysis method similar to Mueller et al. (2021) but instead of training an Latent Dirichelt Allocation (LDA) (Blei et al., 2003) topic model and performing Point-Wise Mutual inference to obtain topics related to demographics, we train a Partially-Labeled LDA model (Ramage et al., 2011) which allows us to assign labels to demographic groups as well as depression and control groups to obtain label-specific topics to our user groups.
We base our analysis on datasets from previous work using Twitter. We train simple text-based models based on previous work on these datasets (Harrigian et al., 2020a;Aguirre et al., 2021). We use a labeled topic model to characterize what content indicates depression and how this content varies by demographic group.
Our analysis shows variations in content between depression and control subgroups across demographic groups, however, most of these differences are due to non-clinical phenomena e.g. vi-ral content trends such as TV shows awards. Further, model error analysis corroborates that temporal trends and nongeneralizable topics of demographic groups are correlated with downstream model error. Our qualitative analysis approach can be utilized to analyze language differences across demographics on other datasets and mental health tasks. We discuss the implications of our work on creating new datasets, as well as designing and training language models for mental health.

Ethical Considerations
Given the sensitive nature of mental health topics and demographics of individuals, additional precautions (based on depression diagnoses (Benton et al., 2017a); gender identity (Larson, 2017); race/ethnicity identity (Wood-Doughty et al., 2020)) were taken during this study. Data sourced from external research groups was retrieved according to each datasets respective data use policy. For gender labels, due to current limitations on datasets and methods, we consider the folk perception of gender, as described in Larson (2017), and for race/ethnicity labels we use the mutually-exclusive non-Hispanic White, non-Hispanic Black,non-Hispanic Asian and Hispanic/Latinx, following Wood-Doughty et al. (2020). We acknowledge that both our gender and racial/ethnic categories do not fully capture many individuals' gender and/or race/ethnicity. Additionally, we acknowledge the limitations of the demographic inference methods employed to obtain the demographic labels that have been raised in multiple previous studies (Mueller et al., 2021;Aguirre et al., 2021). While we carefully consider these issues, we believe the urgency of understanding mental health models (Aguirre et al., 2021) warrants our work and hope that our results provide sufficient evidence to justify further study in this area. This research was deemed exempt from review by our Institutional Review Board (IRB) under 45 CFRg 46.104.

Data
We use two datasets for depression identification on Twitter from previous studies: the CLPsych 2015 Shared Task (Coppersmith et al., 2015b), and the multi-disorder multitask learning for mental health dataset (Benton et al., 2017b).
CLPsych. The dataset contains publicly available tweets of individuals where the diagnosed group was collected by self-report through regular expression matching, e.g. "I was diagnosed with <disorder>". Control individuals were approximated by matching inferred age and gender using tools from the World Well-Being Project (Sap et al., 2014) from a pool of random accounts. While the original dataset collected four conditions, we select the depression users (475) and their matched control users resulting in 950 individuals.
Multitask. This dataset combines subsets of several datasets (Coppersmith et al., 2015a,b,c). All methods used the same collection process: selfreport through regular expression matching, and control individuals by matching inferred age and gender with the same tool. Additionally, the complete public history of tweets is collected for each individual as opposed to the latest 3000 tweets on CLPsych resulting in a bigger dataset. We select the depression users (1400) and their matched control users resulting in 2800 individuals.
While both dataset collection methods are nearly identical, the time period in which the tweets were collected, and the number of tweets and individuals are different for each dataset, likely leading to different types of depression indicators. Note that while there is an overlap between Multitask and CLPsych of 110 individuals, it is a small percentage of both datasets (∼ 4% and ∼ 10% respectively).

Methodology
Demographic Labels. While both datasets utilized gender and age inferences to match control and disorder groups at collection time, these models are now out-dated and labels for race/ethnicity were not made available. We obtain new race/ethnicity and gender labels from the work of Aguirre et al. (2021). Demographic statistics for both datasets are available in Appendix A. Since the race/ethnicity minority groups are extremely underrepresented in the datasets, we combine them to create a Person of Color (PoC) group.
Mental Health Models. We create mental health models for these datasets based on recent work (Harrigian et al., 2020a;Aguirre et al., 2021). Following standard pre-processing procedures, we filter numeric values, username mentions, retweets and urls from raw tweet text.

Topic Model Analysis
Model. We use a topic model analysis to identify topic distribution differences between demographic groups. We train on each dataset (separately) a Partially Labeled LDA model (Ramage et al., 2011), which incorporates per-label latent topics into an LDA model. We assign both depression and demographic labels to individuals, with K = 5 topics per label and 20 latent topics not associated with any labels for a total of 50 topics, following the number of topics from previous work. Intuitively, this has the effect of credit at-tribution -associating words to either depression, demographic groups or latent to dataset topics for each individual.
Metrics. To measure topic prevalence between groups, we use the enrichment (E) metric from Marlin et al. (2012); Ghassemi et al. (2014): The metric E has the effect of highlighting topics regardless of topic importance within the group. In order to preserve topic importance, we take the non-normalized average difference in E between control and depression groups (∆). For each document i, and corresponding label y i , topic c i , and topic probability q ic : Where negative values are topics most aligned with control group and positive values are aligned with depression.
Finally, To measure error rate attributions to topics, we use the topic error rateÊ metric from Chen et al. (2018): Processing. In addition to removing numeric values, username mentions, retweets and urls, we also remove English stopwords, pronouns 1 and emojis in order to create more coherent topics for our annotators. Removing stopwords and pronouns has the potential to erase depression signals as previous studies have found signals on pronoun usage, and also suppress voices and languages that do not fit certain norms. A full list of stopwords and pronouns is provided in Appendix B. We excluded topics from our results that did not have any coherent semantic groupings as annotated by one of the authors and 2 volunteers by looking at the top 15 most probable words per topic, obtaining a fair multi-annotator agreement Fleiss' Kappa κ = 0.332. After, topics were the majority of annotators selected as coherent where labeled by one of the authors.  Table 2: Top 5 topics as measured by topic error rateÊ. Higher value represents higher prevalence on individuals that mental health models misclassified.

Analysis
The topic model identified label-specific topics for depression and control. Appendix C shows the topics for both datasets as well as the top 10 most probable words per topic. Some depression topics are reasonable e.g. mental-health (in both datasets) and social media stats (may be related to internet statistics and popularity). Similarly, control topics like sports and beauty are active, positive and self-caring topics that are reasonable for being representative of our control group. However, some topics in both depression and control groups are not clearly tied to the groups e.g. for depression group, topics like One Direction and 5 Seconds of Summer. These might be topics introduced by temporal phenomena impeding model generalization (Harrigian et al., 2020a), rather than representative topics for those labels.

Content Differences
We characterize the difference in content between depression and control groups for each demographic. Table 1 shows the top and bottom 5 most prevalent topics with respect to the depression subgroups per demographic category as measured by ∆ on the Multitask dataset where only the topics with statistically significant ∆ are shown, as computed by bootstrapping with 1000 iterations with a CI of 95%. Some depression topics (One Direction and 5 Seconds of Summer) are not representative across demographics, while reasonable topics e.g. mental-health are representative of depression across demographic groups. Additionally, the topics most prevalent in the control subgroups (School and Sports) are the same across all demographics and represent qualities that are not related to depression, showing the ro-bustness of these indicators and the well-formed nature of the control group in the dataset.
The Body Negative topic, attributed to the Female-White label, is very prevalent on depression subgroups for both female groups but is not prevalent on male subgroups, suggesting that there are differences on depression language online between gender groups.
For Male PoC individuals, the mental-health topic for depression is not prevalent in the depression subgroup while One Direction is prevalent. Given that the Male PoC group has the fewest users in the dataset, this suggests that its depression subgroup is not a representative group of individuals for depression yielding spurious topics, confirming prior work on dataset size being a factor on difference in performance across demographics (Aguirre et al., 2021).
Further, topics representing non-English language (Arabic and Spanish) or minority accent (AAVE) are more prevalent in control subgroups of demographics where those are not expected e.g. Arabic and AAVE on Female-White group. Perhaps this is evidence of demographic label noise, further exacerbating the need of obtaining self-reported demographic labels on mental health datasets for more concrete analysis.

Depression Model Errors
We analyze the predictions of our depression models to identify content differences between demographics that are correlated to models errors. Table 2 shows the top 5 topics that are most prevalent on individuals that were wrongly classified by the models on each dataset. Expanding results from Section 5.1, we find that topics that are not representative across demographics e.g. One Direction, are correlated with downstream errors in classification of mental health models. This suggests that topics that are prevalent of depression subgroups and are not related to depression are misleading the model.
Additionally, the majority of topics most prevalent on model errors (e.g. Justin Bieber, One Direction and People's Choice) apart from not being related to mental health and not representative across demographics, are influenced by temporal phenomena e.g. short term events such People's Choice Awards, that stem from the time period in which the dataset was collected. Such topics are not generalizable, corroborating evidence from prior work on challenges in model generalizations of temporal themes (Harrigian et al., 2020a).
Further, some topics prevalent on model errors are the effect of dataset balance. For example, the topic AAVE is a Female-PoC labeled topic, but it also is very prevalent on model errors, suggesting that there are very few examples of AAVE in the dataset and the mental health model is oversensitive to this language. On the other hand, the topic beauty, labeled as control, is over-represented in the dataset. This suggest that datasets should be balanced based on demographics, following prior work (Aguirre et al., 2021).

Conclusion
We performed a qualitative analysis to find content differences related to mental health across demographic groups. We showed that there are content differences between depression and control subgroups, while most of these differences are due to non-clinical phenomena e.g. temporal topics. Additionally, we find that dataset size might be a factor in these content differences. Furthermore, model error analysis corroborates that temporal topics and demographic-specific topics are correlated with downstream model error.
Our findings support prior work on the importance of methods that seek to generalize temporal topics (Harrigian et al., 2020a). We also find supporting evidence of the importance of dataset size as well as dataset balance in order to generalize to minority groups (Aguirre et al., 2021). Though in our analysis we only consider one mental health disorder (depression), our methodology was able to generalize across two datasets. This suggests that it is a valid method for qualitative analysis on finding content differences in other mental health datasets. Additionally, while we were limited in our demographic labels by current demographic models and dataset sizes, we showed that our approach is valid across two demographic axes and could be expanded to include other demographic axes (such as age and economic status), and include genders and racial/ethnic groups outside of the ones considered in this work. We hope our work warrants further studies of mental health language differences across more diverse demographic groups yielding more inclusive datasets and research.

B Partially Labeled LDA
We need a procedure to identify topic distribution differences between demographic groups. Prior work have accomplished this by training an LDA topic model and either using pointwise mutual inference (PMI) (Mueller et al., 2021) or an enrichment metric E (Marlin et al., 2012;Ghassemi et al., 2014) to measure how distinctive a given topic is of a given demographic group. Instead, we train 2 , on both datasets separately, a Partially Labeled LDA model (Ramage et al., 2011), which incorporates per-label latent topics to an LDA model. Unlike LDA, each document d can only use the topics associated with the set of labels L d assigned to d, where each label l ∈ L d is assigned some number of topics K. The model computes the joint likelihood of observed words w, observed labels l and topic assignments z, given available labels Λ, and document-topic α, topic-word η priors from a Dirichtlet distribution P (w, l, z|Λ, α, η). We assign both depression and demographic labels to individuals, with K = 5 topics per label and 20 latent topics not associated with any labels for a total of 50 topics. Intuitively, this has the effect of credit attribution -associating words to either depression, demographic groups or latent to dataset topics for each individual.
To measure topic difference between groups (RQ1) we use the enrichment (E) metric from Marlin et al. (2012); Ghassemi et al. (2014): To measure error rate (RQ2) we use the topic error rateÊ metric from Chen et al. (2018): Additionally, in order to preserve topic importance, we take the non-normalized average difference in E between control and depression groups (∆). For each document i, and corresponding label y i , topic c i , and topic probability q ic : Where negative values are topics most aligned with control group and positive values are aligned with depression. In addition to filtering numeric values, username mentions, retweets and urls, we also filter stopwords, pronouns and emojis to obtain more coherent topics. We excluded topics from our results that did not have any coherent semantic groupings as annotated by one of the authors by looking at top 10 most probable words per topic.    followed stats unfollowers bro follower moment question yea hahahaha bus awkward teacher thats la unfollower 0.0287 Relationships babe idk ugly rn bae boyfriend wtf mad bored honestly annoying w k tbh kiss 0.0467 Table 6: Label, topic title, top words and topic importance of CLPsych dataset.