RSDD-Time: Temporal Annotation of Self-Reported Mental Health Diagnoses

Self-reported diagnosis statements have been widely employed in studying language related to mental health in social media. However, existing research has largely ignored the temporality of mental health diagnoses. In this work, we introduce RSDD-Time: a new dataset of 598 manually annotated self-reported depression diagnosis posts from Reddit that include temporal information about the diagnosis. Annotations include whether a mental health condition is present and how recently the diagnosis happened. Furthermore, we include exact temporal spans that relate to the date of diagnosis. This information is valuable for various computational methods to examine mental health through social media because one’s mental health state is not static. We also test several baseline classification and extraction approaches, which suggest that extracting temporal information from self-reported diagnosis statements is challenging.


Introduction
Researchers have long sought to identify early warning signs of mental health conditions to allow for more effective treatment (Feightner and Worrall, 1990). Recently, social media data has been utilized as a lens to study mental health (Coppersmith et al., 2017). Data from social media users who are identified as having various mental health conditions can be analyzed to study common language patterns that indicate the condition; language use could give subtle indications of a person's wellbeing, allowing the identification of at-risk users. Once identified, users could be provided with relevant resources and support.
While social media offers a huge amount of data, acquiring manually-labeled data relevant to mental health conditions is both expensive and not scalable. However, a large amount of labeled data is crucial for classification and largescale analysis. To alleviate this problem, NLP researchers in mental health have used unsupervised heuristics to automatically label data based on self-reported diagnosis statements such as "I have been diagnosed with depression" (De Choudhury et al., 2013;Coppersmith et al., 2014aYates et al., 2017).
A binary status of a user's mental health conditions does not tell a complete story, however. People's mental condition changes over time (Wilkinson and Pickett, 2010), so the assumption that language characteristics found in a person's social media posts historically reflects their current state is invalid. For example, the social media language of an adult diagnosed with depression in early adolescence might no longer reflect any depression. Although the extraction of temporal information has been well-studied in the clinical domain (Lin et al., 2016;Dligach et al., 2017), temporal information extraction has remained largely unexplored in the mental health domain. Given the specific language related to self-reported diagnoses posts and the volatility of mental conditions in time, the time of diagnosis provides critical signals on examining mental health through language.
To address this shortcoming of available datasets, we introduce RSDD-Time: a dataset of temporally annotated self-reported diagnosis statements, based on the Reddit Self-Reported Depression Diagnosis (RSDD) dataset (Yates et al., 2017). RSDD-Time includes 598 diagnosis statements that are manually annotated to include pertinent temporal information. In particular, we identify if the conditions are current, meaning that the condition is apparently present according the the self-reported diagnosis post. Next, we identify how recently a particular diagnosis has occurred. We refer to these as condition state and diagnosis recency, respectively. Furthermore, we identify the time expressions that relate to the diagnosis, if provided.
In summary, our contributions are: (i) We explain the necessity of temporal considerations when working with self-reported diagnoses. (ii) We release a dataset of annotations for 598 selfreported depression diagnoses. (iii) We provide and analyze baseline classification and extraction results.
Related work Public social media has become a lens through which mental health can be studied as it provides a public narration of user activities and behaviors (Conway and O'Connor, 2016). Understanding and identifying mental health conditions in social media (e.g., Twitter and Reddit) has been widely studied (De Choudhury et al., 2013;Coppersmith et al., 2014b;De Choudhury and De, 2014;Mitchell et al., 2015;Gkotsis et al., 2016;Yates et al., 2017). To obtain ground truth knowledge for mental health conditions, researchers have used crowdsourced surveys and heuristics such as self-disclosure of a diagnosis (De Choudhury et al., 2013;Tsugawa et al., 2015). The latter approach uses high-precision patterns such as "I was diagnosed with depression." Only statements claiming an actual diagnosis are considered because people sometimes use phrases such as "I am depressed" casually. In these works, individuals self-reporting a depression diagnoses are presumed to be depressed. Although the automated approaches have yielded far more users with depression than user surveys (tens of thousands, rather than hundreds), there is no indication of whether or not the diagnosis was recent, or if the conditions are still present. In this work, we address this by presenting manual annotations of nearly 600 self-reported diagnosis posts. This dataset is valuable because it allows researchers to train and test systems that automatically determine diagnosis recency and condition state information.
apply it to a set of 598 diagnosis posts randomly sampled from the Reddit Self-Reported Depression Diagnosis (RSDD) dataset (Yates et al., 2017). In the annotation environment, the diagnosis match is presented with a context of 300 characters on either side. A window of 150 characters on either side was too narrow, and having the whole post as context made annotation too slow, and rarely provided additional information.
Annotation scheme Two kinds of text spans are annotated: diagnoses (e.g., "I was diagnosed") and time expressions that are relevant to the diagnosis (e.g., "two years ago"). On diagnosis spans, the following attributes are marked: • Diagnosis recency determines when the diagnosis occurred (not the onset of the condition). Six categorical labels are used: very recently (up to 2 months ago), more than 2 months but up to 1 year ago, more than 1 year but up to 3 years ago, more than 3 years ago, unspecified (when there is no indication), and unspecified but not recent (when the context indicates that the diagnosis happened in the past, yet there is insufficient information to assign it to the first four labels).
• For condition state, the annotator assesses the context for indications of whether the diagnosed condition is still current or past. The latter includes cases where it is reported to be fully under control through medication. We use a fivepoint scale (current, probably current, unknown, probably past and past). This can be mapped to a three-point scale for coarse-grained prediction (i.e. moving probable categories to the center or the extremes).
• When a diagnosis is presented as uncertain or incorrect, we mark it as diagnosis in doubt. This can be because the diagnosis is put into question by the poster (e.g., "I was diagnosed with depression before they changed it to ADHD"), or it was later revised.
• Occasionally, incorrect diagnosis matches are found in RSDD. These are marked as false positive. This includes diagnoses for conditions other than depression or self-diagnosis that occur in block quotes from other posts. False positive posts are not included in the analyses below.
Time expressions indicating the time of diagnosis are marked similarly to the TIMEX3 specification (Pustejovsky et al., 2005), with the additional support for ages, years in school, and references to other temporal anchors. Because of these additions, we also annotate prepositions pertaining to the temporal expression when present (e.g., 'at 14', 'in 2004'). Each span also has an indication of how their associated diagnosis can be assigned to one of the diagnosis recency labels. Explicit time expressions allow immediate assignment given the post date (e.g., yesterday, last August, in 2006). If the recency can be inferred assuming a poster's age at post time is known, it is inferable from age (e.g., at 17, in high school). A poster's age could be established using mentions by the author, or estimated with automatic age prediction.
Inter-annotator agreement After an initial annotation round with 4 annotators that allowed for the scheme and guidelines to be improved, the entire dataset was annotated by 6 total annotators with each post being at least double annotated; disagreements were resolved by a third annotator where necessary. We report pairwise interannotator agreement in Table 1. Cohen's kappa is linearly weighted for ordinal categories (condition state and diagnosis recency).
Agreement on false positives and doubtful diagnoses is low. For future analyses that focus on detecting potential misdiagnoses, further study would be required to improve agreement, but it is tangential to the focus on temporal analysis in this study.
Estimating the state of a condition is inherently ambiguous, but agreement is moderate at 0.41 weighted kappa. The five-point scale can be backed off to a three-point scale, e.g. by collapsing the three middle categories into don't know. Pairwise percent agreement then improves from 0.52 to 0.68. The recency of a diagnosis can be established with substantial agreement (κ = 0.64). Time expression attributes can be annotated with almost perfect agreement.   Availability The annotation data and annotation guidelines are available at https://github. com/Georgetown-IR-Lab/RSDD-Time. The raw post text is available from the RSDD dataset via a data usage agreement (details available at http://ir.cs.georgetown.edu/ resources/rsdd.html).

Corpus analysis
Counts for each attribute are presented in Table 2. Figure 1 shows the incidence and interaction between condition state and diagnosis recency in our dataset. About half the cases have a condition state that is current, but interestingly, there are also many cases (55) where the diagnosis relates (at least probably) to the past. There is also a large number of cases (225) where it is not clear from the post whether the condition is current or not. This further shows that many self-reported diagnosis statements may not be current, which could make a dataset noisy, depending on the objective. For diagnosis recency, we observe that the majority of diagnosis times are either unspecified or happened in the unspecified past. For 245 cases, however, the diagnosis recency can be inferred from the post, usually because there is an explicit time expression (59% of cases), or by inferencing from age (41%). Next, we investigate the interaction between condition state and diagnosis recency. We particularly observe that the majority of past conditions (rightmost two columns) are also associated with a diagnosis recency of more than 3 years ago or of an unspecified past. On the other hand, many current conditions (leftmost column) have an unspecified diagnosis time. This is expected because individuals who specifically indicate that their condition is not current also tend to specify when they have been first diagnosed, whereas individuals with current conditions may not mention their time of diagnosis.

Experiments
To gain a better understanding of the data and provide baselines for future work to automatically perform this annotation, we explore methods for attribute classification for diagnosis recency and condition state, and rule-based diagnosis time extraction. We split the data into a training dataset (399 posts) and a testing dataset (199 posts). We make this train/test split available for future work in the data release. For our experiments, we then disregard posts that are labeled as false positive (yielding 385 posts for training and 188 for testing), and we only consider text in the context window with which the annotator was presented.

Diagnosis recency and condition state classification
We train several models to classify diagnosis recency and condition state. In each we use basic bag-of-character-ngrams features. Character ngrams of length 2-5 (inclusive) are considered, and weighted using tf-idf. For labels, we use the combined classes described in Section 2. To account for class imbalance, samples are weighed by the inverse frequency of their category in the training set. We compare three models: logistic regression, a linear-kernel Support Vector Machine (SVM), and Gradient-Boosted ensemble Trees (GBT) (Chen and Guestrin, 2016). The logistic regression and SVM models are 2 normalized, and the GBT models are trained with a maximum tree depth of 3 to avoid overfitting.
We present results in Table 3. The GBT method performs best for diagnosis recency classification, and logistic regression performs best for condition  state classification. This difference could be due to differences in performance because of skew. The condition state data is more skewed, with current and don't know accounting for almost 80% of the labels.

Time expression classification
To automatically extract time expressions, we use the rule-based SUTime library (Chang and Manning, 2012). Because diagnoses often include an age or year in school rather than an absolute time, we added rules specifically to capture these time expressions. The rules were manually generated by examining the training data, and will be released alongside the annotations.
RSDD-Time temporal expression annotations are only concerned with time expressions that relate to the diagnosis, whereas SUTime extracts all temporal expressions in a given text. We use a simple heuristic to resolve this issue: simply choose the time expression closest to the post's diagnosis by character distance. In the case of a tie, the heuristic arbitrarily selects the leftmost expression. This heuristic will improve precision by eliminating many unnecessary temporal expressions, but has the potential to reduce precision by eliminating some correct expressions that are not the closest to the diagnosis.
Results for temporal extraction are given in Table 4. Notice that custom age rules greatly improve the recall of the system. The experiment also shows that the closest heuristic improves precision at the expense of recall (both with and without the age rules). Overall, the best results in terms of F1 score are achieved using both the closest heuristic and the age rules. A more sophisticated algorithm could be developed to increase the candidate expression set (to improve recall), and better predict which temporal expressions likely correspond to the diagnosis (to improve precision).  Table 4: Results using SUTime, with additional rules for predicting age expressions and when limiting the candidate expression set using the closest heuristic.

Conclusion
In this paper, we explained the importance of temporal considerations when working with language related to mental health conditions. We introduced RSDD-Time, a novel dataset of manually annotated self-reported depression diagnosis posts from Reddit. Our dataset includes extensive temporal information about the diagnosis, including when the diagnosis occurred, whether the condition is still current, and exact temporal spans. Using RSDD-Time, we applied rule-based and machine learning methods to automatically extract these temporal cues and predict temporal aspects of a diagnosis. While encouraging, the experiments and dataset allow much room for further exploration.