Towards Developing an Annotation Scheme for Depressive Disorder Symptoms: A Preliminary Study using Twitter Data

Major depressive disorder is one of the most bur-densome and debilitating diseases in the United States. In this pilot study, we present a new annotation scheme representing depressive symptoms and psycho-social stressors associated with major depressive disorder and report annotator agreement when applying the scheme to Twitter data


Introduction
Major depressive disorder -one of the most debilitating forms of mental illness -has a lifetime prevalence of 16.2% (Kessler et al., 2003), and a 12-month prevalence of 6.6% (Kessler and Wang, 2009) in the United States. In 2010, depression was the fifth biggest contributor to the United State's disease burden, with only lung cancer, lower back pain, chronic obstructive pulmonary disease, and heart disease responsible for more poor health and disability (US Burden of Disease Collaborators, 2013).
Social media, particularly Twitter, is increasingly recognised as a valuable resource for advancing public health (Ayers et al., 2014;Dredze, 2012), in areas such as understanding population-level health behaviour (Myslín et al., 2013;Hanson et al., 2013), pharmacovigilance (Freifeld et al., 2014;Chary et al., 2013), and infectious disease surveillance (Chew and Eysenbach, 2010;Paul et al., 2014). Twitter's value in the mental health arena -the focus of this paper -is particularly marked, given that it provides access to first person accounts of user behaviour, activities, thoughts, feelings, and relationships, that may be indicative of emotional wellbeing.
The main contribution of this work is the development and testing of an annotation scheme, based on DSM-5 depression criteria (American Psychiatric Association, 2013) and depression screening instruments 1 designed to capture depressive symptoms in social media data, particularly Twitter. In future work, the annotation scheme described here will be applied to a large corpus of Twitter data and used to train and test Natural Language Processing (NLP) algorithms.
The paper is structured as follows. Section 2 describes related work. Section 3 sets out the methodology used, including a list of semantic categories related to depression and psycho-social stressors derived from the psychology literature, and a description of our annotation process and environment. Section 4 presents the results of our annotation efforts and Section 5 provides commentary on those results.  Huang et al., in a largescale study of electronic health records, used structured data to identify cohorts of depressed and nondepressed patients, and -based on the narrative text component of the patient record -built a regression model capable of predicting depression diagnosis one year in advance (Huang et al., 2014). Pestian et al. showed that an NLP approach based on machine learning performed better than clinicians in distinguishing between suicide notes written by suicide completers, and notes elicited from healthy volunteers (Pestian et al., 2010;Pestian et al., 2012). Using machine learning methods, Xuan et al. identified linguistic characteristics -e.g. impoverished syntax and lexical diversity -associated with dementia through an analysis of the work of three British novelists, P.D. James (no evidence of dementia), Agatha Christie (some evidence of dementia), and Iris Murdoch (diagnosed dementia) (Xuan et al., 2011).
More specifically focused on Twitter and depression, De Choudhury et al. describes the creation of a corpus crowdsourced from Twitter users with depression-indicative CES-D scores 2 , then used this corpus to train a classifier, which, when used to classify geocoded Twitter data derived from 50 US states, was shown to correlate with US Centers for Disease Control (CDC) depression data (De Choudhury et al., 2013). Jashinsky et al. used a set of Twitter keywords organised around several themes (e.g. depression symptoms, drug use, suicidal ideation) and identified strong correlations between the frequency of suicide-related tweets (as identified by keywords) and state-level CDC suicide statistics (Jashinsky et al., 2014). Coppersmith et al. identified Twitter users with self-disclosed depression diagnoses ("I was diagnosed with depression") using regular expressions, and discovered that when depressed Twitter users' tweets where compared with a cohort of non-depressed Twitter users' tweets there were significant differences between the two groups 2 Center for Epidemiologic Studies Depression Scale (Radloff, 1977) in their expression of anger, use of pronouns, and frequency of negative emotions (Coppersmith et al., 2014).

Annotation Studies
Annotation scheme development and evaluation is an important subtask for some health and biomedical-related NLP applications (Conway et al., 2010;Mowery et al., 2013;Roberts et al., 2007;Vincze et al., 2008;Kim et al., 2003). Work on building annotation schemes (and corpora) for mental health signals in social media is less well developed, but pioneering work exists. For example, Homan et al. created a 4-value distress scale for rating tweets, with annotations performed by novice and expert annotators (Homan et al., 2014). To our knowledge, there exists no clinical depression annotation scheme that explicitly captures elements from common diagnostic protocols for the identification of depression symptoms in Twitter data.

Methods
Our first step was the iterative development of a Depressive Disorder Annotation Scheme based on widely-used diagnostic criteria (Section 3.1). We then went on to evaluate how well annotators were able to apply the schema to a small corpus of Twitter data, and assessed pairwise inter-annotator agreement across the corpus (Section 3.2).

Classes
Our Depressive Disorder Annotation Scheme is hierarchally-structured and is comprised of two mutually-exclusive nodes -No evidence of clinical depression and Evidence of clinical depression. The Evidence of clinical depression node has two non-mutually-exclusive types, Depression Symptom and Psycho-Social Stressor, derived from our literature review (top-down modeling) and dataset (bottom-up modeling). A summary of the scheme is shown in Figure 1.
For Depression Symptom classes, we identified 9 of the 10 parent-level depression symptoms from five resources for evaluating depression: Depression Screening Day Scale (HANDS) (Baer et al., 2000) sised 12 parent-level classes based on the Diagnostic and Statistical Manual of Mental Disorders, Edition 4 (DSM IV) Axis IV "psychosocial and environmental problems" (American Psychiatric Association, 2000) and work by Gilman et al. (Gilman et al., 2013). We identified other potential parent classes based on annotation of 129 randomlyselected tweets from our corpus. The hierarchical structure of the scheme, emphasising parent and child classes assessed in this study, is depicted in Figure 2.
In the following subsections, 3.

Pilot Annotation Study
The goal of this preliminary study was to assess how reliably our annotation scheme could be applied to Twitter data. To create our initial corpus, we queried the Twitter API using lexical variants of "depres-sion" e.g., "depressed" and "depressing", and randomly sampled 150 tweets from the data set 4 . Of these 150 tweets, we filtered out 21 retweets (RT). The remaining tweets (n=129 tweets) were annotated with the annotation scheme and adjudicated with consensus review by the authors (A1, A2), both biomedical informaticists by training. Two clinical psychology student annotators (A3, A4) were trained to apply the guidelines using the extensible Human Oracle Suite of Tools (eHOST) annotation tool (South et al., 2012) (Figure 3). Following this initial training, A3 and A4 annotated the same 129 tweets as A1 and A2.
In this study, we calculated the frequency distribution of annotated classes for each annotator. In order to assess inter-annotator agreement, we compared annotator performance between annotators (IAA ba -between annotators) and against the adjudicated reference standard (IAA ar -against the reference standard) using F1-measure. Note that F1-measure, the harmonic mean of sensitivity and positive predictive value, is equivalent to positive specific agreement which can act as a surrogate for kappa in situations where the number of true negatives becomes large (Hripcsak and Rothschild, 2005). We also assessed IAA ar performance compared to the reference standard at both parent and child levels of the annotation scheme hierarchy (see Figure 2 for example parent/child classes). In addition to presenting IAA ar by annotator for each parent class, we also characterise the following distribution of disagreement types:

Presence/absence of clinical evidence (CE)
e.g., No evidence of clinical depression vs. Fatigue or loss of energy 2. Spurious class (SC) e.g., false class annotation 3. Missing class (MC) e.g., missing class annotation 4. Other (OT) e.g., errors not mentioned above
In Table 3, we report IAA ar for each annotator compared to the reference standard for both parent and child classes. IAA ar ranged from 60-90 for the parent classes (e.g. Media) and 41-87 for child classes (e.g. Media: book). The IAA ar difference between parent and child class performance ranged from 3-36 points. Table 4 enumerates IAA ar for the observed parent classes. Note that only 12 (55%) of the parent classes were observed in the reference standard. A1 had variable agreement levels including 4 subtypes between 80-100, 6 subtypes between 60-79, and 3 subtypes between 40-59. A2 had consistently high agreement with 10 subtypes between 80-100 followed by 1 subtype IAA ar between 20-39 IAA ar . A3 achieved 3 subtypes between 60-79 and 1 subtype between 40-59. A3 performed with 2 subtypes between 80-100, 3 subtypes between 60-79, 1 subtype between 40-59, and 2 subtypes between 20-39.   We observed between 15-57 disagreements across annotators when compared to the reference standard (see Table 5), with No evidence of clinical depression accounting for 60-77% of disagreements. Missing classes accounted for 16-33% of disagreements.

Discussion
We developed an annotation scheme to represent depressive symptoms and psychosocial stressors associated with depressive disorder, and conducted a pilot study to assess how well the scheme could be applied to Twitter tweets. We observed that content from most tweets can be represented with one class annotation (see Table 1), an unsurprising result given the constraints on expressivity imposed by Twitter's 140 character limit. In several cases, two symptoms or social stressors are expressed within a single tweet, most often with Low mood and a second class (e.g. Economic problems).
We observed low to moderate IAA ba between annotators (Table 2). Annotators A1 and A2 achieved highest agreement suggesting they have a more similar understanding of the schema than all other pair combinations. Comparing our kappa scores to related work is challenging. However, Homan et al. reports a comparable, moderate kappa (50) between two novice annotators when annotating whether a tweet represents distress.
When comparing IAA ar , annotators achieved moderate to high agreement at the parent level against the reference standard (Table 3). Annotators A1 and A2 had higher parent and child level agreement than annotators A3 and A4. This may be explained by the fact that the schema was initially developed by A1 and A2. Additionally, the reference standard was adjudicated using consensus between A1 and A2. Around half of the depressive symptoms and psycho-stressors were not observed during the pilot study (e.g. Anhedonia, Fatigue or loss of energy, Recurrent thoughts of death or suicidal ideation -see Table 4) although may well appear in a larger annotation effort. The reference standard consists mainly of No evidence of clinical depression and Low mood classes suggesting that other depressive symptoms and psychostressors (e.g. Psychomotor agitation or retardation) are less often expressed or more difficult to detect without more context than is available in a single tweet. For these most prevalent subtypes, good to excellent agreement was achieved by all 4 annotators. Considerably lower agreement was observed for annotators A3 and A4 for less prevalent classes. In contrast, A1 and A2 maintained similar moderate and high agreement, respectively. In future experiments, we will leverage all annotators' annotations when generating the reference standard (i.e. the reference standard will be created using majority vote).   (7) 2(4) 0 (0) OT 2 (5) 0 (0) 3 (6) 4 (7) Total 38 15 49 57 Table 5: Count (%) of disagreements by type for each annotator compared against the reference standard tifying a tweet as containing No evidence of clinical depression (see Table 5). The line between the presence and absence of evidence for clinical depression is difficult to draw in these cases due to the use of humour ("So depressed :) #lol"), misuse or exaggerated use of the term ("I have a bad case of post concert depression"), and lack of context ("This is depressing"). In very few cases, disagreements were the result of other differences such as specificity (Media vs Media: book) or one-to-one mismatch (Weather: NOS vs Media: book). This result is unsurprising given that agreement tends to reduce as the number of categories become large, especially for less prevalent categories (Poesio and Vieira, 1998). We acknowledge several limitations in our pilot study, notably the small sample size and initial queried term. We will address these limitations in future work by annotating a significantly larger corpus (over 5,000 tweets) and querying the Twitter API with a more diverse list of clinicianvalidated keywords than was used in this pilot annotation study.

Conclusions
We conclude that there are considerable challenges in attempting to reliably annotate Twitter data for mental health symptoms. However, several depressive symptoms and psycho-social stressors derived from DSM-5 depression criteria and depression screening instruments can be identified in Twitter data.