A Cross-modal Review of Indicators for Depression Detection Systems

Automatic detection of depression has attracted increasing attention from researchers in psychology, computer science, linguistics, and related disciplines. As a result, promising depression detection systems have been reported. This paper surveys these efforts by presenting the first cross-modal review of depression detection systems and discusses best practices and most promising approaches to this task.


Introduction
Given advancements in hardware and software, coupled with the explosion of smartphone use, the forms of potential health care solutions have begun to change and interest in developing technologies to assess mental health has grown. Among the latest technologies are depression detection systems, which use indicators from an individual in combination with machine learning to make automated depression level assessments. Researchers have made significant progress, but challenges remain. One major challenge is the existing disconnect between language technology subfields: approaches to depression assessment from natural language processing (NLP), speech processing, and human-computer interaction (HCI) tend to silo by subfield, with little discussion about the utility of combining promising approaches. This existing disconnect necessitates a bridge to facilitate greater collaboration and cooperation across subfields and modalities.
Experts across several fields are attempting to build valid tools for depression assessment. Each subfield tends to approach the task from a unique perspective, with slightly different goals, and completely different data sources. Due to these experimental differences, it is difficult to compare approaches and even more difficult to combine promising approaches. For example, if we consider data sources alone, NLP research has aimed to detect depression from writing, both formal and informal (i.e. online text), speech processing research has aimed to assess depression level from audio while HCI and related fields try to assess depression level from video. Each data source is then labeled for depression through different approaches, including rating scales, self-report surveys, manual annotation, etc. As a result, we see various definitions of how depression is defined across studies. Regardless of the existing differences, every study and system share the common goal of discovering a way to use technology to help assess depression.
This survey paper aims to serve as a bridge between the subfields by providing the first review of depression detection systems across subfields and modalities. This paper focuses on the following research questions, how has depression been defined and annotated in detection systems? What kinds of depression data exists or could be obtained for depression detection systems? What (multimodal) indicators have been used for the automatic detection of depression? How do we evaluate depression detection systems? Each research question could serve as the main focus of an entire paper. Therefore, this review briefly touches upon each question and dedicates the most focus to reviewing indicators of depression and subsequently features for depression detection systems. We cover numerous features across modalities, including visual, acoustic, linguistic, and social. We briefly review approaches to defining and annotating depression, existing data sources, and how to evaluate depression detection systems. Lastly, we end our discussion with the practical or ethical issues that require attention when building systems for depression detection.

Clinical Definition and Diagnostics
According to the Diagnostic and Statistical Manual of Mental Disorders (APA, 2013), the most widely used resource in diagnosing mental disorders in the United States, most people will experience some feelings of depression in their lifetime, although it does not meet the criteria of an illnesss until a person has experienced, for longer than a two-week period, a depressed mood and/or a markedly diminished interest/pleasure in combination with four or more of the following symptoms: significant unintentional weight loss or gain, insomnia or sleeping too much, agitation or psychomotor retardation noticed by others, fatigue or loss of energy, feelings of worthlessness or excessive guilt, diminished ability to think or concentrate, indecisiveness, or recurrent thoughts of death. In addition, diagnosis requires that the symptoms cause clinically significant distress or impairment in social, occupational, or other important areas of functioning.
Commonly used assessment tools for depression include clinical interviews or selfassessments. The Hamilton Rating Scale for Depression (HAM-D) (Hamilton, 1960) is a widely used assessment tool and is often regarded as the most standard assessment tool for depression for both diagnosis and research purposes (Cummins et al., 2015a). The HAM-D is clinicianadministered, includes 21 questions, and takes 20 to 30 minutes to complete. The interview assesses the severity of symptoms associated with depression and gives a patient a score, which relates to their level of depression. Some symptoms included are depressed mood, insomnia, agitation, and anxiety. Each of the questions has 3 to 5 possible responses which range in severity, scored between 0-2, 2-3, or 4-5 depending on the importance of the symptom. All scores are then summed and the total is arranged into 5 categories (normalsevere).
There also exist commonly used self-report measures, including the the Beck Depression Inventory (BDI-II) (Beck et al., 1961). The BDI-II is a self-report questionnaire that consists of 21 items and takes 5 to 10 minutes to complete. The question items aim to cover important cognitive, affective, and somatic symptoms associated with depression. Each question receives a score on a scale from 0-3 depending on how se-vere the symptom was over the previous week. Similar to HAM-D, all scores are summed and the final score is categorized into 4 different levels (minimal-severe). Other diagnostic tools include the Montgomery-Åsberg Depression Rating Scale (Montgomery and Asberg, 1979), the Patient Health Questionnaire (Kroenke et al., 2001), and the Quick Inventory of Depressive Symptomology (Rush et al., 2003).

Scalable Approaches to Annotation
When working with datasets, it is not always feasible to acquire clinical ratings for depression level. As a result, researchers have come up with innovative ways of acquiring depression labels at scale, notably from social media sources. Given the explosion of social media, this domain is especially rich in data for mental health research. However, any research in this domain must take into account the ability of online users to be anonymous or even deceptive. Coppersmith et al. (2015) looked for tweets that explicitly stated "I was just diagnosed with depression". Moreno et al. (2011) evaluated Facebook status updates using references to depression symptoms such as "I feel hopeless" to ultimately determine depression label. Choudhury et al. (2013) used crowdsourcing, via the Amazon Mechanical Turk platform, to collect Twitter usernames as well as labels for depression. Reece and Danforth (2016) used a similar crowdsourcing approach to collect both depression labels and Instagram photo data. In some approaches to annotation, depression is subsumed into broader categories like distress, anxiety, or crisis. For example, Milne et al. (2016) used judges to manually annotate how urgently a blog post required attention, using a triage system of green/amber/red/crisis.
These innovative approaches to data annotation highlight the potential of social media data. This domain offers a very rich data source which can be used to build, train, and test models to automatically perform mental health assessments at a large scale.

Datasets
The task of depression detection is inherently interdisciplinary and all disciplines-psychology, computer science, linguistics-bring an essential set of skills and insight to the problem. However, it is not always the case that a team is fortunate enough to have collaborators from all disciplines. One way to promote collaboration is to   Subchallenge (2013Subchallenge ( -2016 are examples of depression detection system challenges that spurred interest, promoted research, and built connections across the research community. In this section, we describe the kinds of depression data that exist, listed in Table 1. We focus solely on datasets that are publicly available to download. For a detailed list of databases both private and public that have been used in speech processing studies see (Cummins et al., 2015a).
Both the AVEC 2013 and 2014 corpora are available to download 1 . The AVEC challenges are organized competitions aimed at comparing multimedia processing and machine learning methods for automatic audio, video and audiovisual emotion and depression analysis, with all participants competing under strictly the same conditions. The AVEC 2013 corpus (Valstar et al., 2013) includes 340 video clips in German of subjects performing a HCI task while being recorded by a webcam and a microphone. The video files each contain a range of vocal exercises, including free and read speech tasks. The level of depression is labeled with a single value per recording using the BDI-II. The AVEC 2014 corpus (Valstar et al., 2014) is a subset of the AVEC 2013 corpus. In total, the corpus includes 300 videos in German; the duration ranges from 6 seconds to 4 minutes. The files include a read speech passage (Die Sonne und der Wind) and an answer to a free response question.
The Crisis Text Line 2 is a free 24/7 crisis support texting hot line where live trained crisis counselors receive and respond quickly to texts. The main goal of the organization is to support peo-ple with mental health issues through texting. The organization includes an open data collaboration. In order to gain access, researchers must complete an Institutional Review Board application with their own university and an application with Crisis Text Line, which gives researchers access to a vast amount of text data annotated by conversation issue, including but not limited to depression, anger, sadness, body image, homelessness, selfharm, suicidal ideation, and more.
The Distress Analysis Interview Corpus (DAIC)  contains clinical interviews in English designed to support the diagnosis of psychological distress conditions such as anxiety, depression, and post-traumatic stress disorder. The interviews were conducted by an animated virtual interviewer called Ellie. The DAIC interviews were meant to simulate the first step in identifying mental illness in health care settings, which is a semi-structured interview where health care providers ask a series of open-ended questions with the intent of identifying clinical symptoms. The corpus includes audio and video recordings and extensive questionnaire responses. Each interview includes a depression score from the PHQ-8 (Kroenke et al., 2009). A portion of the corpus was released during the AVEC 2016 Depression Sub-challenge and is available to download 3 . The publicly-available dataset also includes transcripts of the interview.
The DementiaBank Database 4 represents data collected between 1983 and 1988 as part of the Alzheimer Research Program at the University of Pittsburgh (Becker et al., 1994). DementiaBank is a shared database of multimedia interactions for the study of communication in dementia. A subset of the participants from the dataset also have HAM-D depression scores.
The ReachOut Triage Shared Task dataset 5 consists of 65,024 forum posts written between July 2012 and June 2015 (Milne et al., 2016). A subset of the corpus (1,227 posts) is manually annotated by three separate expert judges indicating how urgently a post required a moderators attention. Labels included crisis, red, amber, and green.
The SemEval-2014 Task 7 (Pradhan et al., 2014) dataset 6 represents clinical notes which are annotated for disorder mentions, including mental disorders such as depression.

Indicators of Depression
Ideally, machine learning tools for depression detection should have access to the same streams of information that a clinician utilizes in the process of forming a diagnosis. Therefore, features used by such classifiers should represent each communicative modality: face and gesture, voice and speech, and language. This section provides a review of each modality highlighting markers that have had success in systems.

Visual Indicators
Visual indicators have been widely explored for depression analysis, including body movements, gestures, subtle expressions, and periodical muscular movements.
Girard et al. (2014) investigated whether a relationship existed between nonverbal behavior and depression severity. In order to measure nonverbal behavior they used the Facial Action Coding System (FACS) (Ekman et al., 1978). FACS is a system used to taxonomize human facial movements by their appearance on the face. It is a commonly used tool and has become standard to systematically categorize physical expressions, which has proven very useful for psychologists. FACS is composed of facial Action Units (AUs), which represent the fundamental actions of individual muscles or groups of muscles. Girard et al. (2014) found that participants with high levels of depression made fewer affiliative facial expressions, more non-affiliative facial expressions, and diminished head motions. Scherer et al. (2013b) also investigated visual features using FACS and found that depression could be predicted by a more downward angle of the gaze, less intense smiles, shorter average durations of smile, longer self-touches, and fidgeting.
In addition to FACS features for video analysis, others have considered Space-Time Interest Points (STIP) features (Cummins et al., 2013;, which capture spatio-temporal changes including movements of the face, hands, shoulder, and head. Using STIP features,  found that they could detect depression with 76.7% accuracy. Their results showed that body expressions, gestures, and head movements can be significant visual cues for depression detection.

Speech Indicators
Recent research has shown the promise in using speech as a diagnostic and monitoring aid for depression (Cummins et al., 2015b(Cummins et al., ,a, 2014Williamson et al., 2014a). The speech production system of a human is very complex and as a result slight cognitive or physiological changes can produce acoustic changes in speech. This idea has driven the research on using speech as an objective marker for depression. Depressed speech has consistently been associated with a wide range of prosodic, source, formant and spectral indicators. For a thorough review of speech processing research for depression detection see (Cummins et al., 2015a).
Many researchers have provided evidence for the robustness of prosodic indicators to capture depression level, specifically noting the promise of speech-rate (Mundt et al., 2012;Hönig et al., 2014). Cannizzaro et al. (2004) examined the relationship between depression and speech by performing statistical analyses of different acoustic measures, including speaking rate, percent pause time, and pitch variation. Their results demonstrated that speaking rate and pitch variation had a strong correlation with the depression rating scale. Moore et al. (2008) investigated the suitability for a classification system formed from the combination of prosodic, voice quality, spectral, and glottal features and reported maximum accuracy of 91% for male speakers and 96% accuracy for females speakers when classifying between absence/presence of depression. Stassen et al. (1998) found for 60% of patients in their study that speech pause duration was significantly correlated with their HAM-D score. Alpert et al. (2001) also found significant differences in speech pause duration between spontaneous speech of their depressed group versus their control group. Cannizzaro et al. (2004) found a significant correlation between reduced speaking rate and HAM-D score. Mundt et al. (2012) found six prosodic timing measures to be significantly correlated with depression severity, includ-ing total speech time, total pause time, percentage pause time, speech pause ratio, and speaking rate. Hönig et al. (2014) reported a positive correlation with increasing levels of speaker depression and average syllable duration. Trevino et al. (2011) found that changes in speech rate are stronger at the phoneme level, finding stronger relationships between speech rate and depression severity when using phone-duration and phonespecific measures instead of a global speech rate. Cohn et al. (2009) investigated vocal prosody and found that variation in fundamental frequency and latency of response to interviewer questions achieved 79% accuracy in distinguishing participants with moderate/severe depression from those with no depression. Low et al. (2011) investigated various acoustic features, including spectral, cepstral, prosodic, glottal and a Teager energy operator based feature. In their best performing systems, using sex-dependent models, they achieved 87% accuracy for males and 79% for females. In Cummins et al. (2011) spectral features, particularly mel-frequency cepstral coefficients (MFCCs) were found to be useful, distinguishing 23 depressed participants from 24 controls with an accuracy of 80% in a speaker-dependent configuration. Scherer et al. (2013a) found glottal features (normalized amplitude quotient and quasiopen quotient) differed significantly between depressed and control groups. When used to detect depression they found glottal features to differentiate between the 2 groups with 75% accuracy. Alghowinem et al. (2013) investigated a number of feature sets for detecting depression from spontaneous speech and found loudness and intensity features to be the most discriminative.

Linguistic and Social Indicators
While most literature concerning depression detection systems has focused on the speech signal, there is a related body of work on detecting depression from writing using linguistic cues. For clinical psychologists, language plays a central role in diagnosis. Therefore, when building language technology in the domain of mental health it is essential to consider both the acoustic and linguistic signal. For an in-depth review of NLP applications for mental health assessment see Calvo et al. (2017).
Features derived from the speech signal are motivated by ways in which the cognitive and phys-ical changes associated with depression can lead to differences in speech. Similarly, psychological and sociological theories suggest that depressed language can be characterized by specific linguistic features. Aaron Beck's (1967) cognitive theory of depression posits that people prone to depression possess a depressive schema, leading them to see themselves and the world in pervasively negative terms. When activated, these schema give rise to depressive thinking. A stressful event can then trigger these schema, leading an individual to perceive the event in a negative way and, as a result, cause an episode of depression. Pyszczynski and Greenberg (1987) speculated that depressed individuals think a great deal about themselves, stressing the role of self-focused attention and extreme self-criticism. Also related is the social integration model by Durkheim (1951), which posits that the perception of oneself as not integrated into society (detached from social life) is key to suicidality and is also relevant to the depressed persons' perceptions of self.
These theories have motivated empirical studies of depressed language which have in turn provided support for their validity. Stirman and Pennebaker (2001) provided evidence consistent with both the self-focus and social integration perspectives by studying the word usage of suicidal and non-suicidal poets. They conducted a comparison of 300 poems from the early, middle, and late periods of nine poets who committed suicide and nine who did not. They used the Linguistic Inquiry and Word Count (LIWC) dictionary (Pennebaker et al., 2007), which is a text analysis tool that can be used to count words in psychologically meaningful categories. Using LIWC, they found that suicidal poets used more first-person singular (I, me, my) words, and fewer first-person plural (we, us, our) words. In related work, Poulin et al. (2014) used medical records and a text analysis approach to predict suicide risk with an accuracy of 65%, finding that certain words were predictive of suicide.
Later work by Rude et al. (2004) analyzed narratives written by currently-depressed, formerlydepressed, and never-depressed college students. In the context of an essay task, they examined linguistic patterns using LIWC, including the use of first person singular, first person plural, social references, and negatively/positively valenced words. As hypothesized based on Pyszcynski and Greenberg's model of self-focus, depressed students used significantly more first person singular words than did never-depressed individuals. They also found that depressed students used more negatively valenced words and fewer positive emotion words, supporting both the negative focus predicted by Beck's cognitive theory of depression and the self-preoccupation predicted by Psyzcynski and Greenberg's control theory of depression. Given the success of LIWC in Rude et al.'s work, many other researchers have incorporated LIWC into depression detection systems with encouraging results. Nguyen et al. (2014) found LIWC to be useful in capturing topic and mood which showed good predictive validity in depression classification between clinical and control groups in blog post texts. Morales and Levitan (2016b) incorporated LIWC into a depression detection system and found certain LIWC categories to be useful in measuring specific depression symptoms, including sadness and fatigue.
Various approaches to modeling word usage have had much success in detecting depression. Coppersmith et al. (2015) accurately identified depression with high accuracies using n-gram models in Twitter text. Althoff et al. (2016) presented a large-scale quantitative study on the discourse of counseling conversations. They developed a set of discourse features to measure how correlated linguistic aspects of conversations were with outcomes. Features in their study included: sequence-based conversation models, language model comparisons, message clustering, and psycholinguistics-inspired word frequency analyses. Their results were also consistent with Psyzcynski and Greenberg's theory of depression, in that texters with a smaller amount of self-focus were associated with more successful conversations. In addition, Schwartz et al. (2014) showed that regression models based on Facebook language can be used to predict an individuals degree of depression.
In addition to considering word usage, researchers have also explored syntactic characteristics of depressed language. Zinken et al. (2010) investigated whether an analysis of a depressed patients' syntax could help predict improvement of symptoms. This work built upon previous findings that showed the health benefit of expressive writing (Pennebaker, 1997). Building upon this work, Zinken et al. considered the psychological relevance of syntactic structures of language use. Word use and syntactic structure were analyzed to explore whether the degrees to which a participant constructs relationships between events in a brief text can inform the likelihood of successful participation in depression treatment. They also used LIWC and targeted 2 categories: causation words and insight words. In addition, they manually coded eight different syntactic structures (ranging from simple to complex) in the patients' narratives. They found that certain structures were correlated with patients' potential to complete a self-help treatment. Zinken et al.'s findings demonstrate the promise in investigating syntactic characteristics of an individual's language use. Moreover, related work has found that differences in frequencies of part-of-speech (POS) tags were useful in detecting depression from writing (Morales and Levitan, 2016b). Resnik et al. (2015) explored the use of supervised topic models in the analysis of detecting depression from Twitter. They use 3 million tweets from about 2,000 twitter users, of whom roughly 600 self-identify as having been diagnosed with depression. This work provided a more sophisticated model for text-based feature development for detecting depression, yielding promising results using supervised Latent Dirichlet Allocation (LDA). LDA uncovers underlying structure in a collection of documents by treating each document as if it were generated as a mixture of different topics. Qualitative examples confirmed that LDA models can uncover meaningful and potentially useful latent structure for the automatic identification of important topics for depression detection.
With the rise of social media, posts on sites such as Twitter and Facebook provide an interesting domain to investigate depression. Not only do these domains provide rich text data but also social metadata which captures important social behaviors and characteristics, like number of friends/followers, number of likes, retweets, etc. De Choudhury et al. (2014) studied Facebook data shared voluntarily by 165 new mothers. Their work aimed to detect and predict onset of postpartum depression (PPD). They considered multiple behavioral features including activity (frequency of status updates, media items, and wall posts), social capital (likes and comments on status updates or media), emotional expression and linguis-tic style measured through LIWC. They found that experiences of PPD were best predicted by increased isolation, which was modeled by reduced social activity and interaction on Facebook and decreased access to social capital. Wang et al. (2013) constructed a model to detect depression from online blog posts. The features they extracted included first person singular and plural pronouns, polarity of each sentence using their polarity calculation algorithm, ratio of first person singular pronouns to first person plural pronouns, use of emoticons, user interactions with others (@username mentions), and number of posts. Using 180 users, the features given above, and three different kinds of classifiers Wang et al. (2013) report a a precision of 80% when classifying between depressed versus non-depressed users.

Multimodal Indicators
Researchers have also investigated multimodal indicators for depression detection. Scherer et al. (2013a), investigated visual signals and voice quality in a multimodal system, finding that they were able to distinguish interviewees with depression from those without depression with an accuracy of 75%.
Morales and Levitan (2016b) provided a comparative investigation of speech versus text-based features for depression detection systems, finding that a multimodal system leads to the best performing system. In addition, Morales and Levitan investigated using an automatic speech recognition system (ASR) to automatically transcribe speech and found that text-based features generated from ASR transcripts were useful for depression detection.
Fraser et al. (2016) extracted a large number of textual features and acoustic features. Textual features included POS tags, parse tree constituents, psycholinguistic measures, measures of complexity, vocabulary richness, and informativeness. Acoustic features include fluency measures, MFCCs, voice quality features, and measures of periodicity and symmetry. Using these multimodal features, Fraser et al. were able to detect depression with 65.8% accuracy. Related work on suicide risk assessment found that multimodal indicators were able to discriminate between suicidal and non-suicidal patients (Venek et al., 2016).

Evaluation
Depression detection can be divided into three different prediction tasks: presence (depressed vs. not depressed), severity (normal, mild, moderate, severe, and very severe), and score level prediction. With each task comes a set of evaluation metrics. In regards to the first two groups, performance is usually reported in terms of classification accuracy (Acc.). Given that accuracy is heavily affected by skewness in datasets, often times sensitivity (Sens.), specificity (Spec.), precision (Prec.), and F1-score (harmonic mean of precision and recall) are also reported. For score level prediction, performance is usually reported as a measure of differences between values predicted and the values actually observed, such as Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). In Table 2 we report, to our knowledge, the best performing depression detection systems from 2016.
As Table 2 highlights, it is very difficult to make systematic comparisons across studies. Data, task, label, and experimental set-up tend to vary across study. Therefore, it is hard to understand which approach is most promising. However, in regards to features, it tends to be the case that combining features from multiple modalities leads to improvements (Morales and Levitan, 2016a;Scherer et al., 2013a;Fraser et al., 2016;Williamson et al., 2016;Valstar et al., 2016). In many cases, researchers may only have access to certain labels. However, when data sources do contain score labels reporting both error for regression as well as classification performance metrics will help facilitate comparisons across systems. Given that each feature or subset of features are meant to measure specific depression indicators or symptoms, it is also extremely important to understand how well each feature is performing. Therefore, it is best to always include correlation experiments, such as Pearson correlation tests, in order to make it transparent which features are important.

Confounding Factors
Specific variability factors have been shown to be strong confounding factors for depression detection systems (Cummins et al., 2015a(Cummins et al., , 2014(Cummins et al., , 2013(Cummins et al., , 2011Sturim et al., 2011). Variability factors include traits like gender, age, emotion, or personality of the speaker. Therefore, it is important to keep these factors in mind when building a detection system. For example, in many studies systems have achieved better results using sex-dependent classifiers (Moore et al., 2008;Low et al., 2011;Yang et al., 2016;. Oth-  ers (Morales and Levitan, 2016a) have used unsupervised clustering prior to depression detection, finding that this approach could tease out participant differences and in turn lead to performance improvements. However, these approaches to dealing with variability factors usually mean a reduction in training data, which at times can be a substantial trade-off. Another factor to consider, is comorbidity. Comorbidity refers to the simultaneous presence of two chronic diseases or conditions. For example, Alzheimer's disease (AD) and depression frequently co-occur. Fraser et al. (2016) found that their depression detection system performed considerably lower on patients with comorbid depression and AD than on those patients with only depression. Therefore, comorbidity can lead to a more difficult task given the wide overlap of symptoms in the two conditions. Factors such as gender, age, and comorbidity, can have substantial effects on system performance. In order to better understand performance across studies and the effect of variability factors more transparency is necessary, in regards to dataset details and descriptions. In addition, researchers should begin to consider more diverse populations in their studies. Thus far, most research and data collection efforts have focused on detecting depression from young and otherwise healthy participants. In order to generalize detection systems, datasets representing other populations need to be considered.

Discussion
As with any technology or tool there is always risk of misuse and therefore it is important to discuss general ethical considerations with pursuing this line of research. It is especially important to de-fine and outline appropriate use of these systems. Mental health professionals should view language technology for depression detection as a mechanism to complement current diagnoses by giving them access to a novel and rich non-intrusive data source. It is understandable that mental health professionals as well as the general population may be uncomfortable with the possibility that technologies might have to predict psychological states, especially when relatively accurate predictions can be made. To be clear, these systems are not proposed as standalone diagnostic tools that could replace current approaches to diagnosing mental health issues, but instead proposed as part of a broader awareness, detection, and support system. These technologies provide numerous advantages, including large-scale and remote assessment, which in turn could help a broader population. These methods could also provide a lower cost complement to traditional depression assessments. In addition, these tools could help health professionals manage current patients more efficiently, allowing clinicians to monitor their patients continuously. Determining how machines should augment and assist in diagnosis is a complicated issue. However, there exists evidence that mechanical prediction (statistical, algorithmic, etc.) is typically as accurate or more accurate than clinical prediction (Grove et al., 2000). Moreover, mechanical predictions do not require an expert judgment and are completely reproducible. Although there are general ethical considerations, it is important to highlight the potential of mental health assessment tools to enhance the quality of life for society.

Conclusion
In this paper, we present a review of the latest work on depression detection systems. We provide a cross-modal review of indicators for depression detection systems, covering visual, acoustic, linguistic, and social features. We also outline approaches to defining and annotating depression, existing data sources, and how to evaluate depression detection systems. This paper serves as a bridge between the subfields by providing the first review across subfields and modalities. Given that depression detection is inherently a multimodal problem, this paper is an important contribution to the research community as it serves as a great resource for understanding multimodal features as well as what factors to consider when designing a depression detection system. Lastly, in order for the research community to progress together researchers should begin to follow the best practices (Stodden and Miguez, 2013). Best practices lead to communication standards, which will help disseminate reproducible research, facilitate innovation by enabling data and code re-use, and enable broader communication of the output of computational research. Without the data and code that underlie scientific discoveries, is is all but impossible to verify published findings. We urge researchers to focus on reproducible research, through the dissemination, availability, and accessibility of data and code.