Identifying Depressive Symptoms from Tweets: Figurative Language Enabled Multitask Learning Framework

Existing studies on using social media for deriving mental health status of users focus on the depression detection task. However, for case management and referral to psychiatrists, health-care workers require practical and scalable depressive disorder screening and triage system. This study aims to design and evaluate a decision support system (DSS) to reliably determine the depressive triage level by capturing fine-grained depressive symptoms expressed in user tweets through the emulation of the Patient Health Questionnaire-9 (PHQ-9) that is routinely used in clinical practice. The reliable detection of depressive symptoms from tweets is challenging because the 280-character limit on tweets incentivizes the use of creative artifacts in the utterances and figurative usage contributes to effective expression. We propose a novel BERT based robust multi-task learning framework to accurately identify the depressive symptoms using the auxiliary task of figurative usage detection. Specifically, our proposed novel task sharing mechanism,co-task aware attention, enables automatic selection of optimal information across the BERT lay-ers and tasks by soft-sharing of parameters. Our results show that modeling figurative usage can demonstrably improve the model’s robustness and reliability for distinguishing the depression symptoms.


Introduction
The recent survey conducted by WHO shows that a total 322 million people in the world are living with depression. At its most severe, depression can lead to suicide and is responsible for 850, 000 deaths every year (WHO and others, 2017). Early detection and appropriate treatment can encourage remission and prevent relapse (Halfin, 2007). However, the stigma coupled with the depression makes patients reluctant to seek support or provide truthful answers to physicians (Haselton et al., 2015). Additionally, clinical diagnosis is dependent on the self-reports of the patient's behavior, which requires them to reflect and recall from the past, that may have obscured over time. In contrast, social media offers unique platform for people to share their experiences in the moment, express emotions and stress in their raw intensity, and seek social and emotional support for resilience. As such, the depression studies based on social media offer unique advantages over scheduled surveys or interviews (Coppersmith et al., 2014;Manikonda and De Choudhury, 2017;De Choudhury et al., 2016). Social media self-narratives contain large amounts of implicit and reliable information expressed in real-time, that are essential for practitioners to glean and understand user's behavior outside of the controlled clinical environment. Majority of these existing studies have formulated the social media depression detection task as a binary classification problem (i.e., depressive/non-depressive) and therefore are limited to only identifying the depressive users.

Symptom
Sample Tweet S1: Lack of Interest I don't think I care about anything at all lol it's f*** w my brain , boutta go nuts S2: Feeling Down im alone at home with no money and depressed as f***' S3: Sleeping Disorder This is not a good night at all . Rough . S4: Lack of Energy i have not moved all day . still in bed . S5: Eating Disorder its good not to eat.. S6: Low Self-Esteem i am so ugly but will never stop posting pics 4 validation lol S7: Concentration Problem my mind is screaming so many things S8: Hyper/Lower Activity wish i didn't sit around every day wishing all my days away .why . S9: Self-Harm Cut all my elbow up but can't feel it Table 1: Sample tweets (rephrased) and their associated PHQ 9 symptoms.
To assist healthcare professionals (HPs) intervene in a timely manner such as with an automatic triaging, it is necessary to develop an intelligent decision support system that provides HPs fine-grained depression related symptoms. The triage process is a critical step in giving care to the patients because, by prioritizing patients at different triage levels based on the severity of their clinical condition, one can enhance the utilization of healthcare facilities and the efficacy of healthcare interventions. There have been a few efforts to create datasets for capturing depression severity, however they are limited to (1) only clinical interviews (Valstar et al., 2013;Ringeval et al., 2019; and questionnaires (De Choudhury et al., 2013), and (2) individuals who voluntarily participate in the study .
In this work, we exploit the Twitter data to identify the indications (specifically, PHQ-9 guided symptoms) of depression. We developed a high quality dataset consisting of total 12, 000 tweets, with 3738 tweets posted by 205 self-reported depressed users over 2 weeks time, which were manually annotated using PHQ-9 questionnaire (Kroenke and Spitzer, 2002) based symptoms categories. In Table-1, we provide sample tweets associated with the nine item PHQ-9 depression symptoms. Our research hypothesis is that depressed individuals discuss their symptoms on Twitter that can be tracked reliably. Nonetheless, user social-media post offer unique challenges as discussed below: • Usage of the figurative language: First, the depressive users often tend to use figurative language ('FL') elements such as sarcasm and metaphor, to describe their symptoms. For example, one user wrote metaphorically, "My skin is paper, razor is the pen", while another user wrote "I want to cut myself". While both of these utterances refer to one specific medical concept "Self-Harm", the first sentence utilizes paper and pen metaphorically to convey self-harm. Furthermore, previous studies utilizing social media data reported prediction errors when drug or symptom names were utilized in a figurative sense (Iyer et al., 2019). • Usage of implicit sense: The creative expressions used by depressive users also possess implicit sense not evident from a literal reading. For example, one user expresses their desperation as, "What if life comes after death, grab my knife and find out myself." implicitly referring to "Self-Harm", while another user gives a compliment with "You have a killer look.", or captures anger through "If looks could kill, I would be dead by now.". Other challenges include recognizing misspelled words, slangs, acronyms and unconventional contractions. • Usage of highly polysemous words: The vocabulary of social media language offers polysemous words that require understanding of the context to determine the semantic labels. For example, "woke up and nose started bleeding" and "I wish I had the nerve to press the blade deeper into my skin so I don't stop bleeding this time.", use "bleeding" in different contexts and senses.
To account for this creative linguistic device widely observed in utterances of depressive users, we propose a Figurative Language enabled Multi-Task Learning framework (FiLaMTL) that works on the concept of task sharing mechanism (Ruder, 2017;Yadav et al., 2018b;Yadav et al., 2019). In this work, we improve the performance and robustness of the FiLaMTL for the primary task of 'symptom identification' combined with the supervisory task 'figurative usage detection' in a multi-task learning setting. We introduce a mechanism named 'co-task aware attention' which enables the layer-specific soft sharing of the parameters for the tasks of interest. The proposed attention mechanism is parameterized with the 1. Linguistic Marker: Language often reflects how people think and is a well known tool used by psychiatrists to assess the mental health condition of the people (Fine, 2006). Numerous research (Coppersmith et al., 2014;De Choudhury et al., 2013;Yadav et al., 2020b) has shown that modeling of word-use and social language combined with network analysis has been effective in recognizing depression. A widely adopted resource for understanding the linguistic patterns in mental health is the well-known Linguistic Inquiry Word Count (LIWC) (Pennebaker et al., 2007). Other researchers exploited sentiment analysis (Xue et al., 2014;Huang et al., 2014;Yadav et al., 2018a), topic modeling (Resnik et al., 2015) and emotion features (Chen et al., 2018;Aragón et al., 2019) to detect depression. Furthermore, substantial progress has been made with the introduction of a shared task (Coppersmith et al., 2015;Milne et al., 2016). Recently, most of the existing studies (Yates et al., 2017;Benton et al., 2017) have drifted from the traditional linguistic indicators to automated feature generation using the neural network based technique to predict or assess at-risk depressive users. 2. Visual Marker: Visual information such as head pose, body movement, facial expressions, gestures and eye blinks provide important cues in analyzing depression. Girard et al. (2014) examined if there exists a relationship between non-verbal cues and depression severity using Facial Action Coding System (Ekman and Friesen, 1978). In another prominent study utilizing FACS, Scherer et al. (2013) identified that a more downward gaze angle, dull smiles, shorter average lengths of smile, longer self-touches may predict depression. Several studies (de Melo et al., 2019;Cummins et al., 2013) have also investigated the Space-Time Interest Point (STIP) features that capture spatiotemporal changes such as facial motion and the movement in the hand, foot, and head. 3. Speech Marker: Recent studies have shown the potential for exploiting speech for depression detection and monitoring (Cummins et al., 2015a;Cummins et al., 2015b;. Numerous research (Mundt et al., 2012;Hönig et al., 2014) have revealed the strength of prosodic markers, specifically the speech-rate to analyze the level of depression. Moore II et al. (2007) proposed a depression classification system based on a wide range of acoustic feature like prosodic, spectral, voice quality, and glottal feature. Other prominent studies (Mundt et al., 2012;Cummins et al., 2011) have explored spectral features like prosodic timing measures, mel-frequency cepstral coefficients (MFCC) and glottal features to accurately classify depressed and control groups. 4. Multi-modal Marker: In recent times, there is visible surge in investigating multi-modal indicators to diagnose depression, particularly due to publicly available datasets made available through research challenges like Audio/Visual Emotion Recognition (AVEC) Workshop Depression Subchallenge (2013-2019) (Schuller et al., 2011;Valstar et al., 2013;Ringeval et al., 2019) and popular Distress Analysis Interview Corpus (DAIC) . Several computational models (Tzirakis et al., 2017;Ringeval et al., 2017;Ringeval et al., 2018) based on machine learning and sophisticated deep learning techniques have been proposed to address the challenges posed by AVEC each year. The best system at AVEC 2019 (Ray et al., 2019) proposed an attention based fusion technique to judiciously select the feature representation obtained from multimodal source.

Corpus Creation and Analysis: D2S
In this section, we describe how we crawled our dataset Depression to (2) Symptoms (named as D2S) using the Twitter streaming application programming interface, filtered out irrelevant profiles, annotated the tweets of depressive users and verified the annotations by a psychiatrist to prepare the gold standard dataset.
1. Dataset Crawling: We utilized the lexicon developed by (Yazdavar et al., 2017) in collaboration with a psychologist. The lexicon contains around 1000 depression-related terms categorized into nine categories of symptoms from PHQ-9. A subset of highly informative depression indicative terms from the lexicon, that are likely to be used by depressive individuals, was used as seed terms to crawl the public profiles of twitter users with at least one of those filtered terms in their profile description. Through this process, we collected 5, 000 users and their tweets.
2. Filtering and Identifying Depressed Users: As users on social media often use sarcasm and metaphor to implicitly express their feelings, contemporary approaches that do not capture context well, miss sub-population of depressive users. To improve upon these approaches, we proceed as follows. After filtering out the retweets, we removed the profiles with less than 100 tweets and obtained 1567 users. To emulate the PHQ-9 using social media, we chunked the tweets of each profile into two week buckets. To ensure the high quality data and identify potential depressive profiles with severity level mild to severe based on PHQ-9 scoring, we filtered the profiles based on their frequency of post. After filtering out the profiles who had not tweeted at least 5 days in the most recent bucket, we obtained 575 profiles. Although these profiles had depression-related terms in the description, due to the lack of context-sensitivity in the profile identification process, a subset of those were false positives, i.e., non-depressive. A few of these non-depressive profiles were meant to share motivational quotes for depressive users. We strictly examined the visual (i.e., profile image and shared images) and linguistic markers (i.e., profile name, description and tweets) of each of those profiles and removed the users having no depressive tweets. Finally, we obtained 205 depressive users and selected the bucket of the most recent tweets over two weeks for annotation, which sums to 3738 tweets.
3. Anonymization, Annotation and Verification: Prior to annotation, we anonymized the user profiles with random numbers and replaced the mentions and URLs in tweets with strings '@USER' and 'URL' respectively. Four native English speakers from multiple disciplines were assigned to independently annotate the tweets into 9 categories of PHQ-9. The annotators were also asked to identify the tweets having usage of FL such as sarcasm and metaphor. The annotators were provided with the definitions and samples of annotated tweets from each of those 9 categories of PHQ-9 and as well as FL. The average inter-annotator reliability scores for the symptoms, depressive vs. non-depressive, and figurative classes were K=0.83, K=0.87, and K=0.79, respectively, based on Cohen's Kappa statistics. We resolved the conflicting annotations with the majority voting strategy and resolved the ones voted evenly by a psychiatrist. After preparing the final gold standard data, we randomly selected 100 annotated tweets from each of the symptom categories, including the non-depressive ones and verified by a psychiatrist.   We performed topic analysis to examine the usage of utterances associated with each symptom. Table 3 illustrates the topic distribution of each symptom. We observe from the table, to express their feelings, the depressive individuals use metaphoric phrases such as 'body is begging', and 'feel like trash'; sarcastic expressions such as 'am eating ?'; implicit utterances such as 'up all night', and 'feel myself falling'.

Ethical Concerns:
Psychiatric research using social media data poses several ethical concerns regarding user privacy, which should necessarily be taken into consideration (Hovy and Spruit, 2016;Valdez and Keim-Malpass, 2019). Following the ethical practices, as adopted by the previous research on Twitter data (Coppersmith et al., 2015), we constructed our dataset using only public twitter profiles. We anonymized the profiles before presenting it to the annotators who pledged not to make attempts to contact or deanonymize any of the users or share the data with others. The dataset will be shared with researchers who agree to follow the similar ethical guidelines.

Methods
Our proposed approach to identify the depressive symptoms, is assisted by the Bidirectional Representation from Transformers (BERT) and multi-task learning (Yadav et al., 2020a) with soft-parameter sharing. This section describes the proposed methodology for identifying the depression symptoms from user tweets.

Problem Definition
Given an input tweet sequence T consisting of n words, i.e., T = {w 1 , w 2 . . . w n }, our multi-label classification task is to learn the function f un(.) that predicts the set of probable classesȳ 1 ,ȳ 2 , . . . ,ȳ k from the set of class labels, Y . Mathematically, y 1 ,ȳ 2 , . . . ,ȳ k = f un(T, θ 1 , θ 2 , . . . , θ P ) where, θ i , (i = 1, . . . , P ) is the model parameter. The function f un(.) returns the probability of each symptom class assigned to the tweet. We choose the set of best probable class based on the particular threshold value, a hyper-parameter. In our proposed multi-task learning framework, the primary task is symptom identification with nine labels from PHQ-9. We consider the figurative usage detection as the auxiliary task having three class labels: 'metaphor','sarcasm', and 'others'.
cannot be reflective of the entire population, and (iii) it cannot be verified if self-reported depressed users are being truthful

Background
BERT is one of the powerful language representation models that has the ability to make predictions that go beyond the natural sentence boundaries (Lin et al., 2019). Unlike CNN/LSTM model, language models benefit from the abundant knowledge from pre-training using self-supervision and have strong feature extraction capability. It uses word-piece tokenizer (Wu et al., 2016) to tokenize the input sentence. When the model uses word-piece token and randomly mask a portion of the word to predict in the masked language model (MLM) task then the model attempts to recover a piece of the word rather than the whole word. To mitigate this issue, recently, Devlin et al. (2019) where, s j i is the i th token representation obtained from j th transformer encoder, and h i ∈ R d and d is the dimension of the [CLS] token hidden state representation obtained from BERT.

Figurative Language enabled Multi-task Learning (FiLaMTL) Framework
We explore the utility of learning two tasks together in a FiLaMTL framework. For depression symptom identification (SI) task, FiLaMTL helps achieve inductive transfer from figurative usage detection (FUD) task by leveraging additional sources of information to improve performance on the primary task. We focus on designing the soft-parameter sharing rather than hard-parameter as it offers a way to effectively share the required parameters between the tasks (Misra et al., 2016). We achieve this with co-task aware attention module that finds the best shared representation for multi-task learning. Specifically, the proposed network models shared representations using linear combinations, and learns the optimal combinations for the primary and the auxiliary tasks. Let us denote the hidden states (from eq. 2) for primary task (SI), , as H s ∈ R L×d and the auxiliary task (FUD), as H f ∈ R L×d . For a given layer l ∈ L, we compute the effective shared representation as follows: where h l s ∈ R d and h l f ∈ R d are the hidden state representations obtained from l th BERT layer for SI and FUD respectively. r l s ∈ R d and r l f ∈ R d are modified hidden state representation of l th BERT layer, after applying the effective soft-sharing of features across the two tasks. We will discuss the scaling factors α and β shortly.

Soft-parameter Sharing between Tasks
In multi-task learning, inductive bias of auxiliary task helps to improve the performance of primary task. However, the parameter sharing between the tasks is non-trivial, as we need an optimal strategy to improve the performance of primary task. Towards this end, we devise a strategy to automatically learn the factor by which a feature from a particular task need to be accommodated for learning the optimal set of shared features for a given task. This co-task aware sharing of the features leads to the optimal linear combination of feature spaces across the task. Given the two tasks: "symptom identification" and "figurative usage detection", we learn how much of each task's features contribute to form the shared feature space, which leads to the overall improvement of the tasks. We achieve this by introducing a "co-task factor matrix" α ∈ R T ×T , where T is the number of tasks at hand. In our case T = 2, as we are dealing with two tasks here. An element α (x,y) of the matrix α denotes "the factor of which y th task feature obtained from a particular layer of BERT should contribute to the shared-feature representation for x th task". Moreover, this matrix is learned by end-to-end training of the proposed multi-task learning framework. α 1,3 α 2,3 α 3,3 α 1,2 α 2,2 α 3,2 α 1,1 α 2,1 α 3,1 Figure 1: Representation of co-task factor matrix for the three tasks.

Soft-parameter Sharing between Layers
Given input tokens and task t, BERT produces the set of layer activation h 1 t , h 2 t , . . . , h L t in the form of hidden state representation. Each layer of the BERT attempts to address certain problem (Tenney et al., 2019) and feed the information to the upper layer. Tenney et al. (2019) discovered that information learned at a few layers of BERT is sufficient to reliably model and address the lower-level tasks of an NLP pipeline such as parts-of-speech tagging, but, to model the higher level tasks such as relation extraction or co-reference resolution, we need many layers. The hidden state representation h l t 1 , obtained from the l th layer of BERT for a given task t 1 , may not be as useful for another task t 2 . It always depends on task complexity. Inspired by this, we introduce the soft-parameter sharing among the BERT layers. Similar to the soft-parameter sharing between tasks, we propose a mechanism to automatically learn the factors by which a feature from a particular BERT layer needs to be accommodated for learning the optimal set of shared features for a given task. We achieve this by introducing a "layer factor matrix" β ∈ R L×T ×T , where L and T denote the number of BERT layers and the number of tasks respectively. An element β (x,y,z) of the matrix β denotes "the factor of which z th task feature obtained from x th layer of BERT should contribute to the shared-feature representation for y th task". Similar to the co-task factor matrix α, the layer factor matrix β is also a network parameter and can be learned by end-to-end training of proposed multi-task learning framework. We shown the hypothetical matrices for α and β in Fig 1 and 2 respectively.

Symptom Identification and Figurative Usage Detection
For each task, we obtained the effective shared-feature representation from each BERT layer. The final feature is obtained by the average pooling of each individual feature as follows: where z s ∈ R d and z f ∈ R d correspond to the final pooled features for the task symptom identification and figurative usage detection respectively. W s , W f , b s and b f are the weight and bias matrices and f is a non-linear activation. For symptom identification, we employ a feed-forward network with sigmoid activation function to find the probability of a class label belonging to a given tweet, where l s and l f are logits for the symptom identification and figurative usage detection tasks respectively. W sl , W fl , b sl and b fl are the weight and bias metrices.

Experiments
We will first provide detail about baseline models and then present our results on SI and FUD task. Later, we will assess the performance of our approach on depression detection task.

Implementation Details
We shuffle the D2S dataset and split it into 70% training (TRAIN), 10% development (DEV), and 20% test (TEST). For both symptom identification (SI) and figurative usage detection (FUD) models, we have chosen the hyper-parameters using the development set. In all of our experiments, we have fine-tuned the BERT-wwm model for 10 epochs with the batch size of 32. We fine-tune and extracted the features from top three layers of the BERT model. In the proposed FiLaMTL framework, the overall loss of the network is the weighted factor of the loss computed for both the tasks. The network is trained with the binary-cross entropy loss function for both tasks. We set the weight 0.7 for symptom identification and 0.3 for figurative usage detection task. We use sigmoid activation function as the non-linear activation to project the BERT hidden state representation to another representation of dimension 256. We used Adam optimizer (Kingma and Ba, 2014) with a fixed learning rate of 0.0001. For regularization, we used dropout (Srivastava et al., 2014) with a value of 0.5 on each of the hidden layers. We then ran each best model on TEST, and report recall, precision, and F1-Score.

Baseline Models and Results
We compare the FiLaMTL with the highly competitive baseline models and evaluated the model on Precision, Recall, and weighted F1-Score. Since BERT has already demonstrated remarkable performance on multiple NLP tasks over SOTA deep learning (DL) methods, we restricted ourselves to using BERT over DL techniques as our baseline model discussed below: (1) STL-BERT: This is a domain-adapted BERT-based model proposed for the SI and FUD tasks, which fine tuned the BERT model on corresponding dataset.
(2) MTL-H-BERT, Dense: A variation of the multi-task BERT model where a single BERT model generate the features for both the tasks. The features are transformed to another representation by the task-specific dense layer.
(3) MTL-S-BERT, Cross-Stitch: This model is the re-implementation of Misra et al. (2016), where the model learns an optimal combination of shared and task-specific representations using soft-parameter sharing via cross-stitch units.
(4) MTL-S-BERT, Co-Attention: This model was inspired by the framework of Lu et al. (2016). Firstly, we compute the word-level attention weight as discussed in Lu et al. (2016) for the hidden state representation of both the tasks. These weights were multiplied with the corresponding hidden state representation to compute the attentive features. Similar to MTL-S-BERT, Cross-Stitch, we employed the cross-stitch units to obtained the final hidden state representation for both the tasks. Table-4 provides a comparative summary of the results of our proposed approach over the baseline models demonstrating that our 'co-task-aware' multitask FiLaMTL model outperforms the SOTA single task learning model and the variations of BERT inspired multi-task learning models. Basically, we train MTL model in two different ways: hard-parameter sharing and soft-parameter sharing. We can visualize from Table-4, the multitask learning framework based on the soft-parameter sharing (MTL-S-BERT, Cross-Stitch) can assist the performance of the main task as well as in the FUD task over the single task model. However, multi-task hard parameter sharing model (MTL-H-BERT, Dense) was found to be useful only in the FUD over SI task. This may be due to the noise in the dataset over the tasks, which prevents to learn task-specific efficient representation required to correctly identify the symptom from the input tweet. We also observe that soft-parameter sharing based baseline model (MTL-S-BERT, Co-Attention) could not produce the desired results, because of the additional attention mechanism over the strong self-attention mechanism, overfits the model which leads to the degradation in the performance. The existing strategy of the parameter-sharing shows the inconsistency in the performance of the multi-task learning framework. We exploited variation of soft-parameter sharing to further understand the relevance of co-task aware attention in the multi-task learning setting. Our FiLaMTL outperforms both the hard-parameters and soft-parameters sharing based baseline models on both the tasks (Table 4). This also demonstrates that providing information about FL to the BERT model significantly improves the performance of the model and thus enabling generalization to other tasks related to text classification where there is extensive usage of FL.    Evaluating the FiLaMTL on Depression Detection Task: To further verify the effectiveness of our proposed FiLaMTL model, we utilized a transfer learning procedure, where an intermediate shared model obtained on the SI and FUD task is fine-tuned on the depression detection (DD) task. Towards that, we experimented with (a) STL-BERT and (b) FiLaMTL-fine-tuned for DD task, on D2S corpus and the bechmark CLPsych dataset (Coppersmith et al., 2015). The data statistics can be viewed in Table-6. In FiLaMTL (fine-tuned), we initialize the parameters of the BERT model with the obtained weighted from the FiLaMTL model (BERT model of FUD task) reported in Table 4. Our motivation is that first fine-tuning on the FUD and SI task can assist the LMs to adapt to the depression domain with some understanding of figurative usage detection, thus making the training on DD more stable. Analysis: To get a deeper insight into how FiLaMTL performed over the baseline models, we examined the classification of tweets on SI task and came up with the following observations: 1. Understanding figurative sense:

MTL-S-BERT, Co-attention FiLaMTL
Understanding figurative sense T1: holy sh**. i look like death Low Self-Esteem Low Self-Esteem, Self-Harm Low Self-Esteem, Self-Harm Low Self-Esteem Self-Harm Low Self-Esteem    Table 9: Qualitative analysis of FiLaMTL in identifying figurative language.
Error Analysis: Following are the major errors made by our approach on SI task: 1. Ambiguity: PHQ-9 classes related to sleeping disorder, eating disorder, and self-harm are easy to distinguish. However classes such as PHQ-1 (feeling down) and PHQ-6 (low self-esteem) are difficult to separate because of overlapping expressions, often leading to misclassification. As observed in Table-8, both these classes are semantically similar, challenging manual labelling. 2. Cryptic tweets: Our model is unable to handle tweets that are only a few words long. The lack of context required for robust identification of symptoms can only be remedied by consulting past user interactions and communications. 3. Multiple Symptoms: The FiLaMTL is unable to predict all the PHQ-9 classes indicated by a tweet leading to incompleteness as shown in Table-8, tweet T5 and T6.

Conclusion
In this research, we explored a new dimension of social media in Twitter to identify depressive symptoms. Towards this end, we created a new benchmark dataset (D2S) for identifying fine-grained PHQ-9 emulated depressive symptoms that contains figurative language. We also introduce a robust BERT based MTL framework that jointly learns to automatically discover complementary features required to identify the symptoms with the help of the auxiliary task of figurative usage detection. Our experimental results convincingly show the effectiveness of introducing figurative usage detection for depressive symptoms identification. In future, we aim to enhance the dataset with the other modalities like image and memes to assist the model in better understanding of figurative sense in symptom identification.