A Time-Aware Transformer Based Model for Suicide Ideation Detection on Social Media

Social media’s ubiquity fosters a space for users to exhibit suicidal thoughts outside of traditional clinical settings. Understanding the build-up of such ideation is critical for the identification of at-risk users and suicide prevention. Suicide ideation is often linked to a history of mental depression. The emotional spectrum of a user’s historical activity on social media can be indicative of their mental state over time. In this work, we focus on identifying suicidal intent in English tweets by augmenting linguistic models with historical context. We propose STATENet, a time-aware transformer based model for preliminary screening of suicidal risk on social media. STATENet outperforms competitive methods, demonstrating the utility of emotional and temporal contextual cues for suicide risk assessment. We discuss the empirical, qualitative, practical, and ethical aspects of STATENet for suicide ideation detection.


Introduction
Globally, close to 800,000 people die by suicide each year, and 20 times more people attempt suicide. Suicide is the second leading cause of death in the 15 to 29 year age group (WHO, 2014) with a rising suicide rate of 35% in the US since 1999 (Hedegaard et al., 2020). Extending clinical and psychological care to people showing suicidal ideation relies heavily on identifying those at risk. Tragically, 80% of patients do not undergo psychiatric treatment, and about 60% of those who died of suicide denied having suicidal thoughts to mental health practitioners (McHugh et al., 2019). Recent studies (Coppersmith et al., 2018) also show that people exhibiting suicidal ideation make frequent use of social media, e.g., Twitter, to share their 1 https://github.com/midas-research/ STATENet_Time_Aware_Suicide_Assessment  Figure 1: We study a user whose latest tweet is not indicative of suicidal intent. Without seeing the user's recent historic tweet, which shows self-harm tendencies, it is difficult to accurately assess suicidal risk. However, analyzing a user's tweeting history sequentially without factoring in time irregularities between tweets may lead to an inaccurate representation of a user's mental state. Time-aware modeling of the temporal dependency between historic tweets reduces the impact of tweets from 3 years ago, providing a more realistic risk assessment. All examples in this paper have been paraphrased for user privacy (Chancellor et al., 2019). mental state, with eight out of ten disclosing their suicidal thoughts and plans (Golden et al., 2009).
While recent advances in computational social science (Coppersmith et al., 2018;Ji et al., 2019) have made progress in assessing suicidal risk on social media, analyzing the linguistic traits of tweets is often not sufficient for accurate suicidal intent detection. Additional user-level contexts such as tweeting history can be instrumental in identifying a build-up of negative emotions that are often linked to suicide ideation (Oliffe et al., 2012;Robins et al., 1959). Such a build-up can occur weeks, months, or even years before the onset of suicidal ideation (Overholser, 2003) and suicidal activity can also be influenced by past ideation or suicide attempts (Van Heeringen and Marušic, 2003). Analyzing the user history and emotion spectrum, as shown in Figure 1 can provide crucial context to estimate suicidal risk in a tweet authored by that user. Such an Emotional Historic Context (EHC) of a user over time can be characteristic of their mental health (Coppersmith et al., 2014).
Modeling temporal user context, either as a bagof-tweets (Gaur et al., 2019), or sequentially (Cao et al., 2019;Matero et al., 2019) helps in identifying suicidal intent. However, in Figure 1, we show that the impact of varying time intervals between tweets is crucial for an accurate assessment. It is critical to model the large gap between the user's recent tweets that are collectively indicative of suicidal intent and those three years apart. Such uneven Temporal Tweeting Irregularities (TTI) ranging from seconds to years (Wojcik and Hughes, 2019) between successive tweets influence the assessment of a user's tweet differently. Sequential models such as Long Short Term Memory (LSTMs) networks assume that posting intervals are uniform, hindering the learning ability of a user's emotion spectrum over varying time intervals.
Contributions: Taking into account a user's emotional historic context and temporal tweeting irregularities, we propose STATENet: Suicidality assessment Time-Aware TEmporal Network, a neural framework that evaluates the presence of suicidal intent on social media (Sec. 3.1). Building on transfer learning's success in Natural Language Processing, STATENet uses a dual transformerbased architecture to learn the linguistic and emotional cues in tweets. STATENet jointly learns from the language of the tweet (Sec. 3.2) to be assessed, and the historic Plutchik-based (Plutchik, 1980) emotional spectrum of a user in a time-sensitive manner (Sec. 3.3). Through a series of experiments (Sec. 4) on real-world data (Sec. 4.1), we show that STATENet significantly outperforms competitive methods (Sec. 5), with the F1 Score of 80%. We demonstrate practical applicability through a qualitative analysis (Sec. 5.4), and discuss the ethical implications of this study (Sec. 6).
At a minimum, we establish validity for timeaware emotional temporal context for identifying suicide ideation on social media. We focus on the intersection of NLP and suicidal risk assessment by taking a step towards improving risk assessment in a non-intrusive manner. Our work could be considered as a preliminary screening tool that optimistically forms a component in a larger in-frastructure involving psychologists, health care providers, and social media enterprises. 2 In practice, STATENet would flag tweets as "at-risk" for suicidality as part of a human-in-the-loop system to support decisions about potential intervention.

Related Work
Traditional Methods: Researchers have developed various psychoclinical methods to measure suicidal risk (Pestian et al., 2016), such as the Suicide Probability Scale (Bagge and Osman, 1998), Depression Anxiety Stress Scales-21 (Crawford and Henry, 2003), Adult Suicide Ideation Questionnaire (wa Fu et al., 2007), Suicidal Affect-Behavior-Cognition Scale (Harris et al., 2015), etc. While these methods are professional and effective, they require participants to either answer questionnaires (Venek et al., 2017) or engage in interviews (Scherer et al., 2013), hence not reaching suicidal people who are either unable to access these resources or have a low motivation to seek professional help (Zachrisson et al., 2006;Essau, 2005). Studies suggest that taking a suicide assessment can negatively impact individuals showing depressive symptoms (Harris and Goh, 2016).
NLP Methods: In recent years, social media has shown promise in providing insights into the psychological state of individuals (Paul and Dredze, 2011). Jashinsky et al. (2014) reported that Twitter is a viable tool for real-time monitoring (Braithwaite et al., 2016) of suicide risk. Early efforts in utilizing social media include the use of user features (Masuda et al., 2013) and online suicide notes (Pestian et al., 2010;Huang et al., 2007). Since then, the focus has been on using psycholinguistic lexicons such as LIWC (De Choudhury et al., 2016;Sawhney et al., 2018b) and textual features such as POS, tense, etc. for classification (Ji et al., 2018;Huang et al., 2014). Shared tasks such as CLPsych (Zirikly et al., 2019) and CLEF eRISK (Losada et al., 2019) have seen a rise in the use of deep learning for suicidality prediction. CNN based architectures (Du et al., 2018;Sawhney et al., 2018a;Shing et al., 2018;Naderi et al., 2019) and LSTM based architectures (Ji et al., 2018;Tadesse et al., 2020) utilize pre-trained word embeddings to predict suicide risk. Although these text-based methods capture the semantic nature of posts in isolation, no user associated context is provided that can give insight into the user's mental state to improve predictive power (Venek et al., 2017). A user-dependent, personalized context can truly process the "natural" language of a user and understand the semantic context from the perspective of that specific user (Flek, 2020). User context may include the user's emotion spectrum (Ren et al., 2016), social graph methods  and temporal context (Mathur et al., 2020). Suicide risk assessment for preliminary screening has been done at both binary (suicidal intent present, suicidal intent absent) (Cao et al., 2019;De Choudhury et al., 2016;Mathur et al., 2020;Losada et al., 2019), and multiple (Zirikly et al., 2019;Vioules et al., 2018;Gaur et al., 2019) levels of risk ranging from no risk to severe risk.
Contextual Methods: The best performing model, the dual context BERT (Matero et al., 2019), at the CLPsych 2019 shared task (Zirikly et al., 2019) for suicidal estimation on Reddit exemplifies the utility of temporal context. The Dual Context BERT utilizes post level BERT embeddings passed sequentially through an attention-based RNN. Similarly, Cao et al. (2019) employ a LSTM and fastText-based architecture for modeling temporal context. These RNN and LSTM based approaches assume that users' historical posts are equally spaced in time, hindering the suicide ideation detection model's ability to learn their relative importance in a time-aware manner. Time-aware sequential models have shown improvements in other clinical tasks (Baytas et al., 2017), such as patient subtyping, and in other domains like user activity modeling (Zhu et al.). More recently, Mathur et al. (2020) and Sinha et al. (2019) have modeled a user's historic emotion spectrum using latent representations of GloVe embeddings of historic tweets. These latent features are then aggregated based on specific functions such as exponential decay and sinusoids as opposed to learning them as sequences. These approaches assume that suicidal ideation conforms to specific trajectories, which may not generalize well across users (Giletta et al., 2015) and lose the context of individual historic tweets by aggregating them. Approaches besides deep learning have also been explored, such as the work done by Vioules et al. (2018), which uses the martingale framework (Ho, 2005) with sentiment scores and tweet level features such as likes to study two users on Twitter.

Notations and Problem Formulation
We acknowledge that modeling suicidal intent as a binary classification task is a strong simplification and in this work, we focus on identifying the presence of suicide ideation within a tweet using a user-level temporal context. We denote a tweet to be assessed for suicidal risk as t We formulate the problem as a classification task to predict a label y i for the tweet t i , where, y i 2 {suicidal intent present, suicidal intent absent}.

Encoding the Tweet to be Assessed
Studies have shown that the linguistic styles of social media users can aid in understanding their mental state (De Choudhury et al., 2013) and that their suicidal behaviour is correlated with suicidal tweets (Sueki, 2015). Static word embeddings such as GloVe (Pennington et al., 2014) have been used to encode tweets for detecting suicide ideation  in the past. However, recent studies have shown that pre-trained transformer models yield more comprehensive representations of linguistic features in a tweet (Salminen et al., 2020). We found that SentenceBERT (Reimers and Gurevych, 2019) empirically outperforms embeddings used in previous works such as FastText (Cao et al., 2019), ELMo (Mohammadi et al., 2019), etc. We use the 768-dimensional encoding obtained from SentenceBERT. 3 Formally, where T 0 i 2 R 768 is linearly transformed using a dense layer to T i 2 R d with dimension d.

User Historical Emotion Spectrum
Individual Historic Tweet Encoding: Amplification of emotional factors such as emotional reactivity (Tarrier et al., 2007), intensity (Links et al., 2008) and instability (Palmier-Claus et al., 2012) can increase suicide risk. Building on this, we extract the emotion spectrum of each historic tweet h i k . Although proficient in semantic modeling of text, general text encoders fail to capture the fine-grained emotions expressed in social media posts. To capture fine-grained emotions, we utilize Plutchik's wheel of emotions (Plutchik, 1980). This taxonomy suggests three hierarchical sets of eight emotions arranged as four pairs of opposing dualities. The primary set of emotions described by the wheel are: Joy -Sadness, Surprise -Anticipation, Anger -Fear, and Trust -Disgust. We obtain an encoding that models the emotional spectrum of a historical tweet, and thus that of a user at a historic time. Based on empirical comparisons and the success of transfer learning in NLP, we finetune pre-trained BERT embeddings on the Emonet dataset (Abdul-Mageed and Ungar, 2017). The dataset consists of a total of 1,608,233 tweets labeled across 24 emotions as per Plutchik's wheel of emotions. The presence of the primary emotions in the dataset is skewed towards joy, sadness, and fear, with their representation being 20.57%,8.85%, and 6.13%, respectively, with other emotions having fewer samples. These are labeled using distant supervision using a total of 665 emotion hashtags.
We call this transformer the PlutchikTransformer. This transformer tokenizes each historical post and adds the [CLS] token at the beginning of each post. We use the final hidden state corresponding to this [CLS] token (768-dimensional encoding) as the aggregate representation of the emotional spectrum. We define the emotion vector (E i k 2 R 768 ) of each historic tweet h i k as: Modeling Historical Tweets Sequentially: The emotional historic context of tweets can be used to model progressive emotional states of the author of those tweets (Abdul-Mageed and Ungar, 2017;De Choudhury et al., 2013). This makes recurrent neural networks (RNN), and particularly LSTMs (Hochreiter and Schmidhuber, 1997), the most natural methods for encoding and learning from a sequence of a user's historical tweets. However, the time interval between the posting of historic tweets can vary widely, from a few seconds to a few years (Wojcik and Hughes, 2019). Such variations can be an important factor in analyzing the emotional states of a user over time (Sueki, 2015). LSTM cells assume the input to be equally spaced sequences and thus are unable to model irregularities in posting times of historical tweets. Using this relative time difference between the user's historical tweets can progressively model the user's emotions more accurately over time. Hence, we propose the use of a Time-aware LSTM (T-LSTM) (Baytas et al., 2017) where time lapse between successive tweets is fed to the T-LSTM cell, as shown in Figure 3. The T-LSTM cell thus incorporates the actual time differences between tweets, along with each historical tweet's emotional context E i k . T-LSTM applies time decay to the memory according to the elapsed time between successive elements and weights the short-term memory cell C S k . Intuitively, the greater the time elapsed be-tween two tweets, the less impact they should have on each other. To achieve this, T-LSTM uses a monotonically decreasing function of elapsed time, which transforms time into appropriate weights. Time lapses are incorporated in the T-LSTM as: (Adjusted previous memory) where C k 1 and C k are previous and current cell memories, and {W d , b d } are network parameters. k is the elapsed time between historic tweets h k 1 and h k , and g(·) is a heuristic decaying function that reduces the effect of short-term memory as k increases. We select g( k ) = 1/ k empirically and as suggested in Baytas et al. (2017). For each historic tweet h i k , the T-LSTM cell modifies LSTM gate operations to compute the current hidden state (H i k 2 R d ) by feeding C ⇤ k 1 instead of C k 1 .

Joint Network Optimization
To identify the presence of suicidal intent in a tweet, STATENet jointly learns from the language of the tweet to be assessed and the emotional historic spectrum in a time-aware manner. For this we apply the concatenation operation to T i andH i k respectively, followed by a dense layer with Rectified Linear Unit (ReLU ) (Hahnloser et al., 2000) to form a prediction vector. Finally, a softmax function (Goodfellow et al., 2016) is used to output the probabilities of suicidal intent present.
whereŷ i is the final suicide risk assessment and {W y , b y } are network parameters. Tweet indicating suicidal intent form a very small proportion of the data (Ji et al., 2019). To address this problem of class imbalance (in practice, the imbalance is much greater in the real world), we train STATENet using Class-Balanced loss proposed by Cui et al. (2019) along with Focal Loss (Lin et al., 2017). This loss function applies a class-wise re-weighting scheme by introducing a weighting factor that is inversely proportional to the number of samples. The loss function L is: where CB focal is class-balanced focal loss,ŷ i is the predicted label and y i is the label of the current tweet. and are hyperparameters.

Dataset
We use the Twitter timeline data of users from the dataset introduced by Sinha et al. (2019). Sinha et al. (2019) began with a collection of Twitter posts based on a lexicon of 143 suicidal phrases. After manual inspection of the dataset for trivially nonsuicidal tweets, their final dataset contained 34,306 tweets. Some of these tweets were authored by the same user; thus, the total number of unique users for which tweets were to be classified was 32,558. We summarize the annotation instructions (Sawhney et al., 2018b) that were followed by two annotators, both students of Clinical Psychology, for annotating the collected 34,306 tweets: • Suicidal Intent (SI) Present: Posts where suicide ideation or previous attempts are discussed in a somber and non-flippant tone.
• Suicidal Intent (SI) Absent: Tweets with no evidence for risk of suicide, including song lyrics, condolence message, awareness, news.
It is important to note that this process produced suicide risk labels at the level of individual tweets and not for individual user histories. An acceptable inter-annotator agreement was achieved with a Cohen's Kappa score (Cantor, 1996) of 0.72, under the supervision of a professional clinical psychologist. The resulting dataset contains 3984 suicidal tweets. The Twitter timeline was collected for each user. These timelines span over ten years from 2009 to 2019. The mean number of tweets in user history is 748 (max 3,200) with a standard deviation of 789 tweets. We trim the user history to the 100 most recent tweets for users with a large number of historical tweets. 4 The mean time difference between two consecutive tweets for a user is two days with a standard deviation of almost 24 days between two tweets, indicative of large variations across users. 4070 users were found to have no historical tweets.
Data Preprocessing: We deidentified the dataset by performing named entity recognition and removing any identifiable information such as email addresses, URLs, and names. Next, we follow standard procedures of converting the text to lowercase, removing punctuation and accents, striping whitespaces, and removing stopwords. We split the tweets in the dataset on the basis of users such that there is no overlap between users in the train, validation, and test set. We perform a stratified 70:10:20 split across the three sets, such that the train, validation, and test sets consist of 24014, 3431, and 6861 tweets, respectively. Although there may be multiple tweets to be assessed by the same user, their associated history differs according to the tweets' posting timestamps. We ensure that for each tweet to be classified, only the historical tweets having timestamps older than that of the tweet to be assessed are used for historic modeling.

Experimental Settings
Baseline Methods: We evaluate STATENet using the macro F1 and recall for suicidal intent present (recall s ), against two types of baseline methods; tweet level (TL) and user-level (UL). UL baselines were adapted for tweet level assessment by concatenating embeddings of the tweet to be assessed with the user level features. We implement all methods with Py-Torch 1.5 (Paszke et al., 2019) and optimize using mini-batch AdamW with a batch size of 256 and I lr = 0.0001. We use the cosine scheduler with a warmup step of 5 (Gotmare et al., 2018). We train the model for 20 epochs and apply early stopping with a patience of 5 epochs. The model takes 4,361s to train on an Nvidia Tesla K80 GPU.

Comparative Performance
We note from Table 1 that STATENet significantly (p < 0.005) outperforms competitive baselines. We compare against both text only, and temporal contextual models for suicidal risk assessment. STATENet and other contextual models perform better than the non-contextual RF + tweet features and C-LSTM models. We believe this is because temporal contextual models offer greater insight into the author's historical mental state, thereby increasing predictive power. STATENet and sequential models outperform the Contextual CNN, likely due to their ability to better learn representations from the temporal dependence in historical tweets, as opposed to Contextual CNN's bag of tweets approach. We also observe that STATENet significantly outperforms competitive sequential models. We postulate this to the ability of the timeaware LSTM in STATENet to capture irregularities in tweeting intervals of users. Such time-aware modeling likely learns more accurate latent representations of users' emotional historic context. While exponential decay and episodic modeling  (Plutchik) 0.730 0.608* Current + Sequential History (BERT) 0.767* 0.786* Current + Sequential History (Plutchik) 0.778* 0.795* Current + TA History (Plutchik) 0.799* 0.810* perform well, we note that STATENet does better, in terms of all metrics, particularly recall for the suicidal intent present class. We believe this is because not every user's emotional historic context may conform to fixed trajectories that these approaches aggregate historic tweets on.

Ablation Study
To assess EHC and TTI, we perform an ablation study (Table 2) with different configurations. Without considering historic tweets, the performance of the model drops drastically. We believe that adding historic tweets, even in a random order, adds additional contextual cues about the user, resulting in improved performance. We observe that the PlutchikTransformer variant of Current + Sequential History outperforms its BERT counterpart. This can be attributed to the ability of the PlutchikTransformer to capture the EHC of a user. STATENet jointly models the Current Tweet and EHC in a time-aware manner, overcoming the limitation of previous models that assume equal time intervals between posts. On inspecting the results for the 647 users without any historic tweets, we find that STATENet performs well with a recall of 0.74 and macro F1 of 0.75. This reiterates the Recall for suicidal ability of linguistic only non-contextual models in suicidal intent identification. This is particularly interesting, as, for users with no available history, assessment can still be performed to some degree.

Temporal Analysis
The tweet's language should be studied with historical context to better understand the user's emotional state over time, based on the EHC. To analyze the importance of the order and temporal dependency of historic tweets, we first try a nonsequential, bag of tweets like variant. We feed the Plutchik transformer-based encodings to a Contextual CNN. We observe that the bag of tweets approach is slightly better than the Contextual CNN baseline, likely because of the transformer-based encoding as opposed to static GloVe embeddings used by the baseline. The non-sequential approach drastically underperforms over ten runs in comparison to temporal variants. Further investigating EHC and TTI, we first feed historic tweets in sequential order (Sequential Model) to a regular LSTM, and then we factor in TTI through T-LSTM in STATENet. Figure 4 shows that STATENet is Model predictions for tweets Figure 5: (a), (b) and (c) are emotion intensity across 8 primary emotions based on the Plutchik Wheel for User 1, 2, 3 over time respectively, from White to Blue. h i k represents k th historic tweet associated with the current tweet t i . In (d) Green and Red represent correct and incorrect assessment of suicidal risk respectively for different tweets. We display only the 8 primary emotions of the Plutchik wheel for brevity. significantly (p < 0.005) better than the sequential but non-time-aware variant, and shows the least variation in performance over 10 different runs. For the difference in performance between the Sequential Model and STATENet, we believe that is due to the temporal dependency of historic tweets on the elapsed time between successive tweets.

Qualitative Analysis
For a detailed insight and aiding interpretability, we analyze some cases where STATENet performs well. We also highlight the limitations of STATENet through error analysis. We qualitatively analyze three interesting cases in Table 3 and Figure 5. We see that the tweet to be assessed for User 1 does not show any explicit suicidal intent and alone may not be sufficient to assess suicidal risk. However, temporal models correctly classify the tweet as they learn the build-up of sadness in the historic tweets, which we observe from the Pluchtik emotional intensity in Figure 5a.
When the current tweet of the user is non-indicative, temporal models can get additional context by learning historic activity of the user. Often, temporal patterns are variable, and posting frequencies vary drastically. These TTI present challenges in only relying on the sequence of historic tweets rather than the actual time lapses. For instance, initial tweets of User 2 showed sadness and suicidal intent, whereas the recent historic tweet (h 2 k3 ) of the user represents joy (Figure 5b). LSTM-based models aggregate sadness and hence assume the history to be suicidal. Contrarily, STATENet is able to learn from the variable time-lapses and their relative importance in the context of suicide ideation. However, we found some cases where all models failed. For User 3, the current tweet does not contain strong semantic indicators of suicidal intent. Moreover, historic tweets do not show any recognizable emotional pattern (Figure 5c). Such a case presents the complexities associated with suicide risk assessment.Another interesting observation from Figure 5 is that the learned Plutchik emotion intensity distribution for users is skewed towards joy (positive) and sadness (negative). Although the highly granular emotional context captured by the PlutchikTransformer improves STATENet's performance (Sec. 5.2), over the more generic language features captured by BERT. We leave further exploring the impact of emotion granularity to our future research directions.

Discussion
Ethical Considerations: The preponderance of the work presented in our discussion presents heightened ethical challenges. As explored in Coppersmith et al. (2018), we address the trade-off between privacy and effectiveness. While data is essential in making models like STATENet effective, we must work within the purview of acceptable privacy practices to avoid coercion and intrusive treatment. To that end, we utilize publicly available Twitter data in a purely observational (Norval and Henderson, 2017;Broer, 2020), and non-intrusive manner. Although informed consent of each user was not sought as it may be deemed coercive, automated de-identification of the dataset was performed to reduce the risk of including any identifying data in the raw data. All tweets shown as examples in Figure 1 and Section 5.4 have been paraphrased as per the moderate disguise scheme suggested in Bruckman (2002) to protect the privacy of individuals (Fiesler and Proferes, 2018). The annotation of user data has been kept separately from raw user data on protected servers linked only through anonymous IDs (Benton et al., 2017). Assessments made by STATENet are sensitive and should be shared selectively to avoid misuse, such as Samaritan's Radar (Hsin et al., 2016). Our work does not make any diagnostic claims related to suicide. We study the social media posts in a purely observational capacity (Norval and Henderson, 2017) and do not intervene with the user experience in any way.
Limitations: We acknowledge that studying suicidality is subjective in nature (Keilp et al., 2012) and that the interpretation of the analysis presented may vary across individuals. Due to the situatedness of language, the studied data may be susceptible to demographic, annotator, and mediumspecific biases (Hovy and Spruit, 2016). We recognize that suicide risk exists on a diverse spectrum, and the simplification of binary labels could lead to artificial notions of risk (Bryan and Rudd, 2006).
Practical Implications: Through STATENet, we suggest a neural architecture for preliminary screening of at-risk users on social media to aid the prioritization of clinical resources. Our work observes Twitter in a non-intrusive manner and does not intervene with the user experience in any way. STATENet should form part of a distributed human-in-the-loop (de Andrade et al., 2018) system for finer interpretation of risk. Focusing on STATENet's practical applicability, we work with tweet level annotations rather than the more subjective and difficult to scale user-level annotations. We emphasize on tweet-level prediction; however, STATENet can also be applied for user-level suicide risk assessment given its dual text and historic modeling components.

Conclusion
Motivated by the rising use of social media for exhibiting suicide ideation as opposed to standard clinical practice (McHugh et al., 2019), we present STATENet. Building on psychological studies on analyzing a user's temporal emotional spectrum, STATENet models the time aware emotional context of users through historical tweets for more accurate suicide risk estimation on social media. We plan to explore the impact of varying amounts of historical context for a user in our future work. We show STATENet's applicability as a preliminary tool in assessing suicidality in tweets. We present a qualitative analysis for a deeper understanding of STATENet. Through this work, we aim to form a component in a larger human-in-the-loop infrastructure for analyzing potentially concerning suicide-related social media posts. Priority-based suicide risk assessment for ranking tweets for suicidal risk, rather than classifying them forms our future direction. Additionally, in the future, we would also want to quantify the impact of varying degrees of granularity of learning emotional features from tweets on STATENet's performance.