Learning Models for Suicide Prediction from Social Media Posts

We propose a deep learning architecture and test three other machine learning models to automatically detect individuals that will attempt suicide within (1) 30 days and (2) six months, using their social media post data provided in the CL-Psych-Challenge. Additionally, we create and extract three sets of handcrafted features for suicide detection based on the three-stage theory of suicide and prior work on emotions and the use of pronouns among persons exhibiting suicidal ideations. Extensive experimentations show that some of the traditional machine learning methods outperform the baseline with an F1 score of 0.741 and F2 score of 0.833 on subtask 1 (prediction of a suicide attempt 30 days prior). However, the proposed deep learning method outperforms the baseline with F1 score of 0.737 and F2 score of 0.843 on subtask2 (prediction of suicide 6 months prior).


Introduction
According to World Health Organization (WHO) 1 , close to 800,000 people die due to suicide every year, which is one person every 40 seconds. The US Centers for Disease Control and Prevention (CDC) 2 claimed that suicide was the tenth leading cause of death overall in the United States. Recently, there has been a trend in using natural language processing (NLP) techniques on unstructured physician notes from electronic health record (EHR) data to detect high-risk patients (Fernandes et al., 2018).
With the proliferation of social media where there is free sharing of information, mining data from these platforms has become a natural way to extend the above body of work in more natural settings. Consequently, researchers have started to apply machine learning and NLP based techniques to detect suicide ideation on social media platforms (Ramírez-Cifuentes et al., 2020;Roy et al., 2020). Some of them focused on handcrafted features, including TF-IDF (Zhang et al., 2011), LIWC (Tausczik and W, 2010), N-gram, Part-of-Speech (PoS) and emotions (Shah et al., 2020;Zirikly et al., 2019;Zhang et al., 2015;Ji et al., 2020), while others explored language embeddings (Cao et al., 2019;Jones et al., 2019;Sawhney et al., 2018;Coppersmith et al., 2018).
In this paper, we present several approaches to detect suicide ideation from Twitter posts (1) 30 days before the attempt and (2) six months before the attempt. We use the dataset provided by the CLPsych 2021 Shared Tasks Macavaney et al. (2021) towards this goal.
The main contributions of our work are: • Explored and generated multiple handcrafted feature sets motivated by prior work in this area • Proposed a new deep learning architecture that uses latent features from tweets to detect suicide attempts • Tested several machine learning algorithms using only handcrafted features and only latent features • Achieved better performance than baseline in terms of F1, F2 and True Positive Rate (TPR) on both subtasks Summary of Findings: The main takeaways from this work are: • Extensive testing on the dataset shows that latent feature (Doc2Vec (Lau and Baldwin, 2016)), is better at detecting suicide attempts from the tweets than handcrafted features • Most of our models performed better on detecting individuals who have attempted suicide or were a victim of suicide than on detecting control individuals who have not • The KNN and SVM with latent features perform best on subtask 1, with respect to F1, F2 and TPR; while our proposed C-Attention (C-Att) network performs best on subtask 2, with respect to F1, F2 and TPR

Method
Before we describe the methods in detail we provide a summary of the features used in our work. We use two classes of features: latent features and handcrafted features. These are described in the sections below.

Latent Features
Latent features are typically obtained as language embeddings. In our case, we used the Doc2vec (Lau and Baldwin, 2016) to generate both word embeddings and document embeddings on each post. Doc2Vec creates a vectorized representation of a group of words (or a single word, when used in that mode) taken collectively as a single unit. For every document in the corpus, Doc2Vec computes a feature vector. There are two models for implementing Doc2vec: Distributed Memory version of Paragraph Vector (PV-DM) and Distributed Bag of Words version of Paragraph Vector (PV-DBOW). For our experimentation, we used Distributed Memory (DM) version. DM randomly samples consecutive words from a sentence and predicts a center word using these randomly sampled set of context words and the feature vector.

Emotions
Emotions can be good indicators of depression and suicide ideation (Desmet and Hoste, 2013;Coppersmith et al., 2016;Cao et al., 2020;Ghosh et al., 2020), so we include emotions as one of the handcrafted features. We used the method proposed in (Shao et al., 2019) to generate 12 emotion tags, including contentment, pride, fear, anxiety, sadness, disgust, relief, shame, anger, interest, agreeableness and joy. Apart from that we also generated emotion intensity scores using NRC lexicon (Mohammad, 2018), for the emotions like anger, anticipation, disgust, fear, joy, sadness, surprise and trust. After removing duplicates, we selected 17 emotion tags.

Parts of Speech
We use NLTK (Bird et al., 2009) to generate Part-of-Speech tags. PoS tags can detect the syntactic struc-ture difference between users that attempt suicide and the control group (Ji et al., 2020). It has been shown (Roubidoux, 2012) that persons attempting suicide use more first person pronouns. Therefore, we also calculate the number of occurrences of first person pronouns like "I", "me", "mine" and "myself" and include this count as another PoS related handcrafted feature. In total, we generated 34 PoS tags per post for the "30 days prior prediction" subtask and 37 PoS tags for the "6 months prior prediction" subtask.

Three-step theory of suicide and suicide dictionary
We then generate a dictionary of words based on the three-step theory of suicide (3ST) (Klonsky and May, 2015) beginning with the ideation, followed by unmitigated strengthening of the idea due to insufficient social support and precipitated by an attempt. These stages are underpinned by feelings of hopelessness (Dixon et al., 1991), thwarted belongingness and burdensomeness (Chu et al., 2018;Forkmann and Teismann, 2017). Violence usually differentiates attempters and non-attempters (Stack, 2014). Surviving an attempt is expected to be accompanied by feelings of shame (Wiklander et al., 2012;Wolk-Wasserman, 1985). We expect these feelings to be out of phase with each other creating a leading, inline and lagging indicator of suicide attempt. We used Word2vec (Mikolov et al., 2013b,a,c) software to construct these dictionaries using the accompanying utility (also available in online versions) by evaluating closest neighbors of words (gloom and burden, violence, hurt and shame), each containing about 100 words with some manual cleanup and editing. The manual cleanup involved removing stop-words, words with hyphens, special characters, some vernacular tokens, and words that differed in capitalization alone. We generated this feature set by counting each keyword in each post. In addition, we manually created a dictionary of suicide keywords based on suiciderelated words published in (Low et al., 2020;Yao et al., 2020), and counted how many suicide-related keywords occurred in each post. 3

Models
In this work, we proposed a deep learning model and used a few other machine learning models for The proposed deep learning model, which we refer to as the C-Attention Network, is our primary model.

C-Attention Network
Figure 1 depicts our C-Attention network which uses latent features to detect suicide attempts. This network is similar to our prior C-Attention Embedding model  with the following differences: • In this work we consider each post as a small document, and use Doc2Vec to generate a 100dimension embedding representation for each post; whereas the work in  generated a sentence embedding for each sentence in a speech.
• We removed the positional encoding layer since there is no positional dependency among posts.
In summary, the architecture first calculates the embeddings of the dataset, then processes it via a multi head self-attention (MHA) module that captures the intra-feature relation-ships; an attention layer followed by a single convolution layer and a softmax layer. The MHA module is the same as that proposed in (Vaswani et al., 2017) for the popular transformer architecture.

Latent Features with Other Machine Learning Models
In this approach we combined all the posts for each user. Stop words were removed from the posts and lemmatized. The average length of posts was found to be 140 words. Long posts were chunked into 150 words segments to retain meaningful information in each post. A single 200-dimension embedding vector is generated for each segment using the Doc2Vec as described in Section 2.1. We applied linear discriminant analysis (LDA) (McLachlan, 2004) for dimensionality reduction before classification. The output of LDA was fed to machine learning models. K-Nearest Neighbors (KNN) (Jiang et al., 2012) with K=3, Support Vector Machine (SVM) (Ríssola et al., 2019) with linear kernel (referred to as SVM(EB) in the rest of the paper) and Decision Tree (D-Tree) (Song and Ying, 2015) classifier models were considered.

Handcrafted Features with Other Machine Learning Models
We used three other machine learning models on the handcrafted features described in Sec 2.2 to address both challenges. The three machine learning models were: Random Forest Classifier (RF) (Breiman, 2001), Logistic Regression (LR) (Aladag et al., 2018) and Support Vector Machine (SVM) (Ríssola et al., 2019) (referred to as SVM(HF) for the rest of the paper). We used the entire handcrafted features since we found that leaving out any of those handcrafted feature sets would introduce a performance drop. We finetuned the parameters of each ML model, for example, we set the kernel as rbf (radial basis function) on SVM(HF) model; set the solver as liblinear (limited to one-versus-rest schemes) on LR model; and set the max depth to 4 on RF model to get the best predictions.
3 Results Table 1 and Table 2 show the performance results. The results reported in Table 1 were obtained by running the KNN, SVM(EB) and SVM(HF) models which were trained on the entire training set. The performance of the models are measured in terms of F1 and F2 scores, True Positive Rates (TPR), False Positive Rates (FPR) and Area Under the ROC Curve (AUC).

Analysis/Discussion
The results reported in Table 1 were generated by the KNN, SVM(EB) and SVM(HF) models, which performed best on the training set. From Table 1, we can see that the baseline provided by the CLPsych 2021 shared task outperformed all of these methods. After a thorough investigation of the results, we observed that those models that did not perform best on the training set, performed better on the test set. It probably indicates that we over-trained our models on the training set.
As a result, in the following experiments, we randomly split the training set into 80% for training and 20% for validation, and use the models that performed best on the validation set to predict suicide in the test set. The new performance results on the test set are shown in Table 2.
We noted that in subtask 1, KNN and SVM(EB) performed best in terms of F1, F2 and TRP. The best AUC was achieved by KNN only, and the best FPR was achieved by RF. In subtask 2, C-Att performed best in terms of F1, F2 and TRP; the best FPR was achieved by RF; and the best AUC was achieved by Baseline. Our experiment results would indicate that: • In general, latent features perform better than handcrafted features in this shared task • C-Att model performs better on longer range suicide predictions and KNN and SVM(EB) work better on shorter range suicide predictions • Besides RF, our other models perform better on detecting suicide individuals than control individuals

Conclusion
In this work, we introduce C-Attention model and test other machine learning models to automatically detect suicidal individuals based on the latent feature (Doc2Vec) and handcrafted features including emotions, PoS, and three-step theory of suicide and suicide dictionary. Our results show that both KNN and SVM(EB) achieved the best F1 score of 0.741 and F2 score of 0.833 on subtask 1 (prediction of a suicide attempt 30 days prior), and C-Att reached the best F1 score of 0.737 and F2 score of 0.843 on subtask 2 (prediction of suicide 6 months prior). Ultimately, this work supports the use of social media as an avenue to better predict and understand the experience of suicidal thoughts. However more work is needed to better decipher why certain features and models best predict suicidality in large, diverse, representative samples.

Ethics Statement
Secure access to the shared task dataset was provided with IRB approval under University of Maryland, College Park protocol 1642625.