Latent Suicide Risk Detection on Microblog via Suicide-Oriented Word Embeddings and Layered Attention

Despite detection of suicidal ideation on social media has made great progress in recent years, people’s implicitly and anti-real contrarily expressed posts still remain as an obstacle, constraining the detectors to acquire higher satisfactory performance. Enlightened by the hidden “tree holes” phenomenon on microblog, where people at suicide risk tend to disclose their inner real feelings and thoughts to the microblog space whose authors have committed suicide, we explore the use of tree holes to enhance microblog-based suicide risk detection from the following two perspectives. (1) We build suicide-oriented word embeddings based on tree hole contents to strength the sensibility of suicide-related lexicons and context based on tree hole contents. (2) A two-layered attention mechanism is deployed to grasp intermittently changing points from individual’s open blog streams, revealing one’s inner emotional world more or less. Our experimental results show that with suicide-oriented word embeddings and attention, microblog-based suicide risk detection can achieve over 91% accuracy. A large-scale well-labelled suicide data set is also reported in the paper.


Introduction
Suicide is a growing problem in today's society. Each year nearly 800,000 people worldwide commit suicide, which is one person every 40 seconds, and there are many more who attempt it (Organization et al., 2014). Suicide prevention will conduce to human's well-being, of which timely sensing suicide ideation is an essential task.
Existing Solutions. Traditional suicide risk assessment like Suicide Probability Scale (Bagge and Osman, 1998), Adult Suicide Ideation Questionnaire (Fu et al., 2007), Suicidal Affect-Behavior-Cognition Scale (Harris et al., 2015), etc. requires respondents to either fill in a ques- tionnaire or participate in a professional interview. However, they are applicable to a small group of people. Particularly for the people who are suffering but tend to hide inmost thoughts and refuse to seek helps from others, the approaches cannot function (Essau, 2005;Rickwood et al., 2007;Zachrisson et al., 2006).
Recently, the penetration of social media (like forums and microblogs) and its large-scale, lowcost, and open advantages enable researchers to overcome the limitation and timely detect individual's suicide ideation. Despite great efforts have been made (Alambo et al., 2019;Cheng et al., 2017;Du et al., 2018;Sawhney et al., 2018;Coppersmith et al., 2018;Viouls et al., 2018), the social media based detection performance is constrained due to implicitly and anti-real contrarily expressed posts from people who hide their inmost feelings and thoughts on social media. To illustrate, let's see a user's normal blogs vs. his/her hidden posts in a microblog tree hole in Figure 1. Usually, people of suicidal tendency (referred to as suicidal people in the study) tend to disclose their real inner feelings on the microblog space whose authors have committed suicide. Hundreds of such

Normal Posts
Tree Hole Posts Avg proportion of self-concern words per post 14% 50% Avg proportion of others-concern words per post 68% 12% Avg proportion of suicide-related words per post 5% 95% Avg number of posts per user in a year 69.3 52.1 Total number of posts from all users in a year 252,901 190,087 tree holes exist on Sina microblog. An example tree hole contains 1,700,000 posts from suicide attempts. In Figure 1, we can sense a severe hopelessness from the tree hole posts, but not from the normal posts, and the user even expresses an uplift feeling. After cross-examining suicidal users' normal and hidden posts as shown in table 1, we discover that users' hidden posts in the tree hole have more self-concerns, less others-concerns, and illustrate suicidal thoughts more directly. In comparison, suicidal users normal posts contain much less suicide-related words, and the users are reluctant to show their suicidal feelings in their normal posts. Moreover, the data of self-concern and others-concern shown in the first two rows even indicate that people with suicide risk are not willing to talk about themselves in their normal posts, which takes great challenges to detect suicide risk from users' open normal posts. Our Work. The aim of the study is to break through the above limitation to achieve a new state-of-art performance on latent suicide risk detection from one's open normal microblogs. We leverage tree hole posts from the following two perspectives. (1) We construct suicide-oriented word embeddings based on tree hole contents to strength the sensibility of suicide-related lexicons and context based on tree hole contents. (2) A two-layered attention mechanism is deployed to grasp intermittently changing points from individual's open blog streams, revealing one's inner emotional world more or less. Our experimental results on 252,901 open normal microblogs show that with suicide-oriented word embeddings and two-layered attention, latent suicide risk detection can achieve over 91% accuracy.
In summary, the paper makes the following contributions.
• We build effective suicide-oriented word embeddings to better understand the implicit meanings of words contained in users' nor-mal posts, and propose a two-layered attention model to capture the changing points which reveal suicide risk from individuals' blog streams. Our latent suicide risk detection from users normal posts not only outperforms the state-of-the-art approaches, but also are powerful enough in detecting implicitly and anti-real contrarily expressed posts.  (Just et al., 2017), and so on. While these measurements are professional and effective, they require respondents to either fill in a questionnaire or participate in a professional interview, constraining its touching to suicidal people who have low motivations to seek help from professionals (Essau, 2005;Rickwood et al., 2007;Zachrisson et al., 2006). A recent study found out that taking a suicide assessment may bring negative effect to individuals with depressive symptoms (Harris and Goh, 2017).

Suicide Risk Detection from Social Media
Recently, detection of suicide risk from social media is making great progress due to the advantages of reaching massive population, low-cost, and real-time (Braithwaite et al., 2016). Harris et al. (2014) reported that suicidal users tend to spend more time online, have greater likelihood of developing online personal relationships, and greater use of online forums. Suicide Risk Detection from Suicide Notes. Pestian et al. (2010) built a suicide note classifier used machine learning techniques, which performs better than human psychologists in distinguishing fake online suicide notes from real ones. Huang et al. (2007) hunted suicide notes based on lexicon-based keyword matching on MySpace.com (a popular site for adolescents and young adults, particularly sexual minority adolescents with over 1 billion registered users worldwide) to check whether users have an intent to commit suicide.
Suicide Risk Detection from Community Forums. Li et al. (2013) applied textual sentiment analysis and summarization techniques to users' posts and posts' comments in a Chinese web forum in order to identify suicide expressions. Masuda et al. (2013) examined online forums in Japan, and discovered that the number of communities which a user belongs to, the intransitivity, and the fraction of suicidal neighbors in the social network contributed the most to suicide ideation. De Choudhury et al. (2016) built a logistic regression framework to analyze Reddit users' shift tendency from mental health sub-communities to a suicide support sub-community. heightened selfattentional focus, poor linguistic coherence and coordination with the community, reduced social engagement and manifestation of hopelessness, anxiety, impulsiveness and loneliness in shared contents are distinct markers characterizing these shifts. Based on the suicide lexicons detailing suicide indicator, suicide ideation, suicide behavior, and suicide attempt, Alambo et al. (2019) built four corresponding semantic clusters to group semantically similar posts on Reddit and questions in a questionnaire together, and used the clusters to assess the aggregate suicide risk severity of a Reddit post.
Suicide Risk Detection from Microblogs. Jashinsky et al. (2014) used search keywords and phrases relevant to suicide risk factors to filter potential suicide-related tweets, and observed a strong correlation between Twitter-derived suicide data and real suicide data, showing that Twitter can be viewed as a viable tool for real-time monitoring of suicide risk factors on a large scale. The correlation study between suicide-related tweets and suicidal behaviors was also conducted based on a cross-sectional survey (Sueki, 2015), where participants answered a self-administered online questionnaire, containing questions about Twitter use, suicidal behaviour, depression and anxiety, and demographic characteristics. The survey result showed that Twitter logs could help identify suicidal young Internet users.
Based on eight basic emotion categories (joy, love, expectation, anxiety, sorrow, anger, hate, and surprise), Ren et al. (2015) examined three accumulated emotional traits (i.e., emotion accumulation, emotion covariance, and emotion transition) as the special statistics of emotions expressions in blog streams for suicide risk detection. A linear regression algorithm based on the three accumulated emotional traits was employed to examine the relationship between emotional traits and suicide risk. The experimental result showed that by combining all of three emotion traits together, the proposed model could generate more discriminative suicidal prediction performance.
Natural language processing and machine learning techniques, such as Latent Dirichlet Allocation (LDA), Logistic Regression, Random Forest, Support Vector Machine, Naive Bayes, Decision Tree, etc., were applied to identify users' suicidal ideation based on their linguistic contents and online behaviors on Sina Weibo (Guan et al., 2014;Zhang et al., 2014a;Huang et al., 2014;Zhang et al., 2014b;Huang et al., 2015;Guan et al., 2015;Cheng et al., 2017) and Twitter (Abboute et al., 2014;Burnap et al., 2015;O'Dea et al., 2015;Coppersmith et al., 2015). Deep learning based architectures like Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Long Short-Term Memory Neural Network (LSTM), etc., were also exploited to detect users' suicide risk on social media (Du et al., 2018;Sawhney et al., 2018;Coppersmith et al., 2018). Viouls et al. (2018) detected users' change points in emotional well-being on Twitter through a martingale framework, which is widely used for change detection in data streams.  Although there are some good works on word embeddings (Mikolov et al., 2013;Pennington et al., 2014;Joulin et al., 2016;Devlin et al., 2018), lack of domain information limits their performance on suicide detection. Given a serious of pre-trained word embeddings and suicide-related dictionary, we aim to generate suicide-related word embeddings which can strengthen the sensibility of suicide-related lexicons and context. In this study, we call them Suicide-oriented Word Embeddings, as we take advantage of the information from Tree Hole's data set which can be regarded as a kind of latent emotional expressions of individuals. As suicidal individuals in social media often use some suicide-related words/phrases in their posts, we employ Chinese suicide dictionary (Lv et al., 2015) to generate suicide domain associated embeddings. The Chinese suicide dictionary analyzes 1.06 million active blog users' posts and lists 2168 words/phrases related to suicidal ideation. These words/phrases belong to 13 categories and each word/phrase is assigned with a risk weight from 1 to 3 which indicates the relevance of suicide. We list 5 representative categories in table 2.
Since pre-trained word embeddings already contain rich semantic information and contextual information, we only need to enrich existing word embeddings with suicide-related information.
We employ a masked classification task to do this. Generally, a sentence should contains suicide-related words/phrases if it express suicidal ideation. Hence, We select 10,000 sentences 1 from Tree Hole's data set and ensure every sentence contains more than one word/phrase appeared in Chinese suicide dictionary.
Moreover, we utilize the selected sentences to do a suicidal expression classification. A sentence is regarded as suicidal expression only if it in-Figure 2: An example of the strategy for embeddings training is shown on the left and the architecture of the masked classification task to train the suicide-oriented word embeddings is on the right. cludes at least one suicide-related word/phrase. In this way, we do a sentence-level classification to refine pre-trained word embeddings and let them understand which word/phrase is relevant to suicide expression.
In training details, for each epoch, we randomly select 50% sentences to replace all suiciderelated words/phrases with "[mask]". Especially, for the rest 50% sentences, we randomly insert two "[mask]" into every sentence to avoid the suicidal expression classifier classifying sentence only based on whether it contains word " [mask]".
An example is given in Figure 2 (a). Masked sentence 1 is the sentence that we replace all suicide-related words/phrases with "[mask]" and Masked sentence 2 is the sentence that we randomly insert two "[mask]". We label Masked sentence 1 as 0 (non-suicide) and masked sentence 2 as 1 (suicide).
As there is no clear boundary between suiciderelated words/phrases and others in pre-trained word embeddings, through this suicidal expression classification we force suicide-related words/phrases to be enriched with domain information and let all the suicide-related word/phrases contain the relationship with suicidal ideation. After classification model converge in Tree Hole's data set, we obtain suicide-oriented word embeddings which contain both semantic information from pre-trained word embeddings and suicideinformation from suicide dictionary.
As illustrated in Figure 2, given a sentence A = {w 1 , w 2 , .., w n } written by a user in Tree Hole, where n is the length of a sentence, the aim of suicidal expression classification is to classify whether this sentence contains expression about suicidal ideation or not. In this case, we define X = {x 1 , x 2 , .., x n } ∈ R n×de as the word embeddings of A, where d e is the dimension of embeddings. Figure 2 shows the architecture of the suicidal expression classification Model.
We employ a LSTM layer to extract text feature from A followed by a fully connected layer for classification. We feed the word embeddings X into the LSTM as following: (1) where h t−1 , h t represent the hidden states at time t − 1 and t, H a = {h 1 , h 2 , ..., h n } ∈ R n×de is sentence representation of A, [k 1 , k 2 ] stand for the possibility of whether the sentence contains expression about suicidal ideation or not. W 1 ∈ R de×1 , W 2 ∈ R n×2 , b 1 ∈ R 1×1 and b 2 ∈ R 1×2 are trainable parameters.
Given a sequential of postsT from one user, T = {(s 1 , p 1 ), (s 2 , p 2 ), ..., (s m , p m )} , where m denotes the number of posts, (s i , p i ) stand for text and picture from i-th post. The aim is to detect whether the user at risk of suicide or not. Let X = {x 1 , x 2 , .., x n } ∈ R n×de be the word embeddings of s i , where n represents the length of s i and d e is the dimension of embeddings. Figure 3 shows the architecture of the proposed twolayered attention model.

Feature Extraction
Text Feature Extraction. We employ a LSTM layer and attention mechanism to extract text feature from s i . We feed the word embeddingsX into where h t−1 , h t represent the hidden states at time t−1 and t. We obtain the primary textual representation H p i = {h 1 , h 2 , ..., h n } ∈ R n×de of s i after LSTM layer. To gain the critical suicide-related textual information of H p i , we apply the attention mechanism Att I : where Att I ∈ R n×1 is the attention vector that demonstrates the distribution of the weights for each word of primary textual representation, W 3 ∈ R de×1 and b 3 ∈ R 1×1 are trainable parameters. Then we make multiplication between the attention vector Att I and H p i to get the final textual representationĤ i ∈ R de×1 of text s i , Image Feature Extraction. We extract image features from a 34 layer pre-trained ResNet (He et al., 2016). For the convenience of calculation, we convert the last fully connected layer of ResNet from 512 × 1000 to 512 × d e : where O ∈ R 1×512 is the input of the last fc-layer, W 4 ∈ R 512×de and b 4 ∈ R 1×de are trainable parameters. Then, I i ∈ R 1×de is the visual representation of picture p i . User's Feature Extraction.  As illustrated in table 3, we extract 12 features F ∈ R 12×1 from user' profile and posting behaviour. Since not every one has tree hole's data, in this study we do not consider tree hole's data in suicide risk detection model.

Suicide Risk Detection
Given textual representationĤ i and visual representation I i , we employ a concatenate operation ⊕ to obtain the post representation E i ∈ R 2de×1 for a single post (s i , p i ): Similar with above, we employ another LSTM layer and attention mechanism to generate global post representation G ∈ R 30×1 : where h t−1 , h t represent the hidden states at time t − 1 and t. H g = {h 1 , h 2 , ..., h m } ∈ R m×de represents the primary post representation of a user after LSTM layer. As not every post of a user expresses the ideation of suicide, we apply the attention mechanism Att II to gain the high suicide risk post information of H e . An attention vector Att II ∈ R m×1 was computed to present the different risk weight of posts, where W 5 ∈ R de×1 and b 5 ∈ R 1×1 stand for trainable parameters. Then based on attention Att II, we obtain the global post representation G for a user, where W 6 ∈ R de×30 and b 6 ∈ R 1×30 stand for trainable parameters.
Finally, we apply a concatenate operation to jointly consider G and F , and through a fully connected layer to compute the possibility of suicide:  [y 1 , where y 1 , y 0 represent the possibility of a user at risk of suicide or not, W 7 ∈ R 42×2 and b 7 ∈ R 1×2 stand for trainable parameters.

Data Collection
To make suicide risk detection via social media, we construct two data sets: one from Tree Hole and another from Weibo.
Tree Hole's data set. We studied a suicidal community which exists in the comments of a Chinese student's last posting before the student committed suicide. In March 17, 2012, this student which screen name is "Zoufan" left the last word on Weibo and then committed suicide. For the past seven years, More than 160,000 people gather here and write over 1,700,000 comments which is still continuing to grow. They express their suicidal thoughts, show their tragic experiences and demonstrate their plans of suicide behaviors. In psychology, we can understand this community as a Tree Hole. We crawled all comments from May 1,2018 to April 30, 2019 and selected top 4,000 active users. After that, four doctoral students major in computational mental healthcare were employed to annotate users that whether they are at risk of suicide or not. Specifically, We only decide the user "at suicide risk" based on self-report of his/her tree hole posts. If a user express clear suicidal thoughts like "At this moment, I especially want to die. I feel very tired. I really want to be free." more than 5 times in different days, then we label him/her at suicide risk. Finally we get 190,087 sentences of 3,652 users and the average length per post is 11.96 words.
Suicide data set. To collect users at suicide risk, we crawl user's profile and all created posts in Weibo according to user list from Tree Hole' data set. Besides, we select the users who never submit any post containing expression about suicidal ideation and label them as non-suicide risk. In this case, we discard users whose fans more than 1,500 or posts more than 2,000 because that they may be public figure or organization. The statistic of suicide data set are illustrated in Table 4.

Data Preprocessing
We carry out the following data preprocessing procedures: 1) Emoji. We replace emoji with corresponding word like "happy", "cry" to facilitate our model to understand the emotion of user's post. 2) URL. As URL has no use for our detection, we simply remove them from sentences. 3) Image. All images posted by users were adjusted to 224 × 224 for normalized input.

Experimental Setup
In suicide detection task, we treat recent 100 posts from one user as one sample. After sum up D 2 and D 3 , we obtain 7,329 microblog users and training set, validation set and test set contain 6,129, 600, 600 respectively. All sentences are padded to the length of the longest sentence in the data set with word "<PAD>". Batch size is 16 during training process and we use 0.001 as learning rate. Adam Kingma and Ba (2015) is adopted as the optimizer.
We compare suicide-oriented word embeddings with following well-developed word embeddings.
(1) Word2vec: The fundamental work for considering the local semantic information of words. We get pre-trained Word2vec word embeddings from Li et al. (2018).
(2) GloVe: Context-based unsupervised algorithm, which apply co-occurrence matrix to jointly consider the local and global semantic information. We apply open source tool 2 to train word embeddings on all sentences from Tree Hole's data set. 2 https://github.com/stanfordnlp/GloVe (3) Fasttext: A fast text classification and representation learning model based on Word2vec and hierarchical softmax. Pre-trained FastText word embeddings were obtained from official project 3 .
(4) Bert: Latest language model based on transformer. We acquire pre-trained Bert model from official project 4 and generate word embedding of each word in a sentence dynamically.
Also, we compare our suicide risk detection model with following well-designed methods.
(1) LSTM (Coppersmith et al., 2018): An attention mechanism based Long Short-Term Memory model which can capture contextual information between suicide-related words and others.
(2) Naive Bayesian (NB) and Support Vector Machine (SVM) (Pedregosa et al., 2011): Two representative machine learning methods with well-designed features. We use SC-LIWC information (Cheng et al., 2017) as textual features, saturation, brightness, warm/clear color and fivecolor theme information  from picture as visual features and user's behaviour features from table 3.

Results
Three sets of tests were conducted to evaluate the performance of suicide risk detection model with suicide-oriented word embeddings.

Effectiveness of Suicide-oriented Word Embeddings
We compare the performance of LSTM and SDM with seven word embeddings as illustrated in table 5. We find that without suicide-related dictionary, Bert outperforms other three word embeddings with 2% higher accuracy and 1.5% higher F1score on both models. After leveraging suiciderelated dictionary, suicide-oriented word embeddings based on FastText achieves the best performance with accuracy 88.00% 91.33%, F1-score 88.14%, 90.92% on two models. Obviously, there is a gap between suicide-oriented word embeddings and normal word embeddings which can verify the effectiveness of the former.

Effectiveness of Suicide Risk Detection Model
We compare the performance of four model as shown in table 6. In this case, LSTM and SDM  Table 5: Performance comparison for different word embedding and different detection model, where "So-W2v", "So-Glove" and "So-FastText" represent suicide-oriented word embeddings based on Word2vec, GloVe and Fast-Text respectively. Acc and F1 represent accuracy and F1-score.   employs So-FastText word embeddings as their input. SDM improves the accuracy by over 3.33% and obtains 2.78% higher F1-score on full data set.

Performance on Harder sub-testset
To verify the effectiveness of models in dealing with peoples implicit and anti-real contrary expressions on microblog posts, we filter out 130 suicide risk people from test set who do not show obvious suicidal ideation on normal posts. Those 130 people construct a subset of test set named Harder sub-testset. After observing the performance of four models as shown in table 6, SDM can keep 8% higher value both in accuracy and F1score on Harder sub-testset, compared with other models, and the decline is smaller than other models. This suggests that SDM performs better than existing models in dealing with people's implicit and anti-real contrary expressions.

Ablation Test for Suicide Risk Detection Model
To show the contribution of different input to the final classification performance, we design an ablation test of SDM after removing different input. All SDMs are based on embedding So-Fasttext.
Since not every post contains image and user's features contain missing value, we do not only use images nor user's feature as input of SDMs. As illustrated in table 7, we can see that textual information is a crucial input of our SDM. Besides, user's features play a more important role than visual information. The more modalities we use, the better performance we get.

Discovery
To further explore the correlation between normal posts and hidden posts from same user, we import Pearson Correlation Coefficient (Tutorials, 2014) to manage it. For each user, we obtain a normal vector V n i = {v n i,Jan , v n i,F eb , ..., v n i,Dec } ∈ R 12 , where v n i,Jan shows the total number of times words/phrases appear in posts for user i in January. In a similar way we get a hidden vector V h i = {v h i,Jan , v h i,F eb , ..., v h i,Dec } ∈ R 12 . For each normal posts, we also have an attention weight from Att II which represents the suicide risk. Then, similar as above, an attention risk vector V a i = {v a i,Jan , v a i,F eb , ..., v a i,Dec } ∈ R 12 was computed, where v a i,Jan shows the total suicide risk for user i in January. We donate the pearson correlation coefficient ρ n,h,i of V n i and V h i , ρ a,h,i of V a i and V h i as the correlation between normal posts and hidden posts , attention and hidden posts for user i. As shown in figure 4, we find that there are high positive linear correlation between normal posts and hidden posts from user 1 with ρ nh = 0.84 and high negative linear correlation from user 4 with ρ nh = −0.7. For other two suicide risk users, there are not obvious linear correlations with ρ nh = 0.33, −0.12 respectively. The phenomenon that correlations ρ ah between attention and hidden posts from four users all higher than 0.6 which means high positive linear correlation, which verify the ability of the two-layered attention mechanism to reveal ones' inner emotional world.

Conclusion
In this paper, we explore the uses of tree holes to enhance microblog-based suicide risk detection. Suicide-oriented word embeddings based on tree hole contents are built to strengthen the sensibility of suicide-related lexicons and a twolayered attention mechanism is deployed to grasp intermittently changing points from individuals open blog streams. Based on above word embeddings and attention mechanism, we propose a suicide risk detection model which outperforms the well-designed approaches on benchmark data set. Through experimental results we also find that, our model also performs well on people's implicit and anti-real contrary expressions.