Cross-Lingual Suicidal-Oriented Word Embedding toward Suicide Prevention

Early intervention for suicide risks with social media data has increasingly received great attention. Using a suicide dictionary created by mental health experts is one of the effective ways to detect suicidal ideation. However, little attention has been paid to validate whether and how the existing dictionaries for other languages (i.e., English and Chinese) can be used for predicting suicidal ideation for a low-resource language (i.e., Korean) where a knowledge-based suicide dictionary has not yet been developed. To this end, we propose a cross-lingual suicidal ideation detection model that can identify whether a given social media post includes suicidal ideation or not. To utilize the existing suicide dictionaries developed for other languages (i.e., English and Chinese) in word embedding, our model translates a post written in the target language (i.e., Korean) into English and Chinese, and then uses the separate suicidal-oriented word embeddings developed for English and Chinese, respectively. By applying an ensemble approach for different languages, the model achieves high accuracy, over 87%. We believe our model is useful in accessing suicidal ideation using social media data for preventing potential suicide risk in an early stage.


Introduction
As online social media has become the norm to share our daily lives, people often share their emotions, feelings, and mental state. This has spurred scholars to identify diverse mental health problems such as depression, anxiety, bipolar disorder, or suicidal thoughts using plenty of user behavior data on online social media (Ji et al., 2019;Pavalanathan and De Choudhury, 2015;Kim et al., 2020). Such user behavior data can provide a cue for identifying individual mental state or even suicide risk (O'dea * Corresponding author. Ren et al., 2015;Coppersmith et al., 2018), which can be used to support mental health care (Shen and Rudzicz, 2017;Suhara et al., 2017).
Among the diverse mental health problems, suicide has become one of the big and emerging concerns worldwide. The OECD (Organization for Economic Cooperation and Development) reported 11.2 deaths per 100,000 population in OECD countries in 2017 (OECD, 2020). In particular, the suicide rate of Korea and the USA was 24.6 and 13.9 deaths per 100,000 population in 2016, which ranked 1st and 8th, respectively.
The awareness of the severity of suicide has led researchers to assess mental health using social media data for recognizing potential warning signs of suicide in an early stage (Pavalanathan and De Choudhury, 2015;O'dea et al., 2015). In particular, linguistic characteristics (e.g., frequently used words like 'family', 'sad', or 'dream') of social media posts have been extensively investigated (Gaur et al., 2019;Lv et al., 2015). As prior research showed that certain linguistic features revealed in an individual language could be linked to suicide risk (McCarthy, 2010;Sueki, 2015), there have been attempts to develop machine-learning models using a suicide dictionary, which was created and curated by mental health experts. For example, an English suicide dictionary was created and validated by four clinical psychiatrists (Gaur et al., 2019); a Chinese suicide dictionary was curated by eleven mental health experts (Lv et al., 2015).
The predictive power of such suicide dictionaries with domain knowledge (in English or Chinese) in identifying suicide risk from an English-or Chinese-written social media post has been demonstrated (Gaur et al., 2019;Lv et al., 2015). However, little attention has been paid to validate whether the existing dictionary developed for the specific language (e.g., English or Chinese) can be used for predicting suicidal ideation with other languages (e.g., Korean or Japanese), where any suicide dictionary has not yet been developed. It is essential to investigate whether and how existing suicide dictionaries developed by domain experts can be utilized by predicting suicidal ideation for non-English or non-Chinese spoken countries because building and validating such a knowledge-based dictionary requires much effort.
To shed light on this issue, we propose a crosslingual suicidal ideation detection model that can identify whether a given social media post includes suicidal ideation or not. To utilize the existing suicide dictionaries developed for other languages (i.e., English and Chinese) in word embedding, our model translates a post written in the target language (i.e., Korean) into English and Chinese and then uses the separate word embeddings developed for English and Chinese, respectively. Our model then uses attention to make a representation for post embedding. The attention helps find words that are more relevant to suicidal ideation, thereby obtaining a better post representation. By applying an ensemble approach for different languages, which can reflect linguistic or cultural differences , our proposed model finally predicts suicidal ideation of the given post in Korean.
We highlight the main contributions of our work as follows.
• To the best of our knowledge, this is the first attempt to utilize the suicide dictionaries developed for other languages (i.e., English and Chinese) in predicting suicidal ideation in Korean. We believe the proposed model provides a cost-effective way to detect suicide risk from a social media post written in a low-resource language where a knowledge-based suicide dictionary does not exist. The proposed model achieves high accuracy, over 87%.
Note that the Korean suicidal-oriented word-embedding is built by a computational approach without medical knowledge base but shows a considerable performance in suicidal ideation detection. We believe the suicidal-oriented word-embeddings can be useful for researchers who want to access suicidal ideation using social media data for preventing potential suicide risk at an early stage. It becomes the norm for people to share their daily lives or feelings on diverse social media. This in turn has led researchers to investigate individuals' mental health problems using a deluge of user activity data on social media (Ji et al., 2019;Pavalanathan and De Choudhury, 2015;Shing et al., 2018), because such user behavior can provide a cue for identifying individual mental state or even suicide risk (O'dea et al., 2015;Ren et al., 2015;Coppersmith et al., 2018;Sinha et al., 2019). There has been great interest in developing a model to detect suicide risks based on user behavior such as the number of posts or followers (Kumar et al., 2015;Cao et al., 2019) and linguistic characteristics (e.g., frequently used words like 'family', 'sad', or 'dream') revealed in social media posts (Gaur et al., 2019;Lv et al., 2015). For example,  conducted a linguistic analysis on social media data and found a few signals that can be linked to suicide attempts and suicidal ideation. De Choudhury et al. (2016) analyzed user post data in Reddit and found that individuals who could become suicidal tend to exhibit changes in linguistic structures, interpersonal awareness, and social interactions in social media. Such identified distinctive markers of shifts can be used for identifying individual suicidal ideation.

Suicide Dictionary Development
As it has been reported that certain linguistic features revealed in individual language can link suicide risk (McCarthy, 2010;Sueki, 2015), there have been attempts to develop a learning-based model using a suicide dictionary which is created and curated by mental health experts. For example, Gaur et al. (2019)   . However, little attention has been paid to whether existing dictionaries developed for specific countries or languages (e.g., English or Chinese) can be used for predicting suicidal ideation with other languages such as Korean or Japanese, where any suicide dictionary has not yet been developed. This paper proposes and evaluates a model for predicting suicidal ideation using Korean social media data by exploiting multiple suicide dictionaries developed for other languages (e.g., English and Chinese).

Cross-lingual Suicidal Ideation Detection Model
We propose a suicidal ideation detection model that can identify whether a given post includes suicidal ideation or not. To utilize the existing suiciderelated dictionaries developed for other languages (i.e., English and Chinese) in word embedding, our model translates a post written in the target language (i.e., Korean) into English and Chinese and then uses the separate word embeddings developed for English and Chinese, respectively. Note that we use Naver Papago (Lee et al., 2016) for translation, which is known to be an efficient translator from Korean to other languages. By applying an ensemble approach for different languages, our proposed model finally predicts suicidal ideation of the given post in Korean. Figure 1 illustrates the overall architecture of our proposed model.

Suicidal-oriented Word Embedding
We adopt a suicidal-oriented word embedding similar to the prior work (Cao et al., 2019) that refines a word embedding to capture domain knowledge from a pre-built suicide-related dictionary. Figure 2 illustrates the model that identifies whether a given sentence contains suicidal expression or not.

Generating suicidal and non-suicidal expressions
For training a suicidal-oriented word embedding, we use a pre-built suicide-related dictionary. If such a dictionary contains word-level information that exhibits how much a word is associated with suicidal ideation (like a Chinese dictionary (Lv et al., 2015)), we apply the word-masking classification method similar to the prior work (Cao et al., 2019). To this end, we generate suicidal and non-suicidal expressions for a given input suiciderelated post collected for word embedding, e.g., Weibo Tree Hole data (Cao et al., 2019). The sui-cidal expression is generated based on the input data itself. For generating a non-suicidal expression, we replace all the suicide-related words (in the dictionary) with "[mask]" in the given input. To avoid learning from the "[mask]" words themselves replaced in the non-suicidal expression, we randomly add two "[mask]" words in the suicidal expression. During the training, we randomly select 50% of the generated suicidal and non-suicidal expressions, respectively, for each epoch. If a pre-built suicide-related dictionary contains sentence-level information such as Gold Standard Dataset (Gaur et al., 2019), which includes English sentences related to suicidal ideation, directly applying the word-masking method (Cao et al., 2019) is not possible; words for masking cannot be extracted from the given sentence-level dictionary. Hence, we use the sentences belonging to the dictionary as suicidal expressions and non-suiciderelated posts as non-suicidal expressions for developing the sentence-level word embedding. To generate non-suicidal expressions, we randomly select the (non-suicidal-related) posts on Reddit. Note that the ratio of the generated suicidal and non-suicidal expressions is 1:1.

Word embedding
Given a set of words A j = {w 1 , w 2 , ..., w n } in a given expression j labeled as suicidal or nonsuicidal, we define Y j = {y 1 , y 2 , ..., y n } ∈ IR n×de is the word embedding of A j , where n is the length of the words in the given expression j and d e is the whole dimension of the embedding. Then, each word in Y j is fed into an LSTM cell to derive text representation: where h t−1 and h t represent the hidden state at time t − 1 and t, respectively. Note that H A = {h 1 , h 2 , ..., h n } ∈ IR n×de represents a textual representation of A j . Finally, the model classifies whether the expression is suicidal or not as follows:

Post Attention Layer
Given a post p i , by passing through the corresponding suicidal-oriented word embedding, we obtain the word embedding X i = {x 1 , x 2 , ..., x n } ∈ IR n×de for post p i . After that, we feed X i into the LSTM layer as follows: Note that H p = {h 1 , h 2 , ..., h n } ∈ IR n×de is a textual representation of p i after the LSTM layer. We then apply the attention mechanism to reflect the important suicide-related information of H p as follows: where Attention ∈ IR 1×n is the score vector of attention, S ∈ IR 1×32 is the final contextual vector, h n is the last hidden state of the last LSTM cell, and tanh is activation function. W 3 ∈ IR 256×1 , b 3 ∈ IR 1×1 ,W 4 ∈ IR 512×32 and b 4 ∈ IR 1×32 are trainable parameters. Figure 3 represents the architecture of the post attention layer.

Ensemble Layer
For a given set of contextual vectors for different languages, S KR , S CN and S EN in our case, we concatenate them to obtain the total post representation Q ∈ IR 1×96 : A fully-connected layer with an activation function Relu is first applied into Q. We then finally classify whether the post p i includes suicidal ideation or not as follows: where W 5 ∈ IR 96×32 , b 5 ∈ IR 1×32 , W 6 ∈ IR 32×2 and b 6 ∈ IR 1×2 are trainable parameters.

Suicide Data
To develop models for predicting suicidal ideation for a post written in Korean, we collected the suicide-related and non-suicide-related Korean posts from Naver Cafe 1 . To improve the model performance, we further collected data for generating suicide word embeddings for Chinese, English, and Korean, respectively. Note that all the collected data is anonymized, hence no user information can be identifiable.

Target Post Data for Predicting Suicidal Ideation
To collect suicide-related and non-suicide-related posts, we selected Naver Cafe operated by Naver, one of the most popular web-based services in Korea (Nam et al., 2009;Park et al., 2014). Like a subreddit in Reddit, a user can create a topic-based community in Naver Cafe, named a 'cafe', where members in a cafe can communicate with others via writing posts or commenting posts. We collected 10,000 suicide-related posts from a representative suicide-related cafe in Korea, 'Talking about Suicide', where users share their interest in suicide, and 21,723 non-suicide-related posts from two popular cafes, 'Goodbye Single 2 ' and 'Cafe Powder Room 3 ', where people socialize with others and share their daily life. Table 1 shows the examples of the suicide-related and non-suiciderelated posts.

Data for Suicidal-Oriented Word Embedding
To train the suicidal-oriented word embeddings for each language (i.e., Korean, English, and Chinese), we further collected three language sets of data to generate suicidal and non-suicidal expressions, explained in Section 3.1

Suicide Dictionary
We first obtained the pre-built existing suicide dictionaries based on domain knowledge in Chinese and English. We also created a suicide dictionary in Korean to evaluate the model performance with this dictionary, written in the same language (i.e., Korean) of our target suicide-related posts but is computationally generated without any medical knowledge base. We detail the suicide dictionary for each language, summarized in Table 2, as follows.
A Chinese suicide dictionary is built by Lv et al. (2015), which includes 2,168 words extracted from 1.06 M posts in Sina Weibo. Note that each word has a score in the range of 1 to 3, assigned by three experts, indicating how strongly the given the word expresses suicidal ideation.
We obtained an English suicide dictionary, titled as Gold Standard Dataset, which was developed by Gaur et al. (2019). It contains 500 users' posts in the "r\SuicideWatch" subreddit in Reddit. Each user is annotated with one of the five levels across suicide severity (i.e., Indicator, Ideation, Behavior, Attempt, and Supportive) by practicing psychiatrists. We used only four levels except for the 'supportive' level to avoid confusion because a user in the 'supportive' class can be regarded as one without suicide risk but show similar linguistic characteristics to the users in other classes. Finally, we obtained 7,286 posts written by 373 users.
To generate a Korean suicide dictionary, we collected posts from representative suicide-related online communities in Korea. We collected 1,258 and 6,332 suicide-related posts from two suiciderelated public web forums, "Lifeline Korea" 4 and "Companions of Life, Suicide Prevention Counselling" 5 , where a user can share his/her suicidal ideation and can be supported by a mental health counselor. We further collected 2,410 suiciderelated posts from the Naver cafe, 'Talking about Suicide'. Note that additional data collected from the Naver cafe is only used for generating the Korean dictionary and not used for learning for detecting suicidal ideation. Following the method proposed in prior work (De Choudhury et al., 2013;Burnap et al., 2015), we then extracted the top 1,000 keywords from all the collected posts using the TF-IDF.

Posts for Word Embedding
To train a suicidal-oriented word embedding, we further collected suicide-related posts for Chinese and Korean, which use the suicide dictionary with word-level information, and non-suicide-related posts for English, which use the dictionary with sentence-level information, for generating suicidal and non-suicidal expressions. In particular, for training the word embedding for Chinese, we collected the Tree Hole posts in Sina Weibo, used in prior study (Cao et al., 2019;Zhao et al., 2018), where users have shared their thoughts on suicide by exchanging over 100 M comments. By using the Weibo API, we obtained 6,093 posts from March 11th to 31st in 2020. For training the word embedding for Korean, we obtained another set of the 2,410 suicide-related posts from the Naver cafe, 'Talking about Suicide', where we collected data for predicting suicidal ideation. Note that the newly added data is only used for word embedding. For training the English word embedding, we collected the 102 K non-suicide-related posts from the three subreddits in Reddit, "r\AskReddit", "r\Showerthoughts", and "r\CasualConversation", where users share casual topics or daily events. 4 https://www.lifeline.or.kr/ 5 http://www.counselling.or.kr/

Language Difference on Suicide
To analyze whether and how similar suicide topics are shared across different languages, we compare the top 100 keywords identified in each language. To identify the suicide-related keywords, we further collected 107,606 English suicide-related posts from the "r\SuicideWatch" subreddit in Reddit and 6,297 Chinese non-suicide-related posts from the "Popular" section in Weibo. Note that non-suiciderelated posts were used to exclude the generally popular keywords from the top suicide-related keywords. Table 3 summarizes the numbers of the posts for each language, respectively, used in this analysis.  To compare the different languages, Korean and Chinese posts were translated into English by using the Naver Papago (Lee et al., 2016). We then performed the stemming and extracted unigrams and bigrams. To exclude the commonly used keywords in both suicide-related and non-suiciderelated posts, we removed the keywords that also appear in the top 100 keywords for the non-related posts. Finally, we obtained the top 100 keywords from the suicide-related posts in each language.  Figure 4 shows a Venn diagram that represents how the identified top 100 keywords for the suiciderelated posts in different languages are overlapped. As shown in Figure 4, the 27 keywords are commonly identified in all the languages. In particular, the commonly overlapped words tend to directly express suicidal ideation ('die', 'want die'), show negative emotion ('hate', 'cry', 'hurt', 'sad', 'wrong', 'pain'), and mention about family ('mother', 'mom', 'parents', 'family', 'dad'). The intersection between Korean and English (45 words) includes more common keywords than that between Korean and Chinese (38 words) or between Chinese and English (41 words). This implies that Korean and English tend to share common topics on suicide than others more. For example, we find that the overlapped keywords for Korean and English tend to be related to loneliness ('left', 'alone') and hope ('able', 'want').
Taking a close look at the unique 44 top keywords in Korean, which is the target language for our model evaluation in Section 4.1, we find that the keywords tend to mention about life plan ('job', 'dream'), school life ('high school', 'middle school', 'student', 'grade'), beauty ('face'), sibling ('brother', 'sister'), and past ('year ago', 'ago'), which are not observed in other languages, i.e., English and Chinese.
In summary, our analysis reveals that utilizing the suicide word embedding for other languages can help improve the performance of our model that predicts suicide ideation, as different languages are likely to share similar topics. In addition, the ensemble of multiple languages in our model can be useful since it can capture the linguistic or cultural differences in suicide.

Experiments
We evaluate the proposed cross-lingual suicidal ideation detection model by answering the following research questions: • RQ1: Can the word embedding refined by the suicide dictionary with domain knowledge in other foreign languages (e.g., Chinese, English) improve the model performance?
• RQ2: Is the refined word embedding based on the suicide dictionary created with a computational approach (without domain expert knowledge) useful in identifying suicidal ideation, compared to one with a pre-built existing suicide dictionary created by domain experts in a foreign language?
• RQ3: Can an ensemble from the multiple models with different languages improve the model performance?

Models
To answer the above questions, we evaluate the following models: • ML-baseline is the mono-lingual (ML) model, which takes a post written in a single language as an input. For example, ML-baseline (language: CN) indicates a model taking a post written in Chinese translated from Korean as an input. Note that we use the wellknown general pre-trained embeddings, i.e., Word2vec (Le and Mikolov, 2014) and Fast-Text (Joulin et al., 2017).
• ML-refined is the same as the ML-baseline but uses the word embedding refined by the suicide dictionary, as explained in Section 3. For example, ML-refined (language: English, word-embedding: refined-word2vec) represents a model that uses the word2vec word embedding refined by the English suicide dictionary to learn posts written in English translated from Korean.
• CL-mixed is an ensemble cross-lingual (CL) model that combines multiple mono-lingual language models, e.g., ML-baseline (Korean) and ML-refined (Chinese). Note that we use the general pre-trained word embedding (e.g., word2vec) for the language where the suicide dictionary with domain knowledge does not exist (i.e., Korean), and the refined word embedding(s) for the language(s) where the suicide dictionary is constructed by domain experts (i.e., Chinese and English).
• CL-ours is the same as the CL-mixed but uses the word embedding refined by the suicide dictionary for the input language, Korean.

Results
To answer the questions, we evaluate the performance of each model, summarized in  results of ML-baseline and ML-refined models. We find that ML-refined (CN or EN) models show similar or lower performance than ML-refined (KR) or ML-baseline (KR) models, meaning that using an existing suicide dictionary developed by domain experts in other language does not help to improve the model performance. This may be due to the cultural difference in suicide-related languages, which was discussed in Section 4.3 that showed different language usage in suicide across different languages, e.g., the overlapped portion of suicidalrelated keywords used both in Korean and Chinese is just 45%. Note that the ML-baseline (CN, EN) models perform lower than the ML-baseline (KR), indicating that the translated language (e.g., from Korean to Chinese) can be used in identifying suicidal ideation but shows a limited performance.

RQ2: Effect on Using Suicidal-Oriented Word Embedding Created by a Computational Approach
To answer the second question, we evaluate the model with suicidal-oriented word embedding created by a computation approach (without domain expert knowledge), the ML-refined (KO). As shown in Table 4, the performance of the MLrefined (KO) model is improved compared to the ML-baseline (KO) model. This implies that if a suicide dictionary generated by domain experts does not exist, a suicidal-oriented word embedding generated by a computational approach is useful in identifying suicidal ideation. This is because suicidal people tend to use their own special words (Gaur et al., 2019), and the computational approach can capture such distinct patterns.

RQ3: Effect on Using Cross-Lingual Suicidal-Oriented Word Embeddings
To evaluate our ensemble approach that uses multiple cross-lingual languages together, we compare CL-mixed (KO+CN) and ML-baseline (KO) models. As shown in Table 4, overall the CL-mixed models outperform ML-baseline models, meaning that our cross-lingual approach is useful in identifying suicidal ideation. By combining the model for Korean with one for Chinese or English, we find that a potential limitation due to the cultural language difference can be mitigated.
Lastly, the final model, CL-ours, shows the best performance that achieves 87.5% accuracy. This demonstrates that our proposed cross-lingual model can detect suicidal ideation with high accuracy, which has an important implication on preventing and managing possible suicide risks.

Conclusion
This paper proposed a cross-lingual suicidal ideation detection model that provides a costeffective way to predict suicidal ideation with social media data written in a language where no suicide dictionary exists. We proposed to apply (i) suicidal-oriented word embeddings developed for other languages (i.e., English and Chinese), (ii) attention mechanism for post representation, and (iii) an ensemble approach to reflect potential cultural and language difference. The proposed model achieved high accuracy, over 87%, signifying its great utility in detecting suicidal ideation using social media data for preventing potential suicide risk in an early stage.