Suicidal Risk Detection for Military Personnel

We analyze social media for detecting the suicidal risk of military personnel, which is especially crucial for countries with compulsory military service such as the Republic of Korea. From a widely-used Korean social Q&A site, we collect posts containing military-relevant content written by active-duty military personnel. We then annotate the posts with two groups of experts: military experts and mental health experts. Our dataset includes 2,791 posts with 13,955 corresponding expert annotations of suicidal risk levels, and this dataset is available to researchers who consent to research ethics agreement. Using various fine-tuned state-of-the-art language models, we predict the level of suicide risk, reaching .88 F1 score for classifying the risks.


Introduction
Suicide is one of the major causes of death in the military. In some countries where military service is compulsory because of a conscription system, active-duty military personnel live in physical separation from their family and friends for an extended period of time, often against their will. In the Republic of Korea, for example, most men have the obligation to serve in the military for about a year and half, leading to a large population of about 600,000 in active duty as of this year. Many of them experience difficulty adapting to the isolated environment, and some of them are at risk of suicide.
One approach to detect the suicide risk signs of active-duty soldiers is the analysis of social media posts, similar to the approach used for detecting suicide risk of the general public (Milne et al., 2016;Yates et al., 2017;Zirikly et al., 2019). However, * These authors contributed equally. research on military suicide finds that there are distinct risk factors, such as combat exposure, injury, bereavement, and negative unit climate associated only with military service (Nock et al., 2013). For this reason, we cannot directly apply the findings of suicide risk research of the general public to the military personnel. In this paper, we take on the challenge of collecting social media posts related to military service in the Republic of Korea, annotating and analyzing them using NLP methods for detecting suicide risk of active-duty soldiers.
The first and most challenging step is to create an annotated dataset of military-related social media posts written by active-duty soldiers. We collect posts from a popular social question and answering (Q&A) platform. Anonymous posts are allowed, so we find that there is a considerable amount of military-related posts that contain possible suicide risk and other mental health issues. Annotation poses a challenge, as the mental health and suicide risk of soldiers should be analyzed by mental health experts experienced in the military setting. It is difficult, though, to find such experts, so we reached out to two separate groups of experts, military experts and mental health experts. We asked both groups for annotation, and our analysis includes the results of the annotation, as well as the results of the prediction of suicide risk.
Our focused contribution is in building a dataset of 2,791 social media posts written by military personnel in Korean with corresponding 13,955 expert annotations of suicidal risk levels. We finetune various state-of-the-art language models to classify the risks for developing simple yet effective baselines, achieving up to .88 F1 score.

Constructing Annotated Dataset
We describe the steps in collecting relevant posts, preprocessing, and annotations. We also explain Risk Level Description Imminent Risk (3) • Expressing to self-harm or suicide directly and explicitly.
• Making concrete plans for suicide: seeking access to hazardous tools or pills • Existence of triggering events, make a will, etc.
High Risk (2) • Expressing to self-harm or suicide indirectly and implicitly • Desire for suicidal behavior, suicidal ideation, self-harming • Risk factors becoming severe due to stressful events, relationship problems, etc.
Low Risk (1) • Expressing depressed, stressed, anxiety due to environmental or internal factors • Maladaptation to (military) service, but would become adaptable through adequate measures • Requiring continuous treatment with the therapist or psychiatrist No Risk (0) • No help required due to no risky sign detected • Expressing mild sadness • Simple questions how we dealt with research ethics concerns.

Data Collection
Collecting Posts. We collect relevant posts from Naver Knowledge iN, an online Korean Q&A (Question and Answering) platform in 2019. Like Quora.com, people ask questions through anonymous posts without length constraints. In some cases, users disclose their personal matters to obtain advice from others. To collect the relevant posts, we use 58 military-related keywords plus suicide or self-harm related terms. For instance, we use 'military force', 'army + self-harm', 'army + suicide', and so on. For every keyword, we collect the most recent 1,000 posts without any meta-data such as username and timestamp because these features could make person identification quite easy. Through this process, we collect 44,108 posts. Preprocessing. We preprocess the collected posts in three steps. Since 58 search keywords return many duplicated posts, so we first remove duplicates which reduces the data size. Then we manually remove any post that is written by family or friends of the soldier, or by anyone unrelated to military. We retain only posts written by soldiers themselves so that the trained model can detect suicide risk signals of the active military personnel based on their first-person account.
Next, we manually remove personally identifiable information and all named entities in the text. We found 44 unit names, 7 school names, 10 number of grades, 2 region names, and 2 personal names and user ids and replaced them with unidentifiable placeholders. After preprocessing, we are left with 2,791 posts. The average length of a post is 92.7 words. Ethical Concerns. We carefully consider any potential ethical concerns with the entire process of this research. We collect posts only if they are publicly available from Naver Knowledge iN, and we do not collect any metadata of the posts because they can potentially be used to identify authorship. Also, we manually inspect every post to remove personally identifiable information, masking all named entities. These processes are costly and make it very difficult to build a large-scale corpus. Annotators are shown only anonymized posts, and annotated data will be available to researchers with express consent not to contact or de-anonymize any of the posts. 1 This study is reviewed and approved by the KAIST Institutional Review Board (#KH2019-122).

Annotating Suicidal Risk
5 annotators evaluated the degree of suicidal risk of writers (military personnel) in the anonymized posts. We annotate the risk at the post-level because anonymous posts do not have user names from the start, and the other posts are anonymized by removing user names due to the ethical concerns. Annotation Criteria. Table. 1 shows our annotation criteria, which came from existing shared task settings (Milne et al., 2016;Zirikly et al., 2019) and guidelines such as 'Classification criteria of soldiers in need' issued by the Ministry of National Defense. All posts are annotated with the risk level from 0 to 3, from lowest to highest level of detected   Table. 4. Annotation Process. We recruit two external expert annotators (E1 (Psychiatrist), E2 (Psychotherapist)) and three internal expert annotators (I1 (Military Counselor), I2 (Commander), I3 (Commander)). Each annotator independently evaluated the level of the risk detected in the 2,791 posts into the 4 classes. With posts that show disagreement within each group, annotators were asked to evaluate the risk of those items once again independently. Annotators could choose whether to change their initial evaluation, but in most cases they did not change the first evaluation.

Annotation Results
Here we show risk level frequencies for each annotator, and degrees of agreement among them. Distribution of Risk Levels. As shown in Table. 2, most posts are labeled as No Risk or Low Risk.
External annotators tend to evaluate most posts as No Risk, and internal annotators label more posts as Low Risk rather than No Risk. The proportion of posts labeled as High Risk and Imminent Risk is  Table 3: Cohen's inter-annotator Agreement (IAA) coefficients across annotators. The overall IAA shows fair agreement. Agreement within each group is higher than that of between groups.
Inter-Annotator Agreement (IAA). Table. 3 presents Cohen's IAA among annotators. We find the agreement of within group is higher than that of between groups. In detail, within each group, the external annotators show fair agreement (Krippendorff's α=0.58), and the internal annotators fair agreement as well. (Krippendorff's α=0.55) The overall agreement among the five annotators is lower (Krippendorff's α=0.37) than that of within group agreements. In addition, if annotations are binarized to 'Flagged', the Krippendorff's αs are internal's α=0.53, external's α=0.52, overall α=0.30. Again, we observe fair agreement within groups, but the level of agreement of all annotators is rather low because of the difference between groups.
Comparison between Perspectives. Table. 4 shows a few manually selected examples of risk annotation between groups. For the first and second examples, two groups annotated the same scores. The first example explicitly expresses suicidal thoughts and even a failed suicide attempt, so all annotators agreed the writer seems to require immediate help. The second example asks merely for a skin problem, and both groups annotated as No Risk for this post.
Through third and fourth examples, we can see the difference between the two perspectives. For the third example, the internal annotators rated relatively higher than the external annotators. This is because the writer is under stress due to the problems adjusting to military life, suffering from depression and thinking about suicide. Also, they think this would have highly negative effect on the poster's life in the unit, so they judge the risk factor is relatively high, and the military must pay attention from the perspective of the commander responsible for this soldier's life and work in the military.
Annotation Example E (High, 3.00) I (High, 3.00) I'm a soldier now.. I'm so tired of depression, insomnia, and hallucination. Every day I try to sleep, some voices tell me not to sleep, so I can't sleep without medicine... When I'm in a group of people, my heart beats fast, hard to breathe, head hurts, and I feel dizzy. And when I look in the mirror, I'm so surprised to see someone behind me, even though nobody's there. It was a shock to me... because I can definitely hear and see it. I feel like I'm lying. I don't know what's happening. I feel so sorry to my family, but after I die, hard time would be just a moment. I heard that 'Actifed' is bad for people with high blood pressure like me. I bought 100 pills at the pharmacy. I took all tablets of Actifed, but just throwing up 4 times and being paralyzed for an hour. I still feel pain when I move. I'm sorry I couldn't die. After a few days I was assigned to my platoon, I have felt weird symptoms. Due to the rebuke and curse from senior soldiers, I was so nervous that I couldn't carry out my mission efficiently, and couldn't think or judge well just like a teenager. I hate to be with others, and I didn't really want to live every day. I usually come up with suicidal ideas, but I try hard to withstand the situation by thinking of my parents, and shedding tears alone. I have a continuous headache with dizziness, get to sleep irregularly, and I'm in a daze.
It feels like what I'm doing isn't mine, and I feel depressed all day long. I don't even have an appetite. But unit says it's difficult to discharge me early because my situation is not bad enough, and it doesn't look very serious. My unit's refusing though my medical report says "Consider maladaptation to service." I'm having a hard time every day. I want to get out of the unit and get counseling and proceed with treatment.  However, the external annotators expect that there is little suicidal risk because the writer wants to visit a therapist or psychiatrists anyway rather than moving onto suicidal behavior. The fourth example expresses thoughts about killing oneself, so the external annotators give a higher risk score to this example. But the internal annotators see less risk from the post because they have commonly heard this type of negative expression about the mandatory military service among the soldiers.
Considering these examples and others, we conclude that both perspectives should be considered together and separately while predicting the suicidal risk of military personnel using a computational model.

Experiments
We classify the posts by annotated risk levels at the post-level. We first compute the maximum value of risk annotations to aggregate multiple annotations for each post with the aim to give alert if there is any possibility of suicidal risk. This might increase false positives, but the experts view that in practice, false positives are better than false negatives.
Classification Types. We classify posts in three ways: 1) the four risk levels, 2) Flagged or not, and 3) Urgent or not. For binary classification, we consider Low, High, Imminent Risk posts as Models. We leverage two type of models: 1) Convolutional Neural Network (CNN) and 2) pretrained language models, which are used in the relevant shared task (Zirikly et al., 2019). The teams that participated in the shared task demonstrated that CNN is effective for the risk classification task (Morales et al., 2019). Also, ASU (Ambalavanan et al., 2019) shows fine-tuning pre-trained language model is highly effective. Note that our dataset contains Korean posts with post-level risk annotations, so these previous models should be adjusted to our dataset.
Specifically, we use CNN with pre-trained Korean subword-level word embeddings for the input of the two convolution layers (Park et al., 2018). In case of using pre-trained language models, we have a choice to use multilingual models trained   Among the annotator groups, internal expert group tends to obtain high scores in overall for both F1 and accuracy (F1 = 0.88, acc = 0.92), while external expert group shows the lowest F1 in average in 4-level risk classification. (F1 = 0.56, acc = 0.85) This is caused by the small number of Imminent risk in the external annotator's evaluations.
Also, Flagged posts classification performance is higher than that of 4-level or Urgent post classification, which implies our model identifies well the posts with any level of risk from posts 2 https://github.com/SKTBrain/KoBERT with No Risk. In the 'Flagged' condition, label (1), (2) and (3) are combined as a single class, so imbalance between classes is partially relieved, which leads to a better F1 score. In practice, this would be quite helpful for consideration of intervention.

Discussion and Conclusions
In this paper, we tackle the problem of suicidal risk of military personnel from their social media posts. We focus on the specific population of military personnel in compulsory service because it requires a unique approach to fully understand their suicidal risk. As our first step, we collect 2,791 militaryrelevant posts in a social Q&A platform that are written by at-risk active-duty soldiers and remove any identifying information from the data. Then five annotators (three military experts and two clinicians) evaluate the degree of suicidal risk of the posts. After the dataset is constructed, we fine-tune a pre-trained language models, achieving at most 0.88 F1 score.
Our research can be the first step toward proper intervention programs and institutional support for soldiers with mental health issues. Such follow-up would maximize the value of our model and data. We also plan to add domain-specific features to our model, collect more data, integrate existing suicidal risk datasets with various languages to improve performance. KoBERT. A BERT-base model trained on a corpus consisting of Korean Wikipedia corpus and news data to improve the performance of the multilingual BERT. This is a pre-trained model which is publicly available in GitHub. We use this model to fine-tune on our data. We set all details the same as described in Multilingual BERT above. The number of trainable parameters is 110M. XLM-R. A state-of-the-art pre-trained crosslingual language model which trained on corpus including Korean Documents. (Conneau et al., 2020) The model is based on RoBERTa architecture, and the pre-trained model is shown to be highly effective cross-lingual language understanding tasks. Like other BERTs, we fine-tune this model on our data with adding the same classification layer. The number of trainable parameters is 275M.

A.2 Hyperparameters
Batch size is set to 32, and the maximum sequence length to 512 in CNN, Multiligual BERT, KoBERT, and XLM-R. The learning rate of all models are set to 3e-5 like previous settings. (Devlin et al., 2018;Conneau et al., 2020) The batch size and the sequence length is manually chosen to fit the models to our computing infrastructure. All models are trained on single RTX 2080Ti GPU. For every run, a model converged at most within 3 hours.

A.3 Data Splits
We train the classifiers on the training set which consists of 1,674 posts, and evaluated on the 559 posts of the test set. The number of examples in each splits are shown in Table. 7-9 We use a validation set which contains 558 posts for tuning the hyperparameters of our models.

A.4 Most Frequent Baselines
When aggregating all 5 annotators' labels in 4 risk levels, Low Risk label accounts for 54.28% of all posts. In binarized label as Flagged or not, Flagged is relatively more frequent (75.24%) than Not flagged, and another binarized label as Urgent or not, Not urgent accounts for 78.68%. Aggregating labels of 3 internal experts shows similar tendency with all 5 annotators' result. Low Risk label is the most frequent class which accounts for 54.32% in 4 risk levels. Flagged posts are more frequent (75.42%) than Not flagged, and Not urgent posts are relatively more frequent(78.90%) than Urgent post.     This kind of research requires annotated data, so much effort has been made toward data collection and dissemination. The 2nd Workshop on Computational Linguistics and Clinical Psychology (CLPsych'15) introduced a shared task (Coppersmith et al., 2015) to identify depression and post-traumatic stress disorder (PTSD) users using a Twitter dataset. The shared task for CLPsych'19 introduced an assessment of suicide risk based on social media postings using data from Reddit to identify the four levels of risk (Zirikly et al., 2019). Yates et al. (2017) introduced a large-scale Reddit dataset containing 9,000 users with self-reported depression diagnoses, along with over 107,000 control users. Another research created a general Reddit dataset for the assessment of suicide risk via online postings (Shing et al., 2018).
Unlike previous studies, our work focus on a  specific at-risk population. Suicidal risk of military personnel could more easily result in tragic consequences because of their easier access to firearms (Nock et al., 2013;Oh and Lee, 2017).

Mental Health Problems of Military Personnel.
Since mental health problems in military are different from those of the general population, they should be treated distinctly. Previous research in soldiers' mental health looks into patients with PTSD and other traumatic experiences. This line of research mainly investigates the patients' medical records, questionnaires, psychological measurement tools, interviews, or administrative data (Kim et al., 2011;Bryan et al., 2013a;Thompson et al., 2014;Bryan et al., 2013b;Reger et al., 2018;Anestis et al., 2019;Start et al., 2019). A study using social media posts investigates the temporal changes in military personnel's posts during the year preceding their death, through content coding method and multilevel models (Bryan et al., 2018). This work focuses on explaining the factors of suicide from posts, rather than train a model to predict the risks from unseen data.
Our work applies a computational method to social media posts to predict suicidal risks from unseen posts without additional manual coding. This research opens up an important new direction in computational analysis of mental health in a special at-risk population.