Mental Distress Detection and Triage in Forum Posts: The LT3 CLPsych 2016 Shared Task System

This paper describes the contribution of LT3 for the CLPsych 2016 Shared Task on automatic triage of mental health forum posts. Our systems use multiclass Support Vector Machines (SVM), cascaded binary SVMs and ensembles with a rich feature set. The best systems obtain macro-averaged F-scores of 40% on the full task and 80% on the green versus alarming distinction. Multiclass SVMs with all features score best in terms of F-score, whereas feature ﬁltering with bi-normal separation and classiﬁer ensembling are found to improve recall of alarming posts.


Introduction
The 2016 ACL Workshop on Computational Linguistics and Clinical Psychology included a shared task focusing on triage classification in forum posts from ReachOut.com, an online service for youth mental health issues. The aim is to automatically classify an unseen post as one of four categories indicating the severity of mental distress. Rea-chOut staff has annotated a corpus of posts with crisis/red/amber/green semaphore labels that indicate how urgently a post needs moderator attention.
The system described in this paper is based on a suicidality classification system intended for Dutch social media (Desmet and Hoste, 2014). Therefore, we approach the current mental distress triage task from a suicide detection standpoint.

Related Work
Machine learning and natural language processing have already shown potential in modelling and de-tecting suicidality in the arts (Stirman and Pennebaker, 2001;Mulholland and Quinn, 2013) and in electronic health records (Haerian et al., 2012). However, work on computational approaches to the automatic detection of suicidal content in online user-generated media is scarce.
One line of research focuses on detecting suicidality in individuals relying on their post history: Huang et al. (2007) aim to identify Myspace.com bloggers at risk of suicide by means of a keywordbased approach using a manually collected dictionary of weighted suicide-related terms. Users were ranked by pattern-matching keywords on their posts. This approach suffered from low precision (35%) and the data does not allow to measure recall, i.e. the number of actually suicidal bloggers that are missing from the results. Similarly, Jashinsky et al. (2014) manually selected keywords by testing search queries linked to various risk factors in a user's Twitter profile. In order to validate this search approach, users posting tweets that match the suicide keywords were grouped by US state for trend analysis. The proportion of at-risk tweeters vs. control-group tweeters were strongly correlated with the actual state suicide rates. While this methodology yields a correct proportion of at-risk users, it is unclear how many of those tweets are false positives and how many at-risk tweets are missing.
Going beyond a keyword-based approach, Guan et al. (2015) performed linear regression and random forest machine learning for Chinese Weibo.com microbloggers. Suicidality labels were assigned to users in the data set by means of an online psychological evaluation survey. As classification features they took social media profile metadata and psychometric linguistic categories in a user's post history. Results showed that Linear Regression and Random Forest classifiers obtain similar scores with a maximum of 35% F-score (23% precision and 79% recall) being the highest performance.
As in the CLPsych 2016 Shared Task, another line of research aims to classify suicidality on the post level, rather than the level of user profiles. Desmet and Hoste (2014) proposed a detection approach using machine learning with a rich feature set on posts in the Dutch social media platform Netlog. Their corpus was manually annotated by suicide intervention experts for suicide relevance, risk and protective factors, source origin, subject of content, and severity. Two binary classification tasks were formulated: a relevance task which aimed to detect posts relevant to suicide, and a threat detection task to detect messages that indicate a severe suicide risk. For the threat detection task, a cascaded setup which first filters irrelevant messages with SVM and then predicts the severity with k-Nearest Neighbors (KNN) performed best: 59.2% F-score (69.5% precision and 51.6% recall). In general, both KNN and SVM outperform Naive Bayes and SVM was more robust to the inclusion of bad features. The system presented in this paper is for the most part an extension and English adaptation of this suicidal post detection pipeline.

System Overview
We investigated a supervised classification-based approach to the mental distress triage task using SVMs. Below, we describe the data and features that were used, and the way classifiers were built, optimized and combined.

Data
Labeled data sets: 1/8th of the manually annotated training data was sampled as a held-out development set (n = 118 with at least 4 instances of each class), the remainder (n = 829) was used for training. In the results section, we also report on the held-out test set (n = 241).
Reddit background corpus: In order to perform terminology extraction and topic modelling, we collected domain-relevant text from Reddit.com, a pre-dominantly English social news and bulletin board website. We used the title and body text from all opening posts in mental health and suicide-related boards posted between 2006 and 2014, resulting in a 82.7 million token corpus of over 270, 000 posts. The selected boards mainly contain user-generated discussion on mental health, depression, and suicidal thoughts, similar to the ReachOut forums.
Tokenization and preprocessing: All textual data was tokenized and lower-cased to reduce variation. For topic modelling, emoji and punctuation were removed. Pattern (De Smedt and Daelemans, 2012) was used for lemmatization.

Features
We aimed to develop a rich feature set that focused on lexical and semantic information, with fine-grained and more abstract representations of content. Some syntactic and non-linguistic features were also included.
Bag-of-words features: We included binary token unigrams, bigrams and trigrams, along with character trigrams and fourgrams. The latter provide robustness to the spelling variation typically found in social media.
Term lists: Domain-specific multiword terms were derived from the Reddit background corpus, using the TExSIS terminology extraction tool (Macken et al., 2013). One list was based on suicidespecific boards (/r/SuicideWatch and /r/suicidenotes, 2884 terms), the other included terms only found in other mental health boards (1384 terms).
Lexicon features: We computed positive and negative opinion word ratio and overall post sentiment using both the MPQA (Wilson et al., 2005) and Hu and Liu's (2004) opinion lexicons. We added positive, negative and neutral emoji counts based on the BOUNCE emoji sentiment lexicon (Kökciyan et al., 2013). We also included the relative frequency of all 64 psychometric categories in the Linguistic Inquiry and Word Count (LIWC) dictionary (Pennebaker et al., 2007). LIWC features have proven useful in (Stirman and Pennebaker, 2001) for modelling suicidality in literary works. Furthermore, we included diminisher, intensifier, negation, and "allness" lexica because of their significance in suicide notes analysis (Osgood and Walker, 1959;Gottschalk and Gleser, 1960;Shapero, 2011).
Topic models: Using the gensim topic modelling library (Řehůřek and Sojka, 2010) we trained several LDA (Blei et al., 2003) and LSI (Deerwester et al., 1990) topic models with varying granularity (k = 20, 50, 100, 200). A similarity query was done on each model resulting in two feature groups: k topic similarity scores and the average similarity score. This should allow the classifier to learn which latent topics are relevant for the task, and to what extent the topics align with the ones in the Reddit background corpus. In line with Resnik et al. (2015), we used topic models to capture latent semantic and syntactic structure in the mental health domain. However, we did not include supervised topic models.
Syntactic features: Two binary features were implemented indicating whether the imperative mood was used in a post and whether person alternation occurred (i.e. combinations of first and second person pronouns).
Post metadata: We furthermore included several non-linguistic features based on a post's metadata: the time of day a post was made (expressed in three-hour blocks), the board in which it was posted, whether the post includes a subject line or a URL, the role of the author and whether he or she is a moderator, whether the post is the first in a thread, whether there are (moderator) reactions or kudos (i.e. thumbs-up votes).
When applied to the training data, this resulted in 59 feature groups and 107, 852 individual features, the majority of which were bag-of-words features (almost 96%).

Classifiers
Using SVMs, we tested three different approaches to the problem of correctly assigning the four triage labels to the forum posts. We considered detection of posts with a high level of alarm (crisis or red) to be the priority. Where possible, recall of the priority labels was promoted, since false negatives are most problematic there.
With multiclass SVMs, one model is used to predict all four labels at once. We hypothesized that distinguishing green from non-green posts would require different information than detecting the more alarming categories. We therefore also tested cascades of three binary SVMs, in which each classi-fier predicts a higher level of alarm: green vs. rest; red or crisis vs. rest; and crisis vs. rest. The binary results are combined in a way that the label with the highest level of alarm is assigned. This essentially sacrifices some precision on lower-priority classes for better high-priority recall.
Finally, we tested ensembles of various multiclass and binary systems. Predictions were combined with two voting methods: normal majority voting (reported as ensemble-majority), and crisispriority voting (ensemble-priority) where the most alarming label with at least 2 votes is selected.

Optimization
Typically, the performance of a machine learning algorithm is not optimal when it is used with all implemented features and with the default algorithm settings. SVMs are known to perform well in the presence of irrelevant features, but dimensionality reduction can still be beneficial for classification accuracy and resource usage. In this section, we describe the methods we tested for feature selection and hyperparameter optimization.
With feature filtering, a metric is used to determine the informativeness of each feature, given the training data. Yang (1997) found that Information Gain (IG) allows aggressive feature removal with minimal loss in accuracy. Forman (2003) corroborates this finding, but remarks that IG is biased towards the majority class, unlike the Bi-Normal Separation (BNS) metric, which typically achieves better minority class recall. In the results, we compare both filtering methods (-ig and -bns) to no filtering (-nf ). IG was applied with a threshold of 0.005 (92-97% reduction), BNS with threshold 3 (79-93% reduction for binary tasks, no multiclass support).
We also applied wrapped optimization, where combinations of selected feature groups and hyperparameters are evaluated with SVM using threefold crossvalidation. Exhaustive exploration of all combinations was not possible, so we used genetic algorithms to approximate an optimal solution (Desmet et al., 2013). In the results section, all reported systems have been optimized for feature group and hyperparameter selection, except for multiclass-unopt (baseline without filtering or optimization) and multiclass-hyper (only hyperparameter optimization, no feature filtering or selection).

Results and discussion
In Table 4, we report the four-label classification results of all systems. Most systems perform well in comparison to the shared task top score of 42% macro-averaged F-score, with the multiclass-nf submission scoring highest at 40%. This indicates that the implemented features and approach are within the current state of the art. Arguably, macro-averaged F-score is a harsh metric for this task: it treats the three alarming categories as disjunct, although confusion between those classes can be high and the distinction may not matter much from a usability perspective. Since the test set only contained one crisis instance, failing to detect it effectively limits the ceiling for macroaveraged F-score to 67%. This partly explains the low scores in Table 4. For comparison, we list Fscore, precision and recall for the green vs. alarming distinction in Table 4. Alarming posts can be detected with F = 80% and recall up to 89% (ensemble-priority).  We tested three classifier configurations, and find that a multiclass approach performs as well as or better than more complex systems. On the development data, ensemble systems perform best, although this is not confirmed by the four-label test results, possibly due to paucity of crisis instances. It appears that ensembles are a sensible choice especially if recall is important. This may be due to the inclusion of the high-recall binary-bns cascade, the low precision of which is offset by ensemble voting. Overall, the aim of improving recall with cascaded and ensemble classifiers seems to have been effective: compared to multiclass systems, they all favour recall over precision more, both on development and test data.
The unoptimized multiclass-unopt acts as a majority baseline that always predicts green, indicating that hyperparameter optimization is essential. Feature selection, on the other hand, does not yield such a clear benefit. On the held-out test data, the nf systems consistently outperform their ig and bns counterparts in terms of F-score. On the development data, feature filtering has a positive effect on recall, particularly when BNS is applied. In summary, the applied feature selection techniques are sometimes successful in removing the bulk of the features without harming performance, although the results suggest that they may remove too many or cause overfitting.

Conclusion
This paper discussed an SVM-based approach to the CLPsych 2016 shared task. We found that our systems performed well within the state of the art, with macro-averaged F-scores of 40% on the full task, and 80% for the distinction between green and alarming posts, suggesting that confusion between the three alarming classes is high. Multiclass systems performed best, but ensemble classifiers and feature filtering with BNS perform comparably and are better suited when high recall is required.