Classification of mental health forum posts

We detail our approach to the CLPsych 2016 triage of mental health forum posts shared task. We experiment with a number of features in a logistic regression classiﬁcation approach. Our baseline approach with lexical features from a post and previous posts in the reply chain gives our best performance of 0.33, which is roughly the median for the task.


Introduction
The CLPsych 2016 shared task requires the triage of forum posts from the ReachOut.com forums, a support forum for youth mental health issues. The triage task centres on directing forum moderators to posts which required the most immediate attention (Calvo et al., 2016). For this task, a set of posts from the forum are each annotated with one of the labels crisis, red, amber or green, which indicate decreasing degrees of urgency of moderator addition. All unlabelled posts are made available for systems.
This task follows other studies of social media discourse as it relates to clinical psychology (Thompson et al., 2014;Schwartz et al., 2014;Coppersmith et al., 2015;Schrading et al., 2015). Analysis of ReachOut.com posts is interesting as posts are made by young individuals who have originally come to the forum seeking some kind of help, but over time may participate in several different capacities. Typically most users will initially need support, but this need may substantially increase or decrease over time; users may also support each other or use the forums for activity unrelated to mental health. Our approach to this task was primarily focussed on implementing a straightforward baseline and experimenting with a few ideas derived from experience looking at the data in detail. While the data itself is definitely sequenced, we choose not to model this as a sequence problem, primarily because we expect the meaningful sequences to be fairly short: typically users either create new posts that are generally relevant to the original post in a thread, or reply to a specific post.
We further motivate this local post comparison by considering the annotation flowchart distributed with the data. Many labelling decisions are affected by whether the user's state is considered to be the same, or if their condition has gotten worse. Key to this task is capturing change in author language, and identifying how this reflects a change in their stateof-mind and change of condition.
We implement a feature set based on basic post features and author history and thread context, using the sequence of replies that lead to a post as the context for that post. We experiment with a number of additional features, but our baseline approach provides our best result of 0.33, which puts our performance at the median overall.

Features
We make use of post lexical features, author history and thread history for classification.

Preprocessing
Prior to extracting features, we perform some basic preprocessing on post text. We unescape HTML entities, remove images and replace emoticons with the name of the emoticon to simplify processing. We remove blockquotes entirely, as we want extracted features to be from the content of the current post. We tokenise using the NLTK TweetTokenizer, as we expect the web forum text to be fairly casual and similiar to the Twitter domain for the purposes of tokenisation.

Lexical features
We extract unigrams and bigrams as post features, and continue to use this feature space for the below contexts.

Reply chain features
Instead of using the sequence of posts in a thread as context, we make use of the chain of replies to a post as the context for that post. We make use of two posts in that context: the most recent post before the current post that has the same author as the current post, and the most recent post to the current post. We retrieve unigrams and bigrams for these posts. We then extract three different types of features: the intersection of unigrams and bigrams with the current post; those that occur in the current post but not the previous post; and those that occur in the previous post but not the current. Note that there are separate feature spaces for author posts and non-author posts.

Unused features
We experimented with a number of features which did not improve results. These include use of ngramfeatures from the first post in thread of the post; use of lemmas instead of words; cosine similarity between post bag-of-words; and thread type. We manually identify these thread types for threads which have a substantially different structure to others, such as the Turning Negatives Into Positives and TwittRO. We identify 1 post as game, 2 as media (e.g. image threads), 5 as semi-structured and 5 as short (e.g. TwittRO).

Data and training
The released training corpus contains 65,024 posts, 947 of which are annotated with triage labels. For development, we split this into a train set of 797 posts and a development set of 250 posts. We use a scikit-learn logistic regression classifier, using a grid search over a regularization hyperparameters  over 10-fold cross validation over the train set. Results on development data in Table 1. Figure 1 shows the confusion matrix, including green classifications. We note that a large number of confusions happen between amber and green, largely due to their larger representation in the data. For the full task we use the full 947 posts for training. The test set adds an additional 731 posts.
We experimented with using a cascaded classification approach, classifying crisis v. non-crisis, red v. non-red and amber v. non-amber in sequence, however this approach did not perform well. We also experimented with treating the task as a regression task, mapping crisis to a value of 1.0, red to 0.66, amber to 0.33, and green to 0.0. The idea is that we expect there to be a gradient to post severity rather than a distinct underlying set of 4 labels, and this gradient may be better modelled via a regression approach. Our implementation has lower results than our approach using discrete labels, but we consider this to be a possible direction for future approaches to this task.

Results
We submit two runs, for both L2 (run 1, with regularisation parameter C = 1) and L1 (run 2, with regularisation parameter C = 100) regularisation. Our official results are in Table 2, with per-label breakdowns of each run in Tables 3 and 4. While other labellings fall outside the official metric for the shared task, we are interested in the performance of a system trained on only non-green vs green as opposed to all 4 triage labels. We run this configuration with the same settings as run 1. This configuration has an F-score of 0.80 on our development data, and a score of 0.82, which above our multiple label F-score of 0.73. This may be a useful setup for a two-stage classification or an actual implementation for ReachOut.com moderators.

Discussion
Run 1 performs at the median, and may be an informative baseline. Interestingly, many of the features that we explored decreased or did not significantly improve performance. This is possibly due to feature sparsity: the amount of training data is relatively small, and most of these features likely are not informative. We note that L2 regularisation gives our best performance, the data set is small, and L2 keeping more features from the training data helps compensate for feature sparsity better than L1 regularisation.

Conclusion
We participated in the CLPsych 2016 shared task, providing a baseline approach using a small feature set that gave a near-median performance of 0.33. We look forward to continuing to work on this task.