Automatic Triage of Mental Health Forum Posts

As part of the 2016 Computational Linguistics and Clinical Psychology (CLPsych) shared task, participants were asked to construct systems to automatically classify mental health forum posts into four categories, representing how urgently posts require moderator attention. This paper details the system implementation from the University of Florida, in which we compare several distinct models and show that best performance is achieved with domain-speciﬁc preprocessing, n-gram feature extraction, and cross-validated linear models.


Introduction
As more and more social interaction takes place online, the wealth of data provided by these online platforms is proving to be a useful source of information for identifying early warning signs for poor mental health. The goal of 2016 CLPsych shared task was to predict the degree of moderator attention required for posts on the ReachOut forum, an online youth mental health service that provides support to young people aged 14-25. 1 Along with the analysis of forum-specific metainformation, this task includes aspects of sentiment analysis, the field of study that analyzes people's opinions, sentiments, attitudes, and emotions from written language (Liu, 2012), where several studies have explored the categorization and prediction of user sentiment in social media platforms such as Twitter (Agarwal et al., 2011;Kouloumpis et 1 https://au.reachout.com/ al., 2011;Spencer and Uchyigit, 2012;Zhang et al., 2011). Other studies have also applied sentiment analysis techniques to MOOC discussion forums (Wen et al., 2014) and suicide notes (Pestian et al., 2012), both highly relevant to this shared task.
Our straightforward approach draws from successful text classification and sentiment analysis methods, including the use of a sentiment lexicon (Liu, 2010) and Word2Vec distributed word embeddings (Mikolov et al., 2013), along with more traditional methods such as normalized n-gram counts. We utilize these linguistic features, as well as several hand-crafted features derived from the metainformation of posts and their authors, to construct logistic regression classifiers for predicting the status label of ReachOut forum posts.

Dataset
As part of the shared task, participants were provided a collection of ReachOut forum posts from July 2012 to June 2015. In addition to the textual post content, posts also contained meta-information such as author ID, author rank/affiliation, post time, thread ID, etc. A training set of 947 such posts was provided, each with a corresponding moderator attention label (green, amber, red, or crisis). An additional 65,024 unlabeled posts was also provided. The test set consisted of 241 unlabeled forum posts.

System
In this section, we describe the implementation details for our classification system. In short, our relatively straightforward approach involves selecting and extracting heterogenrous sets of features for The forum "ranking" of the current author. Affiliation Binary Whether the current author is a member of the ReachOut forum staff.

Board Fraction Numeric
The fraction of the current author's total posts that were made in the current post's subforum. each post, which are then used to train separate logistic regression classifiers for predicting the moderator attention label. We report results for each model individually, and experiment with various classifier ensembles. Results were obtained following a randomized hyperparameter search and 10-fold crossvalidation process. For clarity, we subdivide our features into two categories: post attributes and text-based features. We only extracted features for the 947 posts in the labeled training set; however, several of our features were historical in nature, utilizing information from the entirety of the unlabeled dataset of 65,024 posts.

Attribute Features
As a starting point for classifying posts as green, amber, red, or crisis, we began by examining several attributes of each post and its corresponding author.
Many of our attribute features were immediately available from the raw dataset, and required no further processing. A small sample of these statistics include the post's view count, kudos count, author rank, and in which subforum the post is located.
We also incorporated historical attributes that were derived from the entirety of the unlabeled dataset. These include items such as thread size, mean author kudos/views, number of unique reply authors, etc. Our full list of post attributes is shown in Table 1.

Text Features
Each post in the dataset was associated with two sources of free text -the subject line and the body content. Since the post content itself is what moderators themselves look to when deciding whether action should be taken, we speculated that these features were of the greatest importance. We applied several text-based feature extraction techniques, and began with an in-depth preprocessing phase.

Preprocessing
Since the textual information of each post was formatted as raw HTML, our first preprocessing step involved converting the post content to plain text. During this process, we replaced all user mentions (i.e., @user) with a special string token. We also built a map of all embedded images, of which the majority were forum-specific emoticons, and replaced occurrences in the text with special tokens denoting which image was used. We performed a similar technique for links, replacing each one with a special link identifier token. Finally, in an effort to reduce noise in the text, we removed all text contained within <BLOCKQUOTE> tags, which typically contained text that a post is replying to. After these conversions, we stripped all remaining HTML tags from each post, resulting in plain-text subject and body content.
While examining the corpus, we also noticed the frequent presence of text-based emoticons, such as ':)' and '=('. We employed the use of an emoticon sentiment lexicon 2 , which maps text-based emoticons to either a positive or negative sentiment, to convert each textual emoticon to one of two special tokens denoting the corresponding emoticon's polarity. We manually annotated 12 additional emoticons that were not present in the pre-existing lexicon.
Since we found the subject and body text to be highly related, we concatenated these texts into a single string per post. In an effort to further reduce noise in the text, we examined the subject line of each post, and if it was of the form "Re: ..." and contained the same subject text of the post it was replying to, we discarded the subject line.
Finally, we finished our preprocessing phase with several traditional techniques, including converting all text to lowercase and removing all punctuation. We also converted non-unicode symbols to their best approximation. Due to experimental feedback, we did not remove traditional stop words, as doing so decreased classifier performance for this domain.

N-Gram Features
The majority of our text features are derived from traditional n-gram extraction methods. Given the large amount of unlabeled posts in the dataset, we trained our text vectorizers on the entire corpus (minus the test set posts). After constructing a vocabulary of n-grams occurring in the corpus, we counted the number of each n-gram occurring in each post's text, and normalized them by termfrequency inverse-document frequency (tf-idf). Following initial feedback, our n-gram methods employed normalized unigram counts.

Sentiment Lexicon Features
Because a primary goal of the shared task was to gauge the mental state of posting authors, we borrowed a basic technique from sentiment analysis and utilized a pre-existing sentiment lexicon 3 , which contains a list of words annotated as positive or negative. We count the number of occurrences of both positive and negative words in the text of each post.

Embedding Features
Since the amount of unlabeled text was so large relative to the labeled posts, we sought to learn a basic language model from past forum discussions. Our word embedding features are based on the recent success of Word2Vec 4 (Mikolov et al., 2013), a method for representing indidivual words as distributed vectors. Our specific implementation utilized Doc2Vec 5 (Le and Mikolov, 2014), a related method for computing distributed representations of entire documents. Our model used an embedding vector size of 400 and a window size of 4. After training the Doc2Vec model on the entire corpus of post text (minus test posts), we computed a 400dimensional vector for the text of each training post.

Topic Modeling Features
As a final measure to incorporate the abundance of unlabeled text in the dataset, we trained a custom Latent Dirichlet Allocation (LDA) (Blei et al., 2003) model with 20 topics on the entire corpus of post text (minus test posts). LDA is a popular topic modeling technique which groups words into distinct topics, assigning both word-topic and topic-document probabilities. Once trained, we used our LDA model to predict a topic distribution (i.e, a 20-dimensional vector) for the text of each post.

Results
After extracting features for each of the 947 posts in the training set, we trained a separate logistic regression classifier on each source of text features, plus one trained on all of the attribute-based features. Because we hypothesized that the content of the replies to a particular post could be indicative of the nature of the post itself, for each set of text features we trained an additional model on the concatenated text of all direct reply posts only, ignoring the text of the post itself.
For each model, we performed a randomized hyperparameter search in conjunction with a 10-fold cross-validation step based on macro-averaged F1   score. Results for each feature set are shown in Table 2, where it is clear that the model trained on ngrams of the post text (subject + body) performs the best across all metrics. We show a more detailed breakdown of this model's performance in Table 3, which includes per-label metrics.

Discussion
Given the relatively small amount of labeled data, it comes as no surprise that the traditional n-gram approach performs better than the more complex text-based methods. Because our vectorizers and vocabulary were trained on the full corpus of unlabeled and training posts before fine-tuning predictions on the test posts, this model is able to capture trends in word usage across all four labels.
We sought to combine the models shown in Table  2 with various ensemble methods, but found that no combination of classifiers trained on heterogeneous feature sets produced better results than the straightforward n-gram technique. Thus, the simplest textbased method proved also to have the best performance, a benefit for deploying such a system.
To gain better insight into our best-performing model, we show the top 10 features per label in Table   Green Amber   4, obtained by inspecting the model coefficients of the fully-trained logistic regression classifier. Here (aside from the Amber label, which is a bit more ambiguous, as expected), there is a clear distinction and trend in the type of language used between posts of different labels.