Text Analysis and Automatic Triage of Posts in a Mental Health Forum

We present an approach for automatic triage of message posts in ReachOut.com mental health forum, which was a shared task in the 2016 Computational Linguistics and Clinical Psychology (CLPsych). This effort is aimed at providing the trained moderators of Rea-chOut.com with a systematic triage of forum posts, enabling them to more efﬁciently support the young users aged 14-25 communicating with each other about their issues. We use different features and classiﬁers to predict the users’ mental health states, marked as green, amber, red, and crisis. Our results show that random forests have signiﬁcant success over our baseline mutli-class SVM classiﬁer. In addition, we perform feature importance analysis to characterize key features in identiﬁcation of the critical posts.


Introduction
Mental health issues profoundly impact the wellbeing of those afflicted and the safety of society as a whole (Üstün et al., 2004). Major effort is still needed to identify and aid those who are suffering from mental illness but doing so in a case by case basis is not practical and expensive (Mark et al., 2005). These limitations inspired us to develop an automated mechanism that can robustly classify the mental state of a person. The abundance of publicly available data allows us to access each person's record of comments and message posts online in an effor to predict and evaluate their mental health.

Shared Task Description
The CLPsych 2016 Task accumulates a selection of 65,514 posts from ReachOut.com, dedicated to providing a means for members aged 14-25 to express their thoughts in an anonymous environment. These posts have all been selected from the years 2012 through 2015. Of these posts, 947 have been carefully analyzed, and each assigned a label: green (the user shows no sign of mental health issues), amber (the user's posts should be reviewed further to identify any issues), red (there is a very high likelihood that the user has mental health issues), and crisis (the user needs immediate attention). These 947 postslabel pairs represent our train data. We then use the train data to produce a model that assigns a label to any generic post. A separate selection of 241 posts are dedicated as the test data, to be used to evaluate the accuracy of the model.

Methods
Our approach for automatic triage of posts in the mental health forum, much like any other classification pipeline, is composed of three phases: feature extraction, selection of learning algorithm, and validation and parameter tuning in a cross validation framework.

Feature extraction
Feature extraction is one of the key steps in any machine learning task, which can significantly influence the performance of learning algorithms (Bengio et al., 2013). In the feature extraction phase we extracted the following information from the given XML files of forum posts: author, the authors rank-ing in the forum, time of submission and editing, number of likes and views, the body of the post, the subject, the thread associated to the post, and changeability of the text. For the representation of textual data (subject and body) we use both tfidf and the word embedding representation of the data (Mikolov et al., 2013b;Mikolov et al., 2013a;Zhang et al., 2011). Skip-gram word embedding which is trained in the course of language modeling is shown to capture syntactic and semantic regularities in the data (Mikolov et al., 2013c;Mikolov et al., 2013a). For the purpose of training the word embeddings we use skip-gram neural networks (Mikolov et al., 2013a) on the collection of all the textual data (subject/text) of 65,514 posts provided in the shared task. In our word embedding training, we use the word2vec implementation of skip-gram (Mikolov et al., 2013b). We set the dimension of word vectors to 100, and the window size to 10 and we sub-sample the frequent words by the ratio 1 10 3 . Subsequently, to encode a body/subject of a post we use tf-idf weighted sum of word-vectors in that post (Le and Mikolov, 2014). The features are summarized in Table 1. To ensure being inclusive in finding important features, stop words are not removed.

Automatic Triage
The Random Forest (RF) classifier (Breiman, 2001) is employed to predict the users mental health states (green, red, amber, and crisis) from the posts in the ReachOut forum. A random forest is an ensemble method based on use of multiple decision trees (Breiman, 2001). Random forest classifiers have several advantages, including estimation of important features in the classification, efficiency when a large proportion of the data is missing, and efficiency when dealing with a large number of features (Cutler et al., 2012); therefore random forests fit our problem very well. The validation step is conducted over 947 labeled instances, in a 10xFold cross validation process. Different parameters of random forests, including the number of trees, the measure of split quality, the number of features in splits, and the maximum depth are tuned using cross-validation. In this work, we use Scikit implementation of Random Forests (Pedregosa et al., 2011).
Our results on the training set show that incorpo-ration of unlabeled data in the training using label propagation by means of nearest-neighbor search does not increase the classification accuracy. Therefore, the unlabeled data is not incorporated in the training.
For the comparison phase, we consider multiclass Support Vector Machine classifier (SVM) with radial basis function kernel as a baseline method (Cortes and Vapnik, 1995;Weston and Watkins, 1998).

Results
Our results show that random forests have significant success over SVM classifiers. The 4-ways classification accuracies are summarized in Table 3. The evaluations on the test set for the random forest approach are summarized in Table 3.

Important Features
Random Forests can easily provide us with the most relevant features in the classification (Cutler et al., 2012;Breiman, 2001). Random Forest consists of a number of decision trees. In the training procedure, it can be calculated how much a feature decreases the weighted impurity in a tree. The impurity decrease for each feature can be averaged and normalized over all trees of the ensemble and the features can be ranked according to this measure (Breiman et al., 1984;Breiman, 2001). We extracted the most discriminative features in the automatic triage of the posts using mean decrease impurity for the best Random Forest we obtained in the cross-validation (Breiman et al., 1984).
Our results shows that from the top 100 features, 88 100 were related to the frequency of particular words in the body of the post, 4 100 were related to the posting/editing time (00:00 to 23:00) and the day in the month (1 st to 31 th ), 4 100 were indication of the author and author ranking, 2 100 were related to the frequency of words in the subject, 1 100 was the number of views, and 1 100 was the number of likes a post gets.
The top 50 discriminative features, their importance, and their average values for each class are provided in Table 3.1. We have also presented the inverse document frequency (IDF) to identify how     much information each word has encoded within the collection of posts (Robertson, 2004). Many interesting patterns can be observed in the word usage of each class. For example, the word 'feel' significantly more often occurs in the red and crisis posts. Surprisingly, there were some stop-words among the most important features. For instance, words 'to' and 'not', on average occur in green posts 1 2 of times of non-green posts. Another example is the usage of the word 'me', which occurs more frequently in non-green posts. Furthermore, the posts with more 'likes' are less likely to be non-green.
Subject: As indicated in Table 3.1 posts which have word 're' in their subjects are more likely to belong to the green class.
Time: As shown in Figure 1 and Table 3.1 the red posts on average are submitted on a day closer to the end of the month. In addition, the portion of red and crisis message posts in the interval of 5 A.M. to 7 A.M. was much higher than the green and amber posts.

Conclusion
In this work, we explored the automatic triage of message posts in a mental health forum. Using Random Forest classifiers we obtain a higher triage accuracy in comparison with our baseline method, i.e. a mutli-class support vector machine. Our results showed that incorporation of unlabeled data did not increase the classification accuracy of Random Forest, which could be due to the fact that Random Forests themselves are efficient enough in dealing with missing data points (Cutler et al., 2012). Furthermore, our results suggest that employing full vocabularies would be more discriminative than using sentence embedding. This could be interpreted as the importance of occurrence of particular words rather than particular concepts. In addition, taking advantage of the capability of Random Forest in the estimation of important features in classification, we explored the most relevant features contributing in the automatic triage.  Table 4: The 50 most discriminative features of posts and their mean values for each class of green, amber, red, and crisis, which are ranked according to their feature importance. For the words we have also provided their IDF.