Using contextual information for automatic triage of posts in a peer-support forum

Mental health forums are online spaces where people can share their experiences anonymously and get peer support. These forums, require the supervision of moderators to provide support in delicate cases, such as posts expressing suicide ideation. The large increase in the number of forum users makes the task of the moderators unmanageable without the help of automatic triage systems. In the present paper, we present a Machine Learning approach for the triage of posts. Most approaches in the literature focus on the content of the posts, but only a few authors take advantage of features extracted from the context in which they appear. Our approach consists of the development and implementation of a large variety of new features from both, the content and the context of posts, such as previous messages, interaction with other users and author’s history. Our method has competed in the CLPsych 2017 Shared Task, obtaining the first place for several of the subtasks. Moreover, we also found that models that take advantage of post context improve significantly its performance in the detection of flagged posts (posts that require moderators attention), as well as those that focus on post content outperforms in the detection of most urgent events.


Introduction
According to the World Health Organization (WHO), 20% of children and adolescents in the world have mental disorders or problems (WHO, 2014). Suicide ranks as the second leading cause of death in the 15-29 years old group and every 40 seconds a person dies by suicide in the world. The WHO pointed early identification and intervention as a key factor in ensuring that people receive the care they need (WHO, 2014). Mental health problems have a strong impact on our society and require the use of new techniques for their study, prevention, and intervention.
In this context, text mining tools are emerging as a powerful channel to study and detect the mental state of the writers (Calvo and Mac Kim, 2013;Bedi et al., 2015Bedi et al., , 2014De Choudhury et al., 2013a,b;Coppersmith et al., 2015). In particular, there is a greater interest in the study and detection of suicidal ideation in texts coming from social networks. In this line, Tong et al. (2014) and O'Dea et al. (2015) developed automatic detection systems to identify suicidal thoughts in tweets, and Homan et al. (2014) studied the network structure of users with suicidal ideation in a forum. Furthermore, the CLPsych 2016 shared task proposed the triage of posts, based on urgency, from a peer-support mental health forum (for a more ex-haustive review see (Calvo et al., 2017)). In the present article, we build an automatic post triage system and compete in the CLPsych 2017 shared task . The automatic detection of suicidal ideation in social networks and forums provide a powerful tool to address early interventions in serious situations. Additionally, these techniques allow tracking the prevalence of different suicide risk factors among the population (Jashinsky et al., 2014;Fodeh et al., 2017), which provides valuable information that can be capitalized for the design of prevention plans.

CLPsych 2017 Shared Task
The CLPsych 2017 shared task involves the triage of posts from an Australian mental health forum, Reachout.com, which provides a peer-support online space for adolescents and young adults. Reachout.com offers a space to read about other peoples experiences and talk anonymously. Additionally, the forum has trained moderators who intervene in delicate situations, such as when a user is expressing suicidal ideation. There is an escalation process to follow when forum members might be at risk of harm. As the number of forum members increases the reading of all post become impossible, thus an automatic triage that efficiently guides moderator's attention to the most urgent posts result essential . The CLPsych 2017 Shared Task consists of identifying each forum post with one of four triage levels: crisis, red, amber and green (in decreasing priority). A crisis label indicates that the author is in risk so moderators should prioritize this post above all others, while a green label indicates that post does not require the attention of any moderator. See  for a detailed description of the annotation process and the ethical considerations.
CLPsych 2017 Shared Task dataset consists of 157963 posts written between July 2012 and March 2017 (see Table 1). Among these posts, 1188 were labeled by 3 annotators in order to train the model (training set), and 400 were selected to form the testing set. Posts in the training set were written between April 2015 and June 2016 while posts in the test set were written between August 2016 and March 2017.
Fifteen teams took part in CLPsych 2017 shared task, with unlimited submissions per group. Each post of the dataset contains the text of the subject crisis red amber green total  train  40  137  296  715  1188  test  42  48  94  216  400  extra  ----156375   Table 1: Training dataset and extra unlabeled dataset statistics. Crisis, red, amber and green, are the four triage levels and reflects a decreasing priority of required moderator intervention/response. We had access to the test dataset only after the competition have finished and the body, structured in XML format. Additional metadata is also provided, such as boards, thread, post time, or if the post was written by a moderator or not. The official metrics of the task are: • Macro-averaged f-score: the average f1score among crisis, red and amber labels.
• F-score for flagged vs. non-flagged: the average f1-score among flagged (crisis + red + amber) and non-flagged (green) labels. This is considered considered by the task organizers as the most important metric, given that it measures the system's capability to identify post that need moderators attention.
The official measures are the f-scores, as accuracy is known to be less sensitive to misclassification of elements in the minority class in highly unbalanced datasets. In this paper, we also analyze the f-score for crisis vs. non-crisis, which measures the system's capability to identify the most serious cases. This competition is a new version of the CLPsych 2016 Shared Task , which has the same goal but counts with a smaller dataset. The different approaches used in 2016 competition involved a huge variety of features, such as N-grams, lexicon-based features, word embeddings, and metadata. Most of the models extracted features from the content of posts, but only a few authors took advantage of features extracted from the context of the posts, such as n-grams of previous posts of the thread, or previous author's posts (Malmasi et al., 2016;Cohan et al., 2016;.
In the present work, we extract and test a large variety of new features from both the body of the posts and the context in which the posts occur, such as: (1) authors' history, (2) adjacent posts, and (3) the authors' interaction network. We hypothesize that the contextual features will be useful to capture new elements that allow building a better profile of the author of the posts. This idea is grounded in Van Orden et al. (2010) observation that suicidal behavior tends to persist over the lifetime, and also De Choudhury et al. (2013b studies in which they show that interaction patterns have valuable information about the underlying mental state of the users.

Method
To triage posts we apply a supervised classification-based approach. In the present section, we describe the texts preprocessing step, the features that were used, the feature transformation process and the classification method.
First, we preprocessed the body of the post: we removed HTML format and eliminated quotes (HTML quotes tags), we converted ReachOut links, other webpage links, author mentions, and forum's emoticons to tokens such as #reachout link, #ref link, #reference, #SmileyHappy respectively. Then we transformed the text to lowercase and word-tokenized it with the happierfuntokenizing.py (World Well-Being Project, 2017), which can handle most common emoticons.
We extracted a total of 2799 features from each post. We organized features in seven main categories, four of them are content based features: (Word2vec -N-grams -Metadata -Body), and the remaining are context-based ones (Interaction features -Adjacent features -Author features).
After the feature extraction process, a Z-score transformation was applied to all features, with the exception of n-grams features in which we performed a TF-IDF weighting. Then, missing values were filled with the mean value of those features in the unlabeled dataset.
Following we present a brief description of each category (see Table 2 for features statistics). In this section, we will use parenthesis to show the number of features.
Given the large number of features, in some categories we built subsets of features, in which we selected the features that we considered the most

Word2vec representation (50 features)
We used all post bodies in the unlabeled dataset to train a Skip-gram model (Mikolov et al., 2013a) of 50 dimensions. We discarded infrequent tokens, with less than 5 repetitions and very frequent tokens, with a frequency higher than 10 −3 . We set the window size and negative sampling to 15 (which were found to be maximal in two semantic tasks over TASA corpus (Altszyler et al., 2017)). Word2vec semantic representations were generated with the Gensim Python library (Rehuek and Sojka, 2010). After the training, the resulting Word2vec post's features were computed as the average of all word-embeddings in the post.

N-grams (2274 features)
We extracted unigrams and bigrams from all body posts, and kept the 3000 most frequent N-grams in the training corpus (following (Brew, 2016)) and applied a TF-IDF transformation. As the training and test sets contain posts from different time periods, the language patterns may have changed during this time lapse. In order to eliminate most different N-grams, we have excluded all N-grams with a frequency lower than 5.10 −5 in the posts form unlabeled dataset in the period Aug 2016 -Mar 2017 (726 N-grams where eliminated in this way).

Metadata features (23 features)
We included several non-linguistic features derived from post's metadata and removed all features showing lack of variability in our training set (std = 0). The selected features are: week day (7), board (5), whether the author is a moderator or not (1), whether the author created the thread (1) and time since the last edition (1). Additionally, We subdivided the day in 8 timeslots of 3 hours, and create post time features, consisting of 8 dummy variables to identify the timeslot of the post (8).
Missing values in lexicon-based features which have a neutral value were filled by the neutral value (for example in DAL, pleasantness range from 1 (unpleasant) to 3 (pleasant), thus we replaced missing values with 2). All features showing lack of variability (std = 0) in our training set were removed.

Interaction features (155 features)
We believe that the interaction patterns between users hold valuable information about the underlying intention and emotions of the posts. To this end, we built a directed mention graph where a node (post), n i has an incoming edge from n j if n j has mentioned the n i post author within a 10 post temporal windows, or an outgoing edge if n i has mentioned n j author in the same period. First, we take advantage of this network to extract seven basic network structural features such as: in/out degree, number of in/out edge from different authors, number of loops, number of post from the author in the window, out degree of the author mentioned in the post.
Then, we define on this graph node attributes based on some set of Body and Word2Vec features, namely f a . After that, for the k-th node (post) in our network, we define a new set of interactionbased features F int a , by averaging the feature f a across the neighborhood of the post (N ei). It is: 74 features were extracted from incoming edges and 74 from outgoing edges. The extracted features consist on, Word2vec (50), WWBP lexicons (20), Hedonometer (1), pronouns (2) and semantic coherence (1), which is measured as the cosine similarity between the word2vec embedding of the node and the central post.
Missing values that use Word2vec similarity were filled by the mean similarity between successive posts in the unlabeled dataset, missing values to live", "end my life" (\w * refers to 0 or more alphabetic letters. The selected self-harm expressions where inspired in posts from the subreddit suicidewatch. In keyword spotting, it is important not to be influenced by the train data in order to avoid overfitting. which count outgoing edges were filled by −1 and missing values in F int a features were filled with the mean value of the feature a in the unlabeled dataset.

Adjacent features (152)
For each post, we extract 76 features from the previous post in the same thread and 76 features from the previous post produced by the same author in the thread. The extracted features consist on: Word2vec (50), WWBP lexicons (20), Hedonometer (1), pronouns (2) semantic coherence between the post and the previous post (1), post day of the previous post (1), time between posts (1).

Authors' features (77 features)
We replicated and extended Shickel et al's (Shickel and Rashidi, 2016) idea of deriving attributes from the history of the authors. For each post, we computed the mean value of several features for all the previous posts written by the same author. These features provide a baseline for the authors, which may allow the machine learning algorithm to identify when a post differs from the typical behavior of its author. The extracted features consist on, Word2vec (50), WWBP lexicons (20), Hedonometer (1), pronouns (2), post day (1) Additionally, we added other features to identify more general authors behavior, such as, entropy in thread and board participation (2) and median time between posts measured in log-scale, log(#minutes + 1) (1).

Models
We used Support Vector Machine classifiers (SVM) with linear kernels and Radial Basis Function (RBF) kernels. Each model was trained on different combinations of features, and the hyperparameter C was selected with a grid search scheme for each model. In the grid search, the performance metric was the macro f-score with a 10-fold Cross-Validation (CV). The C hyperparameters were varied among {0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100} for the SVM-RBF models and among {0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1} for the SVMlinear models. As the training dataset is highly imbalanced, both SVM models were trained with class weights inversely proportional to class frequencies in the training dataset. We also tested XGBoost and Random Forest models which underperformed the SVM models, and a feature se-lection process which did not produce significant performance improvements in the SVM RBF and SVM linear models (see Supplemental Material A.3). All the models were implemented in python with Sklearn or XGBoost packages, and all other parameters, not included in the grid search, were set to their default values.
We have built nine collections of features composed of different categories and features subsets (for a full description of the collections see Supplementary Materials A.2). With this features collections, we trained 18 SVM models, half with an RBF kernel and half with a linear kernel. Additionally, we implemented four ensemble models composed by SVM's combined with a majority voting method. We used ensembles with four and seven SVMs with RBF and linear kernels and the differences within the voting SVM's are their training features (see Supplementary Materials A.4 for a full description of the voting SVM's features). In case of a tie between classes, the post is classified as the most urgent class. Table 3 shows the top performing models of the CLPsych 2017 challenge divided by metric, in which only the best model of each team is showed. We have obtained the 2nd position in the Macroaveraged f-score with an ensemble of 4 SVMlinear models, the 1st position in the flagged vs non-flagged f-score with an ensemble of 7 SVM-RBF models, the 1st position in the urgent vs non-urgent f-score with a SVM-RBF trained with Word2vec + N-grams + subset of body features, and the 1st position in crisis vs non-crisis f-score with a SVM-RBF trained with Word2vec + Ngrams + subset of metadata features.

Results
In Table 4 we show our model's results ordered by the performance in the flagged vs non-flagged metric, which is considered by the organizers as the most relevant metric, as it measures the system's capability to identify posts that need moderators attention.
It is worth noting that there is not a universal best model, however, our approach obtains very good results in all performance measures. In particular, our models tend to outperforms other team's models in the flagged vs non-flagged fscores, where nine of the top ten models are from our team (see bold scores in Table 4).
Among our models, those that take advan-   Table 4: Our models' scores, ordered by the performance in the flagged vs non-flagged metric. We show in bold the scores of the models that are within the top ten among the 251 models that have participated in the shared task tage of contextual features tend to obtain better flagged vs non-flagged f-scores (p-value= 4.09E-09, Wilcoxon rank sum test). Amber class includes posts where the author is following up on their own previous red or crisis post , thus, the inclusion of contextual features is essential to capture these situations. On the other hand, complex models with many features may learn the particularities and details of the authors present in the training set, thus decreasing the predictive capability in posts from authors never seen before (89% of the authors in training set are not in the test set). This overfitting effect in complex models can be observed in the corre-lation between the number of features (column N in Table 4) and the differences in f-scores between the Cross-Validation and the test set (column CV macro -column macro in Table 4), Spearman correlation of 0.523 with a p-value=0.012. Also, this effect may explain the good performance obtained by less complex models, such as the SVM-linear trained with only 50 word2vec features. Furthermore, it can be seen that models that use only content features tend to obtain better results in urgent vs non-urgent and crisis vs non-crisis metrics (p-value= 4.17E-09 and p-value= 4.15E-09 respectively, Wilcoxon rank sum test). We propose that training with a greater amount of data with more users diversity will avoid this overfitting, thus boosting the performance of the models that use more number of features.
Finally, we extract the 25 most relevant features given by the random forest importance measure when it is trained with the training dataset and all the 2799 features (see Table 5). Within the most important features, 10 came from the Interaction category, 8 from the Body, 4 from Word2vec, 2 from author's and 1 from N-grams.
Furthermore, Table 5 shows that crisis posts tend to exhibit more negative PERMA elements, negative sentiment, first person reference and less happiness than non-crisis posts (p-value¡0.5e-6 in each comparison with a Wilcoxon rank sum test). Depict Word2vec dimensions have not a straightforward interpretation, it can be seen that there is no shared Word2vec components within the relevant interaction features and the selected Word2vec features extracted from posts text. These results show that content of severe posts and their interacting posts provide different features which result useful in the post triaging task.

Conclusion
Mental health forums, such as ReachOut.com, are online spaces where users can share their experiences and get peer support. The large increase in the number of users makes the task of the moderators considerably difficult. This ends in the loss of critical messages that would require immediate attention. In this context, an automatic triaging system is a valuable tool to guide moderators effort.
In the present paper, we present a machine learning approach for the automatic triage of posts from ReachOut.com forum. Our models partici-pated in the CLPsych 2017 Shared Task competition, obtaining very good results along with all official metrics.
The CLPsych 2017 Shared Task is the second part of the 2016 edition, but with more training data and a more balanced test set. Most of approaches used in CLPsych 2016 Shared Task extract features from the content of the posts, but only a few took advantage of features extracted from the posts context. In the present paper we focused on the development and implementation of a large variety of new features from both, the content and the context of posts. The content-based features consist on N-grams, Word2vec, metadata and other features from the body of the posts, while the context-based features extract attributes from the content and structure of the user history, other post in the conversation and the interaction network.
Our implementation obtained the first position on several official metrics. In particular, we obtained the best performance in the flagged vs nonflagged measure, which tests the system's capability to identify posts that require attention from moderators.
We found that exploitation of contextual features tend to improve the detection of posts that require attention from moderator. On the other hand, complex models with many features may learn the particularities and details of the authors present in the training set, thus decreasing the predictive capability in posts from authors never seen before. To avoid this overfitting effect we propose to feed the models with a greater amount of training data with more diversity of users. This can be easily solved with the use of online classifiers (Bordes et al., 2005;, in which the model can continuously learn from the manual classifications made by the moderators, ensuring that the system is kept up-to-date. A feature importance analysis emphasize the importance of the interactions among users and the content of the interacting post. In this respect we showed that the content of crisis posts and theirs interacting posts provide different elements which result useful in the post triaging task. These analysis also highlighted the predictive capabilities of new open-source psycholinguistic measures designed by the world Well-Being Project group (WWBP), specially the ones related to well-being elements (PERMA). Interaction 0.65 +/-0.14 0.51 +/-0.08 0.99 +/-0.06 0.00 +/-0.04 We build subsets of features, in which we selected the ones that we consider the most relevant in each category: • Subsets of body features (23): self-harm regular expression (1), MentalDisLex (1), advisor and helplines keywords (2), negative PERMA features (5), neuroticism from OCEAN (1), affect lexicon from WWBP (1), pronouns (2), Hedonometer (1), negative lexicon from EmoLex (1) and word2vec semantic similarity to keywords (8).
• Subsets of metadata features (7): A selection of 5 boards (ToughTimes Hosted chats, Everyday life stuff, Intros, Something Not Right, Getting Help), whether the author is a moderator or not (1), and whether the author created the thread (1).
• Subsets of interaction features (57): number of in/out edges from different authors (2), number of loops (1), number of authors post in the window (1), out degree of the author mentioned in the post (1), mean pronouns from incoming edges (2) and mean word2vec from incoming edges (50).
• The subsets of author and adjacent features (50 and 100 features respectively) consist of the subsets of features that consider Word2vec representations.
• Subsets of N-grams (50): We performed a random forest feature importance procedure over word2vec and N-grams, in which we kept the 100 most relevant features. The selected features consist of all the 50 Word2vec features and 50 N-grams, thus this procedure led us only to a discarding of N-grams.

A.2 Features collections
Starting from the set of all the features, we progressively discarded some of them, thus generating nine collections of features of decreasing quantity. Each collection was used to train SVMlinear and SVM-RBF models, resulting in 18 of our 22 models (the other four are ensembles). The collections are: • all features (2799)

A.3 Models comparison
In table 6 we compare the macro f-scores of different models in a 10-fold cross-validation scheme with the training set and the 337 features of the selection collection (described in section A.2). The models were implemented with sklearn or xgboost python packages. For each model, a grid search was applied to select the best parameters. For the SVM-RBF model the hyperparameter C was varied among {0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100}, for the SVM-linear among {0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1}, for the XGBoost the max depth was varied among [2,4,6,8] and the learning rate among [0.001, 0.01, 0.1, 0.3] and for the Random Forest the max features was varied among [10,20,40,60,80,100,120,140,160,200]. All other parameters were set to their default values. Among the models, the SVM classifiers outperformed the tree-based models. Given the large model CV macro f-score SVM-RBF 0.549 SVM-linear 0.490 XGBoost 0.486 Random Forest 0.442 Table 6: Macro f-scores of different models in a 10fold cross-validation scheme with the training set and the 337 features of the selection collection.
number of features (337), we also try a feature selection stage using the importance measure of a random forest classifier. In the grid search, not only the parameter C was varied but also the number of selected features, taking values among [50,100,150,200,250,300]. The best SVM RBF model obtained f-score=0.518 with the selection of the best 300 features, while the SVM linear model obtained f-score=0.514 with the selection of the best 250 features. Since the feature selection process did not produce significant performance improvements, it was not included in the contest models.

A.4 Ensemble models
We implemented four ensemble models composed by SVM's combined with a majority voting method.
The features sets of the voting models which compose the ensembles architectures are: