Classifying ReachOut posts with a radial basis function SVM

The ReachOut clinical psychology shared task challenge addresses the problem of providing an automatic triage for posts to a support forum for people with a history of mental health issues. Posts are classiﬁed into green , amber , red and crisis . The non-green categories correspond to increasing levels of urgency for some form of intervention. The Thomson Reuters submissions arose from an idea about self-training and ensemble learning. The available labeled training set is small (947 examples) and the class distribution unbalanced. It was therefore hoped to develop a method that would make use of the larger dataset of unlabeled posts provided by the organisers. This did not work, but the performance of a radial basis function SVM intended as a baseline was relatively good. Therefore, the report focuses on the latter, aiming to understand the reasons for its performance.


Introduction
The ReachOut clinical psychology shared task challenge addresses the problem of providing an automatic triage for posts to a support forum for people with a history of mental health issues. Posts are classified into green,amber,red and crisis. The non-green categories correspond to increasing levels of urgency for some form of intervention, and can be regarded as positive. Green means "all clear", no need for intervention. Table 1  crisis Life is pointless. Should call psych. The entry from Thomson Reuters was planned to be a system in which an ensemble of base classifiers is followed by a final system combination step in order to provide a final answer. But this did not pan out, so we report results on a baseline classifier. All of the machine learning was done using scikit-learn (Pedregosa et al., 2011). The first step, shared between all runs, was to split the labeled data into a training partition of 625 examples (Train) and two development sets (Dev_test1 and Dev_test2) of 161 examples each. There were two development sets only because of the plan to do system combination. This turns out to have been fortunate. All data sets were first transformed into Pandas (McKinney, 2010) data-frames for convenient onward processing. When the test set became available, it was similarly transformed into the test data-frame (Test).
The first submitted run was an RBF SVM, intended as a strong baseline. This run achieved a better score than any of the more elaborate approaches, and, together with subsequent analysis, sheds some light on the nature of the task and the evaluation metrics used.
138 2 An RBF-based SVM This first run used the standard scikit-learn (Pedregosa et al., 2011) SVM 2 , with a radial basis function kernel. scikit-learn provides a grid search function that uses stratified cross-validation to tune the classifier parameters.
The RBF kernel is: where γ = 1 2σ 2 and the objective function is: where ||w|| 2 is the 2 -norm of the separating hyperplane and ξ i is an indicator variable that is 1 when the ith point is misclassified. The C parameter affects the tradeoff between training error and model complexity. A small C tends to produce a simpler model, at the expense of possibly underfitting, while a large one tends to fit all training data points, at the expense of possibly overfitting. The approach to multi-class classification is the "one versus one" method used in (Knerr et al., 1990). Under this approach, a binary classifier is trained for each pair of classes. The winning classifier is determined by voting.

Features
The features used were: • single words and 2-grams weighted with scikitlearn's TFIDF vectorizer,using a vocabulary size limit ( |V | ) explored by grid search. The last example post would, inter alia, have a feature for 'pointless' and another for 'call psych' • a feature representing the author type provided by ReachOut's metadata. This indicates whether the poster is a ReachOut staff member, an invited visitor, a frequent poster, or one of a number of other similar categories.
• a feature providing the kudos that users had assigned to the post. This is a natural number reflecting the number of 'likes' a post has attracted.
2 A Python wrapper for LIBSVM (Chang and Lin, 2011)   • a feature indicating whether the post being considered was the first in its thread. This is derived from the thread IDs and post IDs in each post.

Datasets, class distributions and evaluation metrics
Class distributions We have four datasets: the two sets of development data, the main training set and the official test set distributed by the organisers. Table 2 shows the class distributions for the three evaluation sets and the training set are different. In particular, the final test set used for official scoring has only one instance of the crisis category, when one might expect around ten. Of course, none of the teams knew this at submission time. The class distributions are always imbalanced, but it is a surprise to see the extreme imbalance in the final test set.
Evaluation metrics The main evaluation metric used for the competition is a macro-averaged F1score restricted to amber, red and crisis. This is very sensitive to the unbalanced class distributions, since it weights all three positive classes equally. A classifier that correctly hits the one positive example for crisis will achieve a large gain in score relative to one that does not. Microaveraged F1, which simply counts true positives, false positives and false negatives over all the pos-itive classes, might have proven a more stable target. An alternative is the multi-class Matthews correlation coefficient (Gorodkin, 2004). Or, since the labels are really ordinal, similar to a Likert scale, quadratic weighted kappa (Vaughn and Justice, 2015) could be used.

Grid search with unbalanced, small datasets
Class weights Preliminary explorations revealed that the classifier was producing results that overrepresented the 'green' category. To rectify this, the grid search was re-done using a non-uniform class weight vector of 1 for 'green' and 20 for 'crisis','red' and 'amber'. The effect of this was to increase by a factor of 20 the effective classification penalty for the three positive classes. The grid search used for the final submission set γ=0.01, C at 15 logarithmically spaced locations between 1 and 1000 inclusive, all vocabulary size limits in {10, 30, 100, 300, 1000, 3000, 10000} and assumed that author type, kudos and first in thread were always relevant and should always be used. The scoring metric used for this grid search was mean accuracy. The optimal parameters for this setting were estimated to be: C=51.79, |V |=3000.   this case would have led the classifier astray. Once optimal parameters had been selected, the classifier was re-trained using on the concatenation of Train, Dev_test1 and Dev_test2, and predictions were generated for Test. Table 4 contains classification reports for the classweighted version that was submitted and a nonweighted version that was prepared after submission. The source of the improved official score achieved by the class-weighted version is a larger F-score on the red category, at the expense of a smaller score on the green category, which is not one of the positive categories averaged in the official scoring metric.

Analysis
The left axis of figure 1 shows how the performance changes as a function of the number of examples used. This graph uses the parameter settings and class weights from the main submission (i.e |V |=3000, C=51.79, γ=0.01). The lower curve (green) shows the mean and standard deviation of the official score for test sets selected by crossvalidation. The upper curve (red) shows performance on the (cross-validated) training set, which is always at ceiling. The right axis corresponds to the blue curve in the middle of figure 1 and indicates the number of support vectors used for various sizes of training set. Almost every added example is being catered for by a new support vector, suggesting overfitting. There is just a little generalisation for the green class, almost none for the others. Figure 2 shows the variation in macro-F1 with C and γ. The scoring function for grid search is the official macro-averaged F1 restricted to non-green classes, in contrast to the average accuracy used elsewhere. The optimal value selected by this cross-validation is C=64 and γ=0.0085. This is roughly the same as C=51.79, γ=0.01 chosen by cross-validation on average accuracy.

Discussion
The ReachOut challenge is evidently a difficult problem. The combination of class imbalance and an official evaluation metric that is very sensitive to performance on sparsely inhabited classes means that the overall results are likely to be unstable.
It is not obvious what metric is the best fit for the therapeutic application, because the costs of misclassification, while clearly non-uniform, are difficult to estimate, and the rare classes are intuitively important. It would take a detailed clinical outcome study to determine exactly what the tradeoffs are between false positives, false negatives and misclassifications within the positive classes.
The labeled data set, while of decent size, and representative of what can reasonably be done by annotators in a small amount of time, is not so large that the SVM-based approach, with the features used, has reached its potential. The use of the class weight vector does appear to be helpful in improving the official score by trading off performance on the red label against a small loss of performance on the green label.