Multi-Task Learning for Mental Health using Social Media Text

We introduce initial groundwork for estimating suicide risk and mental health in a deep learning framework. By modeling multiple conditions, the system learns to make predictions about suicide risk and mental health at a low false positive rate. Conditions are modeled as tasks in a multi-task learning (MTL) framework, with gender prediction as an additional auxiliary task. We demonstrate the effectiveness of multi-task learning by comparison to a well-tuned single-task baseline with the same number of parameters. Our best MTL model predicts potential suicide attempt, as well as the presence of atypical mental health, with AUC>0.8. We also find additional large improvements using multi-task learning on mental health tasks with limited training data.


Introduction
Suicide is one of the leading causes of death worldwide, and over 90% of individuals who die by suicide experience mental health conditions. 1 However, detecting the risk of suicide, as well as monitoring the effects of related mental health conditions, is challenging. Traditional methods rely on both self-reports and impressions formed during short sessions with a clinical expert, but it is often unclear when suicide is a risk in particular. 2 Consequently, conditions leading to preventable suicides are often not adequately addressed.
Automated monitoring and risk assessment of patients' language has the potential to complement traditional assessment methods, providing objective measurements to motivate further care and additional support for people with difficulties related to mental health. This paves the way towards verifying the need for additional care with insurance coverage, for example, as well as offering direct benefits to clinicians and patients.
We explore some of the possibilities in the deep learning and mental health space using written social media text that people with different mental health conditions are already producing. Uncovering methods that work with such text provides the opportunity to help people with different mental health conditions by leveraging a task they are already participating in.
However, these studies typically model each condition in isolation, which misses the opportunity to model coinciding influence factors. Tasks with underlying commonalities (e.g., part-of-speech tagging, parsing, and NER) have been shown to benefit from multi-task learning (MTL), as the learning implicitly leverages interactions between them (Caruana, 1993;Sutton et al., 2007;Rush et al., 2010;Collobert et al., 2011;Søgaard and Goldberg, 2016). Suicide risk and related mental health conditions are therefore good candidates for modeling in a multi-task framework.
In this paper, we propose multi-task learning for detecting suicide risk and mental health conditions. The tasks of our model include neuroatypicality (i.e., atypical mental health) and suicide attempt, as well as the related mental health conditions of anxiety, depression, eating disorder, panic attacks, schizophrenia, bipolar disorder, and posttraumatic stress disorder (PTSD), and we explore the effect of task selection on model performance. We additionally include the effect of modeling gender, which has been shown to improve accuracy in tasks using social media text (Volkova et al., 2013;Hovy, 2015).
Predicting suicide risk and several mental health conditions jointly opens the possibility for the model to leverage a shared representation for conditions that frequently occur together, a phenomenon known as comorbidity. Further including gender reflects the fact that gender differences are found in the patterns of mental health (WHO, 2016), which may help to sharpen the model. The MTL framework we propose allows such shared information across predictions and enables the inclusion of several loss functions with a common shared underlying representation. This approach is flexible enough to extend to factors other than the ones shown here, provided suitable data.
We find that choosing tasks that are prerequisites or related to the main task is critical for learning a strong model, similar to . We further find that modeling gender improves accuracy across a variety of conditions, including suicide risk. The best-performing model from our experiments demonstrates that multi-task learning is a promising new direction in automated assessment of mental health and suicide risk, with possible application to the clinical domain.

Our contributions
1. We demonstrate the utility of MTL in predicting mental health conditions from social user text -a notoriously difficult task (Coppersmith et al., 2015a;Coppersmith et al., 2015b) -with potential application to detecting suicide risk. 2. We explore the influence of task selection on prediction performance, including the effect of gender. 3. We show how to model tasks with a large number of positive examples to improve the prediction accuracy of tasks with a small number of positive examples. 4. We compare the MTL model against a singletask model with the same number of parameters, which directly evaluates the multi-task learning approach. 5. The proposed MTL model increases the True Positive Rate at 10% false alarms by up to 9.7% absolute (for anxiety), a result with direct impact for clinical applications.

Ethical Considerations
As with any author-attribute detection, there is the danger of abusing the model to single out people (overgeneralization, see Hovy and Spruit (2016)).
We are aware of this danger, and sought to minimize the risk. For this reason, we don't provide a selection of features or representative examples. The experiments in this paper were performed with a clinical application in mind, and use carefully matched (but anonymized) data, so the distribution is not representative of the population as a whole. The results of this paper should therefore not be interpreted as a means to assess mental health conditions in social media in general, but as a test for the applicability of MTL in a well-defined clinical setting.

Model Architecture
A neural multi-task architecture opens the possibility of leveraging commonalities and differences between mental conditions. Previous work (Collobert et al., 2011;Caruana, 1993) has indicated that such an architecture allows for sharing parameters across tasks, and can be beneficial when there is varying degrees of annotation across tasks. 3 This makes MTL particularly compelling in light of mental health comorbidity, and given that different conditions have different amounts of associated data. Previous MTL approaches have shown considerable improvements over single task models, and

Single-task
Multi-task Figure 1: STL model in plate notation (left): weights trained independently for each task t (e.g., anxiety, depression) of the T tasks. MTL model (right): shared weights trained jointly for all tasks, with task-specific hidden layers. Curves in ovals represent the type of activation used at each layer (rectified linear unit or sigmoid). Hidden layers are shaded.
the arguments are convincing: Predicting multiple related tasks should allow us to exploit any correlations between the predictions. However, in much of this work, an MTL model is only one possible explanation for improved accuracy. Another more salient factor has frequently been overlooked: The difference in the expressivity of the model class, i.e., neural architectures vs. discriminative or generative models, and critically, differences in the number of parameters for comparable models. Some comparisons might therefore have inadvertently compared apples to oranges.
In the interest of examining the effect of multitask learning specifically, we compare the multitask predictions to models with equal expressivity. We evaluate the performance of a standard logistic regression model (a standard approach to text-classification problems), a multilayer perceptron single-task learning (STL) model, and a neural MTL model, the latter two with equal numbers of parameters. This ensures a fair comparison by isolating the unique properties of MTL from the dimensionality-reduction aspects of deep architectures in general.
The neural models we evaluate come in two forms. The first, depicted in plate notation on the left in Figure 1, are the STL models. These are feedforward networks with two hidden layers, trained independently to predict each task. On the right in Figure 1 is the MTL model, where the first hidden layer from the bottom is shared between all tasks. An additional per-task hidden layer is used to give the model flexibility to map from the task-agnostic representation to a task-specific one. Each hidden layer uses a rectified linear unit as non-linearity. The output layer uses a logistic non-linearity, since all tasks are binary predictions. The MTL model can easily be extended to a stack of shared hidden layers, allowing for a more complicated mapping from input to shared space. 4 As noted in Collobert et al. (2011), MTL benefits from mini-batch training, which both allows optimization to jump out of poor local optima, and more stochastic gradient steps in a fixed amount of time (Bottou, 2012). We create mini-batches by sampling from the users in our data, where each user has some subset of the conditions we are trying to predict, and may or may not be annotated with gender. At each mini-batch gradient step, we update weights for all tasks. This not only allows for randomization and faster convergence, it also provides a speed-up over the individual selection process reported in earlier work (Collobert et al., 2011).
Another advantage of this setup is that we do not need complete information for every instance: Learning can proceed with asynchronous updates, dependent on what the data in each batch has been annotated for, while sharing representations throughout. This effectively learns a joint model with a common representation for several different tasks, allowing the use of several "disjoint" data sets, some with limited annotated instances.
Optimization and Model Selection Even in a relatively simple neural model, there are a number of hyperparameters that can (and have to) be tuned to achieve good performance. We perform a line search for every model we use, sweeping over L 2 regularization and hidden layer width. We select the best model based on the development loss. Figure 4 shows the performance on the corresponding test sets (plot smoothed by rolling mean of 10 for visibility).
In our experiments, we sweep over the L2 regularization constant applied to all weights in {10 −4 , 10 −3 , 10 −2 , 0.1, 0.5, 1.0, 5.0, 10.0}, and hidden layer width (same for all layers in the network) in {16, 32, 64, 128, 256, 512, 1024, 2048}. We fix the mini-batch size to 256, and 0.05 dropout on the input layer. Choosing a small minibatch size and the model with lowest development loss helps to account for overfitting. We train each model for 5,000 iterations, jointly updating all weights in our models. After this initial joint training, we select each task separately, and only update the task-specific layers of weights independently for another 1,000 iterations (selecting the set of weights achieving lowest development loss for each task individually). Weights are updated using mini-batch Adagrad (Duchi et al., 2011) -this converges more quickly than other optimization schemes we experimented with. We evaluate the tuning loss every 10 epochs, and select the model with the lowest tuning loss.

Data
We train our models on a union of multiple Twitter user datasets: 1) users identified as having anxiety, bipolar disorder, depression, panic disorder, eating disorder, PTSD, or schizophrenia (Coppersmith et al., 2015a), 2) those who had attempted suicide (Coppersmith et al., 2015c), and 3) those identified as having either depression or PTSD from the 2015 Computational Linguistics and Clinical Psychology Workshop shared task (Coppersmith et al., 2015b), along with neurotypical gendermatched controls (Twitter users not identified as having a mental condition). Users were identified as having one of these conditions if they stated explicitly they were diagnosed with this condition on Twitter (verified by a human annotator), and the data was pre-processed to remove direction indications of the condition. For a subset of 1,101 users, we also manually-annotate gender. The final dataset contains 9,611 users in total, with an average of 3521 tweets per user. The number of users with each condition is included in Table 1. Users in this joined dataset may be tagged with multiple conditions, thus the counts in this table do not sum to the total number of users.
We use the entire Twitter history of each user as input to the model, and split it into character 1-to-5-grams, which have been shown to capture more information than words for many Twitter text classification tasks (Mcnamee and Mayfield, 2004;Coppersmith et al., 2015a). We compute the relative frequency of the 5,000 most frequent n-gram features for n ∈ {1, 2, 3, 4, 5} in our data, and then feed this as input to all models. This input representation is common to all models, allowing for fair comparison.

Experiments
Our task is to predict suicide attempt and mental conditions for each of the users in these data. We evaluate three classes of models: baseline logistic regression over character n-gram features (LR), feed-forward multilayer perceptrons trained to predict each task separately (STL), and feedforward multi-task models trained to predict a set of conditions simultaneously (MTL). We experiment with a feed-forward network against independent logistic regression models as a way to directly test the hypothesis that MTL may work well in this domain.
We also perform ablation experiments to see which subsets of tasks help us learn an MTL model that predicts a particular mental condition best. For all experiments, data were divided into five equal-sized folds, three for training, one for tuning, and one for testing (we report the performance on this).
All our models are implemented in Keras 5 with Theano backend and GPU support. We train the models for a total of up to 15,000 epochs, using mini-batches of 256 instances. Training time on all five training folds ranged from one to eight hours on a machine with Tesla K40M.

Evaluation Setup
We compare the accuracy of each model at predicting each task separately.
In clinical settings, we are interested in minimizing the number of false positives, i.e., incorrect diagnoses, which can cause undue stress to the patient. We are thus interested in bounding this quantity. To evaluate the performance, we plot the false positive rate (FPR) against the true positive rate (TPR). This gives us a receiver operating characteristics (ROC) curve, allowing us to inspect the performance of each model on a specific task at any level of FPR.
While the ROC gives us a sense of how well a model performs at a fixed true positive rate, it makes it difficult to compare the individual tasks at a low false positive rate, which is also important for clinical application. We therefore report two more measures: the area under the ROC curve (AUC) and TPR performance at FPR=0.1 (TPR@FPR=0.1). We do not compare our models to a majority baseline model, since this model would achieve an expected AUC of 0.5 for all tasks, and F-score and TPR@FPR=0.1 of 0 for  all mental conditions -users exhibiting a condition are the minority, meaning a majority baseline classifier would achieve zero recall. Figure 2 shows the AUC-score of each model for each task separately, and Figure 3 the true positive rate at a low false positive rate of 0.1. Precisionrecall curves for model/task are in Figure 5. STL is a multilayer perceptron with two hidden layers (with a similar number of parameters as the proposed MTL model). The MTL +gender and MTL models predict all tasks simultaneously, but are only evaluated on the main respective task.

Results
Both AUC and TPR (at FPR=0.1) demonstrate that single-task models models do not perform nearly as well as multi-task models or logistic regression. This is likely because the neural networks learned by STL cannot be guided by the inductive bias provided by MTL training. Note, however, that STL and MTL are often times comparable in terms of F1-score, where false positives and false negatives are equally weighted. Figure 2, multi-task suicide predictions reach an AUC of 0.848, and predictions for anxiety and schizophrenia are not far behind. Interestingly however, schizophrenia stands out as being the only condition to be best predicted with a single-task model. MTL models show improvements over STL and LR models for predicting suicide, neuroatypicality, depression, anxiety, panic, bipolar disorder, and PTSD. The inclusion of gender in the MTL models leads to direct gains over an LR baseline in predicting anxiety disorders: anxiety, panic, and PTSD.

As shown
In Figure 3, we illustrate the true positive rate -that is, how many cases of mental health conditions that we correctly predict -given a low false positive rate -that is, a low rate of predicting people have mental health conditions when they do not. This is particularly useful in clinical settings, where clinicians seek to minimize overdiagnosing. In this setting, MTL leads to the best performance across the board, for all tasks under consideration: Neuroatypicality, suicide, depression, anxiety, eating, panic, schizophrenia, bipolar disorder, and PTSD. Including gender in MTL further improves performance for neuroatypicality, suicide, anxiety, schizophrenia, bipolar disorder, and PTSD. These differences in AUC are significant at p = 0.05 according to bootstrap sampling tests with 5000 samples. The wide difference between MTL and STL can be explained in part by the increased feature set size -MTL training may, in this case, provide a form of regularization that STL cannot exploit. Further, modeling the common mental health conditions with the most data (depression, anxiety) helps in pulling out more rare conditions comorbid with these common health conditions. This provides evidence that an MTL model can help in predicting elusive conditions by using large data for common conditions, and a small amount of data for more rare conditions. Figures 2 and  3 both suggest that adding gender as an auxiliary task leads to more predictive models, even though the difference is not statistically significant for most tasks. This is consistent with the findings in previous work (Volkova et al., 2013;Hovy, 2015). Interestingly, though, the MTL model is worse at predicting gender itself. While this could be a direct result of data sparsity (recall that we have only a small subset annotated for gender), which could be remedied by annotating additional users for gender, this appears unlikely given the other findings of our experiments, where MTL helped in specifically these sparse scenarios.

Utility of Authorship Attributes
However, it has been pointed out by  that not all tasks benefit from a MTL setting in the same way, and that some tasks serve purely auxiliary functions. Here, gender prediction does not benefit from including mental conditions, but helps vice versa. In other words, predicting gender is qualitatively different from predicting mental health conditions: it seems likely that the signals for anxiety ares much more similar to the ones for depression than for, say, being male, and can therefore add to detecting depression. However, the distinction between certain conditions does not add information for the distinction of gender. The effect may also be due to the fact that these data were constructed with inferred gender (used to match controls), so there might be a degree of noise in the data.
Choosing Tasks Although MTL tends to dominate STL in our experiments, it is not clear whether modeling several tasks provide a beneficial bias in MTL models in general, or if there exists specific subsets of auxiliary tasks that are most beneficial for predicting suicide risk and related mental health conditions. We perform ablation experiments by training MTL models on a subset of auxiliary tasks, and prediction for a single main task. We focus on four conditions to predict well: suicide attempt, anxiety, depression, and bipolar disorder. For each main task, we vary the auxiliary tasks we train the MTL model with. Since considering all possible subsets of tasks is combinatorily unfeasible, we choose the following task subsets as auxiliary: • all: all mental conditions along with gender • all conds: all mental conditions, no gender • neuro: only neurotypicality • neuro+mood: neurotypicality, depression, and bipolar disorder (mood disorders) • neuro+anx: neurotypicality, anxiety, and panic attack (anxiety conditions) • neuro+targets: neurotypicality, anxiety, depression, suicide attempt, bipolar disorder • none: no auxiliary tasks, equivalent to STL Table 2 shows AUC for the four prediction tasks with different subsets of auxiliary tasks. Statistically significant improvements over the respective LR baselines are denoted by superscript. Restricting the auxiliary tasks to a small subset tends to hurt performance for most tasks, with exception to bipolar, which benefits from the prediction of depression and suicide attempt. All main tasks achieve their best performance using the full set of additional tasks as auxiliary. This suggests that the biases induced by predicting different kinds of mental conditions are mutually beneficial -e.g., multi-task models that predict suicide attempt may also be good at predicting anxiety.
Based on these results, we find it useful to think of MTL as a framework to leverage auxiliary tasks as regularization to effectively combat data paucity and less-than-trustworthy labels. As we have demonstrated, this may be particularly useful when predicting mental health conditions and suicide risk.

Discussion: Multi-task Learning
Our results indicate that an MTL framework can lead to significant gains over single-task models for predicting suicide risk and several mental health conditions. We find benefit from predict-ing related mental conditions and demographic attributes simultaneously.
We experimented with all the optimizers that Keras provides, and found that Adagrad seems to converge fastest to a good optimum, although all the adaptive learning rate optimizers (such as Adam, etc.) tend to converge quickly. This indicates that the gradient is significantly steeper along certain parameters than others. Default stochastic gradient descent (SGD) was not able to converge as quickly, since it is not able to adaptively scale the learning rate for each parameter in the modeltaking too small steps in directions where the gradient is shallow, and too large steps where the gradient is steep. We further note an interesting behavior: all of the adaptive learning rate optimizers yield a strange "step-wise" training loss learning curve, which hits a plateau, but then drops after about 900 iterations, only to hit another plateau, and so on. Obviously, we would prefer to have a smooth training loss curve. We can indeed achieve this using SGD, but it takes much longer to con- Figure 5: Precision-recall curves for predicting each condition. verge than, for example, Adagrad. This suggests that a well-tuned SGD would be the best optimizer for this problem, a step that would require some more experimentation and is left for future work.
We also found that feature counts have a pronounced effect on the loss curves: Relative feature frequencies yield models that are much easier to train than raw feature counts.
Feature representations are therefore another area of optimization, e.g., different ranges of character n-grams (e.g., n > 5) and unigrams. We used character 1-to-5-grams, since we believe that these features generalize better to a new domain (e.g., Facebook) than word unigrams. However, there is no fundamental reason not to choose longer character n-grams, other than time constraints in regenerating the data, and accounting for overfitting with proper regularization.
Initialization is a decisive factor in neural models, and Goldberg (2015) recommends repeated restarts with differing initializations to find the optimal model. In an earlier experiment, we tried initializing a MTL model (without task-specific hidden layers) with pretrained word2vec embeddings of unigrams trained on the Google News n-gram corpus. However, we did not notice an improvement in F-score. This could be due to the other factors, though, such as feature sparsity. Table 3 shows parameters sweeps with hidden layer width 256, training the MTL model on the social media data with character trigrams as input features. The sweet spots in this table may be good starting points for training models in future work.

Related Work
MTL was introduced by Caruana (1993), based on the observation that humans rarely learn things in isolation, and that it is the similarity between related tasks that helps us get better.
Some of the first works on MTL were motivated by medical risk prediction , and it is now being rediscovered for this purpose (Lipton et al., 2016). The latter use a long shortterm memory (LSTM) structure to provide several medical diagnoses from health care features (yet no textual or demographic information), and find small, but probably not significant improvements over a structure similar to the STL we use here.   The target in previous work was medical conditions as detected in patient records, not mental health conditions in social text. The focus in this work has been on the possibility of predicting suicide attempt and other mental health conditions using social media text that a patient may already be writing, without requiring full diagnoses.
The framework proposed by Collobert et al. (2011) allows for predicting any number of NLP tasks from a convolutional neural network (CNN) representation of the input text. The model we present is much simpler: A feed-forward network with n-gram input layer, and we demonstrate how to constrain n-gram embeddings for clinical application. Comparing with further models is possible, but distracts from the question of whether MTL training can help in this domain. As we have shown, it can.

Conclusion and Future Work
In this paper, we develop neural MTL models for 10 prediction tasks (suicide, seven mental health conditions, neurotypicality, and gender). We compare their performance with STL models trained to predict each task independently.
Our results show that an MTL model with all task predictions performs significantly better than other models, reaching 0.846 TPR for neuroatypicality where FPR=0.1, and AUC of 0.848, TPR of 0.559 for suicide. Due to the nature of MTL, we find additional contributions that were not the original goal of this work: Pronounced gains in detecting anxiety, PTSD, and bipolar disorder. MTL predictions for anxiety, for example, reduce the error rate from a single-task model by up to 11.9%.
We also investigate the influence of model depth, comparing to progressively deeper STL feed-forward networks with the same number of parameters. We find: (1) Most of the modeling power stems from the expressivity conveyed by deep architectures. (2) Choosing the right set of auxiliary tasks for a given mental condition can yield a significantly better model. (3) The MTL model dramatically improves for conditions with the smallest amount of data. (4) Gender prediction does not follow the two previous points, but improves performance as an auxiliary task.
Accuracy of the MTL approach is not yet ready to be used in isolation in the clinical setting. However, our experiments suggest this is a promising direction moving forward. There are strong gains to be made in using multi-task learning to aid clinicians in their evaluations, and with further partnerships between the clinical and machine learning community, we foresee improved suicide prevention efforts.