Deep learning for language understanding of mental health concepts derived from Cognitive Behavioural Therapy

In recent years, we have seen deep learning and distributed representations of words and sentences make impact on a number of natural language processing tasks, such as similarity, entailment and sentiment analysis. Here we introduce a new task: understanding of mental health concepts derived from Cognitive Behavioural Therapy (CBT). We define a mental health ontology based on the CBT principles, annotate a large corpus where this phenomena is exhibited and perform understanding using deep learning and distributed representations. Our results show that the performance of deep learning models combined with word embeddings or sentence embeddings significantly outperform non-deep-learning models in this difficult task. This understanding module will be an essential component of a statistical dialogue system delivering therapy.


Introduction
Promotion of mental well-being is at the core of the action plan on mental health 2013-2020 of the World Health Organisation (WHO) (World Health Organization, 2013) and of the European Pact on Mental Health and Well-being of the European Union (EU high-level conference: Together for Mental Health and Well-being, 2008). The biggest potential breakthrough in fighting mental illness would lie in finding tools for early detection and preventive intervention (Insel and Scholnick, 2006). The WHO action plan stresses the importance of health policies and programmes that not only meet the need of people affected by mental disorders but also protect mental well-being. The emphasis is on early evidence-based non-pharmacological intervention, avoiding institutionalisation and medicalisation. What is particularly important for successful intervention is the frequency with which the therapy can be accessed (Hansen et al., 2002). This gives automated systems a huge advantage over conventional therapies, as they can be used continuously with marginal extra cost. Health assistants that can deliver therapy, have gained great interest in recent years (Bickmore et al., 2005;Fitzpatrick et al., 2017). These systems however are largely based on hand-crafted rules. On the other hand, the main research effort in statistical approaches to conversational systems has focused on limited-domain information seeking dialogues (Schatzmann et al., 2006;Geist and Pietquin, 2011;Gasic and Young, 2014;Fatemi et al., 2016;Li et al., 2016;Williams et al., 2017).
In this paper we introduce a new task: understanding of mental health concepts derived from Cognitive Behavioural Therapy (CBT). We present an ontology that is formulated according to Cognitive Behavioural Therapy principles. We label a high quality mental health corpus, which exhibits targeted psychological phenomena. We use the whole unlabelled dataset to train distributed representations of words and sentences. We then investigate two approaches for classifying the user input according to the defined ontology. The first model involves a convolutional neural network (CNN) operating over distributed words representations. The second involves a gated recurrent network (GRU) operating over distributed representation of sentences. Our models perform significantly better than chance and for instances with a large number of data they reach the inter-annotator agreement. This understanding module will be an essential component of a statistical dialogue system delivering therapy.
The paper is organised as follows. In Section 2 we give a brief background of the statistical approach to dialogue modelling, focusing on dialogue ontology and natural language understanding. In Section 3 we review related work in the area of automated mental-health assistants. The sections that arXiv:1809.00640v1 [cs.CL] 3 Sep 2018 follow represent the main contribution of this work: a CBT ontology in Section 4, a labelled dataset in Section 5, and models for language understanding in Section 6. We present the results in Section 7 and our conclusion in Section 8.

Background
A dialogue system can be treated as a trainable statistical model suitable for goal-oriented information seeking dialogues (Young, 2002). In these dialogues, the user has a clear goal that he or she is trying to achieve and this involves extracting particular information from a back-end database. A structured representation of the database, the ontology is a central element of a dialogue system. It defines the concepts that the dialogue system can understand and talk about. Another critical component is the natural language understanding unit, which takes textual user input and detects presence of the ontology concepts in the text.

Dialogue ontology
Statistical approaches to dialogue modelling have been applied to relatively simple domains. These systems interface databases of up to 1000 entities where each entity has up to 20 properties, i.e. slots (Cuayáhuitl, 2009). There has been a significant amount of work in spoken language understanding focused on exploiting large knowledge graphs in order to improve coverage (Tür et al., 2012;Heck et al., 2013). Despite these efforts, little work has been done on mental health ontologies for supporting cognitive behavioural therapy on dialogue systems. Available medical ontologies follow a symptom-treatment categorisation and are not suitable for dialogue or natural language understanding (Bluhm, 2017;Hofmann, 2014;Wang et al., 2018).

Natural language understanding
Within a dialogue system, a natural language understanding unit extracts meaning from user sentences. Both classification (Mairesse et al., 2009) and sequence-to-sequence (Yao et al., 2014;Mesnil et al., 2015) models have been applied to address this task.
In this work we consider understanding of mental health concepts of as a classification task. To facilitate this process, we use distributed representations.

Related work
The aim of building an automated therapist has been around since the first time researchers attempted to build a dialogue system (Weizenbaum, 1966). Automated health advice systems built to date typically rely on expert coded rules and have limited conversational capabilities (Rojas- Barahona and Giorgino, 2009;Vardoulakis et al., 2012;Ring et al., 2013;Riccardi, 2014;DeVault et al., 2014;Ring et al., 2016). One particular system that we would like to highlight is an affectively aware virtual therapist (Ring et al., 2016). This system is based on Cognitive Behavioural Therapy and the system behaviour is scripted using VoiceXML. There is no language understanding: the agent simply asks questions and the user selects answers from a given list. The agent is however able to interpret hand gestures, posture shifts, and facial expressions. Another notable system (De-Vault et al., 2014) has a multi-modal perception unit which captures and analyses user behaviour for both behavioural understanding and interaction. The measurements contribute to the indicator analysis of affect, gesture, emotion and engagement. Again, no statistical language understanding takes place and the behaviour of the system is scripted. The system does not provide therapy to the user but is rather a tool that can support healthcare decisions (by human healthcare professionals).
The Stanford Woebot chat-bot proposed by (Fitzpatrick et al., 2017) is designed for delivering CBT to young adults with depression and anxiety. It has been shown that the interaction with this chat-bot can significantly reduce the symptoms of depression when compared to a group of people directed to a read a CBT manual. The conversational agent appears to be effective in engaging the users. However, the understanding component of Woebot has not been fully described. The dialogue decisions are based on decision trees. At each node, the user is expected to choose one of several predefined responses. Limited language understanding was in-troduced at specific points in the tree to determine routing to subsequent conversational nodes. Still, one of the main deficiencies reported by the trial participants in (Fitzpatrick et al., 2017) was the inability to converse naturally. Here we address this problem by performing statistical natural language understanding.

CBT ontology
To define the ontology we draw from principles of Cognitive Behavioural Therapy (CBT). This is one of the best studied psychotherapeutic interventions, and the most widely used psychological treatment for mental disorders in Britain (Bhasi et al., 2013). There is evidence that CBT is more effective than other forms of psychotherapy (Tolin, 2010). Unlike other, longer-term, forms of therapy such as psychoanalysis, CBT can have a positive effect on the client within a few sessions. Also, due to it being highly structured, it is more easily amenable by computer interpretation. This is why we adopted CBT as the basis of our work.
Cognitive Behavioural Therapy is derived from Cognitive Therapy model theory (Beck, 1976;Beck et al., 1979), which postulates that our emotions and behaviour are influenced by the way we think and by how we make sense of the world. The idea is that, if the client changes the way he or she thinks about their problem, this will in turn change the way he or she feels, and behaves.
A major underlying principle of CBT is the idea of cognitive distortions, and the value in challenging them. In CBT, clients are helped to test their assumptions and views of the world in order to check if they fit with reality. When clients learn that their perceptions and interpretations are distorted or unhelpful they then work on correcting them. Within the realm of cognitive distortion, CBT identifies a number of specific self-defeating thought processes, or thinking errors. There is a core of around 10 to 15 thinking errors, with their exact titles having some fluidity. A strong component of CBT is teaching clients to be able to recognize and identify the thinking errors themselves, and ultimately discard the negative thought processes and 're-think' their problems.
We consider the main analytical step in this therapy: an adequate decoding of these 'thinking error' concepts, and the identification of the key emotion(s) and the situational context of a particular problem. Therefore, our ontology consists of think-ing errors, emotions, and situations.

Thinking errors
Notwithstanding slight variations in number and terminology, the list of thinking errors is fairly well standardised in the CBT literature. We present one such list in Table 1. However, it is important to note that there is a fair degree of overlap between different thinking errors, for example, between Jumping to Negative Conclusions and Fortune Telling, or between Disqualifying the Positives and Mental Filtering. In addition, within the data used -and as is likely to be the case in any data of spontaneous expressions of psychological upset -a single problem can exhibit several thinking errors simultaneously. Thus, the situation is much more challenging than in simple information-seeking dialogues, where ontologies are typically clearly defined and there is no or very little overlap between concepts.

Emotions
In addition to thinking errors, we define a set of emotions. We mainly focus on negative emotions, relevant to people in psychological distress. In CBT, emotions tend to be divided into positive and negative, or helpful/healthy and unhelpful/ unhealthy emotions (Branch and Willson, 2010). The set of emotions for this work evolved over time in the early days of annotation. Although we initally agreed to focus on 'unhealthy' emotions, as defined by CBT, there seemed also to be a place for the 'healthy' emotion Grief/sadness. Overall, the list of emotions used was drawn from a number of sources, including CBT literature, the annotators' own knowledge of what they work with in psychological therapy, and the common emotions that were seen emerging from the data early on in the process. Note that more than one emotion might be expressed within an individual problem -for example Depression and Loneliness. The list of emotions is given in Table 2.

Situations
While our main emphasis was on thinking errors and emotions, we also defined a small set of situations. The list of situations again evolved during the early days of annotation, with a longer original list being reduced down, for simplicity. Again, it is possible for more than one situation (for example Work and Relationships) to apply to a single problem. The considered situations are given in Table 3.

The corpus
The corpus consists of 500K written posts that users anonymously posted on the Koko platform 1 . This platform is based on the peer-to-peer therapy proposed by (Morris et al., 2015). In this set-up, a user anonymously posts their problem (referred to 1 https://itskoko.com/ as the problem) and is prompted to consider their most negative take on the problem (referred to as the negative take). Subsequently, peers post responses that attempt to offer a re-think and give a more positive angle on the problem. When first developed, this peer-to-peer framework was shown to be more efficacious than expressive writing, an intervention that is known to improve physical and . Since then, the app developed by Koko has collected a very large number of posts and associated responses. Initially, any first-time Koko user would be given a short introductory tutorial in the art of 're-thinking'/'re-framing' problems (based on CBT principles), before being able to use the platform. This however changed over time, as the age of the users decreased, and a different tutorial, emphasizing empathy and optimism, was used (less CBT-based than the 're-thinking'). Most of the data annotated in this study was drawn from the earlier phase. Figure 1 gives an annotated post example.

Annotation
A subset of posts was annotated by two psychological therapists using a web annotation tool that we developed. The annotation tool allowed annotators to have a quick view of the posts, showing up to 50 posts per page, to navigate through posts, to check pending posts and to annotate them by adding or removing thinking errors, emotions and situations. All annotations were stored in a MySQL database. Initially 1000 posts were analysed. These were used to define the ontology. Then 4035 posts were labelled with thinking errors, emotions and situations. It takes an experienced psychological therapist about one minute to annotate one post. Note that the same post can exhibit multiple thinking errors, emotions and situations, which makes the whole process more complex. We randomly selected 50 posts and calculated the inter-annotator agreement. The inter-annotator agreement was calculated using a contingency table for thinking error, emotion and situation, showing agreement and disagreement between the two annotators. Then, Cohen's kappa was calculated discounting the possibility that the agreement may happen by chance. The result is shown in  due to the unbounded number of thinking errors per post. In other words, the annotators typically have three or four thinking errors in common but one of them might have detected one or two more. Still, the agreement is much higher than chance, so we think that while challenging, it is possible to build a classifier for this task. The distributions of labelled posts with multiple sub-categories for three super-categories are shown in Figure 2 The task of decoding thinking errors and emotions is closely related to the task of sentiment analysis. In sentiment analysis we are concerned with positive or negative sentiment expressed in a sentence. Detecting thinking errors or emotions could be perceived as detecting different kinds of negative sentiment. Distributed representations of words, sentences and documents have gained success in sentiment detection and similarity tasks (Le and Mikolov, 2014a;Maas et al., 2011;Kiros et al., 2015). A key advantage of these representations is that they can be obtained in an unsupervised manner, thus allowing exploitation of large amounts of unlabelled data. This is precisely what we have in our set-up, where only a small portion of our posts is labelled. We utilise GloVe (Pennington et al., 2014) word vectors, which have previously achieved competitive results in a similarity task. We train the word vectors on the whole dataset and then use a convolutional neural network (CNN) to extract features from posts where words are represented as vectors.
We also consider distributed representation of sentences. A particularly competitive model is the skip-thought model, which is obtained from an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage (Kiros et al., 2015). On similarity tasks it outperfoms the simpler doc2vec model (Le and Mikolov, 2014a). An approach that represents vectors by weighted averages of word vectors and then modifies them using PCA and SVD outperforms skipthought vectors (Arora et al., 2017). This method however does not do well on a sentiment analysis task due to down-weighting of words like "not". As these often appear in our corpus, we chose skipthought vectors for investigation here.
The skip-thought model allows a dense representation of the utterance. We train skip-thought vectors using the method described in (Kiros et al., 2015). The automatically generated post shown in Fig 3 demonstrates that skip-thought vectors can convey the sentiment well in accordance to context. We then train a gated recurrent unit (GRU) network using the skip-thoughts as input.

Convolutional neural network model
The convolutional neural network (CNN) model is preferred over a recurrent neural network (RNN) model, because the posts are generally too long for an RNN to maintain memory over words. The convolutional neural network (CNN) used in this work is inspired by (Kim, 2014) and operates over pre-trained GloVe embeddings of dimensionality d. As shown in Fig 4, the network has two inputs, one for the problem and the other for the negative take. These are represented as two tensors. A convolutional operation involves a filter w ∈ R ld which is applied to l words to produce the feature map. Then, a max-pooling operation is applied to produce two vectors: p for problem and n for negative take. The reason for this is that the negative take is usually a summary of the post, carrying stronger sentiment (see Figure 1). We use a gating mechanism to combine p and n as follows: (2) Here, σ is the sigmoid function, W p , W n and W are weight matrices, b is a bias term, 1 is a vector of ones, is the element-wise product, and g is the output of the gating mechanism. The extracted feature h is then processed with a one-layer fullyconnected neural network (FNN) to perform binary classification. The model is illustrated in Fig 4.

Gated recurrent unit model
We use the gated recurrent unit (GRU) model to process skip-thought sentence vectors, for two reasons. First, most posts contain less than 5 sentences, so a recurrent neural network is more suitable than a convolutional neural network. Second, since our corpus only comprises very limited labelled data, a GRU should perform better than a long short-term memory (LSTM) network as it has less parameters. Denote each post as P = {s 1 , s 2 , ..., s t , ...}, where s t is the t th sentence in post P . First, we use an already trained GRU to extract skip-thought embeddings e t from the sentences s t . Then, taking the sequence of sentence vectors {e 1 , e 2 , ..., e t , ...} as input, another GRU is used as follows: bias terms, is the elementwise dot product, and σ is the sigmoid function.
Finally, the last hidden state h T is fed into a FNN with one hidden layer of the same size as input. The model is illustrated in

Training set-up
We first train 100 and 300 dimensions for both GloVe embeddings and skip-thought embeddings using the same mechanism as in (Pennington et al., 2014;Kiros et al., 2015). In some posts the length of sentences is very large, so we bound the length at 50 words. We do not treat the problem separately from the negative take as the GRU will anyway put more importance on the information that comes last. We split the labelled data in a 8 : 1 : 1 ratio for training, validation and testing in a 10-fold cross validation for both GRU and CNN training. A distinct network is trained for each concept, i. e. one for thinking errors, one for emotions and one for situations. The hidden size of the FNN is 150.
We used filter windows of 2, 3, and 4 with 50 feature maps for the CNN model. For the GRU model, the hidden size is set at 150, so that both models have comparable number of parameters. Mini-batches of size 24 are used and gradients are clipped with maximum norm 5. We initialise the learning rate as 0.001 with a decay rate of 0.986 every 10 steps. The non-recurrent weights with a truncated normal distribution (0, 0.01), and the recurrent weights with orthogonal initialisation (Saxe et al., 2013). To overcome over-fitting, we employ dropout with rate 0.8 and l2-normalisation. Both models were trained with Adam algorithm and implemented in Tensorflow (Girija, 2016).

Baselines
For rule-based models, we chose a chance classifier and a majority classifier, where all the posts are treated as positive examples for each class. In addition, we trained two non-deep-learning models, the logistic regression (LR) model and the Support Vector Machine (SVM). Both of them take the bag-of-words feature as input and implemented in sklearn (Pedregosa et al., 2011). For completeness, we also trained 100 and 300 dimensions PV-DM document embeddings (Le and Mikolov, 2014b) as the distributed representations of the posts using the gensim toolkit (Řehůřek and Sojka, 2010), and employ FNNs to do the classification, the hidden size is set as 800 to ensure parameters of all deep learning models comparable. All the baseline models are trained with the same set-up as described in section 6.4.   Table 6 shows the F1-measure of the compared models that detect thinking errors, emotions and situations under the 1 : 1 oversampling ratio. We only include the results of the best performing models, SVMs, CNNs and GRUs, due to limited space. The results show that both models outperform SVM-BOW in larger embedding dimensions. Although SVM-BOW is comparable to 100 dimensional GRU-Skip-thought in terms on average F1, in all other cases CNN-GloVe and GRU-Skipthought overshadow SVM-BOW. We also find that CNN-GloVe on average works better than GRU-Skip-thought, which is expected as the space of words is smaller in comparison to the space of sentences so the word vectors can be more accurately trained. While the CNN operating on 100 dimensional word vectors is comparable to the CNN operating on 300 dimensional word vectors, the GRU-Skip-thought tends to be worse on 100 dimensional skip-thoughts, suggesting that sentence vectors generally need to be of a higher dimension to represent the meaning more accurately than word vectors. Table 7 shows a more detailed analysis of the 300 dimensional CNN-GloVe performance, where both precision and recall are presented, indicating that oversampling mechanism can help overcome the data bias problem. To illustrate the capabilities of this model, we give samples of two posts and their predicted and true labels in Figure 6, which shows that our model discerns the classes reasonably well even in some difficult cases. Table 6: F1 score of the models trained with embeddings with dimensionality of 300 and 100 respectively. While oversampling is essential for both models, GRU-Skip-thought is less sensitive to lower oversampling ratios, suggesting that skip-thoughts can already capture sentiment on the sentence level. Therefore, including only a limited ratio of positive samples is sufficient to train the classifier. Instead, models using word vectors need more positive data to learn sentence sentiment features.

Conclusion
We presented an ontology based on the principles of Cognitive Behavioural Therapy. We then annotated data that exhibits psychological problems and computed the inter-annotator agreement.
We found that classifying thinking errors is a difficult task as suggested by the low inter-annotator agreement. We trained GloVe word embeddings and skip-thought embeddings on 500K posts in an unsupervised fashion and generated distributed representations both of words and of sentences. We  Table 7: Precision, recall, F1 score and accuracy for 300 dim CNN-GloVe with oversampling ratio 1:1 Figure 7: Weighted AVG. F1 for different models then used the GloVe word vectors as input to a CNN and the skip-thought sentence vectors as input to a GRU. The results suggest that both models significantly outperform a chance classifier for all thinking errors, emotions and situations with CNN-GloVe on average achieving better results. Areas of future investigation include richer dis-tributed representations, or a fusion of distributed representations from word-level, sentence-level and document-level, to acquire more powerful semantic features. We also plan to extend the current ontology with its focus on thinking errors, emotions and situations to include a much lager number of concepts. The development of a statistical system delivering therapy will moreover require further research on other modules of a dialogue system.