Using Automated Metaphor Identification to Aid in Detection and Prediction of First-Episode Schizophrenia

The diagnosis of serious mental health conditions such as schizophrenia is based on the judgment of clinicians whose training takes several years, and cannot be easily formalized into objective measures. However, previous research suggests there are disturbances in aspects of the language use of patients with schizophrenia. Using metaphor-identification and sentiment-analysis algorithms to automatically generate features, we create a classifier, that, with high accuracy, can predict which patients will develop (or currently suffer from) schizophrenia. To our knowledge, this study is the first to demonstrate the utility of automated metaphor identification algorithms for detection or prediction of disease.


Introduction
Schizophrenia is a severe mental disorder that has a devastating impact on those who suffer from it, as well as on their families and communities. Schizophrenia is characterized by psychotic behaviors (hallucinations, delusions, thought disorders, movement disorders), flat affect and anhedonia, and trouble with focusing and executive functioning, among other symptoms (American Psychiatric Association, 2013). It afflicts over 21 million people worldwide, and is associated with a 100-150 percent increase in early mortality (Goff et al., 2005;World Health Organization, 2016;Simeone et al., 2015). As a result, diagnosis and treatment of schizophrenia has important public health consequences. Unfortunately, practitioners who are qualified to diagnose and treat serious mental health issues such as schizophrenia are in chronically short supply, and their accumulated knowledge cannot be easily formalized into reproducible metrics (Patel et al., 2007).
However, clinical research into the symptoms and mechanisms of schizophrenia suggests that disturbances in language use, and especially in metaphor use and affect, characterize schizophrenia. This suggests that automated NLP methods may have the potential to help in diagnosis and prognosis of schizophrenia. In this paper, we work from open-ended transcripts of patients interviewed by non-specialists. We then apply NLP algorithms for metaphor identification and sentiment analysis to automatically generate features for a classifier that, with high accuracy, can predict which patients will develop schizophrenia and which patients would currently be diagnosed with schizophrenia by psychiatrists.

Background & Related work 2.1 NLP and Computational Psychiatry
Several recent studies have proven that NLP textanalysis techniques can be successfully applied to predict mental illness. Vincze et al. (2016) use linguistic and demographic features to predict whether a speech transcript was produced by an individual with mild cognitive impairment or by a healthy control. To our knowledge, Elvevåg et al. (2007) were the first to use automated NLP methods to predict whether or not patients suffered from schizophrenia. The technical specifics of their method are unclear, as the paper was intended for a clinical audience, but they use a k-nearest neighbors algorithm in a feature space made up of n-gram features and distributional semantic features to classify 26 schizophrenia patients and 25 healthy controls. They achieve classification accuracy of 78.4% on this task. Mota et al. (2012) employ a graph-based method to classify transcripts taken from interviews with eight patients with schizophrenia, eight healthy controls, and eight manic patients, achieving both precision and recall of 0.875. Bedi et al. (2015) apply semantic coherence measures and measures based on part-ofspeech tags to predict whether 34 youths at risk of psychosis would have a psychotic episode within 2.5 years of being interviewed (five of whom did transition within the study period). They correctly classify 100% of participants.

Metaphor, Affect and Schizophrenia
Mental-health clinicians have long had intuitions that schizophrenia patients differ from healthy individuals in their use of metaphor. A survey by Kuperberg (2010) of over 50 years of observations in the schizophrenia literature concludes that schizophrenia patients "may use common words in an idiosyncratic or bizarre manner." Particularly colorful (and metaphorical) examples of bizarre speech recorded by Andreasen (1986) include patients who referred to watches as "time vessels" and to gloves as "hand shoes." Billow et al. (1997) carried out the first experimental exploration of this phenomenon. They measure the metaphor production of patients with schizophrenia and healthy controls during free responses to a structured interview. They find that patients with schizophrenia produce comparable rates of felicitous, coherent metaphors as healthy controls, but produce deviant metaphorical speech with significantly greater frequency.
It is not clear what could account for these differences in metaphor production, but neuroscientific studies of patients with schizophrenia offer some clues. Research shows that schizophrenia is associated with dysfunction of the amygdala, a brain structure responsible for regulating emotion (Rasetti et al., 2009). Other work demonstrates impairments in emotion perception and production in patients with schizophrenia (Vaskinn et al., 2008) and even demonstrates that face emotion recognition deficits are a predictor of psychosis onset . Based on these findings, and recognizing the important role that metaphor plays in emotional language (see (Kövecses, 2003)), Elvevåg et al. (2011) hypothesize that metaphor production disturbances in patients with schizophrenia are deeply tied to "emotional" language (i.e., language with high affective polarity). However, it should be noted in this regard that most work on metaphor processing has focused on cortical regions involved (Chen et al., 2008;Schmidt et al., 2010;Benedek et al., 2014).

Sentiment Analysis & Metaphor Detection Algorithms
Sentiment analysis is a natural-language processing task that involves determining, for given text, whether the text conveys a positive or negative sentiment, and how positive or negative the sentiment is. The book by Liu (2015) gives a comprehensive overview of sentiment analysis. Metaphor detection is the task of determining whether a given word, phrase, or passage is being used metaphorically or literally. It is an emerging field in NLP, with research still in relatively early stages. A variety of different machine-learning and statistical methods have been applied to the task, including clustering (Birke and Sarkar, 2006;Shutova et al., 2010;Shutova and Sun, 2013); topic models (Bethard et al., 2009;Heintz et al., 2013); topical structure and imageability analysis (Strzalkowski et al., 2013); semantic similarity graphs , and feature-based classifiers (Gedigian et al., 2006;Li and Sporleder, 2009;Turney et al., 2011;Dunn, 2013a,b;Hovy et al., 2013;Mohler et al., 2013;Neuman et al., 2013;Tsvetkov et al., 2013Tsvetkov et al., , 2014Klebanov et al.). Metaphor detection methods differ in how they define the task of metaphor detection-for instance, some algorithms seek to determine whether a phrase (such as sweet victory) is metaphorical (Krishnakumaran and Zhu, 2007;Turney et al., 2011;Tsvetkov et al., 2014;Bracewell et al., 2014;Gutiérrez et al., 2016), while others attempt to tag metaphoricity at the level of the utterance (Dunn, 2013a), or at the level of individual tokens in running text (Klebanov et al.;Schulder and Hovy, 2014;Do Dinh and Gurevych, 2016). For a recent review, see Shutova (2015). For our purposes, we decided that tokenlevel metaphor detection offered the most appropriate level of granularity, and we chose the algorithm of (Do Dinh and Gurevych, 2016) because of its state-of-the-art performance at this task at the time we began this project.

First-Episode Schizophrenia Transcripts
Our main data set 1 consists of interviews with 17 patients who have suffered a first episode of schizophrenia (denoted by 1EP+) and 15 healthy controls (denoted by 1EP-). Healthy controls were obtained from the same source population as patients with schizophrenia in the metropolitan New York City region, using web-based advertising on Craigslist, as well as by posting of flyers in and around the region. Participants engaged in open-ended interviews lasting approximately one hour, during which they were encouraged to express themselves narratively. Participants were queried on four topics, for which interviewers provided clarifying questions if they were not spontaneously discussed. The four discussion topics, as well as details of interviewer training and participant selection criteria, are discussed in more detail in the supplementary materials as well as in (Ben-David et al., 2014). Independent transcribers transcribed the interviews. Participants were matched for socioeconomic characteristics and education level. The average age of the 1EP-cohort was 35, and the average age of the 1EP+ cohort was 39. However, the 1EP-cohort was 47% male, while the 1EP+ cohort was 76% male. We refer to this data set as 1EP.

Prodromal Psychosis Transcripts
We use the data set introduced by Bedi et al. (2015) of transcripts from 34 youths at clinical high risk (CHR) for psychosis, based on the Structured Interview for Prodromal Syndromes (Miller et al., 2003). Demographic details are provided in Bedi et al. (2015). There were no significant differences for age, gender, ethnicity or medication usage between CHR converters vs. CHR nonconverters. Notably, all CHR participants were ascertained using gold-standard clinical measures for which the researchers obtained excellent interrater reliability with other CHR programs in North America. Open-ended baseline interviews were collected from the participants using the same protocol as above. Participants were then assessed quarterly for 2.5 years to determine whether they had transitioned to psychosis. Five of the participants suffered a first episode of psychosis within the assessment period (denoted by CHR+); the remainder did not (denoted by CHR-).

Experiments
The review of the literature in §2.2 suggests that a constellation of disturbances in metaphor use and extremeness/lability of sentiment may characterize schizophrenia. In order to assess whether these phenomena can truly distinguish patients with schizophrenia from healthy controls or to predict future schizophrenic episodes, we produce five features. Four of these features are derived from sentiment scores produced by a sentiment analysis algorithm, and one is derived from metaphor tags produced by a metaphor identification algorithm.

Feature Set
Metaphoricity We hope to detect the alteration in metaphor production observed in patients with schizophrenia by Billow et al. (1997) using an automated metaphor detection algorithm that tag word tokens as metaphorical or not. We adapt the token-level metaphor identification algorithm of Do Dinh and Gurevych (2016) to our task. In particular, we use a multilayer perceptron (MLP) architecture with three layers. The input layer is comprised of the concatenation of the word embeddings for each token and the two tokens before and after (not including non-content tokens, and padded with a randomly created embedding at sentence beginnings and endings). The vector for each token is composed of the word's 300dimensional Word2Vec skip-gram negative sampling word embedding 2 , concatenated with a onehot binary vector that indicates the token's part of speech. The hidden layer has ten fully connected hidden units with the hyperbolic tangent activation function. The output node classifies a token as literal or metaphorical using the softmax activation function.
Training is accomplished by minimizing a cross-entropy objective using stochastic gradient descent; the learning rate is decremented linearly during each epoch, for a maximum of 100 epochs. As in Do Dinh and Gurevych (2016), the MLP is trained on the VU Amsterdam Metaphor Corpus (VUAMC), a subset of the BNC where each token has been annotated as metaphorical or not (Steen et al., 2010), using cross-validation with an 80%-20% train-test split to optimize the regularization and learning rate parameters.
We then measure the percentage of all tokens labeled metaphorical by the metaphor identification algorithm in each transcript, denoting it by Met. We present an example text tagged by this algorithm in figure 1. Notably, the algorithm mistakenly tags the adverbially used preposition up in ended up as metaphorical; Do Dinh and Gurevych We ended up going to different high schools ...and then at home we also ran in different social circles and things like that. Figure 1: Sample sentence from one of the transcripts in the 1ep data set. Tokens in bold was tagged metaphorical by the token-level metaphor detection algorithm.
(2016) cite this as one of the common failure modes of their algorithm, along with failure to detect metaphors that are only clearly metaphorical from a large amount of surrounding context.

Sentiment
We posit that the sentiment scores produced by automated sentiment analysis algorithms should be able to detect disturbances in the production of emotional language, particularly in regard to metaphor. To this end, we create two features that summarize the distribution of sentiment scores in each transcript. In order to obtain tokenand phrase-level sentiment scores, we use the implementation of the Recursive Neural Tensor Network sentiment analysis algorithm (Socher et al., 2012) that is included in the Stanford CoreNLP toolkit, with default settings. This implementation comes pre-trained on the Stanford Sentiment Treebank. Tokens are tagged on an integer scale from 1 (Very Negative) to 5 (Very Positive). For each transcript, we take the percentage of all tokenlevel sentiment scores that were either extremely positive (score of 1) or extremely negative (score of 5), which we denote by SentTok and similarly compute the percentage of all phrase-level sentiment scores, which we denote by SentPhr. We also compute sentiment coherence as where the s i denotes either the sentiment score for token i (to compute CohTok), or the sentiment score for phrase i (to compute CohPhr).

Classification Algorithms
For all algorithms and data sets, we present results produced by leave-one-out cross-validation because of the small number of transcripts available. We use a radial-basis-function supportvector classifier and a convex-hull classifier to classify transcripts based on the variables above. The convex-hull classifier was previously used by Bedi et al. (2015). A test point is classified as originating from a CHR-participant if it lies within the convex-hull of all the CHR-data points in the training set; otherwise, it is classified as CHR+. The intuition behind the convex-hull approach is that individuals that eventually develop psychosis do not necessarily do so following a unique path to conversion, and moreover psychosis itself cannot be considered a well-defined single condition (Binbay et al., 2012); thus it is reasonable to hypothesize that the "breakdown" of mental abilities may occur along different trajectories for individual CHR+ patients.

Statistical Analysis
As predicted, we find that the metaphor identification algorithm does indeed tag a significantly higher proportion of the tokens in the transcripts of patients with schizophrenia as metaphorical (6.3%) than in the healthy controls' transcripts (5.2%); (t = 3.76, p < .001). No significant difference was found between the other variables of interest between patients with schizophrenia and healthy controls. No significant difference was found between males and females in metaphor use frequency (t = 1.105, p = 0.28).

Classification Performance
First-Episode Schizophrenia Transcripts Table 1 shows the performance of classifiers that individually use each of the five features §4.1 as predictors, as well as the classifier that uses all five in tandem (All) 3 . Baseline represents the results of a simple majority classifier (because 18 of the 33 transcripts belonged to patients with schizophrenia, this entails classifying all transcripts as belonging to patients with schizophrenia). Because the 1EP set was not balanced for gender or age, we also present the results of classifying men as having schizophrenia and women as not having schizophrenia (Gender) as well as the results of training a classifier on age (Age). Bedi and Mota represent the classification results attained by applying the features/method of Bedi et al. (2015) and Mota et al. (2012), respectively. Using all of the features to train the support-vector classifier performed better than using any of the features individually. The accuracy of the classifier based on all the features was significantly better than baseline (Fisher's exact test, p < .005). Notably, our features outperformed the features suggested by Bedi et al. (2015) and by Mota et al. (2012).
Prodromal Psychosis Transcripts On the prodromal transcripts, a classifier trained on all the features once again outperformed classifiers on any of the features individually, which performed at or near baseline. Interestingly, the convex-hull classifier outperformed the support vector classifier on this data. The convex-hull classifier trained on all five features correctly identified the outcome of 33 of the 34 CHR patients (97.1% accuracy). The sole patient who was misclassified belonged to the CHR+ group. This is comparable to the 100% accuracy of the Bedi et al. (2015) method and superior to the 79.4% accuracy of the Mota et al. (2012) method.
In order to explore the relationship between the two data sets, we also applied the best classifier trained on the 1EP data to the prodromal data. Interestingly, the 1EP classifier tagged 29 of the 34 CHR patients as patients with schizophrenia, including all five patients in the CHR+ group. The hypersensitivity of the 1EP classifier when applied to the prodromal data suggests that the cues that discriminate between patients with first-episode schizophrenia and healthy controls tend to place CHR patients into the same category as patients with first-episode schizophrenia. It is worth noting that the classifier tagged all of the CHR+ patients as 1EP+. We believe this indicates that our method would be useful as a tool meant to channel limited attention and resources toward patients with particularly high risk (above and beyond the criteria that currently flag a patient as being CHR).

Conclusion
To our knowledge, this study is the first to demonstrate the utility of automated metaphor identification algorithms in a public-health setting, and particularly for the prediction or detection of schizophrenia. Our algorithm's performance on the task of schizophrenia diagnosis from transcripts outperforms the two existing methods detailed in existing literature. Our results also contribute to clinical knowledge of the nature of language-use abnormalities in schizophrenia, as they support previous research which finds that those suffering from schizophrenia produce more metaphors in free speech than healthy controls. Previously it was only possible to measure such disturbances by labor-intensive and subjective hand-coding of transcripts for metaphoricity, or by the assessment of expert clinicians, whose time is limited. This work breaks new ground by showing that such disturbances can be measured in an automated and reproducible fashion, using features generated via machine learning.
Our work is somewhat constrained by the small sample size available to us. As our data comes from a vulnerable population, obtaining a larger data set is challenging, but essential for future work. In fact, two of the authors are in the process of collecting data from a total of 120 CHR individuals. This would enable a more thorough investigation of a larger and more sophisticated suite of linguistic features, and especially a more finegrained analysis of the interaction of metaphor and emotional language in schizophrenia.