Dilated LSTM with attention for Classification of Suicide Notes

In this paper we present a dilated LSTM with attention mechanism for document-level classification of suicide notes, last statements and depressed notes. We achieve an accuracy of 87.34% compared to competitive baselines of 80.35% (Logistic Model Tree) and 82.27% (Bi-directional LSTM with Attention). Furthermore, we provide an analysis of both the grammatical and thematic content of suicide notes, last statements and depressed notes. We find that the use of personal pronouns, cognitive processes and references to loved ones are most important. Finally, we show through visualisations of attention weights that the Dilated LSTM with attention is able to identify the same distinguishing features across documents as the linguistic analysis.


Introduction
Over recent years the use of social media platforms, such as blogging websites has become part of everyday life and there is increasing evidence emerging that social media can influence both suicide-related behaviour (Luxton et al., 2012) and other mental health conditions (Lin et al., 2016). Whilst there are efforts to tackle suicide and other mental health conditions online by social media platforms such as Facebook (Facebook, 2019), there are still concerns that there is not enough support and protection, especially for younger users (BBC, 2019). This has led to a notable increase in research of suicidal and depressed language usage (Coppersmith et al., 2015;Pestian et al., 2012) and subsequently triggered the development of new healthcare applications and methodologies that aid detection of concerning posts on social media platforms (Calvo et al., 2017). More recently, there has also been an increased use of deep learning techniques for such tasks (Schoene and Dethlefs, 2018), however there is little evidence which features are most relevant for the accurate classification. Therefore we firstly analyse the most important linguistic features in suicide notes, depressed notes and last statements. Last Statements have been of interest to researchers in both the legal and mental health community, because an inmates last statement is written, similarly to a suicide note, closely before their death (Texas Department of Criminal Justices, 2019). However, the main difference remains that unlike in cases of suicide, inmates on death row have no choice left in regards to when, how and where they will die. Furthermore there has been extensive analysis conducted on the mental health of death row inmates where depression was one of the most common mental illnesses. Work in suicide note identification has also compared the different states of mind of depressed and suicidal people, because depression is often related to suicide (Mind, 2013). Secondly, we introduce a recurrent neural network architecture that enables us to (1) model long sequences at document level and (2) visualise the most important words to accurate classification. Finally, we evaluate the results of the linguistic analysis against the results of the neural network visualisations and demonstrate how these features align. We believe that by exploring and comparing suicide notes with last statements and depressed notes, both qualitatively and quantitatively it could help us to find further differentiating factors and aid in identifying suicidal ideation.

Related Work
The analysis and classification of suicide notes, depression notes and last statements has traditionally been conducted separately. Work on suicide notes has often focused on identifying suicidal ideation online (O'dea et al., 2017) or distinguish-ing genuine from forged suicide notes (Coulthard et al., 2016), whilst the main purpose of analysing last statements has been to identify psychological factors or key themes (Schuck and Ward, 2008).
Suicide Notes Recent years have seen an increase in the analysis of suicidal ideation on social media platforms, such as Twitter. Shahreen et al. (2018) searched the Twitter API for specific keywords and analysed the data using both traditional machine learning techniques as well as neural networks, achieving an accuracy of 97.6% using neural networks. Research conducted by Burnap et al. (2017) have developed a classifier to distinguish suicide-related themes such as the reports of suicides and casual references to suicide. Work by Just et al. (2017) used a dataset annotated for suicide risks by experts and a linguistic analysis tool (LIWC) to determine linguistic profiles of suiciderelated twitter posts. Other work by Pestian et al. (2010) has looked into the analysis and automatic classification of sentiment in notes, where traditional machine learning algorithms were used. Another important area of suicide note research is the identification of forged suicide notes from genuine ones. Jones and Bennell (2007) have used supervised classification model and a set of linguistic features to distinguish genuine from forged suicide notes, achieving an accuracy of 82%.
Depression notes Work on identifying depression and other mental health conditions has become more prevelant over recent years, where a shared task was dedicated to distinguish depression and PTSD (Post Traumatic Stress Disorder) on Twitter using machine learning (Coppersmith et al., 2015). Morales et al. (2017) have argued that changes in cognition of people with depression can lead to different language usage, which manifests itself in the use of specific linguistic features. Research conducted byResnik et al. (2015) also used linguistic signals to detect depression with different topic modelling techniques. Work by Rude et al. (2004) used LIWC to analyse written documents by students who have experienced depression, currently depressed students as well as student who never have experienced depression, where it was found that individuals who have experienced depression used more first-person singular pronouns and negative emotion words. Nguyen et al. (2014) used LIWC to detect differences in language in online depres-sion communities, where it was found that negative emotion words are good predictors of depressed text compared to control groups using a Lasso Model (Tibshirani, 1996). Research conducted by Morales and Levitan (2016) showed that using LIWC to identify sadness and fatigue helped to accurately classify depression.
Last statements Most work in the analysis of last statements of death row inmates has been conducted using data from The Texas Department of Criminal Justice, made available on their website (Texas Department of Criminal Justices, 2019). Recent work conducted by Foley and Kelly (2018) has primarily focused on the analysis of psychological factors, where it was found that specifically themes of 'love' and 'spirituality' were constant whilst requests for forgiveness declined over time. Kelly and Foley (2017) have identified that mental health conditions occur often in death row inmates with one of the most common conditions being depression. Research conducted by Heflick (2005) studied Texas last statements using qualitative methods and have found that often afterlife belief and claims on innocence are common themes in these notes. Eaton and Theuer (2009) studied qualitatively the level of apology and remorse in last statements, whilst also using logistic regression to predict the presence of apologies achieving an accuracy of 92.7%. Lester and Gunn III (2013) used the LIWC program to analyse last statements, where they have found nine main themes, including the affective and emotional processes. Also, Foley and Kelly (2018) found in a qualitative analysis that the most common themes in last statements were love (78%), spirituality (58%), regret (35%) and apology (35%).

Data
For our analysis and experiments we use three different datasets, which have been collected from different sources. For the experiments we use standard data preprocessing techniques and remove all identifying personal information. 1 Last Statements Death Row This dataset has been made available by the Texas Department of Criminal Justices (2019) and contains 545 records of prisoners who have received the death penalty between 1982 and 2017 in Texas, U.S.A. A total of 431 prisoners wrote notes prior to their death.
Due to the information available on this data we have done a basic analysis on the data available, hereafter referred to as LS.
Suicide Note The data for this corpus has mainly been taken from Schoene and Dethlefs (2016), but has been further extended by using notes introduced by The Kernel (2013) and Tumbler (2013). There are total of 161 suicide notes in this corpus, hereafter referred to as GSN.
Depression Notes We used the data collected by Schoene and Dethlefs (2016) of 142 notes written by people identifying themselves as depressed and lonely, hereafter referred to as DL.

Linguistic Analysis
To gain more insight into the content of the datasets, we performed a linguistic analysis to show differences in structure and contents of notes. For the purpose of this study we used the Linguistic Inquiry and Word Count software (LIWC) (Tausczik and Pennebaker, 2010), which has been developed to analyse textual data for psychological meaning in words. We report the average of all results across each dataset.
Dimension Analysis Firstly, we looked at the word count and different dimensions of each dataset (see Table 1). It has previously been argued by Tausczik and Pennebaker (2010) that the words people use can give insight into the emotions, thoughts and motivations of a person, where LIWC dimensions correlate with emotions as well as social relationships. The number of words per sentences are highest in DL writers and lowest in last statement writers. Research by Osgood and Walker (1959) has suggested that people in stressful situations break their communication down into shorter units. This may indicate alleviated stress levels in individuals writing notes prior to receiving the death sentence. Clout stands for the social status or confidence expressed in a person's use of language (Pennebaker et al., 2014). This dimension is highest for people writing their last statements, whereas depressed people rank lowest on this. Cohan et al. (2018) have noted that this might be due to the fact that depressed individuals often have a lower socio-economic status. The Tone of a note refers to the emotional tone, including both positive and negative emotions, where numbers below 50 indicate a more negative emotional tone (Cohn et al., 2004). The tone for LS is highest overall and the lowest in DL, indicating a more overall negative tone in DL and positive tone in LS. Function Words and Content Words Next, we looked at selected function words and grammatical differences, which can be split into two categories called Function Words (see Table 2), reflecting how humans communicate and Content words (see Table 2), demonstrating what humans say (Tausczik and Pennebaker, 2010). Previous studies have found that whilst there is an overall lower amount of function words in a person's vocabulary, a person uses them more than 50% when communicating. Furthermore it was found that there is a difference in how human brains process function and content words (Miller, 1991). Previous research has found that function words have been connected with indicators of people's social and psychological worlds (Tausczik and Pennebaker, 2010), where it has been argued that the use of function words require basic skills. The highest amount of function words were used in DL notes, whilst both GSN and LS have a similar amount of function words. Rude et al. (2004) has found that high usage, specifically of first-person singular pronouns ("I") could indicate higher emotional and/or physical pain as the focus of their attention is towards themselves. Overall Just et al. (2017) has also identified a larger amount of personal pronouns in suicide-related social media content. Previous work by Hancock et al. (2007) has found that people use a higher amount of negations when also expressing negative emotions and used fewer words overall, compared to more positive emotions. This seem to be also true for the number of negations used in this case where amount of Negations were also highest in the DL corpus and lowest in the LS corpus, whilst the overall words count was lowest for DL and negative emotions highest. Furthermore, it was found that Verbs, Adverb and Adjectives are often used to communicate content, however previous studies have found (Jones and Bennell, 2007;Gregory, 1999) that individuals that commit suicide are under a higher drive and therefore would reference a higher amount of objects (through nouns) rather than using descriptive language such as adjectives and adverbs.  Affect Analysis The analysis of emotions in suicide notes and last statements has often been addressed in research (Schoene and Dethlefs, 2018;Lester and Gunn III, 2013) The number of Affect words is highest in LS notes, whilst they are lowest in DL notes, this could be related to the emotional Tone of a note (see Table 1). This also applies to the amount of Negative emotions as they are highest in DL notes and Positive emotions as these are highest in LS notes. Previous research has analysed the amount of Anger and Sadness in GSN and DL notes and has shown that it is more prevalent in DL note writers as these are typical feelings expressed when people suffer from depression (Schoene and Dethlefs, 2016   The term Cognitive processes encompasses a number of different aspects, where we have found that the highest amount of cognitive processes was in DL notes and the lowest in LS notes. Boals and Klein (2005) have found that people who use more cognitive mechanisms to cope with traumatic events such as break ups by using more causal words to organise and explain events and thoughts for themselves. Arguably this explains why there is a lower amount in LS notes as LS writers often have a long time to organise their thoughts, events and feelings whilst waiting for their sentence (Death Penalty Information Centre, 2019). Insight encompasses words such as think or consider, whilst Cause encompasses words that express reasoning or causation of events, e.g.: because or hence. These terms have previously been coined as cognitive process words by (Gregory, 1999), who argued that these words are less used in GSN notes as the writer has already finished the decision making process whilst other types of discourse would still try to justify and reason over events and choices. This can also be found in the analysis of our own data, where both GSN and LS notes show similar, but lower frequency of terms in those to categories compared to DL writers. Tentativeness refers to the language use that indicates a person is uncertain about a topic and uses a number of filler words. A person who use more tentative words, may have not expressed an event to another person and therefore has not processed an event yet and it has not been formed into a story (Tausczik and Pennebaker, 2010). The amount of tentative words used in DL notes is highest, whilst it is lowest in LS words. This might be due to the fact that LS writers already had to reiterate over certain events multiple times as they go through the process of prosecution.
Personal Concerns Personal Concerns refers to the topics most commonly brought up in the different notes, where we note that both Money and Work are most often referred to in GSN notes and lowest in LS notes. This might be due to the the fact that (Mind, 2013) lists these two topics as    Table 7 shows that the focus of LS letters is primarily in the past whilst GSN and DL letters focus on the present. The high focus on the past in DL notes as well as GSN notes could be, because these notes might draw on their past experiences to express the issues of their current situation or problems.The most frequent use of future tense is in LS letters which could be due to a LS notes writers common focus on afterlife (Heflick, 2005  Overall it was noted that for most analysis GSN falls between the two extremes of LS and DL.

Learning Model
The primary model is the Long-short-term memory (LSTM) given its suitability for language and time-series data (Hochreiter and Schmidhuber, 1997). We feed into the LSTM an input sequence x = (x 1 , . . . , x N ) of words in a document alongside a label y ∈ Y denoting the class from any of the three datasets. The LSTM learns to map inputs x to outputs y via a hidden representation h t which can be found recursively from an activation function.
where t denotes a time-step. During training, we minimise a loss function, in our case categorical cross-entropy as: LSTMs manage their weight updates through a number of gates that determine the amount of information that should be retained and forgotten at each time step. In particular, we distinguish an 'input gate' i that decides how much new information to add at each time-step, a 'forget gate' f that decides what information not to retain and an 'output gate' o determining the output. More formally, and following the definition by Graves (2013), this leads us to update our hidden state h as follows (where σ refers to the logistic sigmoid function and c is the 'cell state'): A standard LSTM definition solves some of the problems of vanilla RNNs have (Hochreiter and Schmidhuber, 1997), but it still has some shortcomings when learning long dependencies. One of them is due to the cell state of an LSTM; the cell state is changed by adding some function of the inputs. When we backpropagate and take the derivative of c t with respect to c t − 1, the added term would disappear and less information would travel through the layers of a learning model. For our implementation of a Dilated LSTM, we follow the implementation of recurrent skip connections with exponentially increasing dilations in a multi-layered learning model by Chang et al. (2017). This allows LSTMs to better learn input sequences and their dependencies and therefore temporal and complex data dependencies are learned on different layers. Whilst dilated LSTM alleviates the problem of learning long sequences, it does not contribute to identifying words in a sequence that are more important than others. Therefore we extend this network by (1) an embedding layer and (2) an attention mechanism to further improve the network's ability. A graph illustration of our learning model can be seen in Figure 2.
Dilated LSTM with Attention Each document D contains i sentences S i , where w i represents the words in each sentence. Firstly, we embed the words to vectors through an embedding matrix W e , which is then used as input into the dilated LSTM.
The most important part of the dilated LSTM is the dilated recurrent skip connection, where c (l) t is the cell in layer l at time t: s (l) is the skip length; or dilation of layer l; x (l) t as the input to layer l at time t; and f (·) denotes a LSTM cell; M and L denote dilations at different layers: The dilated LSTM alleviates the problem of learning long sequences, however not every word in a sequence has the same meaning or importance.
Attention layer The attention mechanism was first introduced by Bahdanau et al. (2015), but has since been used in a number of different tasks including machine translation (Luong et al., 2015), sentence pairs detection (Yin et al., 2016) , neural image captioning (Xu et al., 2015) and action recognition (Sharma et al., 2015).
Our implemenetation of the attention mechanism is inspired by Yang et al. (2016), using attention to find words that are most important to the meaning of a sentence at document level. We use the output of the dilated LSTM as direct input into the attention layer, where O denotes the output of final layer L of the Dilated LSTM at time t +1 .
The attention for each word w in a sentence s is computed as follows, where u it is the hidden representation of the dilated LSTM output, α it represents normalised alpha weights measuring the importance of each word and S i is the sentence vector:

Experiments and Results
For our experiments we use all three datasets, Table 8 shows the results for the experiments series. We establish three performance baselines on the datasets by using three different algorithms previously used on similar datasets. Firstly, we use the ZeroR and LMT (Logistic Model Tree) previously used by (Schoene and Dethlefs, 2016).
Additionally we chose to benchmark our algorithm also against the originally proposed Bidirectional LSTM with attention proposed by Yang et al. (2016), which was also used on similar existing datasets before (Schoene and Dethlefs, 2018).

Evaluation
In order to evaluate the DLSTM with attention we look in more detail at the predicted labels and visualise examples of each note to show which features are assigned the highest attention weights.

Label Evaluation
In Figure 2 we show the confusion matrix over the DLSTM with attention. It can be seen that LS notes are most often correctly predicted and DL notes are least likely to be accurately predicted. The same applies to results of the main competing model (Bi-directional LSTM with Attention), Figure 3 shows that this model still misclassifies LS notes with DL notes.   The most important words highlighted in a last statement note (see Figure 4) are personal pronouns as well as an apology and expression of love towards friends and family members. This corresponds with the higher amount of personal pronouns, positive emotions and references to Family in LS notes compared to GSN and DL notes. Furthermore it can be seen that there is a low amount of cognitive process words and more action verbs such as killing or hurt, which could confirm that inmates have had more time to process events and thoughts and don't need cognitive words as a coping mechanism anymore (Boals and Klein, 2005). Figure 5 shows a GSN note, where the most important words are also pronouns, references to family, requests for forgiveness and endearments. Previous research has shown that forgiveness is an important feature as well as the giving instructions such as help or phrases like do not follow are key  to accurately classify suicide notes (Pestian et al., 2010). Terms of endearment for loved ones at the start or towards the end of a note (Gregory, 1999). The DL note in Figure 6 shows that there is a greater amount of cognitive process verbs present, such as feeling or know as well as negations, which confirms previous analysis using LIWC.  Figure 7 shows a visualisation of a LS note. In this instances the word God was replaced with up, when looking into the usage of the word up in other LS notes, it was found that it was commonly used in reference to religious topics such as God, heaven or up there. Whilst there is still consistency in highlighting personal pronouns (e.g.: you), it can be seen that the end of the note is missing and more action verbs such as hurt or take are more important. The visualisation in Figure 9 demonstrates how the personal pronoun I has been removed from several DL notes, where DL notes are least likely to be predicted accurately as shown in Figure 2.

Conclusion
In this paper we have presented a new learning model for classifying long sequences. We have shown that the model outperforms the baseline by 6.99 % and by 5.07 % a competitor model. Furthermore we have provided an analysis of the linguistic features on three datasets, which we have later compared in a qualitative evaluation by visualising the attention weights on examples of each dataset. We have shown that the neural network pays attention to similar linguistic features as provided by LIWC and found in human evaluated related research.