Predicting Psychological Health from Childhood Essays with Convolutional Neural Networks for the CLPsych 2018 Shared Task (Team UKNLP)

This paper describes the systems we developed for tasks A and B of the 2018 CLPsych shared task. The first task (task A) focuses on predicting behavioral health scores at age 11 using childhood essays. The second task (task B) asks participants to predict future psychological distress at ages 23, 33, 42, and 50 using the age 11 essays. We propose two convolutional neural network based methods that map each task to a regression problem. Among seven teams we ranked third on task A with disattenuated Pearson correlation (DPC) score of 0.5587. Likewise, we ranked third on task B with an average DPC score of 0.3062.


Introduction
The Fifth Annual Workshop on Computational Linguistics and Clinical Psychology (CLPsych) includes a shared task on predicting current and future psychological health from childhood essays.The organizers provided participants with a dataset of 9217 essays written by 11-year-olds and 4235 essays written at age 50 for training.1000 age 11 essays are provided for testing.The data is from the National Child Development Study (NCDS) (Power and Elliott, 2005) which followed the lives of 17000 people born in England, Scotland, and Wales in 1958.There are three shared tasks using this dataset: (i) Task A involves predicting behavioral health scores at age 11 using childhood essays.Specifically, participants were asked to develop methods to score the anxiety and depression levels of a child given their essay.(ii) Task B asks participants to predict future psychological distress at ages 23, 33, 42, and 50 using the age 11 essays.Ground truth training scores are provided for ages 23, 33, and 42.Par-ticipants are not given age 50 distress scores and must infer them based on scores at the previous ages.(iii) The innovation challenge involves generating essays written at age 50 given the age 11 essays.
In this paper, we summarize our submission for the 2018 CLPsych shared tasks A and B. This paper is organized as follows: Section 2 describes our two submissions -models UKNLPA and UKNLPT.In Section 3, we present the official results and then discuss future directions in Section 4.

Methods
We submitted results from two different models, UKNLPA and UKNLPT, to tasks A and B. Both use the same convolutional neural network (CNN) architecture that has been shown to work well across a wide variety of tasks (Kim, 2014;Rios andKavuluru, 2015, 2017;Tran and Kavuluru, 2017).After a brief overview of the CNN architecture in Section 2.1, we describe the UKNLPA model in Section 2.2 and the UKNLPT model in Section 2.3.

Convolutional Neural Networks
The basic CNN architecture for both UKNLPA an UKNLPT are shown in Figure 1.The CNN contains three main components.The first component is the input layer, which takes an essay x as input and represents it as a matrix D, where each row is a word vector.The number of rows will depend on the number of words in the essay.The next component transforms D into a vector.Convolution filters transform every successive n-gram (n successive word vectors) into a real number.The convolution layer, applied to every successive n- gram in the essay, will produce a vector representation (feature map) of the essay.The length of the feature map will depend on the number of words in the essay.Multiple convolution filters produce multiple feature maps.To form a fixed-size vector representation of the essay, we use max-over-time pooling across each feature map.These max values are combined to form the final fixed-size vector representation of the essay.In the remainder of this paper we refer to the fixed size vector as g(x).Finally, we refer to prior work (Kim, 2014) for more details about the architecture.

UKNLPA
For our first model, we represent each essay x as where h(x) is the concatenation of the CNN feature vector g(x) with 59 Linguistic Inquiry and Word Count (LIWC) features (Pennebaker et al., 2001) l(x) and a binary feature s(x), representing the gender of the child.
Let m represent the six psychological health scores for both task A and B: age 11 total, age 11 anxiety, age 11 depression, and distress values at ages 23, 33, and 42.To predict these six scores, we pass h(x) through an affine output layer Training Procedure To train our model we use the huber loss as our training objective.The huber loss combines both the mean squared error (MSE) loss with the mean absolute error (MAE) loss.We define the huber loss as where δ is a hyperparameter that weights the difference between between MSE and MAE and y is the ground truth encoding for one of the six psychological health factors.For small errors, the huber loss is equivalent to MSE and a weighted MAE is used for large errors.Therefore, the huber loss is less sensitive to outliers compared to MSE.
During preliminary experiments, we tried training all outputs jointly and separately.We found our model performs best across all psychological health factors when trained jointly except for age 11 total.Thus, we trained two models.One with a multi-task loss optimized across all six heath factors and one model trained only on age 11 total.We mask the loss for missing values of a particular outcome variable.Finally, because age 50 ground truth scores were not given for training, we output the age 42 predictions directly as the scores for age 50.
Linear Model We train a ridge regression model with three sets of features: term frequency-inverse document frequency (TFIDF) weighted unigrams and bigrams, 59 LIWC features, and a binary feature representing gender.
Ensemble Our final UKNLPA model is an ensemble of multiple CNNs and the linear model.Specifically, we average the predictions of five CNNs trained on different 80/20 splits of the training datasets with the predictions from the linear model, where all models are weighted equally.

Model Configuration and Preprocessing
We preprocess each essay by lowercasing all words.Next, we replace each newline character with a special NEWLINE token and replace all illegible words with the token ILLEGIBLE.Likewise, all words that appear less than five times in the training dataset are replaced with the token UNK.For tokenization, we use a simple regex (\w\w+).We train the UKNLPA model with the Adam optimizer (Kingma and Ba, 2015) using a learning rate of 0.001.We initialize the word vectors of our model with 300 dimensional pre-trained 6B glove embeddings 1 (Pennington et al., 2014).The CNN is trained with windows that span 3, 4, and 5 words with 300 filters per window.Hence, the final neural vector representation of each essay h(x) has 960 dimensions.Our model is regularized using both dropout and L2 regularization.We apply dropout to the embedding layer and to the CNN output g(x) with a dropout probability of 0.2.The L2 regularization parameter is set to 0.001 and the huber loss parameter δ is set to 0.1.We train for a total of 25 epochs with a mini-batch size of and checkpoint after each epoch.The best checkpoint based on a held-out validation dataset is used at test time.For the linear model, we set L2 regularization parameter to 0.1.Finally, we want to note that the social class was not used for UKNLPA.Preliminary experiments showed that it either did not improve or negatively impacted our validation results.

UKNLPT
The architecture of our second model shares the CNN design introduced in Section 2.2.The final feature vector for this model, h ∈ R p , is defined as where g(x) is the CNN-based feature vector composition, l(x) is a feature vector encoding LIWC scores, and s (x) is a feature vector encoding gender and social class for some input essay x.For 1 https://nlp.stanford.edu/projects/glove/ each example, we emit two sub-outputs: one for linear regression and one for binary classification, the latter serving as as "switch" mechanism which determines whether the regression sub-output is passed to the final output.The regression output denoted by ŷ ∈ R m , for m output variables, defined such that where W 1 ∈ R m×p and b 1 ∈ R m are parameters of the network.The sub-output ȳ ∈ R m serving as the "switch" is defined as where W 2 ∈ R m×p and b 2 ∈ R m are parameters of the network and σ is the sigmoid function.The final output y ∈ R m of the network is defined as The idea is to recreate the distribution of the countbased scores by jointly learning to discriminate between the zero and non-zero case, the former of which occurs frequently in the ground truth.For this model, the age 50 predictions are made based on averaging the age 33 and 42 predictions.

Model Configuration
The model is trained with a learning rate of 0.001 using the Adam optimizer (Kingma and Ba, 2015).The input text is lowercased and tokenized on contiguous spans of alphabetic characters using the same regex expression introduced in Section 2.2. the ground truth is encoded as a vector y ∈ R m , where m is the number of target variables to be predicted, then the mean squared error loss ŷ for a single example is defined as where y j , ŷj denotes the jth value of y , ŷ respectively.For the switch output, ȳ, the example-based binary cross entropy loss is defined as where γ j = min(y j , 1).Each example-based loss is mean-averaged over the batch dimension to obtain a mini-batch loss.The learning objectives are trained in alternation for each mini-batch.We check-point at each epoch; the epoch with the best score (based on averaging the DPC measure over the m prediction variables) on the held-out development set of 500 examples is kept for test-time predictions.
We train two separate "instances" of the aforementioned model, one to learn on the a11 bsag total variable and one to learn on the remaining five variables which share a similar range and distribution jointly: a11 bsag anxiety, a11 bsag depression, a23 pdistress, a33 pdistress, and a42 pdistress.Each "instance" is an ensemble of 3 models each trained with a different random parameter initialization and training/development set split.

Experiments
In this section, we compare our methods on the official test set.The competition reports two evaluation metrics: mean absolute error (MAE) and disattenuated pearson correlation (DPC)2 .Final rankings for task A are based on the Total DPC.The average of the age 23 to 42 distress DPC scores are used to rank participants on task B.
Besides our two submissions, UKNLPA and UKNLPT, we also report the results for three variants of UKNLPA: • uk linear -the ridge regression model introduced in Section 2.2.
• uk cnn -an ensemble consisting of five CNNs trained on different 80/20 splits of the training dataset.
• uk ens2 -an ensemble of uk linear and uk cnn.Compared to the method described in Section 2.2, uk ens2 gives more weight to the linear model.

Results
The task A results are shown in Table 1.Officially, UKNLPA placed third and UKNLPT placed fourth based on the Total DPC score.We observe that no single method is best across all three categories.uk ens2 outperforms UKNLPA for the Total category.However, uk ens2 underperformed UKNLPA and uk cnn for anxiety.For both Total DPC and Depression DPC, uk linear performs comparably to uk cnn.Given that uk cnn is an ensemble, this suggests that simple linear models are strong baselines for this task.Furthermore, the best performer based on MAE does not necessarily perform best on DPC measures.For example, both UKNLPT and uk cnn achieve an MAE of 0.944 even though there is a 10% difference between their DPC depression scores.Because each of the psychological health aspects follow a zero-inflated probability distribution (many of the observed ground truth values are zero), MAE favors models that predict zero more often.DPC favors models that are more linearly correlated with the ground truth rather than predicting the exact psychological scores compared to uk cnn.
Table 2 shows the official results for task B. UKNLPA ranked third, while UKNLPT ranked seventh.Similar to task A, we find that on average uk ens2 slightly outperforms UKNLPA.Furthermore, we find that no single method performs best across all ages.We observe that uk linear outperforms the CNN ensemble uk cnn for ages 42 and 50 distress DPC metrics.However, uk cnn outperforms uk linear for age 23 and 33.For all methods except UKNLPT, we use the age 42 predictions to predict age 50 distress because ground truth age 50 distress scores was not provided for the training dataset.Because uk linear performed better on age 42 compared to uk cnn, it also performs best on age 50.Likewise, because uk cnn performs poorly on age 50, when we ensemble it with uk linear it performs worse compared to only using uk linear.

Conclusion
In this paper, we describe our submissions to the 2018 CLPsych shared tasks A and B. Overall, our method UKNLPA ranked third on both tasks and UKNLPT ranked fourth on task A. We identify two avenues for future work.
• The childhood essays contain certain common characteristics.For example, many essays contain illegible words and spelling mistakes.If a word is misspelled, then we may ignore it because it occurs infrequently.So we hypothesize that data cleaning techniques such as using a spell checker to correct spelling issues may improve our results.
• For both tasks A and B, we observe that no single method performs best across all psychological health categories.Therefore, it would be beneficial to use different methods for each category depending on what performs best.Furthermore, if we combine the CNN and linear models with more sophisticated ensemble approaches, we may improve our overall results.

Office of Research Integrity Review
This study has undergone ethics review by the University of Kentucky IRB and has been deemed exempt given it does not involve human subjects, more specifically because, (1). the data analyzed is de-identified, (2).we (the participants) do not have access to a code to re-identify subjects, and (3).there is no collaborator listed on our protocol who has access to identifiers.

Figure 1 :
Figure1: The CNN model layout.We append auxiliary features to the max-pooled CNN features then pass it to an affine output layer.For UKNLPA, the auxiliary features are the 59 LIWC features and the gender.UKNLPT uses LIWC, gender, and social class auxiliary features.

Table 1 :
Official task A results.Models we submitted for the competition are marked with *.Our models that were not official submissions for the competition are marked with †.
Training Procedure Training this model involves optimizing on two separate loss objectives, one for each of the sub-outputs ŷ and ȳ.Suppose

Table 2 :
Official task B results.Models we submitted for the competition are marked with *.Our models that were not official submissions for the competition are marked with †.