Current and Future Psychological Health Prediction using Language and Socio-Demographics of Children for the CLPysch 2018 Shared Task

This article is a system description and report on the submission of a team from the University of Pennsylvania in the ’CLPsych 2018’ shared task. The goal of the shared task was to use childhood language as a marker for both current and future psychological health over individual lifetimes. Our system employs multiple textual features derived from the essays written and individuals’ socio-demographic variables at the age of 11. We considered several word clustering approaches, and explore the use of linear regression based on different feature sets. Our approach showed best results for predicting distress at the age of 42 and for predicting current anxiety on Disattenuated Pearson Correlation, and ranked fourth in the future health prediction task. In addition to the subtasks presented, we attempted to provide insight into mental health aspects at different ages. Our findings indicate that misspellings, words with illegible letters and increased use of personal pronouns are correlated with poor mental health at age 11, while descriptions about future physical activity, family and friends are correlated with good mental health.


Introduction
Studying early markers of well-being is a significant emerging frontier in child development research, examining the strengths, assets and abilities to establish positive developmental trajectory for children (Masten and Coatsworth, 1998). Humans are affected by experiences early in their childhood in ways that shape their life course. Language can be very useful in predicting wellbeing in the short term (Schwartz et al., 2013b). Predictions about the long-term future using language is rather unexplored by the NLP community, and can aid a variety of applications aimed at the understanding of early life markers and development of preventative care. The CLPsych 2018 shared task explores the predictive ability of language to elucidate a person's long-term well-being. The competition uses a corpus of individuals, who were surveyed at various points in their life since their birth to monitor their health and socioeconomic status. At age 11, the participants wrote short essays on where they saw themselves at age 25, fourteen years in the future; these essays are used to predict aspects of their mental health, measured by depression syndrome, anxiety syndrome, and the total Bristol Social Adjustment Guide (BSAG) score (Stott and Sykes, 1963). The two sub tasks are to predict these aspects of a) current mental health at age 11 (Task A), and b) future mental health at ages 23, 33, and 42 (Task B). Additional non-linguistic variables, including gender and childhood parental social class were also provided.
For our participation in this shared task, we treat the task as a regression problem using standard regularised linear regression algorithm (i.e. Ridge Regression). We use a wide range of automatically derived textual features (based on word clustering and other pre-trained models) to obtain different representations of the language used by individuals. Our regression model returns a continuous score for each aspect of mental health for each individual. The results are measured on Disattenuated Pearson Correlation (shown as r disatt in the results of our paper) between the predictions and the actual survey outcomes. This metric is similar to a Pearson correlation, but it accounts for measurement error and thus yields values with larger variance. The measurement error (accounted for by its inverse, reliability) is taken from the literature on the reliability of the psychological distress questionnaires (0.77; (Ploubidis et al., 2017)) and of similar language-based predictions (0.70; (Park et al., 2014)). The metric is thus: Descriptive statistics of sociodemographics at age 11 for the individuals in training and test datasets.
In addition to the shared task we also looked at characterizing language for each mental health indicator using both open and closed vocabulary approaches.

System Overview
In our approach, we aggregate the word counts in all of an individual's posts, irrespective of the word order within (a bag-of-words approach). Each individual in the dataset is thus represented by a distribution over words. We then use automatically derived groups of co-occurring words (or 'topics') to obtain a lower dimensional distribution for each individual. These topics, built using automatic clustering methods from separate large datasets, capture a set of semantic and syntactic relationships (e.g. words reflecting depression, pronouns etc). In addition, we use the sociodemographics of each individual.

Data
This study has undergone IRB ethics review at the University of Pennsylvania and has been deemed exempt. The shared task uses data from the National Child Development Study (Davie et al., 1972), which is a British birth cohort study following an initial 17,416 babies born in Britain in one week in March 1958. The study was augmented in subsequent childhood sweeps by immigrants to Great Britain born in the studys target week, bringing to the total NCDS sample to 18,558. Surviving members of this birth cohort have been surveyed on eight further occasions in order to monitor their changing health, education, social and economic circumstances, of which the data for ages 11, 23, 33 and 43 are shared in this task.
When the children of the NCDS were eleven years old in 1969 they were asked to write an es-   say about what they thought their life would be like at age 25. 10,511 essays were then restored and transcribed from historic records (see (Davie et al., 1972) for details of the transcription process). The statistics of both the training and test datasets shared, which excludes any essays that contained fewer than 50 words, are presented in Table 1. The descriptive statistics of the mental health outcomes for the training dataset are presented in Table 2. The inter-correlations between mental health aspects at multiple ages are shown in Table 3.

Features and Methods
We briefly summarize the features used in our prediction task. The entire pipeline of feature extraction, out of sample prediction (for the shared task) and language insights used the Differential Language Analysis ToolKit (DLATK) Python package .

Features
Unigram Features (unigrams) We use unigrams as features in order to capture a broad range of textual information. First, we tokenized the essays into unigrams using a modified version of Chris Potts' HappyFunTokenizer (Manning et al., 2014) which captures social media content such as emoticons and hashtags 1 . We use the unigrams mentioned by at least 1% of individuals in the training set, resulting in 1,147 features (out of 55,486 features).
UnigramMeta After extracting unigrams, we calculate two meta features for each individual: a) average length of unigrams, and b) number of unigrams per essay. These features were shown to predict depression in social media individuals (Guntuku et al., 2017c).
Word2Vec Word Clusters (W2V) Neural methods have recently been gaining popularity in order to obtain low-rank word embeddings and obtained state-of-the-art results for a number of semantic tasks (Mikolov et al., 2013b). These methods, like many recent word embeddings, also allow to capture local context order rather than just 'bag-of-words' relatedness, which leads to also capture syntactic information. We use the skip-gram model with negative sampling (Mikolov et al., 2013a) to learn word embeddings from a corpus of 400 million tweets also used in (Lampos et al., 2014). We use a hidden layer size of 50 with the Gensim implementation. 2 We then apply spectral clustering on these embeddings to obtain hard clusters of words. We create 200 hard clusters i.e. one word can belong to only one topic. The importance score associated with every word represents how central the word is in its cluster. Clusters are computed using spectral clustering over a word-word similarity matrix generated by Word2Vec. These features were shown to predict income and personality of users on social media (Lampos et al., 2014;Guntuku et al., 2017a). These clusters are available online 3 .
LDA Word Clusters (LDA) A different type of clustering is obtained by using topic models, most popular of which is Latent Dirichlet Allocation (Blei et al., 2003). LDA models each post as being a mixture of different topics, each topic representing a distribution over words, thus obtaining soft clusters of words. We use the 2000 clusters introduced in (Schwartz et al., 2013a), which were computed over a large dataset of posts from 70,000 Facebook users. These features were shown to predict multiple user traits like depression , personality (Schwartz et al., 2013a), other demographic and psychological traits (Jaika et al., 2018) on social media. These clusters are available online 4 Linguistic Inquiry and Word Count (LIWC) LIWC (Pennebaker et al., 2007) is a dictionary comprising 64 different categories (e.g., topical categories, emotions, parts-of-speech) which are manually constructed based on psychological theory. We use LIWC to represent the language of each individual as normalized frequency distributions of these categories, by counting the words associated with each category for each user and normalizing them based on the total number of words that the user posted . These features were shown to predict user traits across multiple modalities such as essays, social media and blogs (Boyd and Pennebaker, 2017). LIWC has also been used to understand the relationship between a persons social media activities and real life behaviors, such as substance use (Ding et al., 2017). , 2013) is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The annotations were manually done by crowdsourcing. We use NRC Lexicon to represent the language of each individual as normalized frequency distributions of these emotions.

NRC Emotion Lexicon (NRCEmot) The NRC Emotion Lexicon (Mohammad and Turney
Personality We used automatic text-regression methods (Schwartz et al., 2013a)

Methods
Task A and B We stratified individuals into fivefolds. In this five-fold cross validation setting, we tried linear regression with ridge regularization. We used the implementation from Scikit-Learn (Pedregosa et al., 2011) which uses Stochastic Gradient Descent for inference. Parameter tuning plays a vital role in good performance of regression algorithms. We measure Pearson correlation on our training set using 5 cross-fold validation and optimize parameters using grid search for each feature set individually. The performance was measured by calculating Disattenuated Pearson's Correlation r disatt and Mean Absolute Error (MAE) over the aggregated predictions from the five-folds.
Language Insights In addition to Task A and B we also tried to identify language that characterizes each of the mental health outcomes using both an open and closed vocabulary approach. For the open vocabulary approach we used Differential Language Analysis (DLA) (Schwartz et al., 2013a). Here we individually correlate the unigram features against each of our outcomes (age 11 anxiety, depression and BSAG score, age 23 distress, age 33 distress and age 44 distress) via ordinary least squares regression. We only considered unigrams used by at least .1% of users (5,457 total features). For the closed vocabulary approach we used LIWC categories and applied the same analysis (univariate correlations via ordinary least squares regression). In both approaches we added gender as a covariate in the regression model but this produced few (or zero) significant (p < 0.05) results for distress outcome at various ages. We also applied a Benjamini-Hochberg correction (Benjamini and Hochberg, 1995) to the significance threshold in order to compensate for multiple comparisons.

Results and Discussion
Task A The results of our methods at predicting current mental health on a cross-validation setting are presented in Table 4. For total BSAG score, unigrams show the best performance followed by LDA clusters, LIWC and Word2Vec clusters. It is interesting that both LDA and Word2Vec clusters perform well, even though trained on datasets from a different modality than essays (i.e. social media). unigram-Meta and SocioDemographic features rank next in performance, which is interesting considering they are a very low dimensional representation. For Depression, the performance of different features is relatively similar with the exception that Word2Vec clusters have marginally better performance than LIWC. Predicting Anxiety yields the lowest performance of all three aspects of mental health, with minor changes in rank order of different features.
NRCEmot and language predicted Personality features do not perform well, specifically for predicting Anxiety, possibly because the difference in both the modality on and the time at which these features are built when compared to the essays being analyzed. NRCEmot was primarily developed for identifying emotion-related words on Twitter. The huge difference in the language of Twitter and essays written by the children in this sample would have led to poor generalisation of NRCEmot. The Personality model was also built on another social media platform -Facebook; considering the time period in which the model was built and that in which the essays were written, drift in language (Biber and Finegan, 1989;Jaidka et al., 2018;Wijaya and Yeniterzi, 2011) apart from modality differences would have led to poor generalization of the feature space.
At the time of submission, we did not evaluate the performance of unigram features, and sub-  mitted the predictions from LDA topics for total BSAG score and Depression, and prediction from Word2Vec clusters for Anxiety on the test set.
Task B The results of our methods at predicting future mental health on a cross-validation setting are presented in Table 5. Predicting future distress is a much tougher task when compared to predicting current mental health aspects, as also seen by the performance metrics. Surprisingly SocioDemographics outperform all other language features in the prediction of future distress. Socio economic status is known to affect health over individual's life course as suggested by prior research (Smith, 2007), and in this cohort it is seen to outperform the language of essays that children wrote about their impression of their future self.
Among language features, performance of predicting distress worsens with increase in the time from when the child wrote the essays and the time at which the prediction is being made (i.e. r disatt at Age 23 > r disatt at Age 33 r disatt at Age 42). For predicting distress at Age 23 and 42, unigrams rank best followed by LDA and Word2Vec clusters. For Age 33, LDA clusters outperform unigrams and W2V. Also it should be noted that the mental health aspects at age 11 and not strongly correlated with the mental health aspects at age 23 and 33 (Table 3) which potentially indicate that the linguistic characteristics of the essays that the children wrote at age 11 might not be able to accurately reflect their future mental health.
Considering the complexity of the task involved, it can be hypothesized that the relationship between the language features and the outcomes is non-linear, potentially consisting of multiple latent variables. Using stacked autoencoders to capture the non-linearity in the task could potentially improve the modeling performance (Guntuku et al., 2016b). Further, simpler text selection/categorization techniques like representing all misspelled words/words not in a dictionary/punctuation by a single category might be worth exploring, thereby reducing the feature space to consist of dimensions which contribute to the modeling task (Preoţiuc-Pietro et al., 2017). Table 6 shows the intercorrelations between meta-language features, mental health and socio-demographics. Here we see that higher social classes are correlated (significantly, though with a low effect size) with increased word usage and increase word length (Ling, 2005). All age 11 mental health measures are negatively correlated with word length and word totals. Males have higher depression and BSAG at age 11 while females have higher distress at age 23, 33 and 42. Figure 1 shows the results of our open vocabulary approach (DLA). Here color represents the words frequency in the corpus (darker for more frequent) and size represents correlation strength. Misspelled words like 'will', 'wen', 'marid', 'mared', 'old' are associated bad psychological health at age 11, while words like 'house', 'saturday', 'friends', 'playing' are associated with the language of those with good psychological health (Ginsburg et al., 2007). Language of individuals with bad psychological health at age 11 is also associated with words containing letters which were illegible to transcribe (as indicated by * ), and several spelling errors ('marid', 'mared', 'houes', 'gow') which are not found in language of mentally healthier children (Crum et al., 1993). It is interesting that the words 'and' and 'will' seem like low-hanging fruit for validating this approach.

Language Insights
Distress at ages 23 and 33 is positively correlated with daily activities of life 'shopping', 'hairdresser', 'sewing', 'school' whereas words associated with sports 'football ', 'training', 'cricket', 'boat' etc (Power and Elliott, 2005). It is interesting that several insights about their future mental health can be gleaned using responses to such prompts.
The results of the LIWC analysis are in Figure 2. Here red cells are positively correlated with the outcome (more distress, anxiety, etc.), blue cells are negatively correlated (less distress, anxiety, etc.) and white cells are not significant after correction for multiple comparisons. Here we see 'posemo', 'family' and 'affiliation' are all protective at age 11 (Kellam et al., 1977). Bad mental health is associated with both the 'i' and 'informal' categories at age 11 with pronoun usage and with pronoun usage at older ages. While 'leisure' is protective at all ages, no categories are associated with mental illness at every age. This is consistent with the linguistic manifestation of several mental health conditions (e.g. depression ?)).

Conclusions
This paper reported on the participation of a team from the University of Pennsylvania in the CLPsych 2018 shared task on identifying current and future mental health of children based on language from essays they wrote.
Our methods were based on linear regression using different types of word clusters. The methods we presented were designed to be as task agnostic as possible, and thus, our approach showed best results for predicting distress at the age of 42 and for predicting current anxiety on Disattenuated Pearson Correlation, and ranked fourth in the future health prediction task. Our method did not perform well compared to other teams in predicting current mental health. Fitting more complex non-linear models might have yielded better performance for that subtask. It is interesting that SocioDemographic features outperformed all language features in predicting future distress. Next, normalized word counts (unigrams) performed best at most subtasks. In addition to the subtasks presented, we attempted to provide insight into mental health aspects at different ages. Our findings show that a) mental health aspects at age 11 correlate poorly with mental health at ages 23 and 33 for the children in this cohort; b) males have higher depression scores when compared to females at age 11, while females have higher distress at ages 23, 33 and 42; c) mental health measures are negatively correlated with word length and total number of words used in the essay; d) misspellings, words with illegible letters and increased use of personal pronouns ('I') are correlated with poor mental health at age 11, while descriptions about future physical activity, family and friends are correlated with good mental health.
For future work, since the Socio Demographic performed best, we could apply methods such as User-Factor Adaptation which focus on the author of the content in addition to the content (Lynn et al., 2017;Zhu et al., 2018). It would also be interesting to investigate if word clusters trained on historical sources (for e.g. Google books) might yield reliable feature representations when studying mental health aspects at different ages to emulate the linguistic associations of elderly, for whom data from other platforms such as social media is be scarce.