Using Syntactic and Semantic Context to Explore Psychodemographic Differences in Self-reference

Psychological analysis of language has repeatedly shown that an individual’s rate of mentioning 1st person singular pronouns predicts a wealth of important demographic and psychological factors. However, these analyses are performed out of context — syntactic and semantic — which may change the magnitude or even direction of such relationships. In this paper, we put “pronouns in their context”, exploring the relationship between self-reference and age, gender, and depression depending on syntactic position and verbal governor. We ﬁnd that pronouns are over-all more predictive when taking dependency relations and verb semantic categories into account, and, the direction of the relationship can change depending on the semantic class of the verbal governor.


Introduction
Approximately 1 in 18 English words on Facebook are first-person singular pronouns. 1 Extensive work in psychological analyses of language has consistently found strong relation between first-person pronoun use and psychological attributes of individuals (Kendall, 1998;Pennebaker and Stone, 2003;Pennebaker, 2011;Twenge et al., 2012;Oishi et al., 2013;Carey et al., 2015). Although such findings have been replicated extensively, little is known about how the syntactic or semantic context of the pronouns may affect their relationship with human traits. Usage in subject or object position may vary, and the type of verb governing the reference may further change its relationship. For instance, while younger individuals are more likely to use 1stperson singular pronouns overall, older individuals may be more likely to use them as the subject of social verbs.
In this study we dive deep into this one type of word which makes up a large portion of our daily lives. We first look at the relationship between first person singular pronouns and age, gender, and depression. We then consider the syntactic position of the pronoun and its occurrence in the subject and direct object position. Next, we explore the selfreferenced use of verbs compared to their general use across different semantic categories, followed by an examination of the rate of 1st-person singular pronoun as the subject and the object with different verb categories.
We ultimately show that pronoun relationships with human outcomes can change drastically depending on their syntactic position and the category of their verbal governor. To be more specific, our contributions include: (a) taking the role of context into account in the psychological analysis of personal pronouns, (b) distributional clustering of verbs using Canonical Correlation Analysis (CCA), and (c) exploring the integration of verbal semantic categories in the analysis of pronouns. Utilizing verb categories instead of actual verbs, enables generalization and less sparsity in the semantic comparison of the contexts in which personal pronouns are used.

Background
A wealth of studies have explored pronoun use with regard to age, gender, and personality types. In fact, a whole book, "The Secret Life of Pronouns" has been dedicated summarizing such studies which have built up over several decades of work (Pennebaker, 2011). 2 We could not come close to a full survey of such work, but rather list some of the most notable and recent results for outcomes related to those of this study. Pennebaker et. al. (2003) and Chung & Pennebaker (2007) found that the use of self-references (i.e. 'I', 'me') decreases over age. Pennebaker et. al. (2003), and Argamon et. al. (2007) showed that females use significantly more first-person singular personal pronouns compared to males. Bucci and Freedman (1981), Weintraub (1981), andZimmermann et. al. (2013) found that first-person singular pronouns are positively correlated with depressive symptoms. These analyses do not take the role of syntactic and semantic context into consideration which may indicate interesting information about psychological factors.

Method
Data Set: Facebook Status Updates. Our dataset consists of the status updates of 74,867 Facebook users who volunteered to share their posts in the "MyPersonality" application , sent between January 2009 and October 2011. The users met the following criteria: (a) have English as a primary language, (b) indicated their gender and age, (c) be less than 65 years old (due to data sparsity beyond this age), and (d) have at least 1,000 words in their status updates (in order to accurately estimate language usage rates). This dataset contains 309 million words within 15.4 million status updates. All users completed a 100-item personality questionnaire (an International Personality Item Pool (IPIP) proxy to the NEO-PI-R (Goldberg, 1999). User-level degree of depression (DDep) was estimated as the average response to seven depression facet items (nested within the larger Neuroti-cism item pool of the questionnaire) (Schwartz et al., 2014).
Dependency Features. We used dependency annotations in order to determine the syntactic function of personal pronouns i.e. subject (S) and direct object (DO). We obtained dependency parses of our corpus using Stanford Parser (Socher et al., 2013) that provides universal dependencies in (relation, head, dependent) triples. In the next step, we extracted the words in in the nominal subject ("nsubj") and direct object ("dobj") positions including nsubj 1st-person singular pronoun "I", and dobj 1st-person singular pronoun "me". We also extracted the corresponding verbs for each of the nominal subjects, and direct object words.
Verb categorization. In order to integrate the verbal semantic categories in the syntactic analysis of pronouns, we utilize two verb categorization methods (a) linguistically-driven Levin's Verb Classes, and (b) empirically-driven verb clustering based on CCA.
Levin's verb classes (Levin, 1993) includes around 3100 English verbs classified into 47 top level, 193 second and third level classes. This classification is based on Levin's hypothesis that the syntactic behavior of a verb is influenced by its semantic properties, indicating that identifying sets of verbs with comparable behavior at the syntax level will lead to coherent clusters of semantically similar verbs. In this paper we used all of the 193 second and third level Levin's classes (Lev). As an alternative way, we also used the 50 top most frequent sub-classes in our social media data (LevTop).
To derive empirically driven clusters we use Canonical Correlation Analysis (CCA), a multiview dimensionality reduction technique. CCA has previously been used in word clustering methods such as multi-view learning of word embeddings (Dhillon et al., 2011), or multilingual word embeddings (Ammar et al., 2016). The advantage of a multi-view technique is that we can leverage both the subject and object context. More precisely, we performed sparse CCA on matrix x that includes 5k by 10k verb-by-nominal-subject (nsubj) co-occurrences, and matrix z that includes 5k by 10k verb-by-direct-object (dobj) co-occurrences. The output of CCA is a subject by component matrix  Table 1: Area under the ROC curve (AUC) for gender (higher is better), and Mean Square Error (MSE) for age and depression prediction (lower is better), and the prediction using 1st-per pronoun use overall, in subject and object position, and given verb categories.
(u: subject-view), and object by component matrix (v: object-view). We then build matrix S by multiplying x by u and matrix O by multiplying z by v to get the verbs by CCA-components from subjectview, and verbs by object components from objectview respectively. In order to cluster verbs from direct CCA components, we use the average score of subject-view and object-view components, assigning verbs to those components for which they have a non-zero absolute weight (CCA-D). Sparse CCA zeros-out verbs from multiple components so as to assign verbs to components, but we also explore normal CCA and cluster the verbs using k-means (k = 30) clustering from the z-scaled values of S and O matrices (CCA-KM). Both Levin's and CCA-based verb classes are derived from syntactic behavior. As a result, they often do not distinguish antonyms. For instance, Levin's "admire" verb class contains both 'love' and 'hate". Building on research showing positive and negative emotions differ across age and gender (Schwartz et al., 2013), we integrate valence information in our verb clustering. We used positive and negative sentiment scores from EmoLex word-emotion association lexicon (Mohammad and Turney, 2013), dividing each of our clusters into positive, negative, and neutral sub-classes.
Analysis. We explore the use of 1st-person singular pronouns across age and gender in different syntactic and semantic contexts. Features are encoded as the mean from maximum likelihood estimation  over the probability of mentioning a first person singular pronoun in a given context. (a) The overall usage first person singular pronoun: The probability of using first person singular pronoun in the nsubj, and the dobj positions: where rel ∈ {nsubj, dobj}.
(c) The probability of using first person singular pronoun in the nsubj, and the dobj positions of a given verb category: where rel ∈ {nsubj, dobj} and vcat is the set of all verb categories being considered.

Evaluation
The goal of our work is to expand the knowledge of how the first-person singular pronoun, one of the most common word types in English, is related to who we are -our demographics and psychological states. We work toward this goal in an empirical fashion, by first replicating known general relationships of 1st-person singular pronouns with gender, age, and depression, exploring how their use in different syntactic positions, and, finally, by looking at relationships within specific semantic contexts according to the verb classes described earlier.  , blow, roll, hack, cast .22 hold, handle, grasp, clutch, wield, grip .18 hit, kick, strike, slap, smash, smack, bang, butt -.10 add, mix, connect, link, combine, blend -.04  Replication. We use standardized linear and logistic regression to correlate gender, age, and depression with P (1p) (first-person singular pronoun use). We control for age in the case of gender, gender in the case of age, and both gender and age in the case of depression by including them as covariates in the regression and reporting the unique coefficient for the variable in question. Logistic regression is used for gender, since it is binary, while linear regression is used for the continuous age and depression variables. Confirming past results, we found significant relationships between first-person pronoun usage and gender (β = .11, p < .001), age (r = −0.17, p < .001), and depression score (r = −0.06, p < .01).
Syntactic Context. Taking dependency relationships into account (P (1p|r)), we observed shifts in the magnitude of correlations. Specifically, we found significant negative correlations between age and using 1st-person singular pronoun in the subject (r = −0.12, p < .001), and the object positions (r = −0.17, p < .001). For gender we found a significant positive correlation between being female and the probability of using 1st-person singular pronoun (r = 0.11, p < .001), and 1st-person singular pronoun in subject position (r = 0.16, p < .001).
Syntactic and Semantic Context. Table 1 reports the area under the ROC curve (AUC) for gender prediction and the Mean Square Error (MSE) for predicting age and depression based on P (1p), P (1p|r), and P (1p|r, c), driven from various categorization approaches. We used AUC since it can capture more differences in performance by evaluating the class probabilities of test instances rather than just finding whether it was right or wrong. We applied 10-fold cross-validation with a linear-SVM in the case of gender, and ridge-regression in the case of depression. The obtained results reveal a consistent pattern: in gender, age, and depression prediction all the features that take context into account outperform P (1p) which is the vastly reported measure of self-reference in the literature. This suggests that there is more information to be gained by utilization syntactic and semantic context. In other words, we can achieve a more meaningful, deeper insight into the relationship of subject and object position of the first person in different contexts, revealing a more complex, and more insightful set of relations. We achieve the best performance by utilizing verb categories. We first observe that integrating sentiment helps in nearly all verb categorization approaches. Next, we see that while both CCA and Levin verb clusters yield improvement in prediction accuracy, our performance gains using the datadriven CCA-based verb clustering are not as large as that from Levin's linguistically-driven classes.
While we believe our features can improve prediction accuracy, that is not the primary application of social science research. Rather, it is correlating the behavior of referencing the self with psychological conditions, like depression, in order to gain human insights. In the case of correlating behavior with a psychological measure, Pearson coefficients above .1 are considered noteworthy and above .3 are considered approaching a "correlational upperbound" (Meyer et al., 2001).
Tables 2, 3, and 4 show the most predictive features, using the best performing clustering method (i.e. Levin & Sentiment). Note that in the case of age and gender, we see that not only does the magnitude of the relationship change, but it's possible that the direction can completely change.
For example, while males are less likely to 2057 Verb Clusters r 1st person singular pronoun use .06 1st person singular nominal subject cry, worry, suffer, fear, bother, ache, mourn, anger .11 scare, annoy, confuse, depress, upset, disappoint .11 1st person singular direct object kill, murder, slay, slaughter, butcher .09 scare, annoy, confuse, depress, upset, disappoint .07 use first-person singular pronouns overall, they are much more likely to use them as the subject of aggressive physical contact verbs like "kick", "shoot", "slap", and "smash", suggesting men are more likely to express themselves as agents of aggressive contact. On the other hand, women use first-person singulars in the social sphere, particularly in an affiliative context. They assert themselves as agents of empowering and encouraging others (e.g. "love", "enjoy", "cherish", "admire") and faith in others (e.g. "trust", "value", "support", "respect").

Conclusion
We have shown that the well-studied link between the first-person singular pronoun and human psycho-demographics is largely dependent on its syntactic and semantic context. Many theories and conclusions are built on such relationships, but here we show these relationships depend on verbal context; correlations can shrink, grow, and even change directions depending on the verbs governing the pronoun. For example, while the usage of 1st person singular pronoun decreases over age, it increases if it is used as the subject of verbs such as "thank", and "celebrate", or as the object of verbs such as "join". Similarly, while females tend to use 1st person singular pronouns more than males, they use them less often as the subject of "destroy" verbs or as the object of "hit" and "kick" verbs. By integrating syntactic dependency relationships along with semantic classes of verbs, we can capture more nuanced linguistic relationships with human factors. Beyond pronouns, we ultimately aim to expand the regimen of open-vocabulary techniques available for the analysis of psychologicallyrelevant outcomes.