Identifying Locus of Control in Social Media Language

Individuals express their locus of control, or “control”, in their language when they identify whether or not they are in control of their circumstances. Although control is a core concept underlying rhetorical style, it is not clear whether control is expressed by how or by what authors write. We explore the roles of syntax and semantics in expressing users’ sense of control –i.e. being “controlled by” or “in control of” their circumstances– in a corpus of annotated Facebook posts. We present rich insights into these linguistic aspects and find that while the language signaling control is easy to identify, it is more challenging to label it is internally or externally controlled, with lexical features outperforming syntactic features at the task. Our findings could have important implications for studying self-expression in social media.


Introduction
Language is a form of action, and written occurrences constitute a performance of power and control (Fairclough, 2001). Research in natural language processing has long focused on distilling the constituents of writing that conveys authority (Danescu-Niculescu-Mizil et al., 2012), dominance (Bradley and Lang, 1999), dogmatism (Fast and Horvitz, 2016), expertise (Levy and Potts, 2015) and politeness (Danescu-Niculescu-Mizil et al., 2012). These studies have shown how authors' use of certain lexical and syntactic patterns achieve specific rhetorical effects. In this study, we contribute to the growing literature with an analysis of how individuals express control, and compare the insights and predictive power obtained from both, lexical and syntactic features.
We operationalize the key aspects of locus of control described by existing psychology theories * Masoud Rouhizadeh & Kokil Jaidka co-lead this work. (Rotter, 1966) to identify an external locus of control when authors express that they feel controlled by other people or the environment. On the other hand, authors communicate an internal locus of control when they ascribe the control of their decisions and circumstances to themselves. This study poses the question: do writers rely more on content than syntax to convey their control? Following its psychological underpinnings, we expect that overall lexical choice, self-reference, and certain verb categories would be indicative features for modeling author's locus of control. Our study attempts to validate these assumptions by answering two research questions: • To what extent can lexical and syntactic features predict control relevance and internal and external locus of control? • Which verb categories are more associated with control relevance and internal vs. external locus of control? We train a multi-stage predictive model to classify posts on Facebook as (a) control-relevant or not, and (b) internal vs. external control. We find that syntactic and semantic features work well to identify control relevance, with verb categories and pronoun use being the most predictive features. However, determining internal vs. external control is a more challenging task, where lexical features vastly outperform syntax-based features. The best performance is seen by a model that combines lexical and syntactic features along with self-reference ratio. We will be releasing our tools and models along with annotated and anonymized datasets for the benefit of the research community.

Background
Locus of control, or "control", reflects the extent to which people ascribe the cause or control of events in their lives to themselves or the external factors (Rotter, 1966). The language indicating internal control in a given situation signals intentionality (the author is describing an action that he/she intended) and awareness (the author is aware of the effect of the action). Internal control is often associated with causing a given event or doing something that is clearly a choice. The external control language, on the other hand, is characterized by lack of intention or awareness, or by concrete mention of being controlled by others. It is usually lined to describing an out-of-control event, or something that is not a choice Organizations and governments are attempting to better understand issues of locus of control and self-efficacy, which are closely related with physical health and job performance (Marmot et al., 1991;Harter et al., 2003). The study by (Jaidka et al., 2018b) offered to explore the relationship between locus of control (measured through surveys) and the Big 5 personality taxonomy (John and Srivastava, 1999), and touched briefly upon the content-related linguistic signals. However, the present study is more interested in comparing the lexical (aka open-vocabulary or bag-of-words) and syntactic signals of control that are embedded in language. If deployed on a large scale with the informed consent of the authors, it may allow unobtrusive, cost-effective estimates of well-being 1 .
Signals within the language should provide insight into the cognitive sense of control experienced by an individual; however, to our knowledge, no study has examined whether these signals are stronger in the way people express themselves, or what they say. Although the discourse and lexical signals that convey status and dominance have been studied in previous work, they involve controlling other people (Danescu-Niculescu-Mizil et al., 2012;Fast and Horvitz, 2016), while locus of control is an assessment of the writer's own affairs -perhaps indicated through lexical features that focus on first person pronouns.
Dominance has also been qualitatively measured in terms of the affective context at the level of an utterance (Bradley and Lang, 1999); however, we found that this also did not correspond to our notion of locus of control. With reference to the syntactic features, we will test whether control may be understood through different verb forms, and active versus passive voice. Locus of control is different from semantic role labeling: for internal control, we are mainly interested in whether the author is the agent (rather than identifying who the agent is, in a sentence), and how much they control their life. External control includes a complex mixture of semantic roles such as patient, experiencer, theme, recipient, and beneficiary.

Locus of control data
Posts on Facebook capture self-expression by a diverse audience about their daily lives, which makes it a natural starting point for exploring the linguistic features of control. For data collection, we deployed a survey on Qualtrics comprising several demographic questions, the Sense of Control facet items from the MIDUS survey (Brim et al., 2004) and a 3-item health inventory measuring poor health, general health and weekly excercise, taken from the CDCs Behavioral Risk Factor Surveillance System (BRFSS) and the European Social Survey (ESS). We invited users to share access to their Facebook status updates and randomly sampled 2000 users who had posted at least 100 words (with 839 status updates on average). We then randomly sampled 2 status updates per user to obtain 4000 statuses. We cleaned up the text by removing URLs, hashtags, user handles, etc. and split each status into sentences using our in-house pattern-based sentence splitter inspired by CMU ARK Twitter Twokenize script (Owoputi et al., 2013). Finally, from each status update, we selected one sentence that has 2 or more words.

Locus of control annotation
Building a useful computational model requires labeled training data. We labeled the Facebook dataset using three trained annotators pursuing a Master's program in Psychology, to construct the first public corpus annotated with control. We asked the annotators to determine whether the author of the sentence is in control (internal control) or being controlled by others or circumstances (external control). To ensure quality work, we provided examples corresponding to each point on the scale. The examples provided to the annotators are also provided in the annotation scheme described in the Appendix.
In this paper, we focus on Stage 1 and Stage 2 of annotation, which distinguish control-relevant sentences and classify them as either internal or external control. We evaluate the reliability of the annotations by computing percent-agreement. The highest pairwise inter-annotator agreement between annotators is 90.2% for Stage 1 and 79% for Stage 2.

Methods
To train and test the classifiers for the Stage 1 and 2, we use the Differential Language Analysis ToolKit 2 (DLATK) , a Python package developed for social media text analysis. Lexical features are extracted in DLATK and syntax-based features are separately generated and imported into DLATK feature sets.  (2015). They built 200 open-ended word clusters by applying spectral clustering to the word-to-word similarity matrix from the neural embeddings Mikolov et. al (2013). Pronouns: We clustered all pronouns (except for possessives) into 1st-, 2nd-, and 3rd-person regardless of their syntactic role.

Syntax-based features
We acquire dependency parses of our corpus by SyntaxNet (Andor et al., 2016) with Parsey Mc-Parseface model 3 that produces universal dependencies in relation, head, dependent triples in CONLL format. We obtain subject-verb tuples (SVs) and subject-verb-object triples (SVOs) from the dependency trees. In our in-house evaluation on a random set of 100 Tweets, SyntaxNet with the Parsey McParseface model outperforms the Stanford Parser (Socher et al., 2013) on extracting SVs and SVOs from social media (P=.75, R= 68, TN Rate =.90 for the former; P=.51, R= .55, TN Rate =.80 for the latter). SyntaxNet is also a better tool for our purpose compared to the Tweebo Parser (Kong et al., 2014), that only provides dependency graphs and not the relations. Verb predicate features: We identify the verb classes using (a) linguistically-driven Levin's Classes (Lev) (Levin, 1993), and (b) an in-house manual verb clustering on the most frequent 130 verb into 40 semantic classes (M). Inspired by the previous research (Rouhizadeh et al., 2016), we extract five sets of dependency-driven verb predicate features. (1) Pronouns-SVO: occurrence of 1st, 2nd, and 3rd person pronouns in the subject/object positions, (2) VerbCat-1-2-3PP occurrence of 1st, 2nd, and 3rd person pronouns in the subject/object positions of each verb category (from the Levin's or our own verb classes), (3) VerbCat-1PP: occurrence of 1st or non-1st pronouns (i.e. self-reference ratio) in the subject/object positions of each verb category, (4) VerbCat-SVO: all words in the subject/object positions of each verb category, and (5) VerbCatall categories of all the verbs in subject-verb and subject-verb-object contexts. POS-ngrams: We capture shallow syntactic features by constructing Penn Part-of-Speech (POS) unigrams, bigrams and trigrams after tagging every word with their POS information by using the Python NLTK package (Bird et al., 2009).

Results
We train multiple classifiers on different sets of lexical and syntactic features after performing feature selection based on univariate regression between each feature and the outcome. We also perform dimensionality reduction by randomized principal component analysis (PCA). After evaluating a number of classification models available in Pythons sklearn package, we report the results from the logistic regression classifier, with L2 penalization, and inverse of regularization strength set to 0.1, which obtain the best results. Although we have experimented with a variety of feature combinations which are included in the Appendix, we discuss selected classifiers (for meaningful comparisons) as well as the best-performing ones.

Predicting control relevance
We report the results for prediction performance of the main feature in Table 1a. We see that lexical ngrams are moderately effective, whereas LIWC, ANEW, and w2v features are not helpful. On the other hand, pronouns appear to be very helpful, despite their linguistic and semantic simplicity. Among the syntax-based features, using the verb categories of SV and SVOs provide the second best F1 score (Levin's classes are better than our classes), and adding self-reference ratio to verb categories generates the best results. Interestingly, POS-ngrams perform on-par with the lexical ngrams, suggesting that the encoded syntactic information in pos-sequences is just as good as lexical information in ngrams with considerably less sparsity and more efficient computation.

Predicting Internal/External control
The best results for classifying internal vs. external control are in Table 1b. Unlike control relevance, ngrams are the most helpful features here and pronouns and verb category features result in a poor performance. LIWC, ANEW, and w2v do not noticeably improve the results when combined with ngrams, but adding verb categories and self-reference ratio to this combination creates the best result. This suggests that identifying internal/external control benefits from an ensemble of syntactic, semantic and lexical features and simple lexical features are more helpful than word embeddings. Similar to control relevance, POSngrams are helpful here although they are not as good as lexical ngrams. Table 2 shows the most predictive verb categories (with subject or subject-object arguments) based on logistic regression coefficient. We see that not only does the magnitude of the relationship change, but it is possible that the direction can completely change (positive values are control relevance in Table 2a and external control in Table 2b. Interestingly, we see that verbs for audio/visual activities, demand, and start/end of pro-cesses, are more frequently used in control-related contexts, whereas using love and like verbs is an informative signal that a given sentence is not related to control. In addition, verbs of cognition, missing, feeling, and hope are positively associated with being controlled by others or the environment, whereas verbs indicating attempt are correlated with author's control of the situation. Although not complete, these semantic categories are intuitively well-associated with the psychological theory of the control concept. In the Appendix, we also identify the POS ngram patterns which are significantly correlated with both, control relevance and internal vs. external control.

Error Analysis
In observing the instances of misclassification, we observed the following general patterns: • False negatives: -Control-relevant imperative sentences were frequently misclassified as control-irrelevant, e.g. "have a good day everyone"; "Find Jesus."; "Stay focused!" -Internal LoC sentences with modals and 'get' were misclassified as external LoC, e.g., "Finally got a best man!"; "Join me and you can save money on all you purchase here." • False positives: -Control-irrelevant sentences with emotion verbs ("like", "hate") were often misclassified as control-relevant, e.g., "I Hate SNOW!" ; "I dont like cologne, but it smells nice." -External LoC sentences with possessive pronouns and the verb "make" were misclassified as internal LoC, e.g., "It made my day."; "Is your name math because i have a problem with you."

Validation against psychological LoC
An ideal validation of language-based control against psychological LoC would find that users expressing an internal LoC in their social media posts would also be more likely to have an internal psychological LoC as measured by the Sense of Control scale (Brim et al., 2004), and poorer health as reported in previous work (Birnbaum et al., 2010). With only two observations per user, we speculate that our data does not afford us enough power to confirm this relationship. A Spearman correlation between the counts of internal and external-labeled sentences per user and their self-reported LoC and health items, revealed a weak association between authors of a higher number of sentences labeled as external LoC and their psychological LoC as per the survey questionnaire (ρ = −.06, p < 0.05, N = 2000), physical and general health (-0.08< ρ < -0.06, p < 0.01, N = 2000). Authors with higher number of sentences labeled as internal LoC did not show a significant association with internal LoC, but were more likely to exercise regularly (ρ = 0.04, p < 0.05, N = 2000). These findings, although weak, correspond in direction to the findings reported in the literature (Birnbaum et al., 2010), suggesting that future work should investigate to what extent the linguistic expressions of control can extrapolate to psychological LoC.

Conclusion
We describe a computational linguistic approach to identify the locus of the author's control in their social media writing. Utilizing its psychological underpinnings, we create an annotated dataset of 4000 sentences, labeled with control relevance and of internal and external control. We show that identifying control is largely dependent on syntactic information for control-relevance and lexical information for internal/external control. From the NLP standpoint, this suggests that solely using bag-of-word features may not be sufficient for predicting specific psychological outcomes. We have also distinguished our work against dominance and thematic roles, which may be the closest approximations to LoC; however these concepts do not completely translate into each other.
Differential language analyses identified interesting associations between verbs categories and control. We found that audio/visual verbs tend to occur significantly more in the control-related text, and some emotion verbs such as "miss" or "feel" are correlated with lack of control of the surroundings. A caveat of using language-based models is that their association with traits is correlational, not causal. Furthermore, the differences in platform affordances (Jaidka et al., 2018c) and diachronic drifts in language use over time (Jaidka et al., 2018a) imply that language models to identify control may need domain adaptation before they are applied on other corpora, language from other time periods and other social media platforms, as well as posthoc domain adaptation before they can scale to measure community traits (Rieman et al., 2017).
Existing psychological theories are mainly based on self-expressed and self-perceived locus of control in questionnaires, but they may be susceptible to self-report biases. Instead, we demonstrate that some of these constructs can reliably be extracted from language samples, such as social media posts, which are unsolicited selfexpressions of control.
Internal locus of control has been argued to be important for mental and physical health, and to characterize well-run ("empowered") work teams; We hope that the model presented here can be used-with appropriate consent and privacy-for unobtrusive monitoring of LoC in many therapy and work settings.