Predicting Human Trustfulness from Facebook Language

Trustfulness — one’s general tendency to have confidence in unknown people or situations — predicts many important real-world outcomes such as mental health and likelihood to cooperate with others such as clinicians. While data-driven measures of interpersonal trust have previously been introduced, here, we develop the first language-based assessment of the personality trait of trustfulness by fitting one’s language to an accepted questionnaire-based trust score. Further, using trustfulness as a type of case study, we explore the role of questionnaire size as well as word count in developing language-based predictive models of users’ psychological traits. We find that leveraging a longer questionnaire can yield greater test set accuracy, while, for training, we find it beneficial to include users who took smaller questionnaires which offers more observations for training. Similarly, after noting a decrease in individual prediction error as word count increased, we found a word count-weighted training scheme was helpful when there were very few users in the first place.


Introduction
Trust, in general, indicates confidence that an entity or entities will behave in an expected manner (Singh and Bawa, 2007). While trust has been computationally explored as a property of relationships between people, i.e. interpersonal trust (Golbeck et al., 2003;Colquitt et al., 2007;Murray et al., 2012), few have considered trustfulness -a personality trait of an individual indicating their tendency, outside of any other context, to trust in people, institutions, and situations (Nannestad, 2008).
Trustfulness is tied to many real world and social outcomes. For example, it predicts individual health (Helliwell and Wang, 2010), and how likely one is to join or to cooperate in diverse social groups (Uslaner, 2002;Stolle, 2002), and individual mental health and well-being (Helliwell and Wang, 2010). The importance of trustfulness is thought to be increasing as modern societies are increasingly interacting online with unknown people (Dinesen and Bekkers, 2016). This suggests it could be increasingly important in a clinical domain where has been shown to be essential in securing a strong and effective patient-client bond (Brennan et al., 2013;Lambert and Barley, 2001). Trait trustfulness also relates to selfdisclosure which in turn greatly aids the clinician in her provision of care (Steel, 1991). Provider trust also likely is important to effectively treat a patient, especially in online therapeutic sessions, as it signals trustworthiness and care, but research on this topic remains sparse.
Unfortunately, traditional trustfulness measurement options (e.g. surveys) are expensive to scale to large populations and repeated assessment (i.e. in clinical practice) and they carry biases (Baumeister et al., 2007;Youyou et al., 2017). Researchers are actively searching for alternative behavior-based methods of measurement (Nannestad, 2008).
Language use in social media offers a behavior from which one can measure psychological traits like trust. Over the last five years, more and more researchers are turning to Facebook or Twitter language to develop psychological trait predictors, fitting user language to psychological scores from questionnaires . According to standard psychometric validity tests, such language-based approaches have been found to rival other accepted measures, such as questionnaires and assessments from friends (Park et al., 2015). However, while language-based predictive models for many traits now exist, none have considered a model for trustfulness-a trait which some have argued is now of marked importance as modern societies are increasingly interacting online with unknown people (Dinesen and Bekkers, 2016). Further, across such trait prediction work, little attention has been paid to the role of (1) questionnaire-size -how many questions are used to assess an individual's trait, and (2) word counthow many words the user has written from which the language-based predictions are made. 1 Here, we answer the call for more behaviorbased trait measurement (Baumeister et al., 2007;Youyou et al., 2017), by developing languagebased (a behavior) predictive model of trustfulness fit to questionnaire scores, and we seek to draw insights into the role of word count and questionnaire size in predictive modeling.
Contributions. This work makes several key contributions. First, we introduce the first language-based assessment of trustfulness (henceforth "trust"), evaluated over out-of-sample trust questionnaires, enabling large-scale or frequently repeated trust measurement. We also (2) study the number of questions in the psychological survey to which one fits our model (in other words, finding which one matters more: number of questions in questionnaires or number of users who took it?), (3) explore the relationship between users' word count and model error, and (4) introduce a weighting scheme to train on low word count users. All together, we add trustfulness, an important trait for clinical care, to an increasing battery of languagebased assessments.

Background
Previous computational work on trust has focused on interpersonal trust -an expectation of trust concerning future behaviour of a specific person toward another known person. (Bamberger, 2010). Interpersonal trust is primarily focused on situations in which there are two known individuals (the truster and trustee) who share a history of previous interactions. Such trust, requires study of a history of interactions indicating how well each member participant might understand the others' personalities (Kelton et al., 2008;Golbeck et al., 2003). Interpersonal trust has been studied especially in the context of online social networks where it is sometimes possible to track users from first interactions (Kuter and Golbeck, 2007;DuBois et al., 2011;Liu et al., 2014Liu et al., , 2008. While some of these works have considered the amount of communication (Adali et al., 2010), content is rarely considered and none of these past works have attempted to measure the trait, trustfulness, as we do here.
Trustfulness (also referred to as "generalized trust"), in contrast with interpersonal trust, measures trust between strangers. As Stolle (2002) put it: [Trustfulness] indicates the potential readiness of citizens to cooperate with each other and to abstract preparedness to engage in civic endeavors with each other. Attitudes of trustfulness extend beyond the boundaries of face-toface interactions and incorporate people who are not personally known.
This version of trust has been tied to the belief in the average goodness of human nature (Yamagishi and Yamagishi, 1994), and it involves a willingness to be vulnerable and engage with random others despite interpersonal risks (Mayer and Davis, 1999;Rousseau et al., 1998). It has been shown predictive of individual mental health and physical well-being (Abbott and Freeth, 2008;Helliwell and Wang, 2010). For communities, trust is a key indicator of social capital (Coleman, 1988;Putnam, 1993), and it is highly predictive of economic growth (Delhey and Newton, 2005;Knack and Zak, 2003)

Trustfulness from Questionnaires
Trustfulness, just like other personality traits is typically measured with either questionnaires or behavioral observations during experiments (Ermisch et al., 2009). Data linking experiments on trust with individual linguistic data is not available or easily acquired, so we fit our langauge-based model of trust to a gold-standard of questionnairebased trust. A variety of such questionnaires exist with high inter-correlation, including the Faith in People scale (Rosenberg, 1957), Yamagishi & Yamagishis (1999) Trust Scale, and the Trust Facet of the Agreeableness trait in the Big Five personality questionnaire (Goldberg et al., 2006). Here, due to its availability, we chose to fit our language-based trust predictor to the later of these questionnaires -the trust personality facet.

Data Set.
We use trust facet scores from the trait questionnaire of consenting participants of the MyPersonality study . 2 From this dataset we derive two versions of trust measurement scores: (1) using 10 questions of trustfulness (referred to as 10-question trust), or (2) using a subset of 3 questions (referred to as 3-question trust). Participants can either answer all 10 questions (as part of larger set of over 300 questions) or just answer the 3-question version (as part of a 100 questions). Each question is on a scale of 1 to 5, from totally disagree to completely agree. For example, the following are the questions for the 3-item version: • I believe that others have good intentions.
• I suspect hidden motives in others. * • I trust what people say.
Some questions (e.g. * above) are "reverse scored" so a 1 becomes a 5 and vice-versa. One's final trust score is based on taking the mean of the responses to the individual trust questions. Although 3-question trust is less accurate, 3 it may be useful to enable training data from more users.
From MyPersonality, we used a dataset containing 19, 455 Facebook users who wrote at least 1, 000 words across all of their status updates. We additionally included 6, 590 users who had less than 1, 000 words in some experiments. Totally 26, 045 users took the Big Five questionnaire, answering at least the 3 trust-focused questions in it (short version). Among all the users, only 621 had completely answered all of the 10 trust related question (long version). Table 3 represents number of users in detail. It is worth mentioning that not only the participants consent for their Facebook and questionnaire data to be used in research, but also the data has been anonymized.

Method
We build a language-based model for the trait of trustfulness. From Facebook status updates, Threshold-1000  438  19445  Threshold-10  621  26045   Table 1: Number of users who filled the long or short version questionnaire based on their word counts.

Long Version Short Version
Threshold-X means setting word count threshold to X. Long version represents users who had 10-question trust score, and short version includes users who had 3-question trust score.
we extracted two types of user-level lexical features, which have previously been shown to be effective for trait prediction (Park et al., 2015): (a) ngrams of length 1 to 3 and (b) LDA topics. To extract the ngrams from the text we used the HappierFunTokenizer. We did not apply any text normalization, as past work has found that often the forms in which people choose to write a word ends up being predictive about their personality (Schwartz et al., 2013). Two types of ngrams were extracted: one containing relative frequencies of each ngram ( f req(ngram,user) f req( * ,user) ) and the other simply a binary indicator of whether the user mentioned each ngram at all. Considering ngrams mentioned by at least 1% of the users, we obtained 50, 166 ngrams features for each of the two types of ngrams. Topic features were derived from posteriors of Latent Dirichlet Allocation. We use the 2, 000 LDA topic posteriors publicly available from Schwartz et al. (2013). 4 .
We use a series of steps to avoid high dimensional issues and prevent overfitting. First, an occurrence threshold is applied to remove words that were used by less than 1% of people. Second, we select features with at least a small relationship with our trust labels according to having a univariate family-wise error rate < 60. Third, we ran a singular value decomposition (in randomized batches) to effectively decrease the size of feature space and reduce colinearity across dimensions (Boutsidis et al., 2015). We performed this process based on the training data, and then applied the resulting feature reduction on the test data.
Each type of feature (i.e. ngram relative frequencies, booleans, and topics) is qualitatively and distributionally different from each other (Almodaresi et al., 2017). Thus, we perform reduction technique on ngrams, boolean ngrams and topics separately. This is so the comparatively few topic features are not likely to get lost among the relatively plentiful ngrams. At the end, we merge both types of features to build a single feature matrix (or an embedding with approximately 5% of the number of training observations). Similar feature reduction pipelines have been shown to perform well in language based predictive analytics . We then use ridge regression to fit our dimensionally reduced feature set to the trust labels from the Big Five questionnaires.
Questionnaire Size and Word Count. While the 10-question trust score is more accurate, we have less than 1, 000 users with this label. Our default setup has the users with 10-question trust as the test set while we train over the much larger set of users with only 3-question trust. We then experiment to determine if this setup is ideal.
Previous work has suggested user attribute prediction benefits from an approximate minimum threshold of 1, 000 words per user in order to get accurate estimates of one's personality (Schwartz et al., 2013). Since our dataset contains 6, 590 users with less than 1, 000 words, we explore if we can include these users in an effective way to improve the model. To this end, we weight each users' contribution to the loss function proportionate to the number of words she or he has written. We used two different weighting schemes, linear and logistic, as shown below, where wc is the word count, and T max and T min are 1, 000 and 200 respectively.
Thus, users with more than 1, 000 words are weighted 1 while those with less than 200 words are weighted 0 (we settled on these min and max values based on our study of the mean error per word count - Figure 1).

Evaluation
We focus on evaluating our language model by comparing the performance of our model on prediction of 10-question trust vs. 3-question trust labels. We did this comparison in 3 settings: (1) train and test on 10-question trust score, (2) train and test on 3-question trust score, and (3) train on 3-question and test on 10-question trust score.
For the first setting, where all users answered the same number of questions, we performed a 10fold cross-validation. For the second and third settings, we consider all users with 10-question trust score as our test group and the remaining users which only had 3-question trust score but not the 10-question trust as the train group. This enables us to first determine how well a model trained on 3-question trust performs in not only predicting 3question trust itself, but also the 10-question trust, and compare the later with the model which is trained on small group of users with 10-question trust. In all these three experiments, we considered 1, 000 as the threshold for word count, and used the same group of users as the test group. We present result as both mean squared error and disattenuated correlation which accounts for measurement error: r dis(a,b) = r a,b √ ra,ar b,b where r a,a = .70 the reliability of the trust questionnaire  and r b,b = .70 the expected reliability of the trust language-based measurement based on evaluations of language-based personality assessment reliability (Park et al., 2015) (every r on the right-hand side of the equation is a Pearson product-moment correlation coefficient).  Table 2: Comparing the language model performance on 3-question trust score vs. 10-question trust score. Pearson r dis is dissattenuated Pearson r and MSE is the mean squared error.
As shown in table 2, our model's r dis with only limited 10-item data is 0.259, suggesting we cannot learn a very accurate model by training on such a small number of users. Comparing the second and third settings, we see the result of testing on 10-question trust score outperforms the 3-question trust score by 0.07 margin in dissattenuated Pearson r and MSE by a margin of 0.11. To further understand why 10-question trust seems to be easier to predict, we calculate the variance for both 3question and 10-question trust, yielding σ 2 = 0.85 and σ 2 = 0.72 respectively. This suggests that 10question trust has less noise than 3-question trust. Due to these results, in all of the following experi-  Table 3: Comparing the performance of our language model with sentiment as baseline, using different feature sets: ngr r: ngrams as relative frequencie, ngr b: ngrams as boolean variables. Bold indicates the best performance. Pearson r dis is dissattenuated Pearson r and MSE is the mean squared error.
ments we only train on 3-question trust labels and test on 10-question trust labels. We next evaluate the performance of our trust model by comparing to two baseline models. Because positiveness is associated with trust (Helliwell and Wang, 2010), we consider a baseline of sentiment scores using the NRC hashtag sentiment lexicon, an integral part of the best system participating in SemEval-2013(Mohammad et al., 2013. We also compare it to clusters of words derived from word2vec embeddings (Mikolov et al., 2013) using spectral clustering (Preoţiuc-Pietro et al., 2015).  Table 3 demonstrates the predictive performance of our model in comparison to the sentiment and word2vec baselines. Our best model (ngr r + ngr b + topics) had an 8% reduction in mean squared error over sentiment, and achieved a Pearson correlation coefficient of r dis = .494 which is considered a large relationship between a behavior (language use) and a psychological trait (Meyer et al., 2001) and just below state-ofthe-art language-based assessments of other personality traits (Park et al., 2015).
In the next experiment we present how the error rate changes as a function of word count per user using various combinations of features. We trained 4 models using (1) relative-ngrams, (2) binary-ngrams, (3) topics, and (4) all features together. We predict the 10-question trust score of our test users and plot the test users error rate with respect to their word count, which is shown in figure 1. Overall, users' trust score is more predictable as they use more words flattening out after 1000 words. Additionally, for users with few words, relative-ngrams and binary-ngrams are equally predictive and better than topics. For users with many words, the prediction power of binaryngrams fades out, likely reflecting features being primarily ones. Similarly, topic-based models perform better for talkative users, likely because more words means better topic estimation. Figure 2: Effect of increasing the number of training users, who have more than 1, 000 word count, while there are 6, 590 users with less than 1, 000 word count in train set: "Threshold-1000" is training ridge-regression on users with at least 1, 000 words, "threshold-200" is training ridge-regression on users with at least 200 words, "linear" is training weighted ridge-regression on users with at least 200 words, and finally "logistic" is training weighted ridge-regression on users with at least 200 words. Now that we know word count is correlated with prediction error, we explore a word count weighting scheme that enables us to include 6, 590 users with fewer than 1, 000 words in training. Such users are included in three different ways, (1) without using any weight, (2) using linear weighting, and (3) using logistic weighting.
In figure 2 we compare the various model training setups at different training sizes. As shown, when we have just a few users with more than 1, 000 words, including more users, but with low word count, improves the performance, no matter which models we exploit. However, as the number of users with more than 1, 000 word count increases, injecting low word count users hurts the performance. In addition, the weighting scheme does not seem to help at all in this situation.
To get an idea of the type of features signalling high and low trust predictions, we ran a differential language analysis (Schwartz et al., 2013) to identify the top 50, independently, most predictive features. Figure 3 show the word-clouds of both positively correlated and negatively correlated with 3-question trust score, limited to those passing a Benjamini-Hochberg False Discovery rate alpha of 0.01 (Benjamini and Hochberg, 1995). Many of the ngrams correspond with the definition of trustfulness, such as the pro-social words in the positive predictors (e.g. 'friends' 'family', 'thanks'). On the other hand, many curse words can be seen among negative predictors.

Conclusion
We introduced the first language-based model for measuring trustfulness from language, and used it to study novel and useful aspects of the predictive modeling of user traits. First, we found that language use in social media can be used to predict trustfulness about as accurate as other personality traits. Then, we found that, in order to build a language model over questionnaires, including more users who took a shorter questionnaire can lead to improvement, in comparison to using less users who took a longer questionnaire. We also showed that the language model usually performs better in predicting users with more total word count, with error flattening out around 1, 000 words, and that when there are few users (i.e < 1, 000) it is worth lowering the minimum word count threshold to include more users for training purpose. However, using a weighting scheme was not helpful.
Our scaleable measure of trust enables future work to investigate some interesting questions about trust, such as those involved in large-scale or frequent assessments. For example, this may allow for large-scale assessments of trait trustfulness of different patient populations or of samples of clinicians. Also, if clients were to opt into sharing of social media, therapists may be able to use this model to detect drops in patient trust which may help to understand when one is more receptive or not. Trends over time may help to signal interpersonal improvements or regressions, as well as negative interactions with others. It should be noted that while trust is thought of as a relatively stable personality aspect or trait, some research suggests that it is malleable over time (Jones and George, 1998), so changes in trust over time could be another meaningful exploration for future study. Thus, the present model may be helpful for the generation of trustful chat bots, such as virtual assistants or therapeutic aids.