Personality Driven Differences in Paraphrase Preference

Personality plays a decisive role in how people behave in different scenarios, including online social media. Researchers have used such data to study how personality can be predicted from language use. In this paper, we study phrase choice as a particular stylistic linguistic difference, as opposed to the mostly topical differences identified previously. Building on previous work on demographic preferences, we quantify differences in paraphrase choice from a massive Facebook data set with posts from over 115,000 users. We quantify the predictive power of phrase choice in user profiling and use phrase choice to study psycholinguistic hypotheses. This work is relevant to future applications that aim to personalize text generation to specific personality types.

For psychological traits of users, a key set of traits is represented by personality, with the Five Factor Model or the 'Big Five' being the most widely used model for representing personality. This posits the existence of five traits in which people vary: openness to experience, conscientiousness, extraversion, agreeableness and neuroticism (McCrae and John, 1992). Methods for user trait prediction can uncover sociological insight into user behaviour or implicit biases and also improve a range of applications in recommender systems, targeted marketing or in natural language processing where they can lead to improvements in tasks such as text classification (Hovy, 2015) or sentiment analysis (Volkova et al., 2013). While these methods achieve good predictive performance, they pose significant challenges to the anonymization of identity online.
However, stylistic rather than topical differences are needed in some applications. For example, (Mirkin et al., 2015) propose that the output text of machine translation systems should reproduce the traits of the author of the source text. In this case, topical information is fixed, and the trait information can be transmitted only using stylistic cues. Following the work of (Preoţiuc-Pietro et al., 2016b) who studied demographic traits, we study in this paper user personality differences in paraphrase choice -a specific type of stylistic difference. Paraphrases represent alternative ways to convey the same information (Barzilay, 2003), using either single words or short phrases. Table 1 presents a couple of motivating examples of two group of words and phrases which are all paraphrases of each other ordered by the frequency of use for each personality trait.
In this study, we measure for the first time the differences in paraphrase usage between personality types from a large social media data set in an attempt to obtain language differences isolated from topical influence. Our analysis measures similarities between personality traits, the predictive power of stylistic words and a number of psycholinguistic theories about word choice. The paraphrase scores for each of the five personality traits are available online. 1

Data
Our complete data set consists of approximately 15 million Facebook status updates posted by 115,312 users, representing the full MyPersonality data set . Participants volunteered to share their status updates as part of the MyPersonality application, providing informed consent for data collection. In the MyPersonality application they took a variety of questionnaires, including the International Personality Item Pool proxy for the NEO Personality Inventory Revised (NEO-PI-R) (McCrae and John, 1992;Costa and McCrae, 2008), based on which the five personality trait scores are computed for each user (ranging from 1 to 5).
We split our users into binary groups for each personality trait. In order to have non-overlapping groups, we selected the top 20% users as being high in one trait and the bottom 20% as low in that trait. Data set statistics are presented in Table 2. Our methodology requires a split of users into dichotomous groups in order to compute paraphrase preference. We acknowledge that this split represents a simplification of personality traits and of the subsequent personality prediction task, although this was also used in some previous research (Mairesse et al., 2007;Celli et al., 2014) and, due to the ordinal nature of the personality scores, is highly unlikely to qualitatively affect our results.

Quantifying Personality Differences
We use the Paraphrase Database (PPDB) (Ganitkevitch et al., 2013) as our source of paraphrases, owing to its very large size and quality. PPDB 2.0 (Pavlick et al., 2015b) contains 23.820.422 paraphrases derived from a large collection of bilingual texts by pivoting methods. The phrases part of paraphrases are up to three tokens in length (1-3 grams).
In PPDB 2.0, each paraphrase pair comes with predicted scores for the relation type between the two phrases ('Equivalence', 'Entailment', 'Exclusion', 'Other relation', 'Unrelated') obtained using a supervised regression model using lexical, distributional and other features (Pavlick et al., 2015a). While there is no inarguable definition of the paraphrase term (Androutsopoulos and Malakasiotis, 2010; Bhagat and Hovy, 2013), in this work we are most interested in the most restrictive type of relationship ('Equivalence') as described in (Pavlick et al., 2015a). We thus use paraphrase pairs that have an equivalence score of at least 0.2 (chosen based upon the inspection of the pairs), leaving us with 6.157.570 paraphrase pairs. Given a paraphrase pair, we use phrase occurrence statistics computed over our data set to measure the phrase choice difference over user attributes. For the rest of this paragraph, we exemplify with the trait of extraversion, but the computation is analogous for the other four traits.
To score how much a user group favors a phrase w, we compute the scores Extravert(w) and Introvert(w). These are computed by counting the number of times phrase w was used by a user divided by the total number of words of that used, then averaging across all users high or low extraversion respectively. For each phrase we then compute a score: Within a paraphrase pair (w 1 , w 2 ), the difference Extraversion(w 1 )−Extraversion(w 2 ) measures the   Table 1: Two example groups of phrases that are all paraphrases of each other. Words and phrases are ordered by frequency of use. The top words are more frequently used by users low in each personality trait, with words further down the list being more specific of users high in the respective personality trait. The number in brackets represents the score with which the word is related to each trait (described in Section 3).
stylistic distance between users high in extraversion compared to users low in extraversion. This method of computing stylistic distance is similar to the work of Pavlick and Nenkova (2015) who studied paraphrasing in the context of formality and complexity and to that of Preoţiuc-Pietro et al. (2016b) who looked at differences between gender, age and occupational class groups.
In a few experiments, we also use paraphrase clusters which are created by using the transitive closure of pairwise paraphrases, as the supervised model for scoring equivalence combined with our threshold leads to transitivity not holding in our list of pairs. Within these clusters, we subtract the mean phrase score to adjusts for topic prevalence and to lead to a score of 0 representing a point of alignment across all clusters. In total, we derive 785.226 paraphrase clusters (mean = 7.43 words, median = 4 words, st.dev = 11.06 words). Out of these, on average 171.788 clusters (mean = 5.20 words) across the five personality traits contain at least two words scored for phrase choice, as we remove words with low frequency in our data (a relative frequency of under 10 −5 in our data set).

Predicting Personality
We first test the predictive power of paraphrases in the prediction task of whether a user is high or low in each personality trait. We randomly select 90% of the users to build the scores for all phrases and keep 10% of users for evaluating prediction accuracy. We use the Naïve Bayes classifier to assign a score to each user. We use this classifier as this computes for each word the log probability of the word belonging to one class (similar to the measure we previously defined) and computes the dot product between this distribution and the user phrase frequency vector. We chose this algorithm over others to directly tests the viability of our metric. The prior class distribution is estimated based on the training data and we use Laplace smoothing.
To measure the influence of paraphrase choice, we compare the performance of the model using only phrases appearing in at least one paraphrase pair (a proxy for stylistic choice, 62.919 phrases), the rest of the phrases separately (a proxy for topical information, 54.197 phrses) as well as the combined set of phrases. The vocabulary consists of 117.117 phrases (1-3 grams) which have a relative frequency of over 10 −5 in our data set. Results on predicting personality for unseen users measured in accuracy are shown in Table 3.  Table 3: User attribute prediction results evaluated in accuracy. Using only paraphrases that capture more stylistic rather than topical differences between different personality trait groups, our method still shows good predictive power comparing to using all phrase (1-3 grams) features.
We notice that overall personality can be predicted with significant margins even when using a simple Naive Bayes approach without any feature selection. Both phrases part of paraphrase pairs and not part of paraphrase pairs significantly improve on the random baseline with one exception (Extraversion and paraphrases). However, the numbers are lower than in the case of user demographics (Preoţiuc-Pietro et al., 2016b), which is to be expected when predicting psychological traits (Schwartz et al., 2013;Rangel et al., 2015).
We highlight that in the case of openness to experience, the phrases that are part of paraphrase pairs obtain better prediction performance in accuracy than the other set of phrases. The latter perform better when predicting conscientiousness, extraversion and neuroticism and comparable in case of agreeableness. Combining all phrases consistently obtains the best results.

Trait Differences
A very revealing aspect of paraphrase choice for each trait is the order of preference within a para-phrase cluster, as exemplified in Table 1. To quantify this preference across all clusters, we compute the cluster rank similarity between all pairs of user traits. The average Kendall τ rank correlation coefficient across all clusters is presented in Table 4. As certain personality trait scores are correlated and some users might be part of multiple groups, we also show the correlations between the trait scores in Table 5. As the number of users is very large (>100.000), all correlations in Tables 4 and 5 are significant.
The results on paraphrase choice show a few distinctive patterns. In both paraphrase choice and actual personality scores, neuroticism is anticorrelated with all other four traits, albeit more strongly in case of personality scores. Openness to experience is weakly negatively correlated with all four traits in paraphrase choice, while it is overall weakly positively correlated with the other traits in personality scores. Paraphrase choice is positively correlated across the other three traits (conscientiousness, extraversion, agreeableness), similarly to actual personality scores and with comparable correlations numbers.
Overall, this analysis demonstrates that overall, stylistic paraphrase choice largely reflects user level differences with some variation in case of openness to experience.

Ope Con
Ext

Linguistic Hypotheses
We investigate a number of psycholinguistic hypotheses about language choice and style by using our paraphrase based method. We argue that word choice within a paraphrase pair excludes the topical influence that confounds studies using all words (Sarawgi et al., 2011) 6.1 Word Properties Using unigram paraphrases, we study if any user group is more likely to use a word based on the following properties: Word Length We compute the difference in word length in a paraphrase pair as a simple proxy for word complexity.

Number of Syllables
We compute the difference in the number of syllables in a paraphrase pair as another simple proxy for word complexity.
Word Rareness To measure word frequency, we use a reference corpus retrieved from the 10% sample of the Twitter stream between 2 January -28 February 2011 (∼ 400 million tweets), filtered for English using the Trendminer pipeline (Preoţiuc-Pietro et al., 2012). We measure which word from a pair is more frequently used overall by computing a ratio between the frequencies of the two words within a pair.

Perceived Happiness
We use the Hedonometer (Dodds et al., 2011(Dodds et al., , 2015 to obtain happiness ratings for single words. The Hedonometer consists of crowdsourced happiness ratings for 10,221 of the most frequent English words. The ratings range between 8.5 and 1.3 (µ = 5.37, σ = 1.08). Note these do not only infer the emotional polarity of words (e.g., 'happiness' is more positive than 'terror'), but also how words are perceived by the reader individually without text context (e.g., 'mommy' is perceived happier than 'mom'). We compare the user group preference with the difference in happiness ratings.
Affective Norms To compliment the happiness ratings, we use information about the affective norms of words. In the dimensional model of emotions, any particular emotion can be defined as a set of values on a number of different dimensions.
One of the most popular models consists of three dimensions (Mehrabian and Russell, 1974): Valence -pleasant vs. unpleasant; Arousal -excited vs. calm; Dominance -controlled vs. in-control.
We use a list of ∼14,000 words rated in all three affective norms introduced in (Warriner et al., 2013). For words rated in both perceived happiness and valence, the correlation is very high (r = .918).
Concreteness Concreteness evaluates the degree to which the concept denoted by a word refers to a perceptible entity (Brysbaert et al., 2014). Although the paraphrase pairs refer to the same entity, some words are perceived as more concrete (or conversely more abstract) than others. The dual-coding theory posits that humans process and represent verbal and non-verbal information in separate, related systems. According to this, both concrete and abstract words are represented in the verbal system, but only concrete words are represented in the non-verbal system. Thus, concrete words are more easily learned, remembered and processed than abstract words (Paivio, 2013). We use a list of 37,058 English words with ratings of concreteness on a scale from 5 (e.g., 'tiger' -5) to 1 (e.g., 'spirituality' -1.07) introduced in (Brysbaert et al., 2014).
Imageability The construct of imageability represents how easily a particular word elicits a mental picture of the word's referent (Toglia and Bat tig, 1978). Imagery is thought to be an important aspect of the non-verbal system in the dualcoding theory and is correlated with concreteness (r = .78) (Gilhooly and Logie, 1980). We use 6,000 ratings on the ease or difficulty with which words arouse mental images for mono-and disyllabic words (Cortese and Fugett, 2004;Schock et al., 2012), ranging from e.g., 1.2 -'an' to 7 -'blizzard'.
Sensory Experience Sensory experience ratings reflect the extent to which a word evokes a sensory and/or perceptual experience in the mind of the reader (Juhasz and Yap, 2013). In contrast to imageability which explicitly refers to visual and sound images and asks raters to attempt to build a mental image of the concept, the sensory experience ratings measures the ability for a word to evoke an actual sensation (taste, touch, sight, sound, or smell) that occurs when reading the word. Although sensory experience and imageability are correlated (r = .586) (Juhasz and Yap, 2013), the two variables independently predict unique variance in lexical-decision latencies (Juhasz et al., 2011). We use the ratings from (Juhasz and Yap, 2013) which consist of 5,000 word ratings (e.g., 1 -'those'; 3 - .163 * * −.002 −.060 * * −.032 * −.014 Table 6: Correlation coefficients between word property differences and word preference by users high in each personality trait across all paraphrase pairs -p < 0.05, two tailed t-test, significant after false discovery rate multi-comparison corrections: Benjamini-Hochberg ( * ), Bonferroni ( * * ).

Paraphrase Entropy
Additionally, we are interesting in identifying which personality groups prefer using a more diverse set of alternative phrases, rather than using a few idiosyncratic phrases. Using all paraphrase clusters (1-3 grams), we compute the average entropy over paraphrase cluster distributions. A higher entropy means the distribution is less peaked towards a specific word, thus showing higher variety in choice.

Results
We establish if a group of users prefers words within paraphrase pairs with one of the characteristics presented in the previous section using the following method. For each trait and paraphrase pair, we compute the stylistic difference between the words within a pair (see Section 3). Then, for each trait, we run a Pearson correlation between the vector of stylistic difference scores for each pair and the vector containing the differences in word characteristics (e.g. the difference between the number of syllables of the two words). For each word property, we only retain the paraphrase pairs where we can measure both words, which leads to different numbers of pairs (and hence difference significance thresholds) for each test. The Pearson correlation results are shown in Table 6. We observe there are several statistically significant differences in paraphrase choice between the user groups. Paraphrase entropy by personality trait groups are presented in Table 7.
The trait that leads to the largest number of significant correlations with phrase choice is openness to experience. Users high in openness prefer words which are longer and with more syllables. These patterns are consistent with the theory that open people are intellectually attuned, creative, and curious (McCrae and Costa Jr, 1997). Simultaneously, openness to experience was negatively related to concreteness, dominance, valence and happiness. This indicates that users who are high in openness are more likely to express themselves in indirect and abstract ways, and they are less likely to prefer explicitly happier words. Again, these are consistent with a more cerebral or artistic mode of communication. Word rareness is anti-correlated with high in openness. However, we noticed that word rareness captures in a large extent also misspellings and alternative spellings. In terms of entropy however, openness to experience generates by far the largest difference in group means for entropy. Those interested in novelty and new experiences may especially dislike phrasing the same concept in the same way over time when other options are available, prefer idiosyncratic words and may have larger vocabularies.
Conscientiousness, extraversion and agreeableness have similar correlations across all phrase choice traits. Users high in these three traits prefer words that are longer and have more syllables. However, for extraversion and agreeableness, ageof-acquisition results show that these groups tend not to choose words acquired later and entropy results show a more limited breadth in usage, both indicative of less complex word choice. Especially, introverts score higher in these choices, perhaps because introverts prefer solitary activities such as reading and may therefore have larger and more sophisticated vocabularies (Furnham, 1981).
All three traits prefer happier and more dominant words, which, at least for extraversion, is unsurprising as these qualities are part of the definition of the trait (Watson and Clark, 1997). Users high in agreeableness are also known to express higher positive valence and conscientious users tend to be more dominant.
Despite the opposite patterns in language use associated with these three traits and openness, these are positively correlated in the user population. Therefore, the two sets of correlations are not simply the same effect explained in two different ways.
Neuroticism exhibits the fewest correlations with phrase choice. Users high in this trait prefer words that are shorter, have fewer syllables and have a slightly lower entropy, which indicates a mild tendency for simpler, idionsyncratic words. Finally, users high the neuroticism prefer words that are higher in sensory experience, and to a lesser de-gree, that are more concrete. This underlines the preference of this group of users to use social media as a means of communicating about the immediate context.

Conclusions
We have studied phrase choice, a particular type of stylistic language difference, across the Big Five personality traits for the first time. We used a large data-driven paraphrase dictionary as our source of paraphrases in combination with statistics computed over large volumes of Facebook status updates. We have shown paraphrase words are, with one exception, predictive of the personality traits and that differences exist in phrase choices. Our analysis of several psycholinguistic word characteristics showed that personality correlates with many systematic word choices and these are intuitive and correspond to theories of personality.
Differences in paraphrase choice are likely to be useful in text-to-text generation and dialogues systems. Tailoring automatically generated text based on personality traits might be desirable in multiple scenarios, such as for tutoring or customer support. However, in most of these cases, the topic is fixed and personalization can be achieved only at a stylistic level. To this end, we make our scored paraphrase choices across personality traits publicly available.