Word Embeddings (Also) Encode Human Personality Stereotypes

Word representations trained on text reproduce human implicit bias related to gender, race and age. Methods have been developed to remove such bias. Here, we present results that show that human stereotypes exist even for much more nuanced judgments such as personality, for a variety of person identities beyond the typically legally protected attributes and that these are similarly captured in word representations. Specifically, we collected human judgments about a person’s Big Five personality traits formed solely from information about the occupation, nationality or a common noun description of a hypothetical person. Analysis of the data reveals a large number of statistically significant stereotypes in people. We then demonstrate the bias captured in lexical representations is statistically significantly correlated with the documented human bias. Our results, showing bias for a large set of person descriptors for such nuanced traits put in doubt the feasibility of broadly and fairly applying debiasing methods and call for the development of new methods for auditing language technology systems and resources.


Introduction
Implicit association tests probe biases individuals may harbor, by measuring the reaction times of people when asked to sort word stimuli with clearly positive/negative valance and words associated with racial groups or less morally relevant categories such as insects/flowers and musical instruments/weapons (Greenwald et al., 1998). Recent work has revealed that word representations trained on large text corpora reproduce human bias in preference to flowers and musical instruments, but also disturbingly on gender, race and age-related bias (Caliskan et al., 2017).
These findings pose a dilemma. Having systems learn that flowers/musical instruments are pleasant and insects/weapons unpleasant appears to be useful common sense knowledge that systems can leverage to better interact with people 1 . Having racist, sexist and ageist systems however is highly undesirable, as these are integrated in broader technologies like machine translation, which can reinforce the stereotype 2 . Stereotypes are highly problematic because even simply evoking them can trigger change in behavior (Duguid and Thomas-Hunt, 2015;Spencer et al., 2016).
Guided by these compelling arguments, many researchers have started looking for ways to debias word representations and language technologies. In response to the examples in the supplementary materials in (Caliskan et al., 2017), that Google Translate translates 'doctor' as male and 'nurse' as female, Google has indeed rolled out a new version of their systems for certain language pairs, in which both translation versions are displayed 3 . Similarly, earlier work has zeroed in on the gender bias in word representation and has proposed methods for debiasing, which take in a set of words to be debiased as an argument to the algorithm (Bolukbasi et al., 2016). Work further developing this line of analysis and debiasing has appeared in recent computational linguistics venues (Zhao et al., 2017(Zhao et al., , 2018Rudinger et al., 2018). This line of work is in stark contrast with earlier work in the field, which treated human stereotypes encoded in text as common sense knowledge that could be helpful in automating tasks such as named entity tagging and coreference resolution (Bergsma and Lin, 2006;Ji and Lin, 2009).
In this complex context, we set out to study how broad stereotypes are, both in terms of groups they may affect and the subtlety of distinction involved in the stereotype. For this purpose, we turn to personality stereotypes evoked by a single descriptor of a person, such as nationality, profession and arbitrary words describing people. We verify that people hold stereotypes about personality and that the human stereotypes can be recovered fairly accurately from word representations. Given the wide variety of descriptors to which stereotypes apply, we argue that an approach different from classic debiasing approaches for dealing with the problem ought to be established. We discuss some of these thoughts and considerations in the concluding section of this paper.

Big Five Personality Traits
The Big 5 personality traits, OCEAN, are the most common framework for studying personality in psychology studies (John and Srivastava, 1999). In this framework, personality is described in five dimensions: openness to experience, conscientiousness, extroversion, agreeableness and neuroticism. One of the most compact instruments to assess personality in this scale is the Ten Item Personality Inventory (TIPI) (Gosling et al., 2003). TIPI defines the extreme ends of each personality dimension by two simple descriptions: O conventional/uncreative ↔ open to new experiences/complex C disorganized/careless ↔ dependable/selfdisciplined E reserved/quiet ↔ extroverted/enthusiastic A critical/quarrelsome ↔ sympathetic/warm N calm/emotionally stable ↔ anxious/easily upset OCEAN personality traits have been used in a number of computational linguistics studies such as developing dialog systems whose generation components can be tuned to project specific personality (Mairesse and Walker, 2007), predicting perceived personality from social media posts (Celli et al., 2013;, automatic personality detection from essays (Majumder et al., 2017) and predicting specific traits, such as neuroticism, strongly linked with risk for depression and anxiety (Resnik et al., 2013).

Human Stereotype Collection
We collected human personality stereotypes for 98 professions and 135 nationalities, recruiting par-ticipants on Amazon Mechanical Turk 4 . The professions were drawn from the list of nouns that are children of the node 'person' in the WordNet Is-A hierarchy. The list is large, with over 2,300 entries overall. From this list, two of the authors selected 98 professions. Similarly, nationalities were drawn for the CIA fact book and narrowed down to 135 by two of the authors. We used the Ten Item Personality Inventory (TIPI) (Gosling et al., 2003) to elicit the participant expectations about the personality of people with given nationalities or professions. Participants were given tasks consisting of ten nationalities/professions, to be judged for a single personality trait. The top of the page displayed the TIPI ends for the personality dimensions presented above. The participants were asked to rate where a person with the given profession/nationality will fall on a 7-point scale. The middle of the scale is interpreted as 'have no expectation/could be either', -3 corresponds to the negative end of the dimension defined by the description on the left above and 3 corresponds with the positive end of the dimension defined by the description on the right. The order of the nationalities/professions was randomly assigned in each task. One of the ten professions/ nationalities in the task was a repeat. This was used for quality control. Participants who gave different rating for the repeated nationality/profession were excluded from the study, as were participants who gave the same answer for all ten nationalities/ professions.
Only participants residing in the United States were given access to the task.

Analysis of Human Bias
After excluding inconsistent participants, we had 30 judgments for the vast majority of nationalities and 25 judgments for the professions.
We use the Wilcoxon signed-rank test to determine if the mean of the human judgments for each of the five personality traits is different from zero at 95% confidence. We found that 92.5% of the nationalities had at least one statistically significant personality trait; about 40% had numerical values greater than 1 or less than -1 on the seven point scale, indicating a high bias. Similarly, 98% of the professions had at least one statistically significant with personality trait 5 ; about 94% had numerical 4 Data available at https://github.com/ oagarwal/personality-bias 5 We do not perform any adjustments for multiple com-Professions Nationalities mean > 0 mean < 0 mean > 1 mean < -1 mean > 0 mean < 0 mean > 1 mean < -  Table 2: Column 1 is percentage of professions or nationalities with n out of 5 statistically significant personality traits i.e mean different from zero at 95% confidence using Wilcoxon signed rank test. Column 2 is percentage of professions or nationalities with n out of 5 statistically significant personality traits and absolute value of mean greater than equal to 1 indicating high bias. values greater than 1 or less than -1. Often people, including the authors, expect bias to be negative but most of the bias we observe is positive: certain groups were perceived to be agreeable, open to experiences, conscientious and not neurotic. These results can be seen in Table 1. The existence of national stereotypes (from members of the same nation) has been documented, and also shown not to correlate at all with actual self-reported or perceived personalities of the members of the culture (Terracciano et al., 2005). In our study, the nationality stereotypes are from Americans towards other cultures and are likely similarly unfounded. Many of the stereotypes we observe in our study are predictable: Australians and Swedish are ranked at the top positive end for openness; Japanese and Chinese are most conscientious; Americans are extroverts; Canadians and New Zealanders are rated as most agreeable. In professions, priests and accountants are perceived as least open; drug dealers as least conscientious; chemists and mathematicians as introverts; drug dealers and prosecutors as disagreeable; tour guides and pianists as least neurotic.
There were few professions/nationalities for which all five dimensions of personality were statistically significant. Australians, Finnish, New Zealanders, tour guides, designers, house decoraparisons. A number of these findings may be spurious but the number of significant finding far exceeds the 5% expected significant results due to statistical chance. tors, art dealers have highly positive bias towards them. Judges and senators have also significant bias in all traits, but direction varies across traits for them. Overall statistics are shown in Table 2.

Personality Bias Prediction
In this section, we test the extent to which the stereotypes in the human data can be explained by co-occurrence statistics between the nationality/ profession and descriptors related to the personality dimensions. Prior work (Bhatia, 2016) has shown that co-occurrence statistics can be used to predict human bias towards probability of occurrence of real-life events such as terrorist attacks.
In the prominent work on word representations and bias (Caliskan et al., 2017), human stereotypes were reconstructed by substituting human reaction times in sorting words with the cosine similarity between sets of words. In the original psychology studies, the word stimuli are drawn from prior studies which established that people consider certain words to be highly positive or negative. For example, some words with positive connotations used in the study include 'freedom, rainbow, miracle, laughter' and words with negative connotations include 'abuse, sickness, tragedy, ugly'.
We do not do any similar pre-screening of descriptors. The personality descriptors in our study come from a standard instrument developed for personality assessment (see Table 3). Predictions in our final evaluation are performed for a broad  We use off-the-shelf word representations to measure the (cosine) similarity between a list of personality descriptors and a target nationality or profession. We experimented with GloVe representations (Pennington et al., 2014) trained on Common crawl (6B tokens, 400K vocab, 300d) and symmetric pattern (SP) based representations (Schwartz et al., 2015). We used TIPI to collect human judgments but these descriptors of personality are likely too short for the noisy automatic creation of personality stereotypes. For this reason, we use a larger inventory of personality trait descriptors, Goldbergs Big Five markers (Goldberg, 1992). It has about ten descriptors associated with each of the positive and negative dimensions of a personality trait, all shown in Tables 3.
Different words and phrases are present in the two vector representations in our study. While multi-word expressions such as 'drug dealer' and 'movie star' are present in the SP embeddings, they are missing from the GloVe embeddings. Some other words such as 'guilt-ridden' and 'guilt-free' are present in GloVe embeddings but missing from the SP embeddings. Results for each representation are reported using all markers and person descriptors available in the representation.
Let t denote a target description of a person (eg. doctor), pd be the set of positive Goldberg personality markers (eg. energetic, extrovert) for a trait and nd be the set of negative Goldberg personality markers (eg. reserved, introvert)) for a trait. We first develop a baseline where the predicted bias score is the difference between the mean of the cosine similarity of target description t with each of the positive markers for the trait, and the mean co-sine similarity of t with each of the negative markers for the trait. We build separate models for each of the five personality traits. Each of the models has descriptions of both nationalities and professions and we do not differentiate between the two. score = p∈pd sim(t, p) |pd| − n∈nd sim(t, n) |nd| Next, we use linear regression to predict the personality scores using as features the cosine similarity of target description of the person with each of the Goldberg personality markers (eg. energetic, introvert) for the trait. where w n,p are weights learned by regression for each of the Golderberg personality markers.
We do leave-one-out cross validation because we have human judgements for just 233 descriptions of people. Finally, we calculate the Spearman correlation of the scores on the n test points, one from each model in cross validation with the average human scores.
Further, we test the model on new descriptions from WordNet 6 . We randomly selected 140 descriptions and crowdsourced judgments about them in the same manner as the training data. The resulting correlations can shown in Table 4.
On the leave-one-out results on the training data consisting of nationalities and professions, the regression model is clearly superior to the unsupervised baseline. On the test data, the best correlation for Conscientiousness and Agreeableness is achieved by the baseline with SP representations.   We also computed the class of bias for each of the predictions-positive bias, negative bias and no bias. 7 The accuracy was 55-60% for each of the cases except neuroticism (42%). Both representations assigned the same bias class for 65%, 80%, 73%, 79% and 93% descriptions for OCEAN traits respectively. There is no clear word representation that works consistently better.
All correlations are statistically significant and hold up well between the training and test data, even though the test data has much more varied descriptions of people. Notably, Openness and Conscientiousness are predicted most accurately and for a number of personality dimensions the results on the heterogeneous test set are higher than for the training set of nationality and professions.
Some examples which stood out, of test descriptions and bias scores are shown in Table 5. 8 People have a significant bias which is being predicted by the classifier based on embeddings as well. The classifier (Glove) predicted high bias i.e score ≥ 1 or ≤ −1 for 21%, 23%, 14.5%, 11% and 2.5% of the 2,638 WordNet person descriptors for the OCEAN traits respectively.

Discussion and conclusion
We introduced a corpus of human stereotypes of personality. We showed that the off the shelf vec-tor space representations can be leveraged to derive personality stereotypes from corpora. We used the model to make predictions on thousands of person descriptors, with larger samples. This list allows us to inspect a much larger scope of possible bias than smaller targeted categories. For example, in much more controversial direction of work, our approach can be used to train a model that predicts sentiment valence, possibly starting with words from prior studies. Then we can, as we did in the work here, predict which other words may have similar bias, potentially recovering many more nuanced groups.
Our findings indicate that debiasing methods that need explicit set of words to be debiased are unlikely to be effective in removing all stereotypelike data. Moreover, as has been now revealed, debiasing methods only mask the bias rather than fully remove it from influence on downstream tasks like clustering and gendered prediction (Gonen and Goldberg, 2019).
One of the earliest paper reporting correlation between lexical co-occurrence and human implicit bias association tests has a somewhat more optimist view (Lynott et al.). They provide examples in which people exhibit gender and racial implicit bias but when asked to be thoughtful in performing a task, they make decisions not aligned with that bias. This view aligns with the model of two systems of thinking-fast stereotypes that are highly inaccurate in many cases and slow, deliberate thinking that overrides these stereotypes (Kahneman, 2011). It remains an open problem what the slow processing mechanisms should be for automated systems but clearly developing such systems and the necessary benchmarks to test these would mark an important milestone in the development of language technology.