Exploring the Effect of Author and Reader Identity in Online Story Writing: the STORIESINTHEWILD Corpus.

Current story writing or story editing systems rely on human judgments of story quality for evaluating performance, often ignoring the subjectivity in ratings. We analyze the effect of author and reader characteristics and story writing setup on the quality of stories in a short storytelling task. To study this effect, we create and release STORIESINTHEWILD, containing 1,630 stories collected on a volunteer-based crowdsourcing platform. Each story is rated by three different readers, and comes paired with the author’s and reader’s age, gender, and personality. Our findings show significant effects of authors’ and readers’ identities, as well as writing setup, on story writing and ratings. Notably, compared to younger readers, readers age 45 and older consider stories significantly less creative and less entertaining. Readers also prefer stories written all at once, rather than in chunks, finding them more coherent and creative. We also observe linguistic differences associated with authors’ demographics (e.g., older authors wrote more vivid and emotional stories). Our findings suggest that reader and writer demographics, as well as writing setup, should be accounted for in story writing evaluations.


Introduction
Reading or writing a story is an inherently subjective task that depends on the experiences and identity of the author, those of the reader, and the structure of the writing process itself (Morgan and Murray, 1935;Conway and Pleydell-Pearce, 2000;Clark et al., 2018). Despite this subjectivity, many natural language processing tasks treat human judgments of story quality as the gold standard for evaluating systems that generate or revise text. In creative applications, such as machine-inthe-loop story writing systems (Clark et al., 2018), it is important to understand sources of variation in judgments if we hope to have reliable, reproducible estimates of quality.
In this work, we investigate how an author's and reader's identity, as well as overall writing setup, influence how stories are written and rated. We introduce and release STORIESINTHEWILD, 1 containing 1,630 short stories written on a volunteerbased crowdsourcing platform, paired with author demographics and personality information. For each story, we obtain three sets of ratings from third-party evaluators, along with their demographics and personality.
Our findings confirm that author identity, reader identity, and writing setup affect story writing and rating in STORIESINTHEWILD. Notably, people in general preferred stories written in one chunk rather than broken up into multiple stages. Raters age 45 and over generally rated stories as less creative, more confusing, and liked them less compared to raters under age 45. Additionally, we find that, in our corpus, men were more likely than women to write about female characters and their social interactions, and compared to younger authors, older authors wrote more vivid and emotional stories. We also find evidence of reader and author personality, and their interaction, influencing ratings of story creativity.
Our new dataset and results are first steps in analyzing how writing setup and author and reader traits can influence ratings of story quality, and suggest that these characteristics should be accounted for in human evaluations of story quality.

Background and Research Questions
To guide our study, we craft several research questions informed by existing literature on story writing and the relationship between author identity and language, outlined below.
RQ1 How are author gender, age, and personality traits associated with language variation in stories? A wealth of work has shown an association between an author's mental states and their language patterns. Variation in pronoun usage, topic choices, and narrative complexity correlates strongly with the author's age and gender (Nguyen et al., 2016) and moderately with their personality (Yarkoni, 2010). We aim to confirm these differences in a prompted storytelling setting, since most work has focused on self-narratives (e.g., diaries and social media posts; Pennebaker and Seagal, 1999;Hirsh and Peterson, 2009;Schwartz et al., 2013), with the exception of the essays studied by Pennebaker et al. (2014).
RQ2 How are rater gender, age, and personality traits associated with variation in story quality ratings? Ratings of stories are often only used to evaluate a story writing system's output (e.g., Fan et al., 2018;Yao et al., 2019) or to develop automatic evaluation metrics (e.g., Hashimoto et al., 2019;Purdy et al., 2018), ignoring the rater's identity. However, prior work has shown differences in crowdsourcing worker's behavior or annotations based on task framing (Levin et al., 2002;August et al., 2018;Sap et al., 2019) or the annotator's own identity or experiences (Breitfeller et al., 2019;Geva et al., 2019). We seek to confirm and characterize these differences in our story rating task. As a follow-up to RQ2, we also investigate the interaction between author and rater demographics on story ratings.
RQ3 Is writing setup associated with different ratings of story quality? Past work has investigated story writing as a turn-taking game (Clark et al., 2018) or as a distributed activity (Teevan et al., 2016) rather than a single event. We investigate whether writing setup (writing a story all at once or sentence-by-sentence) impacts overall story quality.

STORIESINTHEWILD Collection
We introduce and release STORIESINTHEWILD, containing 1,630 short stories ( §3.1) paired with author demographics and personality information. 2 We pair these stories with third-party rat- ings ( §3.2) to evaluate the effect of writing setup and author identity on story writing.

Crowdsourcing Stories
To construct STORIESINTHEWILD, we first collected 1,630 written stories using a volunteerbased online study platform, LabintheWild (Reinecke and Gajos, 2015). 3 Following best practices in recruiting on LabintheWild (August et al., 2018), we advertised our study as a way for participants to learn more about themselves by seeing how a simple pronoun-based classifier can predict their personality based on their story writing (described in Appendix A.1). We first collected participants' identity and demographics (age, gender, race, and education level). Then, participants chose the topic of their story by selecting one of five preview thumbnails, each representing one of five image strips that participants subsequently used as prompts for their story. We selected the images from the Visual Storytelling dataset of Flickr images (Huang et al., 2016) and a cartoon dataset (Iyyer et al., 2017). All images are shown in Figure 1 in Appendix A.
Writing setup After choosing a topic, all authors are presented with a five image sequence corresponding to the topic they chose to write about. We then randomly assign authors to one of two writing setups: (1) all at once or (2) se-quential, both shown in Figure 2 in Appendix A. In (1), participants simply write a full 5-10 sentence story. In (2), participants are instructed to write five sets of 1-2 sentences in an accordion of text boxes, each box corresponding to an image in the strip. The second writing setup is inspired by machine-in-the-loop turn-taking for story writing (Clark et al., 2018). Once each text box is submitted, participants can no longer edit that text.
In both setups, participants are instructed to tell a story rather than just describe the images, to make sure their story has a clear beginning, middle, and end, and to use correct punctuation. The task took around 9 minutes in both conditions.
Following the story writing, participants can optionally fill out the Ten Item Personality Measure ( Author demographics Of the authors in STO-RIESINTHEWILD, 57% were women and 40% men (3% declined to state their gender), with an average age of 25±12 years and an average of 14.30±4.20 years of education including primary school. Of the authors, 56% were white, 28% Asian, and 3% African-American (13% selected another ethnicity/race); we did not restrict participation to any specific country. 1,133 (70%) authors took the personality questionnaire.

Rating Stories
We create an Amazon Mechanical Turk task to obtain quality ratings for each of the stories collected in our previous task. For each story, we ask U.S.-based workers to rate stories on 6 dimensions (listed in Table 1), using a 7-point Likert scale. 5 Those dimensions include 5 fine-grained quality dimensions (e.g., grammaticality, coherence), as well as an overall impression of the story ("I liked this story"). Each worker also optionally filled out their demographics information (age, race, gender, education level). Additionally, as a measure of in-tellect and creativity, workers filled out the four openness items from the Mini-IPIP Big 5 personality scale (Donnellan et al., 2006).
Rater demographics 56% of our raters were women and 42% were men. 79% identified as white, 6% as African-American, and 6% as Asian. On average, their age was 40±12 years, and they had 15±3 years of education, including primary school.

Analyses
We investigate the effects of author and rater characteristics on the story's language and ratings. Unless otherwise specified, we only consider the male and female gender labels 6 and use a continuous representation of age and personality. We also explore the impact the writing setup-whether authors wrote stories all at once or in sequential chunks-has on story ratings. 7 Note that our findings are simply measuring associations between aggregate categories (e.g., number of pronouns used, authors over age 45) and should not be interpreted as applying to individual data points with specific contexts.

Author Identity (RQ1)
To analyze which types of words are associated with different demographic identities, we extract psychologically relevant linguistic categories from stories, using the Linguistic Inquiry Word Count (LIWC;Pennebaker et al., 2015). For each LIWC category, we compute a linear regression model on the z-scored features, controlling for writing setup and topic choice. We only report regression coefficients (βs) that are significant after Holm correction for multiple comparisons (Holm, 1979).

Rater Identity (RQ2)
We examine the association between rater traits and their story ratings using linear regressions controlling for image type and writing setup (similar to §4.1). We also investigate interaction effects with author demographics, and show the full results of our regressions in Appendix B.3.
Gender, age For age, we first noticed that older workers rated stories noticeably more negatively than younger workers (e.g., r = −.08, p < .001 for both the like and entertaining ratings). When inspecting the data we noticed this trend was most defined for raters age 45 or older, and so we perform our analyses below using a binarized age variable, splitting raters as either 45 or older (N = 921) and younger than 45 (N = 1916).
Personality Openness to experience is often linked to creativity (McCrae, 1987), so we ex-plore how ratings of creativity are associated with rater and author openness to experience personality scores. We find significant correlations between story ratings and rater openness to experience. Specifically, raters with higher openness to experience thought stories were generally more creative (|β| = 0.38, p < 0.05) and less confusing (|β| = 0.64, p < 0.001). Additionally, authors with higher openness scores wrote stories that were rated more creative (|β| = 0.35, p < 0.1)

Author-Rater Identity Interactions
We also investigate story ratings through the lens of author and rater demographics to see if any shared traits across raters and authors were associated with rater preferences.
While both reader and writer openness to experience were associated with significantly higher ratings of creativity, the interaction between the two was negative (|β| = 0.50, p < 0.1), meaning that as writer and reader openness to experience increased, the reader's rating of the story's creativity actually decreased. No other interactions (e.g., age, gender) were significant in our sample.

Differences in writing setup (RQ3)
We quantify the differences in ratings for our two writing setups. We average the ratings for each story, and report differences in Table 1 using Cohen's d. We find that stories written in full are rated to have higher quality across all dimensions, compared to stories written sequentially.
We also find that certain story topics were preferred over others (F = 26.17, p < 0.001). Specifically, stories written about the dog prompt were liked significantly more than others (p < 0.001), and those about the jail prompt significantly less (p < 0.001).

Conclusion
In this study we find that differences in author characteristics are associated with linguistic differences in stories and that rater characteristics are associated with differences in ratings. For authors, men were more likely than women to write about female characters and their social interactions, and compared to younger authors, older authors wrote more vivid and emotional stories. Raters preferred stories written all at once rather than broken up into multiple stages, and raters age 45 and older rate stories significantly lower than raters under age 45. We release our dataset, STO-RIESINTHEWILD, containing 1,630 stories with quality ratings and anonymized author and rater demographics.
Our results suggest that author and reader characteristics (e.g., demographics, personality) could explain variations in story writing evaluations. While work has shown that some study designs are more robust against this variation, (e.g., by ranking instead of rating Yannakakis and Martínez, 2015), rater differences could still lead to variation in annotations. We recommend that evaluations include some ability to collect characteristics, such as a short demographics and personality questionnaire, in order to assess any influence of these variables.
Furthermore, future work could explore alternative ways of collecting author and reader characteristics during evaluations. While demographic questionnaires are common and short (e.g., to collect gender and age would require two questions), full personality questionnaires are time consuming, asking multiple questions for each characteristic. Study designers could instead use reduced questionnaires, such as the ten item personality inventory (TIPI; Gosling et al., 2003). Alternatively, focusing on fewer, more highly trained raters-that represent a diverse set of demographics and personality-could reduce the cost of collecting many rater demographics. Finally, future work should investigate whether annotator variance might be better captured with psychological factors related to reading (e.g., propensity for liking long sentences or fiction) rather than stable traits such as personality or demographics.
Our results that author personality and gender were associated with topic selection and story writing also suggest that studies could leverage the behavior of participants to predict personality characteristics. While these results are not yet strong enough to provide robust measures of personality or demographics, future studies could explore how to leverage these associations to predict author characteristics in story writing or other writing evaluations rather than relying on questionnaires.

A STORIESINTHEWILD Collection
We provide additional details about our data collection process, including the image prompts shown to authors ( Figure 1) and the writing setups ( Figure 2).

A.1 Motivating LabintheWild authors
Since LabintheWild is a volunteer-based crowdsourcing platform, we design our task such that participants can learn about their personality through story writing as a motivation. The study was advertised on the front page of LabintheWild and posted on social media to recruit participants. Once a participant finishes their story, we compute their personality estimate (using the Five Factor Model) based on their story language. Specifically, we extract their pronoun usage using the pronoun categories in LIWC (Pennebaker et al., 2015), and predict personality scores using the coefficients from Schwartz et al. (2013). At the end of the task, we display their personality predictions along with short descriptions of which trait is the most present in their writing (i.e., the trait whose score has the highest magnitude).
Optionally, participants could take a short personality questionnaire (TIPI; Gosling et al., 2003) before seeing their writing-based personality re-sults.
Those who answered these questions could then see their questionnaire-based and their writing-based personality estimates at the end of the task. The end of the task also debriefs participants, explaining the goal of the study and researcher contact information. The debriefing information also includes disclaimers about the personality scores computed from story writing and reiterates that the results should not be used for clinical or diagnostic purposes.

B Analyses
We present further details of our demographic analyses, both between the author demographics and their language use ( §B.2) and between the author and reader demographics ( §B.3).

B.1 Author demographics and topic choice
To ensure the validity of our other analyses, we examine whether an author's identity was associated with their choosing one of the five topics (Figure 1). We find that only an author's agreeableness affected their choice of image prompt, with highly agreeable authors preferring the dog story (Cohen's d = 0.30, p < 0.001; Figure 1a) and low agreeableness authors preferring the jail story (d = 0.41, p < 0.001; Figure 1b). Other demographic variables were comparable for every im- age prompt (as measured by one-way ANOVAs.)

B.2 Linguistic signal of author demographics
As described in §4.1, we first extract language categories from stories using the LIWC (Pennebaker et al., 2015) lexicon. Then, we use a linear regression model to compute the association between the category and the author's demographics, using zscored LIWC features for easier interpretation of the regression coefficients (βs). Our findings, outlined in Table 2, show that an author's identity and personality are somewhat associated with the types of stories they tell (controlling for the type of image prompt they used). Men focused on describing characters (pronoun, social), specifically female characters, whereas women displayed more hierarchical logical storytelling (Analytic; Pennebaker et al., 2014). Controlling for gender, we find that older authors wrote more vivid stories with more emotional tone (Tone, Exclam), more friendship words, and more visual descriptions (percept). In contrast, younger authors wrote in a more past-focused way.
Controlling for age and gender, we find effects of the author's agreeableness and conscientiousness personality traits on the types of language used in stories. We don't see significant effects on the extraversion, openness, or neuroticism scales, likely due to our small sample size of 1.6k (e.g., compared to the 75k users in Schwartz et al., 2013). Shown in Table 2, less conscientious authors wrote more negative stories, whereas more conscientious authors were more positive and fo-cused on character motivations (drives, reward). Less agreeable authors used more swear words.

B.3 Rater and author interaction
As explained in §4.2, we analyze how rater and author traits relate to story ratings. We run linear regression models using story ratings as dependent variables and rater demographics and personality traits as independent variables. We include author demographics and interaction features in these regression models to see if any shared traits across raters and authors were associated with rater preferences. As in all previous analyses, we include story and image type in each model as controlling variables. We report p-values and β coefficients for each regression feature. Full details on the regression results are in Table 3.