Toward a Better Story End: Collecting Human Evaluation with Reasons

Creativity is an essential element of human nature used for many activities, such as telling a story. Based on human creativity, researchers have attempted to teach a computer to generate stories automatically or support this creative process. In this study, we undertake the task of story ending generation. This is a relatively new task, in which the last sentence of a given incomplete story is automatically generated. This is challenging because, in order to predict an appropriate ending, the generation method should comprehend the context of events. Despite the importance of this task, no clear evaluation metric has been established thus far; hence, it has remained an open problem. Therefore, we study the various elements involved in evaluating an automatic method for generating story endings. First, we introduce a baseline hierarchical sequence-to-sequence method for story ending generation. Then, we conduct a pairwise comparison against human-written endings, in which annotators choose the preferable ending. In addition to a quantitative evaluation, we conduct a qualitative evaluation by asking annotators to specify the reason for their choice. From the collected reasons, we discuss what elements the evaluation should focus on, to thereby propose effective metrics for the task.


Introduction
Creativity is vital to human nature, and storytelling is among the most important representations of human creativity. Humans use stories for entertainment and practical purposes, such as teaching lessons and creating advertisements. Stories are deeply rooted in our lives.
In computer science, understanding how humans read and create a story, and imitating these activities with a computer, is a major challenge. Mostafazadeh et al. (2016) proposed Story Cloze Test (SCT) as a reading comprehension task and released a large-scale corpus ROCStories. SCT presents four sentences, where the last sentence is excluded from a story comprising five sentences. A system must select an appropriate sentence from two choices that complement the missing 5th sentence. Among the two options is "right ending", i.e., the appropriate one to complete the story, and the other is "wrong ending".
Herein, we consider story ending generation (SEG) (Guan et al., 2019;Li et al., 2018;Zhao et al., 2018). This is a relatively new task inspired by SCT, and it is designed to be generationoriented. In SEG, the last sentence of a given incomplete story is generated automatically. This is challenging because the system should comprehend the context to generate an appropriate ending.
Despite the importance of this task, no clear evaluation metric has been established thus far. To serve as a reference for future proposals of the evaluation metrics, we conduct human evaluations and study the various elements involved in evaluating an automatic SEG method.
The main contributions of this paper are: • In order to show how well a baseline method performs and what drawbacks it has for SEG, we conducted a pairwise comparison against human-written right endings. • Besides a quantitative evaluation, we conducted a qualitative evaluation by asking annotators to specify the reason for their choice. From the collected reasons, we explored the elements that the evaluation should focus on, to thus propose effective metrics for SEG.

Related Work
Automatic evaluation metrics that measure word matching are not effective in text generation, espe-Morgan enjoyed long walks on the beach.
She and her boyfriend decided to go for a long walk.
After walking for over a mile, something happened.
Morgan decided to propose to her boyfriend.
Her boyfriend was upset he didn't propose to her first.

Sentence Encoder Context Encoder
Context embedding

Decoder
She was so happy that she had a good time.

Generated Ending
loss Given Context w/o Ending Sentence embedding Figure 1: Given stories where the last sentence has been excluded, a method is required to generate an appropriate ending to complete the story. Our baseline method has two steps: sentence encoder and context encoder. The first encoder processes each sentence and generates corresponding sentence embeddings. These sentence embeddings are input to the second encoder, which calculates a representation of the context. The recurrent neural network (RNN) decoder receives the context embedding and generates a sentence to complete the story.
cially in dialog generation (Liu et al., 2016). Further, in story generation, it seems difficult to evaluate text generation methods with conventional automatic evaluation metrics. As SEG is a relatively new task, metrics for human evaluation have also not been established. Zhao et al. (2018) defined two criteria, Consistency and Readability, to implement human evaluation. For each criterion, human assessors rated endings on a scale of 0 to 5. Li et al. (2018) assigned four levels to each ending (Bad (0), Relevant (1), Good (2), and Perfect (3)). Three judgement criteria were provided to annotators: Grammar and Fluency, Context Relevance, and Logic Consistency. They also conducted a direct comparison of the story endings generated by their baseline and their proposed approach. Guan et al. (2019) defined two metrics, Grammar and Logicality, for human evaluation. For each metric, the score 0/1/2 was applied.
In order to measure the distance from the goal of "a system writing story endings like humans", it is useful to compare the generated endings directly with human-written endings. Therefore, we conducted a pairwise comparison against humanwritten "right endings". To show what elements of stories humans focus on, we conducted a qualitative evaluation by asking annotators to specify the reasons for their choices.

Baseline Method for SEG
We define S = {s 1 , s 2 , ..., s n } as a story consisting of n sentences.
In SEG, S = {s 1 , s 2 , ..., s n−1 } is given as an input. Then, a method is required to generate an appropriate ending s n . We refer to S as "context".
A hierarchical approach is useful for generating a long text (Liu et al., 2018), and also effective in story generation (Fan et al., 2018;Ravi et al., 2018). Using the "sequence-to-sequence" (Seq2seq) model (Sutskever et al., 2014) as a point of departure, we introduce a baseline method that handles input text hierarchically. This refers to the conventional method using a hierarchical structure for document modeling (Li et al., 2015) and query suggestion (Sordoni et al., 2015).
To be more precise, we use a two-step encoder. The first encoder receives {s 1 , s 2 , ..., s n−1 } as a word-level input and outputs the sentence embeddings {v 1 , v 2 , ..., v n−1 }. Then, the second encoder receives the sentence embeddings as a sentencelevel input and generates a distributed representation of the entire context. We named the first encoder "sentence encoder", and the second encoder "context encoder". We refer to this method as "Hierarchical Seq2seq" (H-Seq2seq). An overview of the method is shown in Figure 1.
Sentence Encoder: As a sentence encoder, we apply the pre-trained "InferSent" model, a bidirectional long-short term memory (Bi-LSTM) network with max pooling, trained with a natural language inference task (Conneau et al., 2017). In-ferSent was devised as a supervised universal sentence embedding model and demonstrated good performance with various tasks.
Context Encoder: Using the sentence embeddings obtained with the sentence encoder, we applied another embedding layer for context em-bedding v context for the entire input sequence S . We use a gated recurrent unit (GRU) (Cho et al., 2014), for the context encoder to consider the sentences as a time series. A batch normalization layer (Ioffe and Szegedy, 2015) followed.
Then, we input v context to the RNN decoder. Compared with tasks like translation, current datasets for story generation are relatively small. We believe that techniques for avoiding overfitting become more effective in such a situation. We use "word dropout", which drops words from input sentences according to a Bernoulli distribution (Iyyer et al., 2015). As an association from word dropout, we also introduced dropout at the sentence-level. When obtaining sentence embeddings, sentence-level dropout drops some elements randomly according to a given probability ratio and scales the remaining elements.

H-Seq2seq
As explained in Section 3, pre-trained InferSent was applied as the sentence encoder.
Seq2seq As a particularly simple method, we used basic Seq2seq for comparison. To examine the strength of the hierarchical approach, nonhierarchical basic Seq2seq is useful. The series of input words was handled collectively without considering sentence-level information.
Human Right Ending We used the humanwritten "right ending" in SCT as the ground truth. Two candidates in SCT are written by a person that did not write the original story (Mostafazadeh et al., 2016). Therefore, we can consider the "right ending" as the answer if SEG is solved by humans.
Note that while H-Seq2seq uses pre-trained embeddings, Seq2seq does not. For Seq2seq, we randomly initialized the word embeddings because we intended to simplify the implementation of Seq2seq. The results with and without the pretrained embeddings should be compared for more accurate evaluation. Different parameters should also be examined. However, in story generation, it is difficult to evaluate methods with conventional automatic evaluation metrics. On the other hand, conducting all evaluations with humans is unrealistic. Therefore, we focused on investigating how much a baseline model can solve SEG and discuss how to conduct human evaluation. Although there are more sophisticated methods for SEG already proposed (Guan et al., 2019;Li et al., 2018;Zhao   et al., 2018), they are beyond the scope of this study. We leave it as a future work to apply the evaluation method discussed in this paper to more advanced models.

Dataset
Refer to the setting of SCT competition in SemEval-2017 (Mostafazadeh et al., 2017), we used "Spring 2016 release" and "Winter 2017 release" from ROCStories for training and "Spring 2016 release" validation and test sets from SCT for validation and testing (Table 1).

Quantitative Evaluation with MTurk
As a story is created on the premise that a human will read it, evaluation by human readers is considered to be the most accurate evaluation. We conducted human evaluation with help from Amazon Mechanical Turk (MTurk) workers. We evaluated the performance of the model depending on whether the generated ending correctly considers the context and properly completes the story. MTurk workers were given a four-sentence incomplete story and two options for an ending (5th sentence), and they were asked to indicate the best ending among them. We instructed workers that the given stories originally consisted of five sentences but the 5th sentence is lost, and they are required to choose the 5th sentence to complete each story. The workers were given four choices: option A is more appropriate (A), option B is more appropriate (B), both options are equally appropriate (both A and B), and neither options are suitable (neither A nor B). For each pair, we used 200 stories from the SCT test set for comparison. Five MTurk workers evaluated each story and its corresponding candidate endings. The most popular answer among the five workers was considered as agreement among the workers. The results are shown in Table 2.

Qualitative Evaluation with MTurk
We conducted another experiment similar to that in Section 4.3, where workers were required to write the reason they chose the answer. We focused on comparing H-Seq2seq against humans   Table 3. Similar to Table 2, the agreement among five workers was counted: H-Seq2seq = 0, Human = 42, both = 5, neither = 3 (total of 50 stories). The collected reasons are publicly available 1 .

Sentiment Analysis
Regarding SCT, when crowdsourced workers write the "right ending" and "wrong ending" without constraints, the "right ending" tended to be more positive (Sharma et al., 2018). Referring to this finding, we analyzed the endings in SCT, ROCStories, and 1,871 endings generated by H-Seq2seq. To calculate sentiment, we used the VADER sentiment analyzer (Hutto and Gilbert, 2014). The results are shown in Table 4. Focusing on the difference of sentiment between "right ending" and "wrong ending" in SCT, Sharma et al. (2018) aim to improve SCT as a reading comprehension task. In order to eliminate the bias, they apply constraints when they have crowdsourced workers write new "right ending" and "wrong ending". On the other hand, our goal is to clarify what sentiment bias exists when humans freely write the story (SCT "right ending" and ROCStories), and how this sentiment bias is reproduced in SEG as a generation task. As the task of story generation aims to imitate the story that humans write freely, our focus is not on setting constraints when humans write.

Discussion
In quantitative evaluation with MTurk, H-Seq2seq beat Seq2seq in 30 stories out of 200. As this exceeds the number of stories in which Seq2seq beat H-Seq2seq (18 stories), it can be concluded that H-Seq2seq performs better than Seq2seq. Comparing H-Seq2seq with Human Right Ending, H-Seq2seq is far from generating endings that mirror those written by humans. However, 20 stories were evaluated to be equal to or better than an ending written by humans.
To clarify the characteristic of the endings generated with H-Seq2seq, we analyzed the qualitative evaluation results. Table 3 shows that endings containing positive emotions are frequently generated. This tendency is also supported by sentiment analysis (Table 4). Considering the human-written endings, the mean score of the endings from ROC-Stories is 0.119. This value is significantly different from 0 (p < 0.05). Hence, if we have crowdsourced workers write short stories on everyday life, they tend to write stories with happy endings.
We then analyzed the 250 reasons for the choices (five answers for each of the 50 stories). Some examples are shown in Table 3. To identify important elements of the reasons, we tried topic modeling with latent Dirichlet allocation (LDA) (Blei et al., 2003). However, the elements that characterize the topics depended on the content of the story (such as eating or going somewhere), and it was not clear what elements of the reasons were important for the choice. Therefore, we instead used word frequency to analyze 250 reasons. The most frequent 20 words among the 250 reasons are shown in Figure 2. Using this word frequency as a reference, we checked all 250 reasons to determine what was the important factor. "Logical" is a frequently used word; some reasons insisted on the importance of logicality. "Make" and "sense" are both frequently-used words because the idiom "make sense" was commonly used. When a word unrelated to the context was generated, annotators evaluated the generated ending as bad, saying "no mention". As mentioned earlier, an ending is often emotionally biased toward being positive. Therefore, the reasons also included references to emotions, such as "happy". The example in Table 3 shows that an immoral story seems to be disliked. Even the human-written ending was considered as inappropriate. Moreover, there were cases where a choice was made based on common sense, such

Context
Howard is a senior. He feels a lot of bittersweet thoughts. He holds a senior party with all of his friends. They all enjoyed it and drank a lot.
Human Howard liked socializing. H-Seq2seq He is happy that he has a good time.

Human (A)
he was sad about leaving his friends both Both make sense, even if B has a tad more detail. both Either ending will work for the story. A might be a bit better. both Both fit, he might have a bittersweet feeling but he would likely be happy at the end of the party, esp if they all enjoyed themselves. both He wanted to interact with his friends, and they "all enjoyed it", so he was happy.

Context
Lily and Pam were popular girls in school. They invited Joy to a diner after school. Joy was not popular and Lily and Pam knew it. They invited her just to bully her when they got there!

Human
Joy had brought a gun and shot both the bullies in the face. H-Seq2seq They had a great time at the party. a terrible story which shows that bullying is risky; sometimes very risky. neither Neither, they certainly didn't have a great time, and why would she shoot them?  as "dogs do love to play in the snow." There was a "neither" case involving grammar. For example, an annotator explained that "trying to buy a car" implies that it was not successful; therefore, the human-written ending implying buying a car is considered inappropriate. Thus, it would be desirable that the evaluation metric be designed by considering that annotators are conscious of emotions, morals, and common sense, in addition to logic and grammar.

Conclusion
We undertook an SEG task, and examined how to make manual evaluation more effective. As a baseline method, we introduced a hierarchical sequence-to-sequence model. Our focus is not on proposing a better model, but on discussing how to conduct human evaluation. Through quantitative and qualitative evaluations, we showed how well a baseline model performs and what drawbacks it has. To examine the qualities of the generated end- ings, we asked crowdsourced workers to provide reasons for their choice. This qualitative evaluation illustrates the characteristics of our baseline method. The analysis indicated that the evaluation metric should be designed by considering that workers are conscious of emotions, morals, and common sense when they evaluate story endings. Although the amount of analyzed data is limited, we believe that the findings obtained by humanreasoned evaluation would contribute to suggest metrics for story generation in future research.