UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

Despite the success of existing referenced metrics (e.g., BLEU and MoverScore), they correlate poorly with human judgments for open-ended text generation including story or dialog generation because of the notorious one-to-many issue: there are many plausible outputs for the same input, which may differ substantially in literal or semantics from the limited number of given references. To alleviate this issue, we propose UNION, a learnable unreferenced metric for evaluating open-ended story generation, which measures the quality of a generated story without any reference. Built on top of BERT, UNION is trained to distinguish human-written stories from negative samples and recover the perturbation in negative stories. We propose an approach of constructing negative samples by mimicking the errors commonly observed in existing NLG models, including repeated plots, conflicting logic, and long-range incoherence. Experiments on two story datasets demonstrate that UNION is a reliable measure for evaluating the quality of generated stories, which correlates better with human judgments and is more generalizable than existing state-of-the-art metrics.


Introduction
Significant advances have been witnessed with neural encoder-decoder paradigm (Sutskever et al., 2014), transformer-based architecture (Vaswani et al., 2017) and large-scale pretraining models (Devlin et al., 2019;Radford et al., 2019) in a wide array of natural language generation (NLG) tasks including machine translation (Bahdanau et al., 2015), story generation (Fan et al., 2018;Guan et al., 2020), and many more.However, the research is increasingly hindered by the lack of effec-Table 1: Generated story samples given the same leading context from ROCStories (Mostafazadeh et al., 2016).B stands for BLEU (Papineni et al., 2002), M for MoverScore (Zhao et al., 2019), and U for UNION.A story can be reasonable even if it is dissimilar to the reference with a low BLEU score (B=0.14 in Sample 2), or unreasonable even if it has a large MoverScore (M=0.35 in Sample 3).In contrast, UNION is more reliable for evaluating story generation.
tive evaluation metrics, particularly for open-ended text generation tasks such as story generation.
Since human evaluation is time-consuming, expensive, and difficult to reproduce, the community commonly uses automatic metrics for evaluation.Previous studies in conditional language generation tasks (e.g., machine translation) have developed several successful referenced metrics, which roughly quantify the lexical overlap (e.g., BLEU (Papineni et al., 2002)) or semantic entailment (e.g., MoverScore (Zhao et al., 2019)) between a generated sample and the reference.However, such referenced metrics correlate poorly with arXiv:2009.07602v1[cs.CL] 16 Sep 2020 human judgments when evaluating open-ended text generation (Liu et al., 2016) due to the one-tomany nature (Zhao et al., 2017), as illustrated in Table 1.Specifically, a generated sample can be reasonable if it is coherent to the given input, and self-consistent within its own context but not necessarily being similar to the reference in literal or semantics, as shown in Sample 2 and 3.
To address the one-to-many issue, unreferenced metrics are proposed to measure the quality of a generated sample without any reference.Kannan and Vinyals (2017) presented a learnable, unreferenced metric which measures the text quality by learning to distinguish human-written texts from generated samples.However, the discriminatorbased metric can easily lead to over-fitting to specific data (Garbacea et al., 2019) or model bias since the quality of generated texts varies substantially across different NLG models.As a matter of fact, the generalization or robustness issue is critical for any learnable metrics.
Therefore, we propose UNION, a learnable UNreferenced metrIc for evaluating Open-eNded story generation.UNION learns to distinguish human-written stories from negative samples autoconstructed by generating perturbations of humanwritten stories.It is trained without dependence on specific NLG models or any human annotation, making it more generalizable to distribution drift (Sellam et al., 2020) than the discriminator-based metric and those metrics which learn from human preference (e.g., Adem (Lowe et al., 2017)).To capture commonly observed issues in generated stories, such as repeated plots, conflicting logic, and inter-sentence incoherence, we adopt four negative sampling techniques to construct negative samples, including repetition, substitution, reordering, and negation alteration.In addition, we design an auxiliary reconstruction objective for UNION, which recovers the perturbation from a negative sample.This objective is shown to further improve the performance of UNION.
Our contributions are summarized as follows: I.We propose a learnable unreferenced metric UNION for evaluating open-ended story generation to alleviate the one-to-many issue of referenced metrics.UNION does not depend on any output of NLG models or human annotation.II.Extensive experiments1 show that UNION cor-relates better with human judgments than state-ofthe-art metrics, and is more generalizable to data drift (samples from different datasets) and quality drift (samples with different quality levels).

Related Work
Automatic evaluation is crucial for language generation tasks.We roughly divide existing metrics into referenced, unreferenced, and hybrid metrics, according to whether they rely on human-written references when calculating the metric score.Referenced metrics usually measure how similar a generated text is to the reference text.Therefore, they are developed mainly for conditional language generation tasks such as machine translation and text summarization, where plausible outputs are largely limited within the semantics of input.Commonly used referenced metrics include wordoverlap based (e.g., BLEU (Papineni et al., 2002), ROUGE (Lin, 2004)) and embedding based metrics (e.g., BertScore (Zhang* et al., 2020), Mover-Score (Zhao et al., 2019)).However, referenced metrics are reported to correlate poorly with human judgments in open-ended generation tasks including open-domain dialog generation (Liu et al., 2016) and story generation, where the input contains only limited information for generation, and there are many plausible outputs for the same input, which can vary substantially in literal or semantics.Unreferenced metrics measure the quality of a sample without any reference.The most classic unreferenced metric is perplexity, which measures how likely a sample is generated by a given language model trained on human-written texts.However, recent work has shown that natural language is rarely the most probable text (Holtzman et al., 2020), and perplexity is inadequate to measure quality (Hashimoto et al., 2019).Therefore, perplexity may not indicate the actual text quality well.Discriminator-based metric (Kannan and Vinyals, 2017) measures how easily a discriminator distinguishes the generated samples from human-written texts.However, training such a discriminator can be easily over-fitted to a specific dataset, thereby leading to poor generalization and low correlation with human judgments (Garbacea et al., 2019).In addition to the above point-wise metrics which score an individual sample, Semeniuta et al. (2019) proposed the Fréchet InferSent Distance (FID) to evaluate the model-level quality and diversity of generated samples, by computing the Fréchet dis-tance between the Gaussian distribution fitted to human text embeddings and that to generated sample embeddings.However, in real data, the distribution of embeddings may be far from Gaussian.Recently, Zhou and Xu (2020) proposed to evaluate sample-level quality by comparing a pair of samples, and further adopted a skill rating system to evaluate model-level quality based on the samplelevel pair-wise comparison.However, it is unlikely to evaluate a single sample without access to its references.Hybrid metrics combine referenced and unreferenced metrics.For open-domain dialog system evaluation, Lowe et al. (2017) proposed a learnable metric Adem to learn from the human-annotated score of a response given its post and ground truth.However, such a metric shows very poor generalization and is not robust to easy attacks such as simple word substitution or random word shuffle (Sai et al., 2019).Furthermore, RUBER and its variants (Tao et al., 2018;Ghazarian et al., 2019) evaluate a response by directly averaging a nonlearnable referenced embedding similarity score and a learnable unreferenced post-response relatedness score that is learned by applying negative sampling without human annotations.However, merely measuring input-output relatedness is not sufficient for evaluating long text generation, as the intrinsic coherence and consistency within the generated text is a critical factor.Additionally, some metrics which learn from human preference achieve substantial results in conditional language generation, e.g., RUSE (Shimanaka et al., 2018) and BLEURT (Sellam et al., 2020).RUSE trained a regression model to score a reference-candidate pair using their sentence embeddings.And BLEURT used multiple automatic metrics (e.g., BLEU) as supervision signals for pretraining on synthetic data, and was fine-tuned on human judgments.However, BLEURT heavily relies on the quality of automatic metrics, but there are yet no such reliable metrics for open-ended text generation.

Methodology
UNION is expected to measure the overall quality of a generated story.In this section, we begin with common issues that can be observed in the output of NLG models.We then propose four negative sampling techniques based on the observations.Afterward, we introduce how UNION is trained and used for story evaluation.The overall paradigm of UNION is shown in Figure 1. Figure 1: Overview of the UNION metric.UNION is trained to distinguish the human-written stories from the negative samples constructed by four negative sampling techniques, as well as to reconstruct the original human-written stories.

Empirical Observations
The key aspect of UNION is the construction of negative samples, which provides a range of lexical, syntactic, and semantic variations to simulate the errors made by NLG models.Therefore, we first present our empirical observations regarding the question "What makes a story unreasonable for NLG models?".
We analyzed 381 unreasonable stories generated by various NLG models like Plan&Write (Yao et al., 2019) and fine-tuned GPT-2 (Radford et al., 2019) base on ROCStories (Mostafazadeh et al., 2016), and summarized four major types of errors, including repeated plots (repeating similar texts), poor coherence (with unrelated keywords or events but a reasonable main plot), conflicting logic (wrong causal or temporal relationship), and chaotic scenes (difficult to understand or with multiple previous errors).To facilitate understanding of the error types, we resorted to manual annotation of all the unreasonable stories.And seven annotators were hired for each story (see the full details in Section 4.2).In addition to the four error types, we also provide annotators with an option Others.We summarize the proportion of stories annotated with different error types in Table 2 2 .
We can see that the four error types are the major issues of unreasonable stories, which provides rationales of constructing negative samples for evaluating generated stories.Besides, all the Spearman correlations between every two error types are less than 0.15 (p-value > 0.01), suggesting that different error types correlate weakly with each other.Furthermore, the stories annotated with 1/2/3/4 errors constitute 23.36%/36.48%/34.65%/4.46% of the annotated stories, respectively.Most of the unreasonable stories have more than one error, which motivates us to simultaneously apply multiple sampling techniques to construct negative samples.

Constructing Negative Samples
We construct negative samples to cover as many aforementioned issues of unreasonable stories as possible.Since using machine-generated texts as negative samples will easily lead to poor generalization (over-fitting to specific data or model bias (Garbacea et al., 2019)), we devise four negative sampling techniques to automatically construct a large number of negative samples from human-written stories as follows: Repetition: Generating repetitive texts is commonly observed in many state-of-the-art NLG models (Fan et al., 2018;Radford et al., 2019), where the models focus repeatedly on what they have recently generated, particularly with maximumlikelihood based decoding strategies (Holtzman et al., 2020).To address the issue, we introduce lexical and sentence-level repetition to construct negative samples using two policies-we either repeat an N-gram (N=1,2,3,4) in a random sentence, or randomly select a sentence to repeat and remove the following sentence to keep the sentence number unchanged.Substitution: The coherence of a story is mainly embodied through the relationship between keywords in the context (Clark et al., 2018;Guan et al., 2020).Therefore, we create incoherent samples by random keywords and sentence substitution, respectively at word level and sentence level.For word-level substitution, we replace random 15% keywords in a story with their corresponding antonyms (e.g., replace "deny" with "con-firm"), otherwise with another random keyword sampled from all the keywords of the same part-ofspeech (POS), according to the mention frequency.We use the commonsense knowledge base Con-ceptNet (Speer and Havasi, 2012) 3 for keyword recognition and antonym query.ConceptNet consists of commonsense triples like (h, r, t), meaning that the head concept h has a relation r with the tail concept t, e.g., (evaluation, IsA, judgment).We regard those words which are heads or tails in ConceptNet as keywords.And given an keyword, we look up those keywords as its antonyms with which have negated relations, including Antonym, NotDesires, NotCapableOf, and NotHasProperty.If no antonym is found for a keyword, we perform replacement with a random keyword of the same POS.And we adopt NLTK4 for POS tagging.
For sentence-level substitution, we randomly replace a sentence in a story with another one sampled from the rest of stories in the dataset.
Reordering: Conflicting logic usually results from wrong causal relationship and temporal dependency in the context.Therefore, we randomly reorder the sentences in a story to create negative stories with conflicting plot.
Negation Alteration: Negation words such as "not" are crucial for language generation tasks because they may flip the semantics of a sentence, which is also an important cause of conflicting logic.We perform negation alteration by adding or removing negation words using rules for different types of verbs5 .Since there may be multiple error types in a generated story, we apply different sampling techniques simultaneously to construct a negative sample.We first sample the number (n) of techniques from {1,2,3,4} with a distribution {50%, 20%, 20%, 10%}.We then sample a technique without replacement from {repetition, substitution, reordering, negation alteration} with a distribution {10%, 30%, 40%, 20%} until the total number of techniques (n) is reached.Last, we apply the sampled techniques on a human-written story to obtain a perturbated sample.A constructed example is shown in Table 3.

Leading Context
Ken was out jogging one morning.

Reference By Human
The weather was crisp and cool.Ken felt good and energetic.He decided to keep jogging longer than normal.Ken went several more miles out of his way.

Auto-Constructed Negative Sample
The weather was crisp and cool and cool.Ken felt bad and energetic.Ken DID NOT GO several more miles out of his way.He decided to keep jogging longer than normal.Table 3: An example of negative sample construction.The repeated bigram is in italic, the substituted keyword is underlined, the reordered sentences are indicated in bold, and the altered negation words are CAPI-TALIZED.

Modeling
Let {s n , r n , y n } N n=1 denote the training dataset of size N for training the UNION metric, where s n is a human-written story or an auto-constructed negative sample, r n is the corresponding original story of s n .If s n is a negative sample, y n = 0, otherwise y n = 1 where s n is exactly the same as r n in this case.y n ∈ {0, 1} indicates whether s n is written by human.For better story understanding, we leverage BERT (Devlin et al., 2019) to obtain contextualized representations of the input.Given a story s n = (s 1 , s 2 , • • • , s p ) of length p (each s i is a word), BERT outputs a sequence of contextualized vectors: where v [CLS] and v [SEP] are the representation for the special tokens [CLS] and [SEP], respectively.We add a task-specific linear layer on top of the [CLS] vector to predict the UNION score, indicating the probability that s n is written by human: where W c and b c are trainable parameters.We use the cross entropy loss to optimize the prediction objective as follows: In addition to the main prediction task, we devise an auxiliary reconstruction task which requires to reconstruct the corresponding human-written story r n from perturbated story s n .Therefore, we add an additional linear layer at the last layer of BERT, which takes as input the vectors output from the last transformer block and computes a probability distribution over the entire vocabulary through a softmax layer, formally as follows: where ri is the predicted i-th token, W r and b r are the parameters of the additional linear layer.Then the model is trained by minimizing the negative log-likelihood: where r i is the i-th token in human-written story r n .The combined loss function L of the full model is computed as follows: where λ is an adjustable hyperparameter.
We fine-tune all the parameters of UNION on the training dataset, including the BERT and the two additional linear layers.In practical use, UNION can measure the quality of a new generated sample ŝ by taking ŝ as input to predict the corresponding score ŷ.

Experiment
We conducted extensive experiments to evaluate UNION on two story datasets.First, we compared UNION against existing text generation metrics.Then, we assessed its generalization on distribution drifts, including dataset drift and quality drift.Last, we measured the effect of each negative sampling technique with ablation studies.

Baselines
We compared UNION with the following three kinds of metrics as baselines: Referenced metrics: sentence BLEU score (geometric mean of 1-gram up to 4-gram) (Papineni et al., 2002) to measure the lexical similarity between a candidate sample and its reference, and MoverScore (Zhao et al., 2019) to measure the semantic similarity.Unreferenced metrics: Perplexity6 computed by the GPT-2 model (Radford et al., 2019), and a discriminative evaluator (DisScore) (Kannan and Vinyals, 2017) that is trained based on BERT to distinguish generated samples from human-written stories.Hybrid metrics: RUBER-BERT (Ghazarian et al., 2019) which improves the original RU-BER (Tao et al., 2018) with contextualized embeddings from BERT, and the supervised metric BLEURT (Sellam et al., 2020) that is fine-tuned on human judgments after pretraining on large-scale synthetic data with multiple automatic metrics as supervision signals.
In addition, we also reported the performance of the referenced and unreferenced versions in RUBER-BERT, denoted as RUBER r -BERT and RUBER u -BERT, respectively.
We set the parameters of UNION by following the uncased base version of Devlin et al. (2019): the transformer has 12 layers, 768 dimensional hidden states, and 12 attention heads.We used batch size 10, and learning rate 5e-5.The scale factor λ is set to 0.1.We directly used public pretrained parameters of BERT7 or GPT-28 (base version) for all the baselines.

Data Preparation
We used two datasets for evaluation, ROC-Stories (ROC for short) (Mostafazadeh et al., 2016) and WritingPrompts (WP) (Fan et al., 2018).The ROC dataset contains 98,161 fivesentence human-written stories, with an average length of 49.4 words.To achieve better generalization performance, we followed Guan et al. (2020) to make delexilization by masking all the male/female/unknown names with placeholders The WP dataset consists of 303,358 stories paired with writing prompts collected from an online forum.The average length of the prompt/story is 28.4/734.5 respectively, much longer than those in ROC.Since it is still challenging for state-ofthe-art NLG models to maintain a reasonable plot through the whole story, and hard to obtain acceptable annotation agreement in manual evaluation of long stories, we retained about 200 words (with correct sentence boundary) from the start and truncated the rest in WP for subsequent experiments.
We randomly selected 90%/5%/5% stories from both datasets for training/validation/test of UNION and learnable baseline metrics, and created the evaluation set for all the metrics by generating stories based on the test sets of the datasets with state-of-the-art story generation models.The story generation models include fusion convolutional seq2seq model (Fan et al., 2018), plan&write (Yao et al., 2019), fine-tuned GPT-2 (Radford et al., 2019), and knowledge-enhanced GPT-2 (Guan et al., 2020).
The data statistics are shown in Table 4.The number of negative samples for learning the metrics when necessary is the same as that of humanwritten stories on each dataset.Specifically, we created negative samples for DisScore by generating stories with above NLG models.For RUBER u -BERT, a given leading context is appended by a randomly sampled continuation.All the stories in the evaluation set are manually labeled.In addition, we annotated another 400 stories in ROC and 200 in WP for training BLEURT9 .Seven annotators were hired to judge the quality of each story with a binary score (1 for a reasonable story, and 0 otherwise).Furthermore, we asked annotators to label the error type of a story if it is labeled as unreasonable, including repeated plots, poor coherence, conflicting logic, chaotic scenes, and others.We resorted to Amazon Mechanical Turk (AMT) for annotation, and the average score of the seven annotators is treated as the final score.We provide the full details of the instruction for annotators in the supplementary file.

Correlation Results
Correlation analysis has been widely used to evaluate automatic metrics for language generation (Tao et al., 2018;Sellam et al., 2020).We employed UNION and other metrics to score the collected samples, and then calculated the Pearson (r), Spearman (ρ) and Kendall (τ ) correlation coefficients between model evaluation and human judgments.Pearson's r estimates linear correlation while Spearman's ρ and Kendall's τ estimate monotonic correlation, and τ is usually more insensitive to abnormal values than ρ.We used the standard statistical package stats in SciPy 10 for correlation calculation and significance test.
As summarized in Table 5, the referenced metrics correlate worse with human judgments, particularly for BLEU which is based on lexical similarity.Measuring the semantic similarity instead (MoverScore, RUBER r -BERT) can improve the correlation but is still limited, indicating that referenced metrics are not competitive for evaluating open-ended language generation.Perplexity is ineffective on WP because the generated stories in the dataset are much longer and hence suffer from more serious repetition errors than those in ROC, which easily results in low perplexity (i.e., high minus perplexity) (Holtzman et al., 2020) but poor human judgment scores.Furthermore, UNION outperforms other baselines including the supervised metric BLEURT by a large margin, which also demonstrates the advantage of unreferenced metrics.Besides, removing the reconstruction training objective (-Recon) leads to remarkably worse correlation, indicating that the auxiliary task further improves the performance of UNION.

Generalization to Dataset and Quality Drift
It is extremely important for learnable metrics to deal with dataset drift and quality drift (Sellam To assess the generalization to dataset drift, we first trained the learnable metrics on ROC and then directly used them to evaluate generated stories from WP, and vise versa.ter correlation with human judgments.Moreover, our method of constructing negative examples is generalizable to the two datasets. To assess the generalization of UNION to quality drift, we created biased test sets from ROC by sampling stories of different quality levels with different probabilities.Specifically, the annotation score of each story ranges from 0 to 1 (i.e., 0, 1 7 , 2 7 , • • • , 1) since there are seven annotators for each sample.We then created 8 biased sets, indexed from 1 to 8 with variable I.For the I th set, we sampled the stories whose annotation score is k 7 with a probability of We then computed the Pearson correlation of different metrics with human judgments on the 8 sets.Results in Figure 2 (right) show that: I. UNION has higher correlation than other metrics on all the biased sets.II.UNION is more reliable and robust than other metrics, with much less variance.For instance, MoverScore performs much better on Set #1 (with more low-quality stories) than on Set #8 (with more high-quality stories).Interestingly, Perplexity performs much better on high-quality sets than on low-quality ones, because high-quality stories are closer to human-written stories from which a language model learns.III.The ablated UNION without the reconstruction objective has lower correlation and larger variance, indicating that the auxiliary task can improve the discriminative and generalization ability.

Ablation Studies
To understand the effect of each negative sampling technique, we conducted ablation tests on ROC dataset.Each time we ablated one technique of constructing negative samples, re-trained UNION on the constructed data, and evaluated it on five evaluation sets: all 400 samples, and four other sets where each contains 19 reasonable samples and other unreasonable samples of some error type.The error type of a story is decided if at least three of seven annotators annotate the same error type.
Table 7 shows the Pearson correlation results.UNION is remarkably better than its ablated version on the all-sample set, indicating the necessity of the four techniques for constructing negative samples.Reordering seems to be the most important technique, which agrees with our observation that conflicting logic is the major issue in existing story generation models.Furthermore, as expected, the correlation drops remarkably on the evaluation set of some error type if without the corresponding negative sampling technique.Interestingly, it is easier for UNION to evaluate repetitive/chaotic stories, which seem to be easier cases in story generation.
We present UNION, an unreferenced metric for evaluating open-ended story generation.UNION is trained to distinguish human-written stories from auto-constructed negative samples and to recover the perturbation in negative samples.Extensive experiments show that UNION outperforms stateof-the-art metrics in terms of correlation with human judgments on two story datasets, and is more robust to dataset drift and quality drift.Results also show the effectiveness of the proposed four negative sampling techniques.As future work, we will explore the similar idea of designing unreferenced metrics for dialog generation.Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M Meyer, and Steffen Eger. 2019

A Negation Alteration Rules
We designed elaborate rules for negation alteration.
The transformation rule from affirmative sentences to negative is shown in Table 8.In reverse, from negative sentences to affirmative, we removed the negation words ("not" or "n't") and altered the corresponding forms of the verbs.Although there are other words which have negative meanings (e.g., "nobody"), they can be altered by another negative sampling technique: substitution with antonyms (e.g., replace "nobody" with "somebody").Therefore, we did not process these words while performing negation alteration.

B Annotation Instruction
We show a screenshot of the annotation on AMT for a generated story given a leading context from ROCStroies in Figure 4.The annotation instruction for WritingPrompts is similar.

C Annotation Results
We averaged the scores of seven annotators as the final score for each story.Therefore, the annotation score ranges from 0 to 1 (i.e., 0, 1 7 , 2 7 , • • • , 1).The number distribution of stories with different scores is shown in Figure 3. Besides, we show 8 typical samples, one for each score in Table 9.

D Reconstruction performance
Besides the prediction objective, We also trained UNION with an auxiliary reconstruction task, which recovers the perturbation from a negative sample.During testing, we compute the Spearman correlation between human judgments and UNIONâ Ȃ Źs editing behavior.We measure the editing behavior by labeling 1 if UNION edits the input story, otherwise 0 if UNION just outputs the same story.The correlation is 0.1990 (p-value<0.01) on the whole test set of ROCStories, and is 0.3442/0.1652/0.1623/0.2943 on the evaluation set which contains repetitive/incoherent/conïň Ćicting/chaotic stories (the same setting with Table 7, and each set is mixed with reasonable stories), respectively.Results show that it is easier to recognize the repetitive/chaotic stories, which agrees with the results in Table 7.
As for the editing output, although the key motivation of the reconstruction task is to provide more speciïň Ąc supervision signals for recognizing errors, UNION can generate meaningful editing results from unreasonable stories.We observe that UNION can correct lexical errors.For example, given the story "[FEMALE] worked real hard", UNION changed "real" to "really".However, since UNION adopted a non-autoregressive generative framework, it is diïň Čcult to generate a grammatical story if the input has sentence-level errors.But UNION can still accurately recognize the errors.For example, given the repetitive story "we had a great time.we had a great time.", it generated "we had a great time.we ..".We plan to improve the design by aligning the input and output tokens and then auto-tagging with editing operations during training with the reconstruction task in the future.

E Case Study
We present several samples based on ROCStories and the corresponding judgments of different metrics in Table 10.We can see that it is difficult for baseline metrics to recognize the possible issues in stories, which rate the typical unreasonable stories (S2-S5) even higher than the reasonable one (S1).In comparison, UNION judges the quality of a story more accurately regardless of whether it is similar to the reference, suggesting that UNION can alleviate the one-to-many issue more effectively than referenced metrics (e.g., MoverScore).For instance, although S2 maintains a reluctantly reasonable plot through the story except for a repet-itive sentence, annotators still give it zero because there is no such repetition error in human-written stories.And UNION successfully recognizes the issue thanks to the proposed negative sampling techniques which mimic the errors commonly observed in NLG models.Therefore, UNION is more reliable for evaluating open-ended story generation.

F Error Analysis
Although UNION outperforms the state-of-the-art metrics, it needs to be noted that the correlation with human judgments is still at a low level.As shown in Table 11, we present some typical cases where generated stories are misjudged by UNION.Firstly, although the proposed perturbation techniques have provided many lexical and syntactic variations, it is still hard to recognize some errors such as semantic repetition and emotionally conflicting (S6-S9).Secondly, we observed that UNION may not predict some reasonable stories (e.g., S10).This could be because some perturbated stories are still reasonable.For example, exchanging the order of two sentences without speciïň Ąc temporal relation (e.g., "he had to go through a lot of training" and "he took a first responder's course") does not break the storyâ Ȃ Źs coherence.Training with such noisy samples may make UNION misjudge some reasonable stories.Therefore, as future work, it is worth to explore more perturbation techniques for negative sample construction to reduce noise and cover more error types that UNION fails to recognize.Besides, it is necessary to introduce external knowledge to help judge the logic of stories.[NEUTRAL] and [MALE] had been dating for a little while.One day, [NEUTRAL] was convinced he could get a kiss.
[NEUTRAL] decided to give her a kiss.
They agreed to drink it together.they had a good time at the big bar.[FEMALE] noticed a bird's nest by her bedroom window.She decided to climb the tree.She climbed on the ladder and climbed up to the window.She climbed down the ladder and saw her step head.She reached into her pocket and grabbed the bird's back.[MALE]'s narcissist girlfriend only cared about what he had to offer her.He was a successful businessman who couldn't help but feed her desire.He did his best to show her that he was the real deal.She eventually left him because he was a failure.Although she left him, she never found someone else to love.[MALE] rented an old apartment.He was very bored.He was watching a movie while he watched it.[MALE] asked the friend to watch it.[MALE] happily watched it.

7
One day [FEMALE] needed to leave the airport.She was waiting for her husband to get out of work.He had a bad day at work.He asked her to meet him at the airport.[FEMALE] met her husband and they got in a taxi .
6 7 [FEMALE] saw a smoothie at the store.She saw a chocolate cone.she decided to buy it.She went and bought it.The chocolate ice cream was delicious.

1
[MALE] had joined the volunteer fire department.His first day there he saw a homeless man.He gave the man some water because he was thirsty.The man told [MALE] it was the most delicious water he ever tasted.[MALE] gave the man a small bucket of water.

Task Description
Each story contains five sentences.For each story, we will put the first sentence into a generative system, and the following sentences will be generated by the system.The requirement for this manual evaluation is to judge the overall quality of the story especially in terms of the logicality.

Evaluation Criterion
In the process of evaluation, you need to carefully read the whole story including the first sentence and the generated sentences, and annotate whether the story is logically reasonable (and the error type if unreasonable) in terms of the coherence to the given beginnings and the inter-sentence causal and temporal dependencies.In this process, you may encounter sentences that are not completely grammatical.Please make a logical evaluation based on the main part of the sentence (such as some keywords, etc.) and what you can intuitively feel.
If the story is unreasonable, the error types roughly contains repeated plots (repeating similar texts), bad coherence (with unrelated entities or events but a reasonable main plot), conflicting logic (wrong causal or temporal relationship), and chaotic scenes (difficult to understand or with multiple previous errors).
Here are several examples of the stories which are logically unreasonable: 1. ... i was on my way to a party to a party ... Annotation: Unreasonable (Repeated Plots), word-level repetition of "to a party" 2. ... i was on my way to a party .i 'd gotten out of my seat .and i was on my way to a party ... Annotation: Unreasonable (Repeated Plots), sentence-level repetition of "i was on my way to a party"

3.
[MALE] felt he was getting sick .he had to go to an emergency room .it was his first major surgery .he had a terrible stomach ache .he was nervous about a test in an hour .
Annotation: Unreasonable (Bad Coherence), "test" is unrelated to the context 4. i was riding my bike to a park .i stopped into the parking lot .i saw a man with a bike .i asked him if he was on a date with him .he agreed to the date and we went on a date .
[FEMALE] one day decided to visit Germany .she couldn't afford to go though , not without help .so she got to work , trying to raise the money .[FEMALE] raised half the money herself and asked for her parents help .she was excited to get to go home and have a great time .Annotation: Unreasonable (Conflicting Logic), "go home" is conflicting with "visit Germany" 6. [FEMALE] swept and mopped the or .she put her clothes in the washing machine .she was ready to go to bed .when she was done , she washed the clothes .she went to bed .Annotation: Unreasonable (Conflicting Logic), "when she was done" is conflicting with "she washed the clothes" 7.
[MALE] was on thin ice with his job. he had a friend over to help him .[MALE] was able to hold his breath the entire time .he was so cold that he froze in his tracks .[MALE] ally felt good about himself.Annotation: Unreasonable (Chaotic Scenes), difficult to understand 8. [MALE] was out jogging one morning .suddenly he noticed a little puddle and started hitting .he went to the store to buy some new parts .luckily , the house was gone , and [MALE] was mad .luckily , his car was gone and he was able to buy it .Annotation: Unreasonable (Chaotic Scenes), difficult to understand 9. [MALE] was out jogging one morning .[MALE]was out jogging one morning .the weather was crisp and cool was crisp and cool .[MALE] felt bad and energetic .[MALE] did not go several more miles out of his way .he decided to keep jogging longer than normal .Annotation: Unreasonable (Chaotic Scenes), multiple errors including repetition, conflicting If the story is unreasonable but the the error type does not belong to the above, please annotate the story with Unreasonable (Others) Notes Some stories may not be accurately judged.In the process of determining whether the story is reasonable, according to your own understanding of the examples and the subjective feelings of the story, choose a label you think the most appropriate.Please ensure that your evaluation criterion for different stories is the same.
Most importantly, in your process of evaluating, please NOT add story details between the first sentence and the generated stories based on your imagination!All the male/female/neutral names in the stories have been transformed into the special tokens [MALE]/[FEMALE]/[NEUTRAL], respectively.Besides, we lowercase all the initials.

Leading Context:
[MALE] had joined the volunteer fire department .Generated Story: his first day there he saw a homeless man .he gave the man some water because he was thirsty .the man told [MALE] it was the most delicious water he ever tasted .
[MALE] gave the man a small bucket of water .[FEMALE], 12, had grown up in a low-income singleparent household.But ani, 7, was wealthy and spoiled, so she was very bratty.At first she hated [FEMALE] and was always mean to her!But then, finally, the two girls began to become friends.

Is the story logically reasonable?
When their dad left the house, he went to their room.When he came back, he found them in the closet.He scolded them and grounded them for a year.The girls weren't happy with their new stepmother.(Chao)0.00 0.45 0.52 0.00 Table 10: Judgments of different metrics for several typical generated samples.H, M, B and U stand for human ratings, MoverScore, BLEURT and the proposed metric UNION, respectively.Italic words denote the improper entities or events and the specified error type, including Repeated plots, poor Coherence, Conflicting, and Chaotic scenes.For comparison, we normalize all the scores to the range of [0,1] following Tao et al. (2018).

Reference
Generated Samples H U S6 [MALE] wanted to impress his friends.
[MALE] knew how to juggle.
[MALE] juggled for his friends.His friends enjoyed the show.
[MALE] was happy to captivate his friends attention.
He decided to make a homemade pie for himself.He invited all his friends over for a party.Everyone showed up with a recipe.] went to the state capital on the charity's behalf.She argued for laws protecting nonprofits and the people they serve.
[FEMALE] was very successful and felt proud of her accomplishments.
She never had a college degree and had not studied well.She was very good at her job but always felt like a fraud.A lot of people were disappointed with her.
[FEMALE] was very upset and decided to major in philosophy.(Conf)0.14 1.00 S9 [MALE] was very nervous.
The big day had finally come and it was time to pop the question.He held her hand, but she didn't know what was going to happen.
[MALE] got down on one knee and asked her to marry him.With tears in her eyes she accepted and they embraced.
He was at the bar with his girlfriend when a man got into his car.He saw that he was going to be alone.The man was shocked by the situation and asked if she was okay.
[MALE] went to his office and got his wife's name.(Chao)

Figure 3 :
Figure 3: Number distribution of annotated stories with different human annotation scores.The total number for ROCStories/WritingPrompts is 400/200, respectively.

Figure 4 :
Figure 4: A screenshot of the annotation on AMT for manual evaluation.

Table 2 :
Error type Proportions of 381 unreasonable stories, including Repeated plots/poor Coherence/Conflicting logic/Chaotic scenes/Others.

Table 4 :
Data statistics.RUBER u is short for RUBER u -BERT.NS (Negative Sampling) means whether a metric requires negative samples for training/validation. † means the stories are generated by NLG models and manually annotated.

Table 5 :
Correlation with human judgments on ROC and WP datasets.r/ρ/τ indicates the Pearson/Spearman/Kendall correlation, respectively.The best performance is highlighted in bold.The correlation scores marked with * indicate the result significantly correlates with human judgments (p-value<0.01).

Table 6 :
reference/stats.htmlet al., 2020).Specifically, a generalizable metric is expected to reliably evaluate outputs from different datasets even without re-training.Moreover, since the quality of generated samples can vary significantly across NLG models, a reliable metric should be able to evaluate samples of different quality levels.Therefore, we conducted experiments to assess the generalization ability of UNION in this section.Correlation results in the dataset drift setting where the metrics are trained on one dataset and then used for the other one.
10 https://docs.scipy.org/doc/scipy/* 0.1137 * 0.0828 * UNION 0.1986 * 0.2501 * 0.1755 * -Recon 0.1704 * 0.2158 * 0.1523 * Training: ROC Test: WP * 0.2712 * 0.1971 * Table 6 shows the Pearson correlation with human judgments in this setting.Compared with the results in Table 5, all the metrics trained on one dataset have remarkable drops in correlation when they are used for the other dataset because the two datasets are significantly different in length and topic.Nevertheless, UNION performs more robustly than other metrics, with much bet-Generalization over different biased test sets.Left: distribution of stories of different annotation scores in different test sets.Right: the Pearson correlation of different metrics with human judgments on different test sets, where UNION-Recon denotes UNION without the reconstruction task.

Table 7 :
Pearson correlation with different negative sampling techniques.The numbers in parentheses denote the number of stories.The error types include Repeated plots, poor Coherence, Conflicting logic, and Chaotic scenes.The proportions in parentheses indicate the relative change with respect to UNION (the first row).

Table 8 :
Transformation rules from affirmative sentences to negative by adding negation words for different types of verbs.∼stands for the current verb while v. for the base form of the verb.The CAPITALIZED words in sentence examples indicate the altered results.The negation word "not" can be randomly replaced with the short form "n't".[MALE] went to work for his father's business.He was very careful with his business.He didn't get into trouble for his mistakes.His father found out and fired him.He was a bit sad but never did.

Table 9 :
Story samples for different human annotation scores and the annotated error types, including Repeated plots, poor Coherence, Conflicting, and Chaotic scenes.Bold sentences are the given leading context.Italic words denote the improper entities or events.
She usually asks her older sister for help.[FEMALE]'s older sister wasn't home.[FEMALE] needed help with her math homework.She then decided to asked her mother for help with her math.She didn't know what to do with it.She asked her mom to teach her.[FEMALE] said she needed a dog.But [FEMALE]'s mom was a little afraid of him.

Table 11 :
Typical misjudgments by UNION.H and U stand for human ratings and UNION, respectively.Italic words denote the improper entities or events and the specified error type, including Repeated plots, poor Coherence, Conflicting, and Chaotic scenes.