Semantic Frame Forecast

This paper introduces Semantic Frame Forecast, a task that predicts the semantic frames that will occur in the next 10, 100, or even 1,000 sentences in a running story. Prior work focused on predicting the immediate future of a story, such as one to a few sentences ahead. However, when novelists write long stories, generating a few sentences is not enough to help them gain high-level insight to develop the follow-up story. In this paper, we formulate a long story as a sequence of “story blocks,” where each block contains a fixed number of sentences (e.g., 10, 100, or 200). This formulation allows us to predict the follow-up story arc beyond the scope of a few sentences. We represent a story block using the term frequencies (TF) of semantic frames in it, normalized by each frame’s inverse document frequency (IDF). We conduct semantic frame forecast experiments on 4,794 books from the Bookcorpus and 7,962 scientific abstracts from CODA-19, with block sizes ranging from 5 to 1,000 sentences. The results show that automated models can forecast the follow-up story blocks better than the random, prior, and replay baselines, indicating the feasibility of the task. We also learn that the models using the frame representation as features outperform all the existing approaches when the block size is over 150 sentences. The human evaluation also shows that the proposed frame representation, when visualized as word clouds, is comprehensible, representative, and specific to humans.


Introduction
Writing a good novel is hard. Creative writers can get stuck in the middle of their drafts and struggle to develop follow-up scenes. Writing support systems, such as Heteroglossia (Huang et al., 2020a), generate paragraphs or ideas to help writers figure out the next part of the ongoing story. However, Figure 1: The semantic frame forecast is a task that predicts the semantic frames that will occur in the next part of a story based on the texts written so far.
little literature focuses on plot prediction for long stories. Much prior work focused on predicting the immediate future of a story, i.e., one to a few sentences later. For example, the Creative Help system used a recurrent neural network model to generate the next sentence to support writing (Roemmele and Gordon, 2015); the Scheherazade system uses crowdsourcing and artificial intelligence techniques to interactively construct the narrative sentence by sentence (Li and Riedl, 2015); Clark et al. (2018) study machine-in-the-loop story writing where the machine constantly generates a suggestion for the next sentence to stimulate writers; and Metaphoria (Gero and Chilton, 2019) generates metaphors, an even smaller unit, to inspire writers based on an input word by searching relations and ranking distances on ConceptNet (Liu and Singh, 2004).
Generating a coherent story across multiple sentences is challenging, even with cutting-edge pretrained models (See et al., 2019). To generate coherent stories, researchers often first generate a highlevel representation of the story plots and then use it as a guide to generate a full story. For example, Martin et al. (2018) propose an event representation that uses an SVO tuple to generate story plots; Plan-and-write (Yao et al., 2019) uses the RAKE algorithm (Rose et al., 2010) to extract the keyword in each sentence to form a storyline and treat it as an intermediate representation; Fan et al. (2019) use predicate-argument pairs annotated by semantic role labelers to model the structure of stories; and  take words with a certain part-of-speech tag as anchors and show that using anchors as the intermediate representation can improve the story quality. However, these projects all focused on short stories: The event representation is developed on a Wikipedia movie plot summary dataset (Bamman et al., 2013), where a summary has an average of 14.52 sentences; Plan-andwrite uses the ROCStories dataset (Mostafazadeh et al., 2016), where each story has only 5 sentences; Fan et al. test their algorithm on the Writing-Prompts dataset (Fan et al., 2018), where stories have 734 words (around 42 sentences) on average; and Zhange et al.'s anchor representation is developed on the VIST dataset (Huang et al., 2016), where a story has 5 sentences.
All the existing intermediate representations are generated on a sentence basis, meaning that the length of the representations increases along with the story length. That is, when applying these representations to novels that usually have more than 50,000 words (as defined by the National Novel Writing Month (wik, 2020)), it is not likely that such representations can still work. We thus introduce a new Frame Representation that compiles semantic frames into a fixed-length TF-IDF vector and a Semantic Frame Forecast task that aims to predict the next frame representation using the information in the current story block (see Figure 1). Two different datasets are built to examine the effectiveness of the proposed frame representation: one from Bookcorpus (Zhu et al., 2015), a fiction dataset; and one from CODA-19 (Huang et al., 2020b), a scientific abstract dataset. We establish several baselines and test them on different story block sizes, up to 1,000 sentences. The result shows that the proposed frame representation successfully captures the story plot information and helps the semantic frame forecast task, especially for story blocks with more than 150 sentences. To enable humans to perceive and comprehend frame representations, we further propose a process that visualizes a vector-based frame representation as word clouds. Human evaluations show that word clouds represent a story block with reasonable specificity, and our proposed model produces word clouds that are more representative than that of BERT.

Related Work
Automated Story Generation. Classic story generation focuses on generating logically coherent stories, plot planning (Riedl and Young, 2010;, and case-based reasoning (Gervás et al., 2004). Recently, several neural story generation models have been proposed (Peng et al., 2018;Fan et al., 2018), even including massive pretrained models (Radford et al., 2019;Keskar et al., 2019). However, researchers realize that word-by-word generation models cannot efficiently model the long dependency across sentences (See et al., 2019). Models using intermediate representations as guidance to generate stories are then proposed (Yao et al., 2019;Martin et al., 2018;Ammanabrolu et al., 2020;Fan et al., 2019;. These works are developed toward short stories and thus are insufficient for modeling novels (See Section 1).
Automated Story Understanding. Story understanding is a longstanding goal of AI (Roemmele and Gordon, 2018). Several tests were proposed to evaluate AI models' ability to reason the event sequence in a story. Roemmele et al. (2011) proposed the Choice of Plausible Alternatives (COPA) task, focusing on commonsense knowledge related to identifying causal relations between sequences. Mostafazadeh et al. (2016) proposed the Story Cloze Test, in which the model is required to select which of two given sentences best completes a particular story. Ippolito et al. (2019) proposed the Story Infilling task, which aims to generate the middle span of a story that is coherent with the foregoing context and will reasonably lead to the subsequent plots. Under the broader umbrella of story understanding, some prior work aimed to predict the next event in a story (Granroth-Wilding and Clark, 2016) or to identify the right follow-up line in dialogues (Lowe et al., 2016).

Semantic Frame Forecast
As shown in Figure 1, we formulate a long story as a sequence of fixed-length story blocks. Each story block (Figure 2 (1)) has a set of semantic frames (Figure 2 (2)) (Baker et al., 1998). We convert a story block into the Frame Representation (Figure 2 (3)), a TF-IDF vector over semantic frames, by computing the term frequency in that story block and the inverse document frequency over all the story blocks in the corpus. FrameNet (Baker et al., Figure 2: The steps to generate the frame representation for story blocks. The human-readable word clouds are generated to illustrate the conceptual meaning of the frame representation. 1998) defined a total of 1,221 different semantic frames, so the generated TF-IDF has 1,221 dimensions. The Semantic Frame Forecast is then defined as a task to predict the frame representation of the n+1-th story block using the foregoing content, namely the n-th story block.
Evaluation Metric. We use Cosine Similarity between the predicted vector and the gold-standard vector (complied from the human-written story block) for evaluation. Many other metrics, such as Mean-Squared Error (MSE), also exist to measure the distance between two vectors.

Data
We build the dataset from the existing Bookcorpus dataset (Zhu et al., 2015) and CODA-19 dataset (Huang et al., 2020b). This section describes how we preprocess the data, remove undesired content, and build the final dataset.
Bookcorpus Dataset. We obtain a total of 15, 605 raw books and their corresponding meta data. To get high-quality fictional content, we remove books using the following heuristic rules: (i) short books whose size is less than 10KB; (ii) books that contain HTML code; (iii) books that are in the epub format (an e-book file format); (iv) books that are not in English; (v) books that are in the "Non-Fiction" genre; (vi) books that are in the "Anthologies" genre; (vii) books that are in the "Graphic Novels & Comics" genre. Since most books contain book information, author information, and some nonfictional content at the beginning and end of the book, we use regular expressions to match the term "Chapter" to locate the chapter title. Only the contents between the first chapter title and the last chapter title are kept. The last chapter is also removed as there are no certain boundaries to identify the story ending. Books whose chapter titles are unlocatable are also removed. After removing all the unqualified books, a total of 4, 794 books were used in our dataset. We transliterate all non-ASCII characters into ASCII characters using Unidecode (https://pypi.org/project/Unidecode/) to fulfill the requirement of Open-SESAME (Swayamdipta et al., 2017). Open-SESAME is then used to parse the semantic frames for each sentence.
The books are split into training/validation/test sets following a 70/10/20 split, resulting in 3, 357, 479, and 958 books, respectively. To measure the effect of frame representation for different context lengths, we vary the story block length, using 5, 10, 20, 50, 100, 150, 200, 300, 500, and 1, 000 sentences. When creating instances, we first split a book into story blocks with the specified length and extract all the consecutive two story blocks as instances when context window size (see Figure 1) is set to 1. The IDF of the semantic frame is then computed over the story blocks using all the training sets. Combining with the TF value in each story block, we convert story blocks into frame representations. We use scikit-learn's implementation (Pedregosa et al., 2011) of TF-IDF but with a slight modification on IDF: Scikit-learn uses idf (t) = log( n df (t)+1 ) to compute a smoothing IDF, but we use idf (t) = log( n df (t) ). The detailed statistic information is shown in Table 1.
CODA-19 Dataset. We envision a broader definition of "creativity" in writing and attempt to apply story arc prediction technologies to the do-   mains outside novels, for example, scholarly articles. As an earlier exploration, we choose to use a smaller set of human-annotated abstracts (CODA-19 (Huang et al., 2020b)) rather than machineextracted full text (CORD-19 (Wang et al., 2020a)) in our proof-of-concept study, avoiding formatting issues (e.g., reference format, parsing errors) and intensive data cleaning effort. The original CODA-19 dataset contains 10, 966 human-annotated English abstracts for five different aspects: Background, Purpose, Method, Finding/Contribution, and Other. We remove sentences that are annotated as "Other," an aspect for sentences that are not directly related to the content (e.g., terminology definitions or copyright notices.) Abstracts that contain Unicode characters are also removed. A total of 7, 962 abstracts are used in our dataset. We then use Open-SESAME to parse the semantic frames for each sentence. We adopt CODA-19's original split, where the training set, validation set, and testing set have 6, 509, 737, and 716 abstracts, respectively. Three different lengths of story block are used: 1, 3, and 5. We then create instances and compute TF-IDF as described above. Table 2 shows the details.

Models
We implement two naive baselines, an information retrieval baseline, two machine learning baselines, two deep learning baselines, an existing model and a text generation baseline.
Replay Model. For each instance, the replay model takes the frame representation in the n-th story block as the prediction, i.e., the same frames will occur again.
Prior Model. The prior model computes the mean of the frame representation over the training set and uses it as the prediction for all the testing instances.
Information Retrieval with Frame Representation. For each instance, the information retrieval model searches for the most similar story block in the training set and takes the frame representation from its next story block as the prediction. In this setting, we adopt the cosine similarity on frame representations to measure the story similarity. For block size 5 in the Bookcorpus dataset, there are around 3.7 million instances in the training set, which is infeasible to finish.

Random Forest with Frame Representation.
The foregoing story block's frame representation is used as the feature for prediction. We use scikitlearn's implementation of Random Forest Regressor (Pedregosa et al., 2011) with a max depth of 3 and 20 estimators. For block sizes that have more than one million training instances (5 and 10 in the Bookcorpus dataset), we randomly sample one million instances to train the model.
LGBM with Frame Representation. This is the same as the previous setup but trained using the LGBM Regressor model (Ke et al., 2017) with the max depth 5, the number of leaves 5, and the number of estimators 100. For block sizes that have more than one million training instances (5 and 10 in the Bookcorpus dataset), we randomly sample one million instances to train the model.

DAE with Frame
Representation. This is the same as the previous setting but trained with the Denoising Autoencoder architecture (Bengio et al., 2013). We feed in the foregoing story block's frame representation and output the frame representation for the follow-up story block. Thirty percent of the input is dropped randomly. The model is optimized using the cosine distance (1 − cosine similarity).
Both the encoder and decoder are created via five dense layers with a hidden size of 512. We use a learning rate of 1e-5 and a batch size of 512 and train the model with the early stopping criteria of no improvement for 20 epochs. The best model on the validation set is kept for testing.  (Qi et al., 2020). Unlike Martin el al.'s implementation, where the empty placeholder ∅ only replaces unidentified objects and modifiers, we find that the subjects can also be frequently missing in fiction books. For example, in ""Come out?" Zack asked. "Come out of where?"". In both cases here, the verb "come" does not have a subject. In "Fine, follow me.", "follow" has an object but does not have a subject. Therefore, we allow s to have a ∅ placeholder in our implementation. All words are stemmed by NLTK (Loper and Bird, 2002). After extracting the event representation, the sequence of event tuples in the foregoing story block is fed into a five-layer LSTM model (Hochreiter and Schmidhuber, 1997) to predict its follow-up frame representation. Note that the length of the event tuple sequence changes along with the block size. We thus set the maximum length of the sequence to the 95th percentile of the length in the training set. Sequences longer than the maximum length are left-truncated. The model is trained with a hidden size of 512, a learning rate of 3e-5, a dropout rate of 0.05, and a batch size of 64. We optimize the model using the cosine distance and apply the early stopping criteria of no improvement for three epochs. The best model on the validation set is kept for testing.
BERT. We take the pure text in the foregoing story block as the feature and apply the pretrained BERT model (Devlin et al., 2019). BERT has a token length limitation, so we set the maximum length of tokens to 500 for Bookcorpus and 300 for CODA-19. Sentences with more than 500 tokens are truncated from the left. We take the [CLS] token representation from the last layer and add a dense layer on top of it to predict the follow-up frame representation. The model is trained with a learning rate of 1e-5 and a batch size of 32. We optimize the model using the cosine distance and apply the early stopping when no improvement for five epochs. The model with the best score on the validation set is kept for testing.
SciBERT . This is the same as the previous setting but is trained using the pretrained SciBERT model (Beltagy et al., 2019). We only test this approach on the CODA-19 dataset since it is from the scientific domain.
GPT-2 (For Bookcorpus Only). We also include a text generation model, GPT-2 (gpt2-xl) (Radford et al., 2019) with block sizes of 5, 10, 20, and 50. Since GPT-2 is computationally expensive, we conduct the experiment on a subset of the dataset, where 1,000 instances are randomly selected. We feed the text in the latest story block (n) into GPT-2 and generate 70, 150, 300, and 700 words for block sizes 5, 10, 20, and 50, respectively (5 sentences ≈ 70 words; 10 sentences ≈ 150 words in Bookcourpus, etc). For stories that exceed the GPT-2's word limit, we truncate the text from the left. Stories with block size larger than 100 would have more than 1400 words which by itself exceed the GPT-2's word limit. Generated stories are then parsed by Open-SESAME to extract the semantic frames and turned into frame representations as the predictions. Table 3 and Table 4 show the experimental results. In this section, we summarize the main findings.

Experimental Results and Analysis
Predicting forthcoming semantic frames is remarkably challenging yet possible. Machinelearning models outperform the two naive baselines for different story lengths. In the Bookcorpus dataset, BERT performs the best for story blocks under 100 sentences, while LGBM performs the best for story blocks over 150 sentences. In the CODA-19 dataset, SciBERT performs the best for block sizes of 1 and 3, while DAE performs the best for a block size of 5. While the task is very challenging, these results shed light on the semantic frame forecast task. However, the improvement  Table 3: Baseline result for Bookcorpus dataset. BERT and Event-Rep work better in smaller block sizes, while models using frame representation perform better in larger block sizes. DELTA represents the difference between the best model and the prior baseline -an extremely simple but strong baseline -in that specific block size. The small value of DELTA shows that semantic frame forecast is challenging yet possible.  Table 4: Baseline result for CODA-19 dataset. SciB-ERT performs the best in block size 1 and 3. Using the frame representation as the feature, DAE performs the best for block size 5. DELTA shows the difference between the best model and the prior baseline in that specific block size. The small value of DELTA shows that semantic frame forecast is challenging yet possible.
is not big, as shown in the DELTA row, suggesting that semantic frame forecast requires more investigation and understanding.
"Prior" is a robust and strong baseline. In both the Bookcorpus dataset and the CODA-19 dataset, the prior baseline is strong. As the story gets longer, the performance also increases. This suggests that when the story block gets bigger, more and more frames will constantly occur.
Replay baseline shows the relation of consecutive story blocks. The replay baseline assumes that the events that happen now will likely happen again shortly. The results in Table 3 and Table 4 partially confirm this assumption. To understand more about the assumption, we use the replay baseline to predict the n+i-th story block from the n-th story block in the Bookcorpus dataset. Figure 3 Figure 3: Using the replay baseline to predict the n+ith story block from the n-th story block (story block size = 5, 10, · · · , 1000.) Things that happen in the current story block are more likely to happen again shortly.
shows the results. We can see that things that happen now will be more likely to happen in the near future compared to story blocks farther from the current one.
Event-Rep works better in short stories. In the Bookcorpus dataset, event representation works better than the frame representation in small block sizes (5, 10, and 20). However, starting from a block size of 50, the model cannot perform as well as the other models. We thus conclude that event representation works better in short stories. The main reason is that event representations are generated on a sentence-by-sentence basis and will create overwhelming information on long stories. The existing intermediate representations (see Section 1) are mostly generated from sentences and will likely have the same issue as the event representation. Compared to the existing works, the proposed frame representation encodes a story block, no matter how long it is, into a fixed-length vector and therefore performs better on longer stories.  Table 5: Result of the downsampling experiment. Although all the performance drops, the observations we find are still true. Therefore, the conclusions are not merely caused by the effect of data size.
BERT performs very well in short stories. The results of BERT and SciBERT in Table 3 and Table 4 show that textual information is helpful in predicting story blocks. BERT performs better when the block size is under 100 in the Bookcorpus dataset and below 3 in CODA-19. However, handling long texts remain challenging for BERT, as its computational complexity scales with the square of the token length. Researchers started reducing the computation complexity for transformer-based models to allow modeling on long texts such as Linformer (Wang et al., 2020b), Longformer (Beltagy et al., 2020), Reformer (Kitaev et al., 2020), and BigBird (Zaheer et al., 2020). However, these models still require a lot of computation power and are not yet ready for general use.
The good performance does not merely come from the number of instances. Deep learning methods often require more instances for training.
To show that the result in Table 3 is not mainly caused by the number of instances, we conduct the same experiment in Bookcorpus dataset using 88, 720 training instances for block sizes ranging from 5 to 200. Table 5 shows the results. The performance is affected, but the conclusions we make above still stand, showing that the number of instances is not the main factor for our observations. Meanwhile, we find that BERT is affected more than LGBM. In Table 5 the performance of BERT drops by −0.0092 to −0.0051 compared to Table 3, but LGBM only drops −0.0039 to −0.0007. Although this suggests that the number of instances can cause the difference, it also shows that the frame representation can be used with fewer instances.
GPT-2 is not effective. GPT-2 is not effective in predicting the story flow even though it can generate reasonable sentences. Even the naive Replay  Table 6: Results of using 2 or 5 foregoing story blocks to predict the n+1-th story block.
LGBM improves further when using more context but BERT fails to model the longer context, and its performance even gets hurt.  baseline outperforms the GPT-2 baseline in predicting the story block. We hypothesize that GPT-2 is not good at maintaining the coherence among sentences or events, especially in the creative writing domain. Similar phenomenons are also observed by others and used to motivate the need for guided generation models or progressive generation models (Wang et al., 2020c;Tan et al., 2020).

Using a Larger Context Window
This paper focuses on using 1 story block to forecast the next one, i.e., window size = 1 (see Figure 1.) As a proof of concept, we use 2 and 5 blocks (window size = 2 and 5) for prediction, respectively. We use two models: LGBM with frame representation, and BERT with text. For LGBM, we simply concatenate the frame representation from the input story blocks to create the input vector. For BERT, we put the event tuple and the text together as the input. Table 6 shows the results. While BERT does not benefit from using more contexts, LGBM's performance improves, suggesting the potentials of using a larger context window. More research is required to understand the effects.

Which Semantic Frames Affect the Follow-Up Story More?
Different frames may contribute differently to the prediction of the follow-up story. To understand which frame plays a more important role in the story, we conduct an ablation study by investigating the LGBM model on block 150. We obliterate one frame from the input frame representation and record the performance change, where a higher performance deduction means the frame removed is more important. A total of 50 frames are selected randomly for the ablation study. Table 7 shows the top and bottom five frames. We hypothesize that the more generic frames, such as "State_continue" and "Proper_reference," might be less important to the follow-up stories, but it will require more research to understand the impacts fully.

Human Evaluation
We further evaluate the proposed method with humans. We first visualize the vector of semantic frames into word clouds so that humans can perceive and comprehend it. We then use online crowd workers to test the (i) representativeness and (ii) the specificity of the produced word clouds.

Visualizing Semantic Frame Vectors into Word
Clouds. Figure 4 shows the workflow of generating word clouds based on a frame representation (i.e., a TF-IDF vector). In FrameNet, "lexical units" are the terms that can trigger a specific frame. Compared to showing the name and definition of a frame, lexical units are easier for people to read and comprehend. Therefore, we use the top 30 frames (ranked by their TF-IDF weights) and randomly select up to three lexical units for each frame to form a word cloud. The size and color of the lexical unit is computed according to the frame's TF-IDF weight, where a higher TF-IDF value will result in a larger font and darker color. Finally, we arrange the lexical units into three word clouds on nouns, verbs, and adjectives using their POS tags. All the word clouds are generated using d3-cloud (Davies, 2016).

Representativeness
This task evaluates which model can generate the most representative word cloud for a story block.
Task Setups. In this Human Intelligence Task (HIT), we show a story block (n + 1) and two or three [noun, verb, adjective] word clouds (n + 1) produced by different models based on the previous story block (n). The goal is to measure, from the users' perspective, how much the generated word clouds represent the actual human-written followup stories. We display the actual next story block (n + 1) and the word clouds produced by different models based on the latest story block (n). The workers from Amazon Mechanical Turk (MTurk) are asked to read the story and select the word cloud that better represents the story block. In the worker interface, we set up a 3-minutes lock for submission and a reach-to-the-bottom lock for the story panel to make sure the workers read the story. Nine different workers are recruited for each task 1 . We empirically estimate the working time to be less than 6 minutes per HIT and set the price to $0.99/HIT (hourly wage = $10).
We choose block size 150 to compare two models: LGBM with frame representation and BERT with text. Ground-truth word clouds are also added to some of the HITs to check the validity of the task. A total of 150 instances are randomly selected from Boocorpus testing set. For each instance, the foregoing story block is feed into LGBM and BERT to predict the frame representation of the follow-up story block. Out of 150 instances, 50 instances are conducted with ground truth, where a total of three word clouds are shown. Another 100 instances are used for comparing LGBM against BERT directly.
Results. Over the 50 HITs where ground truth is included, (ground truth, LGBM, BERT) wins (32, 15, 16) HITs, respectively (ties exist.) Nine assignments are recruited from 9 workers for each HIT. Regarding to the assignment voting, (ground truth, LGBM, BERT) gets (199,131,120) votes, respectively. The result suggests that humans can correctly perceive the word clouds' conceptual meaning as the ground truth is rated the best.
Over the 100 HITs where LGBM and BERT are compared directly, (LGBM, BERT) wins (59, 41) HITs. Regarding the assignment voting, (LGBM, BERT) gets (472, 428) votes, respectively. The result shows that LGBM is better than BERT in a block size of 150, which aligns with our automatic evaluation results using cosine similarity (see Section 6.)

Specificity
This task evaluates whether using the proposed word cloud to represent a story block is specific enough for humans to distinguish the correct story from the distractor.
Task Setups. In this HIT, we show two story blocks (n) and one set of [noun, verb, adjective] word clouds (n). Note that the current story block (n) and its ground-truth word cloud (n) are used to examine if humans can correctly perceive the semantic information from word cloud visualization. One story block is the answer that is referred to by the word clouds and the other one is a distractor. Workers are asked to read the two story blocks and select the story block that is referred to by the word clouds. Nine different workers are recruited for each HIT. We use the same worker interface design and built-in worker qualifications as that of Section 7.1. A HIT takes estimatedly 2.33 minutes and is priced at $0.38.
We choose block size 20 and use the groundtruth word clouds for this experiment. Fifty instances from 50 different books are randomly selected from Bookcorpus testing set. We also randomly select a 20-sentences story block from a different book as the distractor.
Results. Of the 450 assignments, 63.8% of the answers were correct. When aggregating the assignments using majority voting, 74% of 50 HITs were answered correctly. We thus believe that it is reasonably specific for humans to represent a story block using the proposed word clouds.

Conclusion
This paper proposes a semantic frame forecast task that aims to forecast the semantic frames in the next 10, 100, or even 1,000 sentences of a story. A long story is formulated as a sequence of story blocks that contain a fixed number of sentences. We further introduce a frame representation that can encode a story block into a fixed-length TF-IDF vector over semantic frames. Experiments on both the Bookcorpus dataset and CODA-19 dataset show that the proposed frame representation helps semantic frame forecast in large story blocks. By visualizing the frame representation as word clouds, we also show that it is comprehensible, representative, and specific to humans. In the future, we will introduce the frame representation into story generation models to ensure coherence when generating long stories. We will also explore the possibility of supporting writers to develop the next part of their stories by generating semantic frames as clues using semantic frame forecast.