Controllable Story Generation with External Knowledge Using Large-Scale Language Models

Existing pre-trained large language models have shown unparalleled generative capabilities. However, they are not controllable. In this paper, we propose MEGATRON-CNTRL, a novel framework that uses large-scale language models and adds control to text generation by incorporating an external knowledge base. Our framework consists of a keyword predictor, a knowledge retriever, a contextual knowledge ranker, and a conditional text generator. As we do not have access to ground-truth supervision for the knowledge ranker, we make use of weak supervision from sentence embedding. The empirical results show that our model generates more fluent, consistent, and coherent stories with less repetition and higher diversity compared to prior work on the ROC story dataset. We showcase the controllability of our model by replacing the keywords used to generate stories and re-running the generation process. Human evaluation results show that 77.5% of these stories are successfully controlled by the new keywords. Furthermore, by scaling our model from 124 million to 8.3 billion parameters we demonstrate that larger models improve both the quality of generation (from 74.5% to 93.0% for consistency) and controllability (from 77.5% to 91.5%).


Introduction
Text generation has recently attracted significant attention from the research community as large pretrained language models, such as GPT-2 (Radford et al., 2018(Radford et al., , 2019 demonstrated promising results for generating long, grammatically correct, and fluent text. Finetuning these models has shown significant improvements in downstream tasks, such as persona chat (Wolf et al., 2019). However, one nonnegligible drawback of these large models is the * This work was done during the internship of Peng Xu at NVIDIA. Corresponding authors: Peng Xu, Mostofa Patwary. Context [FEMALE] was on a long road trip . Guan et al. (2020) she had just driven into town . she had been driving for hours . she was so tired she decided to take a nap . when she woke up , she still had n't slept well . MEGATRON-CNTRL-124M keyword: driving she was on her way to see a strange country. she had never seen anything like that before. finally she saw a small town that was so small. it was a great little town with a nice park ! MEGATRON-CNTRL-8B keyword: driving she was driving in the middle of [PLACE]. all of a sudden the tire pressure light came on. she got out to check her tire. it was flat so she used the roadside assistance. MEGATRON-CNTRL-8B-ANT keyword: attract she really wanted to see a few attractions. the first one she saw was a giant water park. it was amazing. it ended up being a fun experience. Table 1: Stories generated by models of increasing capacity and controllability. As the model size grows, story quality becomes increasingly coherent, fluent, and logically consistent. The last row demonstrates how MEGATRON-CNTRL-8B-ANT model controls the story generation with a new keyword, "attract". Note that [MALE] and [FEMALE]  lack of knowledge which humans use to produce natural text. For example, GPT-2 based models produce degraded generations that are illogical and ungrammatical for knowledge-driven generation tasks, such as story generation. Guan et al. (2020) therefore introduced commonsense knowledge to the pre-trained language model by further finetuning on commonsense datasets. Although implicit encoding of knowledge is helpful for knowledge incorporation, there is still a lack of training mechanism to teach the model when and what to incorporate from external knowledge.
In addition, these large pre-trained language models are hard to control. Recently, plug-and-play language models Dathathri et al. (2019) addressed whole document controllability by adding a linear classifier on top of GPT-2 to predict whether generated text observes a particular style or property. Keskar et al. (2019) controlled a 1.2B parameter language model generation via the use of Figure 1: Overview of our generation process. Based on an input context, we generate keywords for future context, use the keywords to retrieve the relevant knowledge from an external knowledge-base, filter them based on their relevance to the context, and use the top scored knowledge sentences to guide the generation.
control codes prepended to the model input. Boyd et al. (2020) controlled the personality of a dialogue agent by conditioning it on prior conversations of a target actor. However, these controlling conditions are predefined, limited in their capability, and are only used once at the beginning to condition the generation of the rest of the document. They do not provide control granularity at either a sentence or sub-document level.
In this work, we address these shortcomings and develop an efficient controllable text generation framework that we apply to the story generation task. In order to provide manual control to users through a set of interpretable keywords, we first develop a keyword predictor model for the next sentence. These keywords are then used to retrieve knowledge sentences from an external knowledge base. Not all the retrieved knowledge is relevant to the story context and often it is noisy. To this end, we introduce a novel contextual ranker that ranks knowledge sentences based on the relevance to the context. As we do not have access to ground-truth supervision for this contextual knowledge ranker, we make use of sentence embedding for weak supervision. The top-ranked knowledge sentences from the knowledge ranker are then fed to the conditional text generator to guide generation. By giving the knowledge in addition to the context, we provide rich information for the generator to attend to and help the model better understand the rationale between sentences. Table 1 shows an example of controllable story generation with increasing model capacity.

Summary of Contributions:
• We propose a novel generation framework that allows dynamical incorporation of external knowledge into language model as well as control for text generation.
• Using both automatic metrics as well as human evaluations, we demonstrate that our model generates more fluent, consistent, and coherent stories with lower repetition rate and higher diversities compared to the previous state-of-the-art on ROC story datasets (Mostafazadeh et al., 2016).
• We showcase the controllability of our model by replacing the keywords used to generate stories. Human evaluation results show that up to 91.5% of the generated stories are successfully controlled by the new keywords .
• We scale our model from 124 million to 8.3 billion parameters and demonstrate that both qualities, as well as controllability of the generations, improve as the model size increases.

Framework
In our problem setup, we complete a story using the first sentence as input, similar to Guan et al. (2020). We augment the generation process with an external knowledge-base and develop a methodology that can guide and control the story generation. Our approach consists of the following four steps connected together as shown in Figure 1: 1. Given the story context, a keyword predictor model first predicts a set of keywords for the next sentence yet to be generated.

2.
A knowledge retriever then takes the generated keywords and queries an external knowledge-base where each knowledge triple is converted into natural language "knowledge sentences" using templates.
3. A contextual knowledge ranker then ranks the external knowledge sentences based on their relevance to the story context.
4. Finally, a generator takes both the story context as well as the top-ranked knowledge sentences as input and generates the next sentence in the story. The output sentence is appended to the story context and steps 1-4 are repeated.
This formulation naturally allows controllability by replacing the keyword prediction process with manual external keywords. This work uses dynamic planning of the keywords and knowledge at each generation step. This allows the users to participate and control the generation on the go. As a result, they don't need to pre-specify the keywords explicitly. We also note that it is challenging to statically plan all the knowledge needed for generation at the beginning. This issue becomes severe for long generations. To formalize this method, we start by introducing notation used throughout the paper and then detail each aforementioned four steps in the following subsections. Notation: A knowledge-base, G is defined as a set of knowledge triples t = (subject, relation, object). A knowledge sentence, r is defined as r = T (t) by mapping t using predefined templates T . For example, (eiffel tower, At-Location, paris) is transformed into eiffel tower is at paris. We should highlight that since our framework transforms the triple knowledge database into natural language sentences, any knowledge base in natural language format can be readily incorporated into our framework. We use superscripts to index story sentences and define a story S of length l as a sequence of individual story sentences s i where S = {s 1 , s 2 , · · · , s l }. We use K i = {k i 1 , · · · , k i q } to denote the keywords associated with story sentence s i . A keyword k i q is made up of subword tokens from our language model's vocabulary. Note that the number of keywords q per sentence varies and can be zero. We define R i = {r i 1 , · · · , r i v } as the knowledge associated with s i , where r i j denotes the j-th knowledge sentence associated s i . The number of knowledge sentences v varies per sentence and can be zero. Note that v = q because a keyword can have multiple knowledge triples associated with it. Given this notation, we define the story context The goal of this work is to generate x i given X i−1 , that is to first predict the knowledge R i contained in s i and then predict s i itself.

Keyword Predictor Model
To provide manual control to users, we first develop a keyword predictor model. Given the current story context X i−1 , the model predicts a set of keywords K i for the next sentence yet to be generated. The prediction of keywords instead of directly predicting knowledge triples not only allows us to control the generation in an interpretable manner, but it also helps to greatly reduce the search space for the knowledge triples. We formulate this keyword prediction problem similar to a left-to-right language model where the goal is to predict the string of concatenated keywords: where K <j denotes all the predicted keywords up to the jth keyword and p is the probability distribution. We use a GPT-2 (Radford et al., 2019) transformer to model this probability distribution. We optimize the keyword predictor with maximum likelihood training and a next token prediction loss. Following Yao et al. (2019), we provide the labels for K i by extracting keywords from a ground truth training sentence s i using the RAKE algorithm (Rose et al., 2010) to train our keyword predictor. Note that our model allows generation of multiple keywords and thus provides the flexibility to choose a subset of them as the control signal to fit in the generation.

Knowledge Retrieval
In this step, we use the generated keywords K i in Section 2.1 and retrieve all the related knowledge triples from our knowledge base G. This is simply done by converting all knowledge triples into knowledge sentences using predefined templates and then matching keywords against the knowledge sentences. This results in the knowledge set R i = {r i 1 , · · · ,r i z } with size z. Future work will focus on replacing this simple retrieval with a learnable module similar to Guu et al. (2020).

Algorithm 1 Building Pseudo Label of R i
Input: Story sentence s i and its preceding sentence s i−1 , USE encoder U , RAKE keywords extractor, and knowledge base G Output: Pseudo Label of R i 1: Extract keywords K i from s i using RAKE 2: FindR = {T (t)|t ∈ G and ∃k i j ∈ K i , s.t. k i j ∈ t} 3: Encode eachr j ∈R to U r j using USE 4: Encode [s i−1 , s i ] to U s 5: Compute cosine similarity score between each U r j and U s 6: returnr j s with the top N highest score

Building Pseudo Label of R i
The main challenge for controlling generation with knowledge is that we have no explicit access to the hidden, latent controlling knowledge humans use to supervise their story writing. That means R i , the knowledge associated with s i is not available. We, therefore, propose to use a weakly supervised signal to build the pseudo labels of R i from s i . We hypothesize that R i should 1) overlap with s i in terms of keywords and 2) have strong connections to both the preceding sentence s i−1 and s i . We include s i−1 along with s i because it is hard to retrieve appropriate knowledge using only s i due to the ambiguity of natural language. We also did not include other previous context beyond s i−1 as additional context overwhelms the information contained in s i .
Following our hypothesis, we first extract keywords K i from s i using RAKE (Rose et al., 2010) and then match K i with all knowledge triples in G. Transforming the retrieved triples into knowledge sentences gives us our set ofR i . We then take the sentence s i and s i−1 , concatenate them, and encode them using the Universal Sentence Encoder (USE) (Cer et al., 2018), a widely-used toolkit for semantic similarity, U s = U ([s i−1 , s i ]), where we denote the encoder of USE as U . For eachr i j ∈R i , we then calculate the cosine similarity between U s and U r j = U (r j ) and sortR i based on this score. We take the top N highest scoresr i j as a pseudo label of R i . Algorithm 1 describes this process. During the training phase of each following model, we use this pseudo label of R i to represent R i .

Contextual Knowledge Ranker
While knowledge retrieval with keywords reduces the controlling knowledge candidate space from the knowledge base G to the subsetR i , this set is still large and noisy since words are ambiguous and can have multiple senses. We, therefore, contextualize the knowledge sentences inR i to obtain relevant and useful ones under X i−1 . To do this, we develop a contextual knowledge ranker. The model is trained with pseudo-labels extracted with access to the future sentence s i as described in Sec.

2.3.
We use a BERT model to encode both the context X i−1 and each knowledge sentencer i j ∈R i . To adapt to the format of BERT, we append a [SEP] token to each R j and s j inside the context X i−1 . A [CLS] token is then added to the beginning of X i−1 . For segment ids, we mark the tokens from the knowledge base as 0 and those from the story as 1. The representation of X i−1 andr i j are then obtained after applying a linear layer on top of the embedding of the [CLS] token: where W 1 and W 2 are learnable weights. We then calculate the relevance score C between X i−1 and r i j using the inner product between V x and V j as : We take R i (Sec. 2.3) as positive samples and R i \R i as negative samples to train our ranker. Given a positive and a negative knowledge sentence r p and r n , we define the ranking loss as where M is a margin and determined empirically. Algorithms 2 describe the ranker training process. At inference time, we simply calculate C j for all r i j ∈R i , sort them based on C i j score and pick the top N most relevant knowledge sentences as R i .

Conditional Generator
The conditional generator is a language model that incorporates the controlling knowledge and generates the following sentences. It concatenates the story context X i−1 and controlling knowledge R i as input and generates s i . A GPT-2 transformer is used to model this conditional probability distribution. We describe the concatenated input representation in the Appendix A.5.

Algorithm 2 Knowledge Ranker Training
Parameters: BERT model parameters Θ and ranker model parameters W 1 and W 2 Input: A story S l with l sentences and a knowledge base G 1: Initialize Θ using a pre-trained BERT model and W 1 , W 2 randomly. 2: Dataset D = ∅ 3: Call Algorithm 1 to retrieve R 1 from G using s 1 . 4: for i ∈ {2, . . . , l} do 5: Call Algorithm 1 to retrieve R i using s i . 6: GetR i using knowledge retrieval (Section 2.2) 7: end for 11: end for 12: for (X, r p , r n ) ∈ D do 13: Calculate loss L using Equation 3 14: Optimize BERT, W 1 , W 2 15: end for 16: return BERT, W 1 , W 2 3 Experimental Setup

Datasets
We use the ROC story dataset (Mostafazadeh et al., 2016) for our experiments. It consists of 98,161 stories, where each story contains five sentences. 88,344/4,908/4,909 stories are used for train/validation/test sets, respectively. Following Guan et al. (2020), for each sentence, delexicalization is performed by replacing all the names and entities in stories with special placeholders, [MALE], [FEMALE], and [NEUTRAL] for male, female and unknown names and entities, respectively. Given the first sentence of each story, our model's task is to generate the rest of the story. For our external knowledge base, we use ConceptNet (Speer and Havasi, 2012), consists of 600k knowledge triples.

Models
We used Megatron-LM (Shoeybi et al., 2019) for pre-trained BERT and GPT-2 models to initialize our contextual knowledge ranker and generative models, respectively. For the model configurations, hidden size, number of layers, and attention heads, we used the configurations of BERT and GPT-2 as in Megatron-LM. For generation with our GPT-2 models, we used a top-k sampling scheme (Fan et al., 2018) with k = 40 and a softmax temperature of 0.7. We detail the training hyperparameters and the input representations for GPT-2 and BERT in Appendix A.1 & A.2 . Both the keyword predictor and the conditional sentence generator follow the same settings.
To train our contextual knowledge ranker, we set the margin to 5.0. We set the number of knowledge sentences in R i to 10. Therefore, for a given story context, the top 10 retrieved knowledge sentences from ConceptNet according to USE are chosen as the positive samples. We further select 40 negative samples to compute our margin loss. We then randomly sample 50 (positive, negative) pairs for each story context to train our contextual knowledge ranker. In total, we used ∼15 million pairs for training and ∼1 million pairs for validation. After training our ranker, we achieve a validation accuracy of 0.9.

Controllability Experiment Setup
To test the controllability of our model, we perform experiments where we change keywords to their antonyms. With antonyms, we expect maximal change to the story generation. To do that, we first use MEGATRON-CNTRL-124M to generate keywords K and corresponding full story S. Then we identify the first keyword k i a ∈ K i from K whose antonym is available at WordNet (Miller, 1995). If multiple antonyms for k i a are available we sample one with a uniform random probability. Afterwards, we provide the start of story {s 1 , s 2 , · · · , s i−1 }, the keywords shared with our original story {K 1 , K 2 , · · · , K i−1 }, and the antonym of k i a to either MEGATRON-CNTRL-124M or larger models (e.g. MEGATRON-CNTRL-355M). We then let the model finish the generation. We refer to these generations as MEGATRON-CNTRL-ANT, for example, we call the antonym generations from MEGATRON-CNTRL-355M model as MEGATRON-CNTRL-355M-ANT.

Baselines
We compare our model with the following stateof-the-art story generation models. (1) Plan and write : The authors use an LSTM-based model to first generate a sequence of keywords for planning the story. These keywords are then used to condition the generation.
(2) Knowledge enhanced GPT-2 (Guan et al., 2020): This work is currently the SOTA for ROC story generation. It finetunes a pre-trained GPT-2 model with knowledge triples from commonsense datasets. Similar to our method, the knowledge triples are converted to sentences with templates. A multitask learning framework is then developed to further finetune the story generation task and classify corrupted stories from real ones. We do not compare to Fan et al. (2019) because Guan et al. (2020) has already shown their model significantly outperforms Fan et al. (2019) and in this work, we compare to Guan et al. (2020). (3) GPT-2-124M: This baseline finetunes a GPT-2 model with a next token prediction loss on the story.

Evaluation
We conduct both automatic as well as human evaluations to assess our generation.

Automatic Evaluation
We use the following metrics to compare different models: Repeat: measures the redundancy of the generated story by reporting the percentage of the stories that contain at least one repeated 4gram (Shao et al., 2019). Distinct: measures the diversity of generated stories by reporting the ratio between distinct 4-grams to all generated 4-grams. Perplexity: In the inference phase, our models involve two steps of generation: (i) generate set of knowledge sentences, R i from story context X i−1 , (ii) generate story sentence, s i from X i−1 and R i . To report the perplexity of the conditional generator we sample R i sequentially before generating each story sentence s i and report the total perplexity of all sentences s i for i ∈ [2, l] where l is the number of sentences in the story.

Human Evaluation on Quality
We conduct human evaluations on Amazon Mechanical Turk 1 (AMT) to analyze the quality of our generations on three aspects: Fluency, Coherence, and Consistency. To evaluate fluency, we show the annotators a pair of generated stories from two models. We ask them to evaluate each sentence independently and choose the story with better overall fluency. Fluency of a story is defined as a measure of intra-sentence linguistic quality and grammatical correctness taken over all sentences of the story. For coherence, we provide the same stories as in fluency but ask to choose the one with better inter-sentence causal and temporal dependencies. We let the annotators choose tie for both fluency and coherence.
Different from the settings of fluency and coherence, we only show one generated story to annotators to evaluate consistency. They are required to choose whether the story is logically consistent, based on whether the story self contradicts or not. 1 https://www.mturk.com/ We set up these three evaluations as independent AMT tasks to make sure the tasks do not influence each other and introduce spurious correlations between labels. To reduce noise in our labeling process, we only accepted workers with an approval rating over 90% and have over 1k accepted jobs. We further limited the location of the annotators to the United States. For each example, we explicitly ask them to spend at least 15 seconds to evaluate coherency and 10 seconds to evaluate the other two properties. In total, we randomly sample 200 stories and assign five annotators for each story. We adopted majority voting to make final decisions among the five annotators.

Human Evaluation on Controllability
To evaluate how controllable our model is, we conduct another human evaluation just for controllability. We show the annotators the start of a story, original keywords, and the corresponding generation. We then show the antonyms of the keywords, along with the corresponding generated story, and ask the annotators if the new story has changed compared to the original story in accordance with the meaning of the keyword's antonyms. The rest of the AMT settings for these experiments are the same as our consistency experiments.

Results
In this section, we first perform automatic and human evaluations with different model sizes and compare our framework to the existing baselines. We then evaluate the controllability of our model and finally show ablation study varying GPT-2 and BERT model sizes. The detailed configuration of the model sizes are shown in Table 2. We provide several generated stories in Appendix A.7 varying the length of the given context. We use M-CNTRL to denote MEGATRON-CNTRL in the tables due to the limited space.

Conditional Keyword Knowledge Model Name
Generator Generator Ranker   Table 3: Pairwise comparison between our models and baselines. Percentages in the format "A% -B%" indicate how often annotators rank the samples from source A better than from source B for a given category, and vice versa. Percentage pairs do not sum to 100% as the annotators were allowed to choose "tie" as being of equal quality. MEGATRON-CNTRL-124M achieves better results than all baselines. Scaling the models shows better coherence and fluency.   Table 4 shows that our smallest model, MEGATRON-CNTRL-124M achieves better distinct and consistency scores compared to previous work. For repetition, our model is worse than Yao et al. (2019) which was also observed in Guan et al. (2020). The reason could be their small 8M model is better at learning short term statistics (e.g. 4-grams), while large models are better at learning long term dependencies. Compared to other GPT-2 based models (i.e. GPT-2-124M and Guan et al. (2020)), MEGATRON-CNTRL-124M achieves lower repeat and higher distinct scores, hence our model generates less repetitive stories. We notice from Table 4 that our perplexity (PPL) score is much higher than other GPT-2-based models. Our hypothesis for why this occurs is that other GPT-2-based methods directly model and report P (s i |s 1 , s 2 , · · · , s i−1 ) while our conditional generator models and reports P (s i |X i−1 , R i ). When computing perplexity, [s 1 , s 2 , · · · , s i−1 ] are given ground truth tokens, but R i and all R in X i−1 must be sampled from a distribution that is learned with weak supervision. This sampling introduces noise and non-determinism that results in higher perplexity. This discrepancy is not an issue when analyzing automatic evaluation metrics within our model family. When scaling our model from 124M up to 8B parameters we see a consistent drop in PPL and increase in distinct. This shows larger models can generate better stories with more diversity.

Automatic and Human Evaluations
Human evaluation results are presented in last column of Table 4 (consistency) and in Table  3. Comparing MEGATRON-CNTRL-124M to Yao et al. (2019), we achieve much better coherence, fluency, and consistency scores, which shows the benefit of large pre-trained transformer models. Comparing MEGATRON-CNTRL-124M to Guan et al. (2020) which uses a similar base model, we find that fluency is similar, however we should note that Guan et al. (2020) is not controllable and our model has significantly better coherence (+7.0%) in Table 3 and consistency (+7.5%) in Table 4. We attribute this to the use of the retrieved knowledge, R i . By explicitly providing facts pertinent to the next sentence, the conditional generative model can focus on just generating text. By comparison, a standard autoregressive GPT-2 model is tasked with predicting both the topics and the text of the next sentence.
Scaling this up, and comparing MEGATRON-CNTRL-355M to Guan et al. (2020), our model significantly outperforms in all aspects. Furthermore, a thorough comparison among MEGATRON-CNTRL-355M, MEGATRON-CNTRL-774M, MEGATRON-CNTRL-2B, MEGATRON-CNTRL-8B shows that scaling the model size further almost always improves the quality of genera-tion in terms of fluency, coherence and consistency. For consistency, our best model at 8B parameters achieves a score of 93%.

Controllability Evaluation
We evaluate the controllability by changing keywords to their antonyms as detailed in Section 3.3 & 3.5. Table 5 shows repeat and distinct for MEGATRON-CNTRL-124M as well as the controlled versions at three different sizes. Altering control with antonym keywords gives lower repeat and higher distinct scores than the original generation. As the model size increases, the repeat stays almost constant while distinct improves. These results show that changing keywords manually results in distinct and not repeated text.

Name
Repeat  Further supporting this hypothesis, evaluation of controllability in Table 6 shows that MEGATRON-CNTRL-124M-ANT achieves a high controllability score of 77.5%. This means that by changing the keywords to their antonym, 77.5% of newly generated stories change their semantic content to follow the new antonym keywords. We also show that larger models are better able to leverage keyword control. Scaling the model size from 124M to 355M and 8B model further improves the controllability score to 84.5% and 91.5%, respectively. We again observe the quality (e.g. coherence) of our controlled generation improves as the model size scales to 8B.

Ablation Studies
In this section, we conduct the ablation study on the planning strategy and external knowledge. The

Planning Strategy
In this section, we investigate the effects of planning strategy in our framework. Yao et al. (2019) showed that static planning works better than dynamic planning for LSTM-based models. To introduce the static planning in our model, we predicted all the keywords and relevant knowledge sentences from the starting sentence and generated the entire stories. When we compare these generations with the stories generated by dynamic planning, we see in Table 7 (first and third rows) that dynamic planning outperforms the static planning strategy with higher distinction (+0.7%) and lower repetition (-3.8%) scores. This is due to direct guidance over each sentence provided by the retrieved knowledge from dynamic planning . In contrast, in static planning, the retrieved knowledge sentences are all predicted together at the beginning using only the starting sentence, which makes the supervision for each story sentence weaker and noisier.

External Knowledge
In this section, we investigate the importance of retrieved knowledge. Table 7 (first and second rows) shows that, when excluding the knowledge from our framework (i.e. MEGATRON-CNTRL-124M w/o knowledge), distinction score decreases by 0.8% and repetition increases by 3.6%, highlighting the importance of external knowledge in our approach. Unlike dynamic planning, we observe that in static planning, the external knowledge does not play an important role in the quality of the generations and using or not using the knowledge leads to similar qualities. This observation also confirms that knowledge needs to be planned dynamically.

Future Work
Short story sentences in ROC story dataset limits our exploration from several potential research directions. For example, how long the control signal would propagate for longer generations? Investigating this issue using longer story datasets (e.g. WRITINGPROMPTS (Fan et al., 2018)) is a subject for future work. Other interesting direction may include incorporating the structure-level controllability by adding it as either an extra input for the conditional generator or a multitask learning supervision for each sequence. We also observed that in some cases during the generation, our model simply mentions the given word in the sentence, and talks about things that are not strictly related to or restricted by the given word. For example, the generated story of MEGATRON-CNTRL-8B in Table 15 only mentions the keyword "realize" instead of centering around it. This is caused by the RAKE keywords extractor, which does not always extract the keywords that represent the sentence well. One way to mitigate this issue is to leverage longer context information to identify better keywords which is subject of the future work.

Related Work
Knowledge Incorporation of knowledge into language models has shown promising results for downstream tasks, such as factual correct generation (Logan et al., 2019) , commonsense knowledge graph construction (Bosselut et al., 2019), entity typing (Zhang et al., 2019) and etc. More recently, several works have shown that inclusion of learned mechanisms for explicit or implicit knowledge can lead to the state-of-the-art results in Question Answering (Guu et al., 2020;Lee et al., 2019;Lewis et al., 2020) and dialogue modeling (Roller et al., 2020).
Storytelling There are several different storytelling tasks described throughout the literature. Storytelling can be classified into story completion , story ending generation (Guan et al., 2019), story generation from prompts (Fan et al., 2018) or titles , and story generation from a given sentence (Guan et al., 2020). Different approaches have been developed to model the structure of stories with storylines , skeletons (Xu et al., 2018), Conditional Variational AutoEncoders (Wang and Wan, 2019) and a coarse-to-fine framework (Fan et al., 2019). Other works focus on incorporating commonsense knowledge into story generation with attention-based models (Guan et al., 2019;. Recently, pre-trained language models have been used to finetune on both story completion datasets and commonsense knowledge to further improve the quality of story completion (Guan et al., 2020). However, few works concern the controllability of language model generation, especially for the large pre-trained models that are common in today's literature.
Controllable Generation Controllable text generation has a wide range of applications, including controlling through persona Boyd et al., 2020), politeness (Niu andBansal, 2018), etc. Wiseman et al. (2018) presented controlling generations by learning latent, discrete templates from data. Fu et al. (2019) discovered the importance of pivot words that determines the sentence attributes and presented a lexical analysis framework. To control large pre-trained models, Keskar et al. (2019) demonstrated the ability to control text generation through a wide range of aspects, such as domains and links. Plug-and-play language models Dathathri et al. (2019) also address whole document controllability by adding a linear classifier on top of GPT-2 to predict whether generated text observes a particular style or property. Prabhumoye et al. (2020) provides a good survey of five modules for control. Differing from these works, we control the generation through keywords backed by external knowledge.

Conclusion
In this paper, we proposed a novel framework that adds control to text generation with external knowledge. Our model first generates a set of keywords and a knowledge retriever then queries an external knowledge base for triples related to the keywords. Based on the relevance to the story context, a contextual knowledge ranker ranks the retrieved knowledge sentences and feeds the top ones to a conditional generator to generate the next story sentence. Experimental results on the ROC story dataset showed that our model outperforms state-of-the-art models by generating less repetitive, more diverse and logically consistent stories. Human evaluation of the controllability of our model shows that 91.5% of generated stories are successfully controlled by changing keywords to their antonym. In line with current trends, we also demonstrate that using larger pre-trained language models consistently improves both the quality of the generated stories and controllability.

A Appendices
A.1 GPT-2 Hyperparameters: We used the BPE subword tokenizer from Radford et al. (2019) to tokenize each sentence of the ROC story dataset. The maximum sequence length is set to 1024 learned positional embeddings. An Adam optimizer (Kingma and Ba, 2014) with learning rate of 0.0001 is utilized. We added dropout to both the embedding layer and multi-head attention layers with 0.1 probability. Gradients were clipped to a global gradient norm of 5.0. We finetuned the GPT-2 models for 5 epochs and selected the best one with the lowest perplexity on the validation set.

A.2 BERT Hyperparameters:
We set the maximum sequence length to 512 learned positional embeddings.
We used a WordPiece tokenizer with the bert-large-uncased vocabulary for tokenization. The model was also optimized with an Adam optimizer with a learning rate of 0.0001, but it used a weight decay of 0.01. Gradients are clipped to a global norm of 1.0. We also added dropout to embedding layer and multi-head attention layers with 0.1 probability. For the selection of margin, we tried 0.1, 0.5, 1.0, and 5.0. The choice of 5.0 gives the best result.

A.3 Model Size
In addition to analyzing the effect of scale on our conditional generative model, we also performed an ablation study on the model size of our GPT-2based keyword predictor and BERT-based ranker models. The results in Table 8 show that increasing the keyword model size from 774M to 2B reduces the repetition while keeping the other scores similar. Increasing the size of our contextual ranker from 336M to 1.2B reduces the repetition while also decreasing the diversity. It is not clear which one is better. We conjecture that as the positive samples, R i , we used to train our contextual ranker are weakly supervised, and the fact that we used templates to synthetically convert knowledge triples to knowledge sentences, scaling the model size might be overfitting to noise. We, therefore, use the smaller, more computationally efficient model with 336M parameters for ranker models in all our experiments.

A.5 Input Format
For the format of R j , we add a " SEP " string to separate different knowledge sentences r j k in R j . We add a " EOK " string to denote the end of knowledge sentences.
For the story context X i−1 = {x 1 , x 2 , · · · , x i−1 } where x j = [R j , s j ], we add a "OS" token to the end of each s j to denote the end of sentence. At the end of the story, a " |endoftext| " token is appended. We then concatenated X i−1 with R i as the input to conditional generator.

A.6 Preprocessing
We follow the implementation of Guan et al. (2020), to replace " ." with ". ". More details have already been provided in Section 3.1.

A.6.1 Computation Resources
We used up to 20 DGX-1V servers (a total of 160 Tesla V100 32GB GPUs) for our experiments. Our infrastructure is highly optimized for large scale deep learning applications. The servers are connected via NVSwitch with 300 GB/sec bandwidth between GPUs inside a server and 100 GB/sec of interconnect bandwidth between servers using 8 In-finiBand adapters per server. As our story dataset is small, the training of GPT-2 models usually takes less than three hours. For the BERT model, it took eight hours as we have ∼ 15 million samples to train.    Guan et al. (2019) she accepted without saying a word . later , he picked out a sushi roll . [NEUTRAL] was so happy with her new sushi rolls . she had to wait for him to come back .
MEGATRON-CNTRL-124M keyword: sure, date she decided to go to the restaurant. she ordered a blowfish. they gave her a small grin. she had a great time.
MEGATRON-CNTRL-8B keyword: thought, rude [NEUTRAL] was nervous because he was worried she would be rude. but at the table , she was sweet and friendly. and she was very nice , even though she ate little sushi.
[NEUTRAL] wondered if maybe blind dates were actually very wonderful ! MEGATRON-CNTRL-8B-ANT keyword: uncertain [NEUTRAL] was uncertain. she was not a sushi person but agreed to try it.
[NEUTRAL] and her blind date ordered salmon sushi. the salmon sushi was delicious.  Guan et al. (2019) he stopped and picked up his brush . he used that brush to brush his teeth . afterwards , he was ready to ride back home . he went on his way home .
MEGATRON-CNTRL-124M keyword: top he got a great view from the top. his legs felt great. he climbed up the mountain and felt much better. he went back to the hill and reached the top.
MEGATRON-CNTRL-8B keyword: top he was at the top of the hill looking down. the weather looked beautiful. the clouds suddenly passed. the sun began to shine over the land again.
MEGATRON-CNTRL-8B-ANT keyword: bottom he saw a huge boulder at the bottom. he could n't believe its size. he had to climb to the top of the hill. he was afraid he would fall if he stood on top. Context: her favorite glasses were ruined. Guan et al. (2019) she was really upset . she had no choice but to buy new ones . she went to the mall to shop and found a great pair . she bought them and was so happy .
MEGATRON-CNTRL-124M keyword: wore she wore them every day. then she had a horrible accident. her sister had broken them. she had to buy new ones.
MEGATRON-CNTRL-8B keyword: wore she wore them daily. she got them at the shoe store. she found them and they were her favorite. after a while they got broken.
MEGATRON-CNTRL-8B-ANT keyword: refresh she was desperate to refresh them. she found a recipe online. she poured bleach in them. she was glad to have functional glasses.   Context: we went to a dinner and show one night. the first guy we saw was really nice. after that we asked him out. he agreed.
MEGATRON-CNTRL-124M keyword: got we got into a bad argument.
MEGATRON-CNTRL-8B keyword: NONE (no keywords) we had a really good time.

A.8 Human Evaluation Interface for Annotators
Below we provide the interfaces used for human evaluations (coherence, fluency, logical consistency, and controllability).