Enhancing Topic-to-Essay Generation with External Commonsense Knowledge

Automatic topic-to-essay generation is a challenging task since it requires generating novel, diverse, and topic-consistent paragraph-level text with a set of topics as input. Previous work tends to perform essay generation based solely on the given topics while ignoring massive commonsense knowledge. However, this commonsense knowledge provides additional background information, which can help to generate essays that are more novel and diverse. Towards filling this gap, we propose to integrate commonsense from the external knowledge base into the generator through dynamic memory mechanism. Besides, the adversarial training based on a multi-label discriminator is employed to further improve topic-consistency. We also develop a series of automatic evaluation metrics to comprehensively assess the quality of the generated essay. Experiments show that with external commonsense knowledge and adversarial training, the generated essays are more novel, diverse, and topic-consistent than existing methods in terms of both automatic and human evaluation.


Introduction
Automatic topic-to-essay generation (TEG) aims at generating novel, diverse, and topic-consistent paragraph-level text given a set of topics. It not only has plenty of practical applications, e.g., benefiting intelligent education or assisting in keyword-based news writing (Leppänen et al., 2017), but also serves as an ideal testbed for controllable text generation (Wang and Wan, 2018). Despite its wide applications described above, the progress in the TEG task lags behind other generation tasks such as machine translation (Bahdanau et al., 2014) or text summarization (Rush et al., 2015). Feng et al. (2018) are the first to propose the TEG task and they utilize coverage vector * Equal Contribution. Figure 1: Toy illustration of the information volume on three different text generation tasks, which shows that the source information is extremely insufficient compared to the target output on the TEG task.
to incorporate topic information for essay generation. However, the model performance is not satisfactory. The generated essays not only lack novelty and diversity, but also suffer from poor topicconsistency. One main reason is that the source information is extremely insufficient compared to the target output on the TEG task. We summarize the comparison of information flow between the TEG task and other generation tasks in Figure 1. In machine translation and text summarization, the source input provides enough semantic information to generate the desired target text. However, the TEG task aims to generate paragraph-level text based solely on several given topics. Extremely insufficient source information is likely to make the generated essays of low quality, both in terms of novelty and topic-consistency.
In this paper, in order to enrich the source information of the TEG task, we elaborately devise a memory-augmented neural model to incorporate commonsense knowledge effectively. The motivation is that the commonsense from the external knowledge base can provide additional background information, which is of great help to im-

Output Essay:
Our life is a movement, a journey, an adventure towards a goal. Are you nearer to your port of goal today than you were yesterday? Since your ship was first sailed upon the sea of life, you have never been still for a single moment. The sea is too deep, you could not find an anchor if you would, and there can be no pause until you come into port.

Commonsense Knowledge
Input Topics port life sea port ship [LocateAt] sea [UsedFor] sail a ship sea deep [Property ] life adventure journey [IsA] life [IsA] goal life [HasA] I n p u t M e m o r y Figure 2: Incorporate commonsense knowledge into topic-to-essay generation via the dynamic memory mechanism. The dashed line indicates that the memory is dynamically updated.
prove the quality of the generated essay. Figure 2 intuitively shows an example. For the given topic "life", some closely related concepts (e.g. "adventure", "journey", "goal") are connected as a graph structure in ConceptNet 1 . These related concepts are an important part of the skeleton of the essay, which provides additional key information for the generation. Therefore, such external commonsense knowledge can contribute to generating essays that are more novel and diverse. More specifically, this commonsense knowledge is integrated into the generator through the dynamic memory mechanism. In the decoding phase, the model can attend to the most informative memory concepts for each word. At the same time, the memory matrix is dynamically updated to incorporate information of the generated text. This interaction between the memory and the generated text can contribute to the coherent transition of topics. To enhance the topic-consistency, we adopt adversarial training based on a multi-label discriminator. The discriminative signal can comprehensively evaluate the coverage of the output on the given topics, making the generated essays more closely surround the semantics of all input topics.
The main contributions of this paper are summarized as follows: • We propose a memory-augmented neural model with adversarial training to integrate external commonsense knowledge into topicto-essay generation.
• We develop a series of automatic evaluation metrics to comprehensively assess the quality of the generated essay.
• Experiments show that our approach can outperform existing methods by a large margin.
With the help of commonsense knowledge and adversarial training, the generated essays are more novel, diverse, and topic-consistent.
1 A large-scale commonsense knowledge base.

Binary Cross Entropy Loss
Reward-Based Objective Figure 3: The sketch of our proposed model and adversarial training.

Proposed Model
Given a topic sequence x containing m topics, the TEG task aims to generate a topic-consistent essay y containing n words, where n is much larger than m. Figure 3 presents a sketch of our model and training process. The proposed model consists of a memory-augmented generator and a multi-label discriminator. We adopt adversarial training to alternately train the generator and the discriminator.

Memory-Augmented Generator
The memory-augmented generator G θ is responsible for generating the desired essay y conditioned on the input topics x. Figure 4 illustrates the overview of G θ , which consists of an encoder and a decoder with the memory mechanism. Encoder: Here we implement the encoder as an LSTM (Hochreiter and Schmidhuber, 1997) model, which aims to integrate topic information. It reads the input topic sequence x from both directions and computes hidden states for each topic, where e(x i ) is embedding of x i . The final hidden representation of the i-th topic is where semicolon represents vector concatenation.
Decoder: External commonsense knowledge can enrich the source information, which helps Figure 4: The overview of our memory-augmented generator G θ . At time-step t, the decoder attends to the concept memory and topic representations to generate a new word. In addition, the memory matrix is dynamically updated via the adaptive gate mechanism.
generate essays that are more novel and diverse. Therefore, we equip the decoder with a memory mechanism to effectively incorporate commonsense knowledge from ConceptNet. ConceptNet is a semantic network which consists of triples R = (h; r; t) meaning that head concept h has the relation r with tail concept t. Since the commonsense knowledge of each topic can be represented by its neighboring concepts in the knowledge base, we use each topic as the query to retrieve k neighboring concepts. The pre-trained embeddings of these concepts are stored as commonsense knowledge in a memory matrix M 0 ∈ R d×mk , where d is the dimension of the embedding vector. 2 In the decoding phase, the generator G θ refers to the memory matrix for text generation. Specially, the hidden state s t of the decoder at time-step t is: where [e(y t−1 ); c t ; m t ] means the concatenation of vectors e(y t−1 ), c t , and m t . y t−1 is the word generated at time-step t − 1. c t is the context vector that is computed by integrating the hidden representations of the input topic sequence, where f (s t−1 , h i ) is an aligned model (Bahdanau et al., 2014), which measures the dependency between s t−1 and h i . m t in Eq. (3) is the memory vector extracted from M t , which aims to encode the commonsense knowledge to assist in essay generation. Inspired by Sukhbaatar et al. (2015), we use the attention mechanism to find the rows in M t that are most relevant to the output. Formally, where W and b are weight parameters. M i t is the i-th column of M t and q i t is the i-th value of q t . Dynamic Memory: As the generation progresses, the topic information that needs to be expressed keeps changing, which requires the memory matrix to be dynamically updated. In addition, the dynamic memory mechanism enables the interaction between the memory and the generated text, which contributes to the coherent transition of topics in the generated essay. Concretely, for each memory entry M i t in M t , we first compute a candidate update memory M i t , where U 1 and V 1 are trainable parameters. Inspired by Highway network (Srivastava et al., 2015), we adopt the adaptive gate mechanism to determine how much the i-th memory entry should be updated, Algorithm 1 Adversarial training algorithm.

Multi-Label Discriminator
The discriminator D φ is introduced to evaluate topic-consistency between the input topics and the generated essay, which further improves the text quality. Since the source input contains a variable number of topics, here we implement D φ as a multi-label classifier to distinguish between the real text with several topics and the generated text. In detail, suppose there are a total of |X | topics, the discriminator produces a sigmoid probability distribution over (|X | + 1) classes. The score at the i-th (i ∈ {1, · · · , |X |}) index represents the probability that it belongs to the real text with the i-th topic, and the score at the (|X | + 1)-th index represents the probability that the sample is the generated text. Here we implement the discriminator D φ as a CNN (Kim, 2014) binary classifier.

Adversarial Training
Inspired by SeqGAN (Yu et al., 2017), here we adopt the adversarial training. We train the memory-augmented generator G θ via policy gradient method (Williams, 1992). Our generator G θ can be viewed as an agent, whose state at timestep t is the current generated words y 1:t−1 = (y 1 , · · · , y t−1 ) and the action is the prediction of the next word y t . Once the reward r(y 1:t−1 , y t ) based on both state y 1:t−1 and action y t is observed, the training objective of the generator G θ is to minimize the negative expected reward, where G θ (y t+1 |y t ) means the probability that selects the word y t+1 based on the previous generated words. Applying the likelihood ratios trick and sampling method, we can build an unbiased estimation for the gradient of J(θ), where y t+1 is the sampled word. Since the discriminator can only evaluate a complete sequence, here Monte Carlo Search with roll-out policy G θ is applied to sample the unknown n−t words. The final reward function is computed as : where N is the number of searches, y n 1:t is the sampled complete sequence based on the roll-out policy G θ and state y 1:t , and D(y) is defined as: where D φ (x i |y) denotes the probability predicted by D φ that the completed sequence y belongs to topic x i . D(y) can be treated as a measure of the coverage of the input topics by the output. A high D(y) requires the generated essay to closely surround the semantics of all input topic words. The discriminator is trained to predict all true topics by minimizing binary cross entropy loss 3 , We alternately train the generator G θ and the discriminator D φ . An overview of the training process is summarized in Algorithm 1.

Experiments
In this section, we introduce the dataset, evaluation metrics, all baselines, and settings in detail.

Datasets
We conduct experiments on the ZHIHU corpus (Feng et al., 2018). It consists of Chinese essays whose length is between 50 and 100. We select topic words based on the frequency and remove the rare topic words. The total number of labels are set to 100. Sizes of the training set and the test set are 27,000 and 2500. For tuning hyperparameters, we set aside 10% of training samples as the validation set.

Settings
We tune hyper-parameters on the validation set. We use the 200-dim pre-trained word embeddings provided by . The vocabulary size is 50,000 and batch size is 64. We use a single layer of LSTM with hidden size 512 for both encoder and decoder. We pre-train our model for 80 epochs with the MLE method. The optimizer is Adam (Kingma and Ba, 2014) with 10 −3 learning rate for pre-training and 10 −5 for adversarial training. Besides, we make use of the dropout method (Srivastava et al., 2014) to avoid overfitting and clip the gradients (Pascanu et al., 2013) to the maximum norm of 10.

Baselines
We adopt the following competitive baselines: SC-LSTM (Wen et al., 2015) uses gating mechanism to control the flow of topic information.
PNN (Wang et al., 2016) applies planning based neural network to generate topic-consistent text.
MTA (Feng et al., 2018) utilizes coverage vectors to integrate topic information. Their work also includes: TAV representing topic semantics as the average of all topic embeddings and TAT applying attention mechanism to select the relevant topics.
CVAE (Yang et al., 2018b) presents a conditional variational auto-encoder with a hybrid decoder to learn topic via latent variables.
Plan&Write (Yao et al., 2018) proposes a planand-write framework with two planning strategies to improve diversity and coherence.

Evaluation Metrics
In this paper, we adopt two evaluation methods: automatic evaluation and human evaluation.

Automatic Evaluation
The automatic evaluation of TEG remains an open and tricky question since the output is highly flexible. Previous work (Feng et al., 2018) only adopts BLEU (Papineni et al., 2002) score based on ngram overlap to perform evaluation. However, it is unreasonable to only use BLEU for evaluation because TEG is an extremely flexible task. There are multiple ideal essays for a set of input topics. To remedy this, here we develop a series of evaluation metrics to comprehensively measure the quality of output from various aspects.
Consistency: An ideal essay should closely surround the semantics of all input topics. Therefore, we pre-train a multi-label classifier to evaluate topic-consistency of the output. Given the input topics x, we define the topic-consistency of the generated essayŷ as: where ϕ is Jaccard similarity function andx is topics predicted by a pre-trained multi-label classifier. Here we adopt the SGM model proposed in Yang et al. (2018a) to implement the pre-trained multi-label classifier. Novelty: The novelty of the output can be reflected by the difference between it and the training texts. We calculate the novelty of each generated essayŷ as: where ϕ is Jaccard similarity function and C x is composed of training samples whose corresponding labels are similar to x. Formally, where τ is the set threshold. Diversity: We also calculate the proportion of distinct n-grams in the generated essays to evaluate the diversity of the outputs.
In addition, the BLEU scores of different systems are also reported for reference.

Human Evaluation
We also perform human evaluation to more accurately evaluate the quality of the generated essays. Each item contains the input topics and outputs of different models. Then, 200 items are distributed to 3 annotators, who have no knowledge  Table 1: Results of automatic evaluation. Dist-n evaluates the diversity of the output. The best performance is highlighted in bold and "*" indicates the best result achieved by the baselines.
in advance about which model the generated essays come from. Then, they are required to score the generated essay from 1 to 5 in terms of four criteria: novelty, diversity, coherence, and topicconsistency. For novelty, we use the TF-IDF feature to retrieve 10 most similar training samples to provide references for the annotators.

Results and Discussion
In this section, we report the experimental results.
Besides, further analysis is also provided.

Experimental Results
The automatic evaluation results are shown in Table 1. Results show that our approach achieves the best performance in all metrics. For instance, the proposed model achieves 11.85% relative improvement over the best baseline on BLEU score. It demonstrates the effectiveness of our approach in improving the quality of the generated essay. More importantly, in terms of novelty, diversity, and topic-consistency, our model can substantially outperform all baselines. Table 2 presents the human evaluation results, from which we can draw similar conclusions. It is obvious that our approach can outperform the baselines by a large margin, especially in terms of diversity and topic-consistency. For example, the proposed model achieves improvements of 15.33% diversity score and 12.28% consistency score over the best baseline. The main reason for this increase in diversity is that we integrate commonsense knowledge into the generator through the memory mechanism. This external commonsense knowledge provides additional background information, making the generated essays more novel and diverse. In addition, the adversarial training is employed to increase the coverage of  the output on the target topics, which further enhances the topic-consistency.

Ablation Study
To understand the importance of key components of our approach, here we perform an ablation study by training multiple ablated versions of our model: without adversarial training, without memory mechanism, and without dynamic update. Table 3 and Table 4 present the automatic and human evaluation results of the ablation study, respectively. Results show that all three ablation operations will result in a decrease in model performance. This indicates that both adversarial training and dynamic memory mechanism can contribute to improving the quality of the output. However, an interesting finding is that the adversarial training and memory mechanism focus on improving different aspects of the model.

Memory mechanism
We find that the memory mechanism can significantly improve the novelty and diversity. As is shown in Table 3 and Table 4, compared to the removal of the adversarial training, the model exhibits larger degradation in terms "knowledge" "economics" "major" "finance" "plan" "hope" "choice" "job" "skills" "special" Word Index Concepts of "Finance" Concepts of "Career" "economics" "going to work" "occupation"    of novelty and diversity when the memory mechanism is removed. This shows that with the help of external commonsense knowledge, the source information can be enriched, leading to the outputs that are more novel and diverse.
Adversarial training Another conclusion is that adversarial training can better benefit the model to enhance the topic-consistency of the generated essay compared to memory mechanism. In detail, Table 4 shows that the consistency score given by humans for ablated versions without adversarial training and memory mechanism decline 0.53 and 0.31, respectively. The reason is that the discriminative signal in training not only evaluates the quality of the generated text, but also models its degree of association with the input topics, thus enhancing the topic-consistency.

Validity of Memory Module
Here we visualize the attention weights in Eq. (9) to provide a more comprehensive understanding of the memory module. Figure 5 shows an overview of the heatmap of the memory attention weights throughout the process of essay generation.
The attention of coarse-grained topics According to Figure 5, in the early stage of decoding (word index 0 to 30), the generated words focus on the topic "finance". In this case, the generator pays more attention to concepts related to "finance" (area A in the heatmap). As the generation turns more focus on the topic "career", some concepts related to "career" (area C in the heatmap) are assigned larger attention weights. This indicates that our approach can automatically select the most informative concepts based on the topic being focused by the generated text.
The attention of fine-grained words Figure 5 also shows that even focusing on the same topic, our model can finely select the most relevant concepts based on the generated word. For example, when the model generates the word "finance" or "economics", it pays the most attention to the concept "economics". This further demonstrates that the memory module can provide external commonsense knowledge, which does a great favor to the generation of high-quality text.
Coherent transition between topics The dynamic memory can also enhance the coherence of the generated essay. For instance, in the output essay in Figure 5, "I want to know what can I do to enrich my knowledge and plan my future" is a transition sentence from the topic "finance" to the topic "career". When generating this sentence, the concepts of both topics (area B in the heatmap) receive a certain degree of attention. This illustrates that the dynamic interaction between the memory and the generated text makes the transition between topics more smooth, thus improving the coherence of the output. Table 5 presents the output of different systems with "mother" and "childhood" as input topics. As shown in Table 5, the baselines tend to generate low-quality essays. For instance, the output of SC-LSTM and PNN contains massive duplicate phrases. Neither MTA nor CVAE can express information about topic "childhood". Although Plan&Write can embody information about both topics, its output is relatively incoherent and less informative. Besides, for the output of these baselines, there exist similar samples in the training set. This indicates that they suffer from poor novelty. Although these baselines strive to incorporate topic information in their unique ways, it is difficult to develop a coherent topic-line based solely on several input topics. This limitation leads to poor coherence and topic-consistency. In contrast, the proposed model succeeds in generating novel high-quality text that closely surrounds the semantics of all input topics. The reason is that our approach can integrate commonsense knowledge into the generator through dynamic memory mechanism. With these additional background information, our model is able to make full expan-sion to generate the novel and coherent essay. Besides, adversarial training based on the multi-label discriminator further improves the quality of the output and enhances topic-consistency.

Related Work
Automatic topic-to-essay generation (TEG) aims to compose novel, diverse, and topic-consistent paragraph-level text for several given topics. Feng et al. (2018) are the first to propose the TEG task and they utilize coverage vector to integrate topic information. However, the performance is unsatisfactory, showing that more effective model architecture needs to be explored, which is also the original intention of our work. A similar topic-to-sequence learning task is Chinese poetry generation. Early work adopts rule and template based methods (Tosa et al., 2008;Yan et al., 2013). When involving in neural networks, both Zhang and Lapata (2014) and Wang et al. (2016) employ recurrent neural network and planning to perform generation. Yan (2016) further propose a new generative model with a polishing schema. To balance linguistic accordance and aesthetic innovation,  adopt memory network to choose each term from reserved inventories. Yang et al. (2018b) and  further utilize conditional variational autoencoder to learn topic information. Yi et al. (2018) simultaneously train two generators via mutual reinforcement learning. However, different from poetry generation presenting obvious structured rules, the TEG task requires generating a long unstructured plain text. Such unstructured target output tends to result in the topic drift problem, bringing severe challenges to the TEG task.
Another similar task is story generation, which aims to generate a story based on the short description of an event. Jain et al. (2017) employ statistical machine translation to explore story generation while Lewis et al. (2018) propose a hierarchical strategy.  utilize reinforcement learning to extract a skeleton of the story to promote the coherence. To improve the diversity and coherence, Yao et al. (2018) present a planand-write framework with two planning strategies to fully leverage storyline. However, story generation and the TEG task focus on different goals. The former focuses on logical reasoning and aims to generate a coherent story with plots, while the latter strives to generate the essay with aesthetics based on the input topics. Besides, the source information of the TEG task is more insufficient, putting higher demands on the model.

Conclusion
This work presents a memory-augmented neural model with adversarial training for automatic topic-to-essay generation. The proposed model integrates commonsense from the external knowledge base into the generator through a dynamic memory mechanism to enrich the source information. In addition, the adversarial training based on a multi-label discriminator is employed to further enhance topic-consistency. A series of evaluation metrics are also developed to comprehensively assess the quality of the generated essays. Extensive experimental results show that the proposed method can outperform competitive baselines by a large margin. Further analysis demonstrates that with external commonsense knowledge and adversarial training, the generated essays are more novel, diverse, and topic-consistent.