Language Generation with Multi-Hop Reasoning on Commonsense Knowledge Graph

Despite the success of generative pre-trained language models on a series of text generation tasks, they still suffer in cases where reasoning over underlying commonsense knowledge is required during generation. Existing approaches that integrate commonsense knowledge into generative pre-trained language models simply transfer relational knowledge by post-training on individual knowledge triples while ignoring rich connections within the knowledge graph. We argue that exploiting both the structural and semantic information of the knowledge graph facilitates commonsense-aware text generation. In this paper, we propose Generation with Multi-Hop Reasoning Flow (GRF) that enables pre-trained models with dynamic multi-hop reasoning on multi-relational paths extracted from the external commonsense knowledge graph. We empirically show that our model outperforms existing baselines on three text generation tasks that require reasoning over commonsense knowledge. We also demonstrate the effectiveness of the dynamic multi-hop reasoning module with reasoning paths inferred by the model that provide rationale to the generation.


Introduction
Despite the recent success of pre-trained language models such as GPT-2 (Radford et al., 2019) on various language generation tasks, these models are still struggling on generation tasks that require reasoning over commonsense knowledge that is not explicitly stated in the context. For example, Figure 1 illustrates an example in the story ending generation task, where external commonsense knowledge in the form of relational paths can guide the generation of the key concepts "substance" and The volcano then exploded with substance that looked like lava !

Story Ending
The class had no clue what was going on and looked on in astonishment.

ConceptNet ROC Story
Is A R e la te d T o Figure 1: An example of using structural relational knowledge as commonsense grounding in story ending generation. Blue nodes correspond to the concepts in the context, orange nodes correspond to those in the story ending and gree nodes are intermediate concepts that connect the evidence chain.
"lava" in the story ending by providing background knowledge such as (volcano, MadeOf, lava) besides the story context. Although pre-trained models have been demonstrated to possess commonsense reasoning ability (Trinh and Le, 2018) by implicitly learning some relational patterns from large-scale corpora, they do not fully utilize the commonsense knowledge bases that provide more explicit knowledge grounding.
To address this defect, incorporating external commonsense knowledge to enhance models' reasoning ability has been widely explored Ye et al., 2019;Lv et al., 2019). In language generation, previous work (Bhagavatula et al., 2020;Guan et al., 2020) transfers commonsense knowledge into pre-trained language models by utilizing triple information in commonsense knowledge bases such as ConceptNet (Speer and Havasi, 2012) and ATOMIC .
However, this approach has two drawbacks.
First, recovering knowledge triples at the posttraining stage (Guan et al., 2020) hardly enables the model to utilize the encoded knowledge in fine-tuning generation tasks which requires reasoning over underlying commonsense knowledge. Second, it ignores the abundant structural relational relevance of the concepts in the knowledge graph (Guan et al., 2020;Bhagavatula et al., 2020) that may provide multiple plausible evidence for complex reasoning. Thus a richer and more explicit way of utilizing external commonsense knowledge is to exploit both structural and semantic information of the knowledge graph and reason over multihop relational paths where multiple connected triples provide chains of evidence for grounded text generation.
In this paper, we propose Generation with Multi-Hop Reasoning Flow (GRF), a generation model that performs multi-hop reasoning on the external knowledge graph for knowledge-enriched language generation. The model operates on the sub-graph extended from the concepts in the input text as commonsense knowledge grounding. It first encodes the multi-relational graph with compositional operation to obtain graph-aware representations for the concepts and the relations ( §3.2.1). Then, the multi-hop reasoning module performs dynamic reasoning via aggregating triple evidence along multiple relational paths to generate the salient concept under the context ( §3.2.3). Finally, the generation distribution combines the probability of copying concepts from the knowledge graph and that of choosing a word from the standard vocabulary with a gate control ( §3.2.4). The overall model architecture is shown in Figure 2. We conduct experiments on three commonsense-aware text generation tasks including story ending generation (Mostafazadeh et al., 2016), abductive natural language generation (Bhagavatula et al., 2020), and explanation generation for sense making . Results show that our model outperforms strong baselines on these tasks, thereby demonstrating the benefit of multi-hop commonsense reasoning in language generation.
Our contributions can be summarized as follows: 1) We propose GRF, a novel generation model that utilizes external structural commonsense knowledge to facilitate explicit commonsense reasoning in text generation. 2) We propose the dynamic multi-hop reasoning module that aggregates evidence along relational paths for grounded gener-ation of some critical concepts. 3) We conduct extensive experiments including automatic and human evaluation on three commonsense-aware text generation tasks and show that our model outperforms various selective baselines. We also visualize reasoning paths inferred by the model to demonstrate the effectiveness of the multi-hop reasoning module.
2 Related Work 2.1 Commonsense-Aware Neural Text Generation Incorporating commonsense knowledge is essential for text generation to augment the limited textual information. In dialogue generation, Zhou et al. (2018) enriched the context representations of the post with neighbouring concepts on ConceptNet using graph attention. In story ending generation, Guan et al. (2019) proposed incremental encoding with multi-source attention to incorporate one-hop knowledge graph for concepts in the story context. In topic-to-essay generation, Yang et al. (2019) augmented the generator with a concept memory that updated dynamically with gate mechanism. Recently, some work also attempted to integrate external commonsense knowledge into generative pretrained language models such as GPT-2 (Radford et al., 2019). Guan et al. (2020) conducted posttraining on sythetic data constructed from commonsense knowledge bases by translating triplets into natural language texts using templates. Bhagavatula et al. (2020) transferred embeddings of COMeT (Bosselut et al., 2019), a GPT-2 model fine-tuned to generate the tail entity of a triple in commonsense knowledge graph, into another GPT-2 model for text generation. In comparison, our model utilizes both structural and semantic information of the commonsense knowledge graph during generation and does not suffers from the catastrophic forgetting problem (Kirkpatrick et al., 2016) caused by implicit knowledge transferring.

Multi-Hop Reasoning on Graph Structure
Performing explicit multi-hop reasoning on graph structure has been demonstrated to be an effective approach for query answering over incomplete knowledge graphs (Das et al., 2018;Chen et al., 2018;Lin et al., 2018), multi-hop question answering (Bauer et al., 2018;Qiu et al., 2019)   2019; Moon et al., 2019;. Particularly, reasoning on knowledge graphs to answer relational query typically adopts REINFORCE to learn concrete policies to search for entities or relations. In multi-hop question answering tasks, the reasoning process is augmented with entity graph Qiu et al., 2019) or concept paths (Bauer et al., 2018) to enhance semantic connections among document segments. In dialogue generation, Tuan et al. (2019) modeled multiple hops on relationship graphs with a Markov transition matrix.  proposed a twostage architecture that selected information from a knowledge graph for further generating the response. Compared with these generation models that operate on knowledge graphs within a specific domain, our focus is to utilize general commonsense knowledge to supply evidence for text generation.

Problem Formulation
In this paper, we focus on text generation tasks where reasoning over external commonsense knowledge is required. Without loss of generality, the input source is a text sequence x = (x 1 , x 2 , · · · , x N ) which may consist of several sentences. The output target is another text sequence y = (y 1 , y 2 , · · · , y M ). To facilitate the reasoning process, we resort to an external commonsense knowledge graph G = (V, E) where V denotes the concept set and E denotes the relations connecting these concepts. Since direct reasoning on the com-plete graph is intractable, we extract a sub-graph G = (V, E) given the input text where V ⊂ V and E ⊂ E. The sub-graph consists of inter-connected H-hop paths starting from the source concepts C x extracted from the input text. We only consider concepts with 1-gram surface texts. The task is then formulated as generating the best hypothesiŝ y which maximizes the following conditional probability:ŷ = argmax y P (y|x, G). (1) We leave the detailed sub-graph extraction process in §4.2 and describe our proposed model in the next section.

Static Multi-Relational Graph Encoding
Graph Neural Network (GNN) frameworks, such as graph convolution network (GCN) (Kipf and Welling, 2017) and graph attention network (GAT) (Velickovic et al., 2018), have been shown effective at encoding graph-structured data by aggregating node information from local neighbours. To model the relational information in the knowledge graph, R-GCN (Schlichtkrull et al., 2018) generalizes GCN with relationspecific weight matrices but is reported to be over-parameterized (Marcheggiani and Titov, 2017;Schlichtkrull et al., 2018). We follow Vashishth et al. (2020) and use a non-parametric compositional operation φ(·) to combine the node embedding and the relation embedding. Specifically, given the input graph G = (V, E) and a GCN with L G layer, for each node v ∈ V we update the node embedding at the l + 1-th layer by aggregating information from its local neighbours N (v) which consist of pairs of node u and the connected relation r.
where h 0 v is initialized by looking up the word embedding and h 0 r by the relation-type embedding. W l N and W l S are two learnable weight matrices specific to the l-th layer. We define the compositional operation as φ(h u , h r ) = h u − h r inspired by the TransE model (Bordes et al., 2013).
The relation embedding is also updated via another linear transformation.
Finally, we obtain node embeddings h L G v and relation embeddings h L G r that encode the static graph context for dynamic reasoning during decoding.

Context Modeling with Pre-Trained Transformer
We adopt the GPT-2 model (Radford et al., 2019), a pre-trained multi-layer transformer decoder to model the contextual dependency of the text sequence. The input to the model is the concatenation of the source and the target sequence: s = (x 1 , · · · , x N , [bos], y 1 , · · · , y M ).
where e t and p t are the token embedding vector and the positional embedding vector. T block is the transformer block with masked self-attention. The final hidden state at the t-th time step h L D t which encodes the context information is used as the input to the multi-hop reasoning module.

Dynamic Multi-Hop Reasoning Flow
To perform explicit reasoning on the graph structure during generation, we devise a dynamic reasoning module that utilizes both structural patterns of the knowledge graph and contextual information to propagate evidence along relational paths at each decoding step.
Specifically, the module broadcasts information on G by updating the score of outer nodes with their visited neighbours for multiple hops until all the nodes on G are visited. Initially, nodes correspond to the concepts in C x are given a score of 1 while other unvisited nodes are assigned with 0.
For the unvisited node v ∈ V , its node score ns(v) is computed by aggregating evidence from N in (v) which denotes the set of visited node u and its edge r that directly connects v.
where γ is a discount factor that controls the intensity of the information flow from the previous hops. f (·) is the aggregator that assembles scores from connected nodes. We consider two forms of aggregators: max(·) and mean(·). We use max(·) for the main results and present the results with mean(·) in the ablation study. R(u, r, v) is the triple relevance that reflects the relevancy of the evidence given by the triplet (u, r, v) under the current context. We compute the triple relevance as follows: After H hops, the final distribution over the nodes is obtained by a normalization.
where c t is the concept of the selected node at the t-th time step. Intuitively, the reasoning module learns to dynamically distribute along the paths by considering the triple evidence according to the current decoder state.

Generation Distribution with Gate Control
The final generation distribution combines the distribution over the concepts (Eq. 11) and the distribution over the standard vocabulary (Eq. 7). We use a soft gate probability g t which denotes whether to copy a concept in the generation to control the weight of the two distributions similar to the copy mechanism (Gu et al., 2016;See et al., 2017).
The final output distribution is the linear combination of the two distributions weighted by g t and 1 − g t respectively.
P (y t |y <t , x, G) = g t+N · P (c t+N |s <t+N , G) where N is the length of the input text sequence.

Training and Inference
To train the proposed model, we minimize the negative log-likelihood of generating the ground truth target sequence y gold = (y 1 , y 2 · · · , y M , [eos]).
We add an auxiliary gate loss L gate to supervise the probability of selecting a concept or a generic word. We additionally introduce a weak supervision L weak to induce the predicted triple relevances to match the heuristic labels of edges obtained by breadth-first search from the source concepts to the concepts in y gold on the graph. Both loss functions take the form of binary cross-entropy. We observe that both loss terms encourage the model to learn multi-hop reasoning on the graph more effectively.
The final loss to be optimized is the linear combination L gen + αL gate + βL weak .
During the inference stage, the input to the model is (x 1 , · · · , x N , [bos]). The model generates a token at a time and concatenates it to the input sequence to generate the next token. The generation process terminates when the special ending symbol [eos] is generated.

Datasets and Metrics
The statistics of the datasets are shown in Table 1. Story Ending Generation (SEG) is to generate a reasonable ending given a four-sentence story context. The stories come from ROCStories corpus (Mostafazadeh et al., 2016). We use the same data split as Guan et al. (2019). Abductive NLG (αNLG) is to generate an explanatory hypothesis given two observations: O 1 as the cause and O 2 as the consequence. We use the official data split 2 from Bhagavatula et al. (2020).
Explanation Generation (EG) is to generate an explanation given a counter-factual statement for sense-making . We randomly split 85% of the data as the training set, 10% as the test set, and the latter as the development set.
For automatic evaluation, we use metrics including BLEU-4 (Papineni et al., 2002), CIDEr (Vedantam et al., 2015), ROUGE-L (Lin, 2004) and ME-TEOR (Banerjee and Lavie, 2005) to evaluate the abductive NLG and the explanation generation tasks. We follow common practice in story generation (Guan et al., 2019(Guan et al., , 2020 and use BLEU-1/2 to evaluate the generated endings. We also adopt Distinct-n  to measure the diversity of the generated endings.

Extracting Sub-Graphs as Knowledge Grounding
To supply knowledge grounding for language generation, we use ConceptNet (Speer and Havasi, 2012)  We extract a sub-graph G = (V, E) from G which consists of multiple inter-connected paths starting from the source concepts C x in the input sequence. To recognize concepts from the input text sequence, we perform fuzzy matching with the lemmatized form of the surface texts using Spacy 3 and filter out stop words. Following Guan et al. (2019), we only consider verbs and nouns as our candidate concepts since we find the extracted graph is much noisier with all the matched concepts.
Specifically, we iterate the following process for H hops: starting from the nodes in the current sub-graph (initialized by C x ) and search for the direct neighbours of each node and preserve top-B nodes with the connected edges to enlarge the sub-graph. For each candidate node, the selection is based on its incoming degree of this node. The incoming degree of a candidate node v is defined as the number of nodes in the current sub-graph that directly connect v. Intuitively, we keep those salient concepts that are commonly visited nodes and support information flow on the graph.   Our model employs the small version of GPT-2 model 4 with 12 layers, 768-dimensional hidden states, and 12 attention heads for contextual modeling and a 2-layer GCN model. We choose the max(·) aggregator for the main results since it yields better performance. During sub-graph extraction, we set the maximum number of hops H = 2 and preserve top-B = 100 nodes per hop. We find this setting balances the coverage and noise of the knowledge graph. Detailed statistics of the extracted sub-graphs are presented in Table 2. To train the model, we use the Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.999, ε = 1 × 10 −6 and linearly decrease learning rate to zero with no warmup. We search the best hyper-parameters according to BLEU-4 on the development set of each task. At the inference stage, we adopt beam search decoding with a beam size of 3 for our model and all the baselines we produce. We conduct all the experiments using the PyTorch framework (Paszke et al., 2017).

Compared Baselines
We produce the following baselines on three generation tasks to compare with our model: Seq2Seq is a sequence-to-sequence model based on gated recurrent unit (GRU) with attention mechanism. We also utilize the copying mechanism (Gu et al., 2016) for the model to generate out-ofvocabulary words.
GPT2-FT is a GPT-2 model fine-tuned on the taskspecific dataset with its model initialization from Radford et al. (2019). GPT2-OMCS-FT is a commonsense-enhanced GPT-2 model first post-trained on the Open Mind Common Sense (OMCS) corpus 5 from which the ConceptNet is constructed. The model is then finetuned on the task-specific dataset.
We also compare our model with baseline models designated to each specific task. For story ending generation, we compare to IE+GA which is based on incremental encoding and graph attention (Guan et al., 2019) and WriterForcing that forces the attention to focus on important keyphrases and avoid generating generic words.
For abductive NLG, we compare with two baselines introduced by Bhagavatula et al. (2020): COMeT-Txt-GPT2 which uses the output texts generated by COMeT as prefix texts to the GPT-2 model while fine-tuning, and COMeT-Emb-GPT2 which integrates the embeddings of the outputs generated by COMeT into the GPT-2 model during fine-tuning.

Automatic Evaluation
We present the automatic evaluation results on the test sets of the explanation generation and the abductive NLG tasks in Table 3. We have the following observations: I. Our model outperforms all the baselines that utilize pre-trained language models or incorporate external commonsense knowledge in terms of all evaluation metrics indicating that incorporating rich structural information of commonsense knowledge graphs can enhance the overall generation quality.
II. Simply post-training on commonsense knowledge source degrades the performance on these two tasks. This is possibly due to the fact that the triple-level post-trained corpus cannot provide rich semantics for the model to generalize on tasks that emphasize reasoning and explaining.
For story ending generation, we present the evaluation results in Table 4. Our model outperforms all the baselines in BLEU and distinct metrics. We also observe that post-training on external commonsense data improves the generation diversity of the pre-trained language model, which accords with the findings of Guan et al. (2020). We suspect that post-training on the commonsense data enables    the model to generate concepts related to the story context, which improves the text diversity.

Human Evaluation
To evaluate the fluency and the reasonability of the generated texts under the specific task settings, we conduct pair-wise comparison with COMeT-Emb-GPT2 on αNLG, IE+GA on SEG, and with two fine-tuned GPT-2 models on all the three tasks. For human evaluation, we randomly sample 100 sentences from the test set for each pair of models and obtain 1,100 sentences from five models. We recruit three annotators to make a preference among win, tie and lose given the input context and two outputs generated by our model and a baseline respectively according to two criteria: fluency and reasonability. For fluency, we require the annotators to focus only on the grammatical correctness and readability of the generated results disregarding the input context. When evaluating reasonability, the annotators are required to assess whether the generated sentence is reasonable under the given context in each task. In SEG and αNLG, annotators are asked to focus on evaluating the causal and temporal relevance of the generated results and the contexts. While on EG, annotators are mainly asked to check whether the generated results explain the counterfactual points in the statements properly.
The human evaluation results are presented in Table 6 where our model significantly outperforms compared baselines in terms of both criteria on all the datasets. Specifically, incorporating structural commonsense knowledge yields significant improvement in generating reasonable texts given the context. Table 7 shows the inter-rater agreement where five out of six annotations show moderate (0.4 ≤ κ < 0.6) or good agreement (0.6 ≤ κ < 0.8). We check the annotation results and find that the GPT-2 baselines also generate story endings with good grammar, which makes it hard for the annotators to reach a high consensus when evaluating the fluency criterion of the story ending generation task (κ = 0.315).

Ablation Study
We conduct ablation study to verify the effect of different model components. As shown in Table  5, all the components contribute to the final performance. Removing the dynamic reasoning module (w/o DMRF) results in the largest performance drop, thereby indicating that dynamic multi-hop reasoning plays a major role in this task. Ablating the graph representation module (w/o SMGE) also degrades the performance since it encodes the graph structure with relational information that benefits concept selection. We also show the results of our reasoning module with mean(·) aggregator and observe some performance drop comparing with max(·).    (Fleiss, 1971) which evaluates the agreement from multiple annotators in terms of fluency and reasonability.

Impact of the Size of Training Data
To demonstrate the complementary performance gain of utilizing relational paths besides textual modeling, we sample different fractions of training data of αNLG for training and evaluate on the original test set. We compare our method with knowledge-agnostic finetuning of the GPT-2 model and the commonsense-enhanced GPT-2 posttrained on OMCS. As shown in Figure 3, our model achieves consistent performance gains against the chosen baselines with different amount of training data, which demonstrates the generalization ability of the proposed model with the aid of structural relation knowledge.

Effectiveness of Dynamic Multi-Hop Reasoning
We demonstrate the effectiveness of the multi-hop reasoning module through both quantitative and qualitative analysis. We investigate the impact of the hyper-parameter γ that controls the information flow in the multihop reasoning module. As shown in Figure 4, the maximum performance is obtained when γ is around 0.4 and 0.6. When γ → 0, the multi-hop reasoning module reduces to local scoring of each concept and ignores evidence accumulated on the paths. While γ → 1, the node score increases monotonically along the paths which also hinders the model's ability to select the correct concept. Therefore, we set γ = 0.5 for all the main results of our model.
To qualitatively assess the ability of the dynamic reasoning module, we visualize a test case on αNLG with top-ranked concepts and scored reasoning paths. As shown in Figure 5, at the first hop the reasoning module starts from the source concepts "adopt" and "puppy" and assigns higher scores to neighbouring concepts which are verbs considering the generated context. At the second hop the module utilizes more plausible evidences along 2hop reasoning paths and selects "play" (g t = 0.64) which is more reasonable given both the observations.

Case Study
We provide some test cases on three datasets in Table 8 and observe that: I. Baseline models tend to generate general cases while the GRF is able to generate more specific concepts by exploring the plau-   Figure 5: Visualization of a test case with inferred reasoning paths by our model. We highlight top-3 concepts with reasoning paths at 1-Hop and 2-Hop reasoning step respectively. sible relations between concepts. For example in the first case, the GRF generates "expensive" which is the antonym of "cheap" under the story context. II. Baseline models sometimes fail to identify the transition of the narrative as shown in the third case where the GRF generates "seasick" as a plausible explanation for the transition from "cruise" to "beach". III. The GRF generates proper attributes of the source concepts in the input context with the aid of external commonsense knowledge as shown in the last two cases of explanation generation.

Conclusion
We present Generation with Multi-Hop Reasoning Flow that reasons over structured commonsense knowledge during text generation. The proposed method leverages both the structural and semantic information of the external knowledge base by performing dynamic multi-hop reasoning on the relational paths. We conduct extensive experiments and empirically show that our method outperforms existing approaches that integrate commonsense knowledge to pre-trained language models on three text generation tasks. We also demonstrate the interpretability of our method with inferred reasoning paths that provide rationale to the generated results.

Story Ending Generation
Story Context I wanted a simple bike for commuting. So I bought a cheap one on sale. But it didn't fit me properly. And it was uncomfortable to ride.

IE+GA
So I decided to buy a new one.

GPT2-FT
So I decided to buy a new bike.

GPT2-OMCS-FT
So I decided to buy a bike from a bike shop instead.

GRF
So I decided to get a more expensive bike.

Story Context
Ava made shakes for her kids on a hot summer day. She called them in from play, but they dallied. By time they came in, the shakes were almost melted. Ava blended in more ice cubes and refreshed them.

IE+GA
Then she went home and ate them.

GPT2-FT
Ava was proud of her kids for being so good at cooking.

GPT2-OMCS-FT
She was proud of her kids for being so thoughtful! GRF Her kids thanked her profusely for helping them cool off.

Observation 1
The Smith family went on a cruise for their summer vacation.

Observation 2
From then on, the Smiths went to the beach each summer instead.

GPT2-FT
The Smith family had a great time on the beach.

GPT2-OMCS-FT
The Smith family went to the beach. COMeT-Emb-GPT2 They didn't have a nice vacation.

GRF
The Smith family got seasick on the cruise.

Observation 1
Nancy bought her dog a squeaky stuffed animal.

Observation 2
The dog had ripped the toy to shreds.

GPT2-FT
Nancy found a toy that looked like a toy.

GPT2-OMCS-FT
Nancy found a toy that looked like a toy. COMeT-Emb-GPT2 The squeaky stuffed animal was the first to come in. GRF Nancy's dog scratched the stuffed animal.

Explanation Generation Statement
Coke is made of alcohol.

GPT2-FT
Coke is a drink.

GPT2-OMCS-FT
Coke is not a liquid.

GRF
Coke is made from corn.

Statement
She cut up a blanket.

GPT2-FT
A blanket is not sharp enough to cut.

GPT2-OMCS-FT
A blanket is too small to be cut.

GRF
Blankets are too soft to be cut. Words in blue denote source concepts in the input contexts while words in orange are the associated concepts generated by the GRF.