CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning

Recently, large-scale pre-trained language models have demonstrated impressive performance on several commonsense-reasoning benchmark datasets. However, building machines with commonsense to compose realistically plausible sentences remains challenging. In this paper, we present a constrained text generation task, CommonGen associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts (e.g., dog, frisbee, catch, throw); the task is to generate a coherent sentence describing an everyday scenario using these concepts (e.g., “a man throws a frisbee and his dog catches it”). The CommonGen task is challenging because it inherently requires 1) relational reasoning with background commonsense knowledge and 2) compositional generalization ability to work on unseen concept combinations. Our dataset, constructed through a combination of crowdsourced and existing caption corpora, consists of 77k commonsense descriptions over 35k unique concept-sets. Experiments show that there is a large gap between state-of-the-art text generation models (e.g., T5) and human performance (31.6% v.s. 63.5% in SPICE metric). Furthermore, we demonstrate that the learned generative commonsense reasoning capability can be transferred to improve downstream tasks such as CommonsenseQA (76.9% to 78.4 in dev accuracy) by generating additional context.


Introduction
Commonsense reasoning, the ability to make acceptable and logical assumptions about ordinary scenes in our daily life, has long been acknowledged as a critical bottleneck of artificial intelligence and natural language processing (Davis and Marcus, 2015). Most recent commonsense reasoning challenges, such as CommonsenseQA (Tal-dog, frisbee, catch, throw -A dog leaps to catch a thrown frisbee.
-The dog catches the frisbee when the boy throws it.
-A man throws away his dog 's favorite frisbee expecting him to catch it in the air.
Expected Output: everyday scenarios covering all given concepts.

[Humans]
GPT2: A dog throws a frisbee at a football player.
UniLM: Two dogs are throwing frisbees at each other .
BART: A dog throws a frisbee and a dog catches it. T5: dog catches a frisbee and throws it to a dog [Machines] exercise | rope | wall | tie | wave -A man in a gym exercises by waving ropes tied to a wall. -The gym owner decided to tie a rope to the wall so people could make a wave in it for exercise.

Concept-Set: [Humans]
GPT2: A woman is tied up in a rope and swinging a wave at a wall.
UniLM: A man with a rope and tie is doing some exercise on a wall.
BART: A man is tied to a rope and is waving his arms and doing exercises on the wall.

[Machines]
Concept-Set: a collection of objects/actions.  (Zellers et al., 2019b), have been framed as discriminative tasks -i.e. AI systems are required to choose the correct option from a set of choices based on a given context. While significant progress has been made on these discriminative tasks, we argue that commonsense reasoning in text generation poses a distinct complementary challenge. In this paper, we advance machine commonsense towards generative reasoning ability.

Generative Commonsense Reasoning
Humans acquire the ability to compose sentences by learning to understand and use common concepts that they recognize in their surrounding environment (Tincoff and Jusczyk, 1999). The acquisition of such an ability is regarded as a significant milestone of human development (Moore, 2013). Can machines acquire such generative commonsense reasoning ability? To initiate the investigation, we present COMMONGEN 1 -a novel constrained generation task that requires machines to generate a sentence describing a day-to-day scene using concepts from a given concept-set. For ex-{ exercise, rope, wall, tie, wave } A woman in a gym exercises by waving ropes tied to a wall.

Relational Reasoning for Generation
Training Compositional Generalization x 1 = { apple, bag, put } y 1 = a girl puts an apple in her bag x = { pear, basket, pick, put, tree }, y = ?
Reference: "a girl picks some pear from a tree and put them in her basket." x 2 = { apple, tree, pick } y 2 = a man picks some apples from a tree x 3 = { apple, basket, wash } y 3 = a boy takes an apple from a basket and washes it. Figure 2: Two key challenges of COMMONGEN: relational reasoning with underlying commonsense knowledge about given concepts (left), and compositional generalization for unseen combinations of concepts (right).

Test
ample, in Figure 1, given a set of concepts: {dog, frisbee, catch, throw}, machines are required to generate a sentence such as "a man throws a frisbee and his dog catches it in the air." To successfully solve the task, models need to incorporate two key capabilities: a) relational reasoning, and b) compositional generalization. Grammatically sound sentences may not always be realistic as they might violate our commonsense (e.g., "a dog throws a frisbee ..."). In order to compose a plausible sentence that describes an everyday scenario, models need to construct a grammatical sentence while adhering to and reasoning over the commonsense relations between the given concepts. Models additionally need compositional generalization ability to infer about unseen concept compounds. This encourages models to reason about a potentially infinite number of novel combinations of familiar concepts -an ability believed to be a limitation of current AI systems (Lake and Baroni, 2017;Keysers et al., 2020).
Therefore, in support of the COMMONGEN task, we present a dataset consisting of 35,141 conceptsets associated with 77,449 sentences. We explicitly design our dataset collection process to capture the key challenges of relational reasoning and compositional generalization described above, through an actively controlled crowd-sourcing process. We establish comprehensive baseline performance for state-of-the-art language generation models with both extensive automatic evaluation and manual comparisons. The best model, based on T5 (Raffel et al., 2019), achieves 31.60% with significant gap compared to human performance of 63.50% in the SPICE metric -demonstrating the difficulty of the task. Our analysis shows that state-of-the-art models struggle at the task, generating implausible sentences -e.g. "dog throws a frisbee ..." , "giving massage to a table", etc. Additionally, we show that successful COMMONGEN models can benefit downstream tasks (e.g., commonsense-centric question answering) via generating useful context as background scenarios. We believe these findings point to interesting future research directions for the community of commonsense reasoning.

Task Formulation and Key Challenges
We formulate the proposed COMMONGEN task with mathematical notations and discuss its inherent challenges with concrete examples. The input is an unordered set of k concepts x = {c 1 , c 2 , . . . , c k } ∈ X (i.e. a concept-set), where each concept c i ∈ C is a common object (noun) or action (verb). We use X to denote the space of all possible concept-sets and use C to denote the concept vocabulary (a subset of ConceptNet's unigram concepts). The expected output is a simple, grammatical sentence y ∈ Y that describes a common scenario in our daily life, using all given concepts in x (morphological inflections are allowed). A scenario can depict either a static situation or a short series of actions. The COMMONGEN task is to learn a function f ∶ X → Y, which maps a concept-set x to a sentence y. The unique challenges of this task come from two aspects: Relational Reasoning with Commonsense. Expected generative reasoners should prioritize the most plausible scenarios over many other less realistic ones. As shown in Figure 2, models need to recall necessary relational commonsense facts that are relevant to the given concepts, and then reason an optimal composition of them for gener-ating a desired sentence. In order to complete a scenario, generative commonsense reasoners also need to reasonably associate additional concepts (e.g., 'woman', 'gym') as agents or background environments for completing a coherent scenario.
This not only requires understanding underlying commonsense relations between concepts, but also incrementally composing them towards a globally optimal scenario. The underlying reasoning chains are inherently based on a variety of background knowledge such as spatial relations, object properties, physical rules, temporal event knowledge, social conventions, etc. However, they may not be recorded in any existing knowledge bases.
Compositional Generalization. Humans can compose a sentence to describe a scenario about the concepts they may never seen them co-occurring. For example, in Figure 2, there is a testing conceptsetx ={pear, basket, pick, put, tree}. The concept 'pear' never appear in the training data, and 'pick' never co-occurs with 'basket'. We, humans, can generalize from these seen scenarios in the training data and infer that a plausible output:ŷ ="a girl picks some pears from a tree and put them into her basket." This compositionally generalization ability via analogy, i.e., to make "infinite use of finite means" (Chomsky, 1965), is challenging for machines. This analogical challenge not only requires inference about similar concepts (e.g., 'apple' → 'pear') but also their latent associations. Figure 3 illustrates the overall workflow of our data construction for the proposed COMMONGEN task. We utilize several existing caption corpora for sampling frequent concept-sets (Sec. 3.1) for reflecting common scenarios. We employ AMT crowd workers for collecting human-written sentences (Sec. 3.2) for the development and test set, while we carefully monitor the quality of crowd workers and refine them dynamically. Finally, we present the statistics of the COMMONGEN dataset, and the analysis on the challenges (Sec. 3.4).

Collecting Concept-Sets from Captions
It can be unreasonable to present any arbitrary set of concepts (e.g., x ={apple, fold, rope}) and ask a reasoner to generate a commonsense scenario, since such an arbitrary set of concepts can be too unrelated. Therefore, our concept-sets are supposed to reflect reasonable concept co-occurrences  in everyday situations. As web images and video clips capture diverse everyday scenarios, we use their caption text as a natural resource for collecting concept-sets and their corresponding descriptions of commonsense scenarios. More specifically, we collect visually-grounded sentences from several existing caption datasets, including image captioning datasets, such as Flickr30k (Young et al., 2014), MSCOCO (Lin et al., 2014), Conceptual Captions (Sharma et al., 2018, as well as video captioning datasets including LSMDC (Rohrbach et al., 2017), ActivityNet (Krishna et al., 2017), and VATEX (Wang et al., 2019b).
We first conduct part-of-speech tagging over all sentences in the corpora such that words in sentences can be matched to the concept vocabulary of ConceptNet. Then, we compute the sentence frequency of concept-sets consisting of 3∼5 concepts. That is, for each combination of three/four/five concepts in the vocabulary, we know how many sentences are in the corpora covering all concepts.
Ideally, we want the selected concept-sets in our dataset to reflect the natural distribution of conceptsets in the real world. At first glance, a reasonable solution may seem to sample from the distribution of the concept-sets based on their frequencies in the source datasets. However, we find that this method leads to a rather unnaturally skewed collection of concept-sets, due to the inherent data biases from the source datasets. We therefore design a function to score a concept-set x based on scene diversity and inverse frequency penalty. We denote S(x) as the set of unique sentences that contain all given concepts {c 1 , c 2 , . . . , c k }, and then we have where ρ(x) = |X | max c i ∈x |{x ′ | c i ∈x ′ and x ′ ∈X }| . The first term in score is the number of unique sen-  tences covering all given concepts in x, and the second term is to represent the diversity of the scenes described in these sentences. Th last term ρ(x) is the penalty of inverse frequency. Specifically, we find the concept in x that has the maximum "set frequency" (i.e., the number of unique conceptsets containing a particular concept), then we take the inverse with the number of all concept-sets for normalization. This penalty based on inverse set-frequency effectively controls the bias towards highly frequent concepts. With the distribution of such scores of concept-sets, we sample our candidate examples for the next steps.

Crowd-Sourcing References via AMT
In order to ensure the best quality, the references of the evaluation examples are crowdsourced from crowd workers on Amazon Mechanical Turk, which amounts to 10,060 references over 2.5k distinct concept-sets. Note that these newly collected references for dev and test examples can ensure that we can do a fair comparisons targeting generalization, considering potential data-leak (i.e., recent pre-trained language models might have seen the caption datasets). Each concept-set was assigned to at least 3 workers. In addition to references about given concept-sets, we also ask the workers to provide rationale sentences to explain what commonsense facts they have used, for ensuring that the described scenarios are common in daily life (example rationales are shown in Fig 9). We control the quality by actively filtering workers who produced low-quality references, then removing their annotations, and finally re-opening the slots only for quality workers. There were 1,492 accepted workers in total and 171 disqualified workers in the end after the active filtering. There are three criteria for efficiently narrowing down candidates for us to further manually remove out low-quality workers: 1) coverage via part-ofspeech tagging, 2) especially high perplexity via and 3) length of the rationales. Meanwhile, we also dynamically replaced the concept-sets that majority of the references do not make sense to ensure the final quality.

Down-Sampling Training Examples
In order to evaluate the compositional generalization ability, we down-sample the remaining candidate concept-sets to construct a distantly supervised training dataset (i.e., using caption sentences as the human references). We explicitly control the overlap of the concept-sets between training examples and dev and test examples. The basic statistics of the final dataset is shown in Table 1. There are on average four sentences for each example in dev and test sets, which provide a richer and more diverse test-bed for automatic and manual evaluation. Table 1 also shows the ratio of unseen concept compositions (i.e., concept, concept-pair, and concept-triple) in the dev and test. Notably, all pairs of concepts in every test concept-set are unseen in training data and thus pose a challenge for compositional generalization.

Analysis of Underlying Common Sense
We here introduce deeper analysis of the dataset by utilizing the largest commonsense knowledge graph (KG), ConceptNet (Speer et al., 2017), as an tool to study connectivity and relation types. Connectivity Distribution. If the concepts inside a given concept-set is more densely connected  with each other on the KG, then it is likely to be easier to write a scenario about them. In each 5size concept-set (i.e. a concept-set consists of five concepts), there are 10 unique pairs of concepts, the connections of which we are interested in. As shown in Figure 4, if we look at the one-hop links on the KG, about 60% of the 5-size concept-set have less than one link among all concept-pairs. On the other hand, if we consider two-hop links, then nearly 50% of them are almost fully connected (i.e. each pair of concepts has connections). These two observations together suggest that the COM-MONGEN has a reasonable difficulty: the concepts are not too distant or too close, and thus the inputs are neither too difficult nor too trivial.
Relation Distribution. Furthermore, the relation types of such connections can also tell us what kinds of commonsense knowledge are potentially useful for relational reasoning towards generation. We report the frequency of different relation types 2 of the one/two-hop connections among conceptpairs in the dev and test examples in Fig. 8. To better summarize the distributions, we categorize these relations into five major types and present their distribution in Table 2, respectively for one/two-hop connections between concept pairs.

Methods
We briefly introduce the baseline methods that are tested on the COMMONGEN task.
Encoder-Decoder Models. Bidirectional RNNs and Transformers (Vaswani et al., 2017) are two most popular architectures for seq2seq learning. We use them with the addition of attention mecha-2 Relation definitions are at https://github.com/ commonsense/conceptnet5/wiki/Relations. nism (Luong et al., 2015) with copying ability (Gu et al., 2016), which are based on an open-source framework OpenNMT-py (Klein et al., 2017). We use bRNN-CopyNet and Trans-CopyNet denote them respectively. To alleviate the influence from the concept ordering in such sequential learning methods, we randomly permute them multiple times for training and decoding and then get their average performance. To explicitly eliminate the order-sensitivity of inputs, we replace the encoder with a mean pooling-based MLP network (MeanPooling-CopyNet). Non-autoregressive generation.
Recent advances (Lee et al., 2018;Stern et al., 2019) in conditional sentence generation have an emerging interest on (edit-based) non-autoregressive generation models, which iteratively refine generated sequences. We assume that these models potentially would have better performance because of their explicit modeling on iterative refinements, and thus study the most recent such model Levenshtein Transformer (LevenTrans) by Gu et al. (2019). We also include a recent enhanced version, ConstLeven (Susanto et al., 2020), which incorporates lexical constraints in LevenTrans. Pre-trained Language Generation Models. We also employ various pre-trained language generation models, including GPT T5 (Raffel et al., 2019), to tackle this task and test their generative commonsense reasoning ability. We fine-tuned all the above models on our training data with a seq2seq format.
Specifically, to use GPT-2 for this sequence-tosequence task, we condition the language model on the format "c 1 c 2 . . . c k = y" during fine-tuning, where c i is a concept in the given concept-set and connects with other concepts with a blank; y is a target sentence. For inference, we sample from the fine-tuned GPT-2 model after a prompt of "c 1 c 2 . . . c k =" with beam search and use the first generated sentence as the output sentence. For BERT-Gen, we use the s2s-ft package 3 to finetune them in a sequence-to-sequence fashion that is similar to the LM objective employed by UniLM.
As for T5, the state-of-the-art text-to-text pretrained model which is pre-trained with a multitask objective by prepending a task description  Table 3: Experimental results of different baseline methods on the COMMONGEN test set. The first group of models are non-pretrained models, while the second group is large pretrained models that we have fine-tuned. The best models are bold and second best ones are underlined within each metric. We highlight the metrics that we used in our official leaderboard. (Results on dev set are at Table. 7.) before the input text, we prepend the input concept set with a simple prompt: "generate a sentence with:" and fine-tune the model with the source sentence on the format "generate a sentence with c 1 c 2 . . . c k ." For decoding, we employ the standard beam search with a beam size of 5 for all compared models. We also report their results with a lexically-constrained decoding method, dynamic beam allocation (DBA) (Post and Vilar, 2018), which do not show improvement over conventional beam searching. 4

Evaluation
We first introduce the automatic evaluation metrics, then present main experimental results with manual analysis, and finally introduce the potential application in transferring CommonGen-trained models for other downstream tasks.

Metrics
Following other conventional generation tasks, we use several widely-used automatic metrics to automatically assess the performance, such as BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Banerjee and Lavie, 2005), which mainly focus on measuring surface similarities. We report the concept Coverage, which is the average percentage of input concepts that are present in lemmatizatized outputs.
In addition, we argue that it is more suitable to use evaluation metrics specially design for caption- 4 The used hyper-parameters are reported in the appendix.
ing task, such as CIDEr (Vedantam et al., 2015) and SPICE (Anderson et al., 2016). They usually assume system generations and human references use similar concepts, and thus focus on evaluate the associations between mentioned concepts instead of n-gram overlap. For example, the SPICE metric uses dependency parse trees as proxy of scene graphs to measure the similarity of scenarios. 5 To estimate human performance within each metric, we treat each reference sentence in dev/test data as a "system prediction" to be compared with all other references, which is equivalent to compute inter-annotator agreement within each metric. Thus, systems that have better generative ability than average crowd-workers should exceed this.

Experimental Results
Automatic Evaluation. Table 3 presents the experimental results in a variety of metrics. We can see that all fine-tuned pre-trained models (the lower group) outperform non-pretrained models (the upper group) with a significant margin. This is not surprising because their pretraining objectives, including masked language modeling, word ordering, and text infilling which predicts missing words or text spans, are relevant to our task. On the other hand, we find that the key disadvantage of nonpretrained models with CopyNet still falls in the  [ : hands washing soap on the sink.
[BERT-Gen]: a woman washes her hands with a sink of soaps.
[UniLM]: hands washing soap in the sink [BART]: a man is washing his hands in a sink with soap and washing them with hand soap.
[T5]: hand washed with soap in a sink.

1.
A girl is washing her hands with soap in the bathroom sink.
2. I will wash each hand thoroughly with soap while at the sink.
3. The child washed his hands in the sink with soap.

4.
A woman washes her hands with hand soap in a sink.

5.
The girl uses soap to wash her hands at the sink. failure of using all given concepts (i.e., low coverage), which results in worse results. Among them, UniLM, BART, and T5 performs the best, which may be due to its inherent sequenceto-sequence pre-training framework. We found that BART has the best concept coverage, which is probably due to its comprehensive pre-training tasks that aim to recover text with noise. The results suggest that further modifying pre-trained models is a promising direction for generative commonsense.
Manual Evaluation. We conduct manual evaluation with a focus on commonsense plausibility for comparing the 6 best-performing models in Table 4. We ask five graduate students to compare 1,500 pairs of model-generated sentences respectively, for ranking the models within 100 conceptsets that are covered by all the models. The final average ranked results are shown in Table 4 and their inter-annotator agreement is 0.85 in Kendall's rank correlation coefficient.
Note that the coverage-weighted hit@1 rate correlates with the SPICE metric the most, i.e., 0.94 in Spearman's ρ for model ranks, while METEOR and ROUGE-2 are both 0.88 and BLEU-4 is 0.78.
Case study. Fig. 5 shows the top generations of dif-Training Steps Accuracy Figure 6: Learning curve for the transferring study. We use several trained COMMONGEN (GG) models to generate choice-specific context for the CSQA task. Detailed numbers are shown in Tab. 8 in the appendix. ferent models and human references about an input concept-set: {hand, sink, soup, wash} (more cases are shown in Fig. 9 in the appendix). We find that non-pretrained seq2seq models (e.g., bRNN, Mean-Pooling, ConstLeven) can successfully use part of given concepts, while the generated sentences are less meaningful and coherent. On the contrary, the outputs of fine-tuned pre-trained language models are significantly more commonsensical. Most of them use all given concepts in their outputs. Con-stLeven tends to make use of frequent patterns to compose a non-sense sentence but uses all concepts. GPT-2 and UniLM incorrectly compose the dependency among hand, wash, and soap. The phrase 'a sink of soaps' in BERT-gen's output makes itself less common. BART and T5 generate relatively reasonable scenarios, but both are not as natural as human references; BART's contains repetitive content while T5's lacks a human agent. Influence of Dynamic Beam Allocation. Considering that all tested models decode sentences with beam searching, one may wonder what if we use a decoding method specially designed for constrained decoding. Thus, we employed dynamic beam allocation (DBA) (Post and Vilar, 2018). The results are shown in Table 5. Note that the models are the same as in Table 3 while only the decoding method is changed to DBA. We can see that all methods are negatively impacted by the decoding method. This suggests that for the COMMON-GEN task and pre-trained language models, we may need to focus on knowledge-based decoding  or re-ranking as future directions.

Transferring CommonGen Models
One may wonder how fine-tuned COMMONGEN models can benefit commonsense-centric downstream tasks such as Commonsense Question Answering (Talmor et al., 2019) (CSQA) with their generative commonsense reasoning ability. To this end, we use the models trained with the COMMON-GEN dataset for generating useful context. We extract the nouns and verbs in questions and all choices respectively, and combine the concepts of the question q and each choice c i to build five concept-sets. Then, we use these concept-sets as inputs to a trained COMMONGEN model (e.g., T5) for generating scenario a sentence g i for each as choice-specific contexts. Finally, we prepend the outputs in front of the questions, i.e., "<s>G: g i | Q: q </s> C: c i </s>". Note that the state-ofthe-art RoBERTa-based models for CSQA uses the same form without "G: g i |" in fine-tuning.
We show the learning-efficiency curve in Fig. 6, where y is the accuracy on the official dev set and x is the number of training steps. The details of the experiments are shown in the appendix.
We highlight the performance of original RoBERTa-Large as the baseline. We find that some CommonGen models further improves the performance by a large margin, e.g., 76.9 UniLM − −−− → 78.4 and they converge at better accuracy in the end. Note that BERT-gen and ConstLeven cause negative transfer due to the low quality of generated context. Particularly, we find that the context generated by the T5-based CommonGen model (CG-T5) helps speed up training about 2 times, if we look at 550th steps of CG-T5 (74.85%) and 1,250th steps of original RoBERTa (74.77%).
Through manual analysis, we find that the successful COMMONGEN models can generate more reasonable and natural sentence for correct choices while noisy sentences for wrong choices. For example with CG (T5), q="What do people aim to do at work?", c i ='complete job' () with g i ="people work to complete a job aimed at achieving a certain goal."; c j ='wear hats' () g j ="people wearing hats aim their guns at each other while working on a construction site." The used question concepts and choice concepts are underlined.  (Fu et al., 2018;Luo et al., 2019b;Li et al., 2018), topics (Feng et al., 2018, etc. Two related scenarios with our task is lexically constrained decoding and word ordering (Zhang and Clark, 2015;Hasler et al., 2018;Dinu et al., 2019;Hokamp and Liu, 2017;Puduppully et al., 2017;Miao et al., 2019). However, they are not easily adopted by the recent pre-trained language models and thus not directly useful for our task. Topical story generation (Fan et al., 2018;Yao et al., 2019) is also a related direction, while it targets generating longer, creative stories around the given topics, making it hard to directly adopt them to our task. Additionally, the COMMONGEN task brings some more challenges mentioned in Section 2. Prior constrained generation methods cannot address these issues together in a unified model. Incorporating Commonsense for NLG. There are a few recent works that incorporate commonsense knowledge in language generation tasks such as essay generation (Guan et al., 2019;Yang et al., 2019a), image captioning (Lu et al., 2018), video storytelling , and conversational systems (Zhang et al., 2020a). These works suggest that generative commonsense reasoning has a great potential to benefit downstream applications. Our proposed COMMONGEN, to the best of our knowledge, is the very first constrained sentence generation dataset for assessing and conferring generative machine commonsense and we hope it can benefit such applications. Our transferring study in Sec. 5.3 also shows the potential benefits of CommonGen-generated contexts.

Conclusion
Our major contribution in this paper are threefold: • we present COMMONGEN, a novel constrained text generation task for generative commonsense reasoning, with a large dataset; • we carefully analyze the inherent challenges of the proposed task, i.e., a) relational reasoning with latent commonsense knowledge, and b) compositional generalization. • our extensive experiments systematically examine recent pre-trained language generation models (e.g., UniLM, BART, T5) on the task , and find that their performance is still far from humans, generating grammatically sound yet realistically implausible sentences. Our study points to interesting future research directions on modeling commonsense knowledge in language generation process, towards conferring machines with generative commonsense reasoning ability. We hope COMMONGEN would also benefit downstream NLG applications such as conversational systems and storytelling models.

A Supplementary Figures and Tables
We include additional figures and tables that we mentioned in the main content here.
• Figure 8 shows the detailed distribution of the commonsense relations between given concepts, the summary of which was shown in Table 2 of the main content. • Figure 9 presents 4 more case studies with human rationales which we asked our crowd workers to provide. • Figure 7 shows instructions and AMT interface for crowd-sourcing human references. • Table 7 shows the model performances on the dev set of COMMONGEN, as a reference for future development. • Table 8 is the full results of the learning curve in Figure 5. We highlight the highest checkpoints and the speed-up by the CG-T5, which are discussed in Section 5.3.  BERT-gen, UniLM, UniLMv2 are all based on their official source code 10 . The  are both adopted by the huggingface transformers 11 framework (Wolf et al., 2019). All models use beam searching as their decoding algorithms and beam-size are mostly 5, which is selected from {5, 10, 20}. All our models were trained on Quadro RTX 6000 GPUs. The training time of X-CopyNet and LevenTrans models are less than 12 hours with a single GPU. The second group of models are trained between 12 and 24 hours, expect for T5large, which we used 3 GPUs and fine-tuned about 48 hours. Note that all the above methods are self-contained in our submitted code as long as users follow the associated readme instructions.

T5-Base
Transferring study experiments. We use the same hyper-parameters which are searched over the baseline RoBERTa-Large model for these experiments. The best hyper-parameter 12 of RoBERTa-Large for CommonsenseQA 13 : • batch size = 16, learning rate = 1e-5, • maximum updates = 3,000 (∼5 epochs) • warmup steps=150, dropout rate=0.1 • weight decay = 0.01, adam epsilon = 1e-6 We tried 10 random seeds and use the best one (42). Then, we follow the steps described in Sec. 5.3 to run other CG-enhanced models with the 10 https://github.com/microsoft/unilm 11 https://github.com/huggingface/  Table 7: Experimental results of different baseline methods on the COMMONGEN dev set. The first group of models are non-pretrained models, while the second group is large pretrained models that we have fine-tuned. The best models are bold and second best ones are underlined within each metric.
same hps. This suggests that further searching for them may have even better performance. Figure 7: Our annotation interface on the AMT platform. The upper part is the instruction for the annotators and we provide an example for them. Note that we give the part-of-speech hints (from the captain corpora) to boost the speed of annotation, but we do not remove sentences with wrong part-of-speech as long as they also make sense.