CosMo: Conditional Seq2Seq-based Mixture Model for Zero-Shot Commonsense Question Answering

Commonsense reasoning refers to the ability of evaluating a social situation and acting accordingly. Identification of the implicit causes and effects of a social context is the driving capability which can enable machines to perform commonsense reasoning. The dynamic world of social interactions requires context-dependent on-demand systems to infer such underlying information. However, current approaches in this realm lack the ability to perform commonsense reasoning upon facing an unseen situation, mostly due to incapability of identifying a diverse range of implicit social relations. Hence they fail to estimate the correct reasoning path. In this paper, we present Conditional Seq2Seq-based Mixture model (CosMo), which provides us with the capabilities of dynamic and diverse content generation. We use CosMo to generate context-dependent clauses, which form a dynamic Knowledge Graph (KG) on-the-fly for commonsense reasoning. To show the adaptability of our model to context-dependant knowledge generation, we address the task of zero-shot commonsense question answering. The empirical results indicate an improvement of up to +5.2% over the state-of-the-art models.


Introduction
People understand narratives of everyday life by capitalising on their commonsense knowledge. They can easily reason about unobserved causes and effects in relation to the events described in narratives, as well as plausible characteristics and mental states of the involved persons. Although this kind of reasoning seems trivial for humans, it is still out of reach for current natural language understanding (NLU) systems. Recently, there have been fast-growing interests in building AI systems with such human-like reasoning capabilities based on inferential commonsense knowledge (Storks et al., 2019;. Such systems are often evaluated by answering questions based on narratives. As illustrated in Figure  1, given a narrative "Austin often spends her weekend at the lake fishing with friends" and a question regarding the intention of Austin, an AI system is supposed to associate this event to relevant events in an inferential knowledge base or a web-scale corpus, find plausible reasons of those events, and conclude that "wanted to relax" is the most probable answer. Knowledge-based approaches to such commonsense question-answering (QA) require an inferential knowledge base. ATOMIC (Sap et al., 2019a) is the largest commonsense knowledge base of this kind, which contains 300,000 short textual description of events and 877,000 typed if-then relations between events, categorised into 9 dimensions. For instance, IF the event "X puts trust in Y" occurs and the target relation is "xWant", THEN "X wants to develop a relationship". Prior work utilizes knowledge in ATOMIC by formulating the learning problem as event prediction in if-then relations . In particular, they encode the textual description of an event and a relation into an embedding, and maximise the probability of predicting the description of the associated event or characteristic of an involved person. However, due to the nature of commonsense knowledge, given an event and a relation, there are multiple plausible associated events. Moreover, these models fail to predict all associated events This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/. Answer classifier Generated clauses after taking first hop Generated clauses based on context Figure 1: The illustration of the process of answering a question with our proposed model. The model receives a context and a question. Using Conditional SEQ2SEQ-based Mixture Model (COSMO), clauses based on the context and different relations are generated. The generated clauses are then used to generate new set of clauses based on the question. The Answer Classifier (AC) module selects the answer with higher score based on generated clauses at the last step and scores of COSMO. The path to the correct answer is distinguished with green color.
to a given event and relation, hence fail to identify every implicit reasoning path to address the task of question answering.
To address the above challenges, we propose a conditional SEQ2SEQ-based mixture model, named COSMO, to answer questions on everyday inferential knowledge in a zero-shot setting. As the events in our target commonsense KGs are described in text, such as ATOMIC, we distil the knowledge into a Sequenceto-Sequence (SEQ2SEQ) model. The distilled model memorises knowledge of the KG and generalise it to new events for a given context and specific relation, on-demand. However, a direct application of maximum-likelihood training on SEQ2SEQ models leads to deteriorated performance, because the underlying distribution over diverse outputs is inherently multi-modal (Shen et al., 2019). To address the above challenge, we incorporate a latent variable into a pre-trained SEQ2SEQ transformers (Yan et al., 2020). Each value of the variable corresponds to an embedding, which indicates the hidden factors explaining different hypothesis. Unlike the existing training methods of such models that cannot guarantee a definitive alignment of different hypothesis with different components during training, our proposed constrained-EM (expectation-maximisation) training procedure enforces that for the same input, different model outputs align with different latent embedding.
To tackle the task of Commonsense QA, we use COSMO to generate context-dependent information, for desired relations. The new generated events form a dynamic knowledge graph, which is reasoned over, until the correct answer is chosen by the model. The ubiquitous nature of everyday commonsense knowledge and lack of training data for each situation motivate addressing the task in zero-shot setting. It is unrealistic to expect the presence of manually constructed training datasets in each commonsense QA domain. With the lack of training data in such question-answering settings, we devise a bespoken answer scoring module to assess the likelihood of each answer. Given a narrative and a question, the model generates sequences of plausible fact descriptions, as reasoning paths, by choosing different values of the latent variable. For each answer and each reasoning path, the scoring module computes a score based on similarity between the answer and the last fact description in the reasoning path. The most probable answer is determined by its answer score and the probability of the associated reasoning path jointly.
To sum up, our contributions are three-folds: • We propose COSMO, a conditional SEQ2SEQ-based mixture model for zero-shot question answering on inferential knowledge. COSMO constructs a dynamic KG on-demand, which is used to reason over and answer commonsense questions.
• We propose a novel training procedure to enforce hard alignments between latent variables and diverse hypotheses, to ensure the diversity of the associated KG events distilled in the SEQ2SEQ.
• Our experimental results on SocialIQA (Sap et al., 2019b) show that our model achieves superior accuracy and significantly more diverse hypotheses than competitive baselines.

Related Works
Commonsense Knowledge Bases Introducing Commonsense Knowledge Graphs (KG) has provided a source of information for machines on the task of commonsense reasoning. This KGs, encode commonsense relations between different pair of events/concepts. Speer et al. (2017) assembled a KG from a variety of sources, ConceptNet, which represents general knowledge in form of tuples. While ConeptNet is more centered around taxonomic knowledge, Sap et al. (2019a) constructed ATOMIC, a KG consisting inferential knowledge. Information in ATOMIC (Sap et al., 2019a) are presented as if-then relationship between events, mental states, and persona. The information in ATOMIC includes such information for the agent of the event and others, who might be affected.  proposed an automatic approach for collecting eventuality KG, ASER, consisting events, activities, and states. Although the presented KGs provide rich commonsense information, reasoning about social situation requires a dynamic on-demand approach for context-dependent information generation.

Commonsense Knowledge Base Completion
As an essential way of enabling machines to perform commonsense reasoning, the methods for automatic KG construction and completion have been studied recently. Sap et al. (2019a) used LSTM as a generator for commonsense knowledge about social situations. Davison et al. (2015) proposed an unsupervised method to extract commonsense knowledge from pretrained models, to complete KG. Malaviya et al. (2020) developed a model which takes both structural and semantic characteristics of the nodes in a KG to address this task. On the other hand, some works have developed generative models on top of pre-trained language models to extract new commonsense information . However, when adapting to the KG, the previously acquired knowledge of these models is forgotten. Unlike these approaches, our neural model ensures both diversity and memorisation.
Commonsense Question Answering Recent surge of commonsense question answering dataset (Sap et al., 2019b;Zellers et al., 2018) has led to many supervised approaches to address this task. Most of these approaches are based on transfer learning, where a large scale pre-trained model (Lan et al., 2019;Devlin et al., 2018) is finetuned on the target task (Sap et al., 2019b). On the other hand, with the release of new commonsense KGs, such as ATOMIC (Sap et al., 2019a) and ConceptNet (Speer et al., 2017), the possibility of enriching language models with these KGs has been investigated vastly (Lv et al., 2020;Mitra et al., 2019;Banerjee and Baral, 2020). Most of these approaches map the context of a question to an entity/event in KG and perform reasoning on the KG (Weissenborn et al., 2017;Paul and Frank, 2019;Lin et al., 2019). While these methods enable the system to perform multi-hop reasoning, they are limited to the set of entities in KG. Other works deployed generative models to generate context-aware information to answer the questions (Shwartz et al., 2020;Banerjee and Baral, 2020). These approaches overcome the limitation of static KGs, but they lack diversity in generating new relations, which makes their inference path limited. Furthermore, for scoring the answer choices they have solely relied on the conditional likelihood of the generative model, which lack to perform when the distribution of KG and QA differ.
Zero-shot Question Answering In recent years, zero-shot learning has become a popular method to conquer the inability of machine learning system to perform on unobserved data (Wang et al., 2019). While this method has been thoroughly researched in other fields (Zhao et al., 2019), the necessity of using such approach has inspired many works in question answering systems as well. Visual Question Answering (Teney and Hengel, 2016) adopted a zero-shot learning approach to extract features from unseen text description about given images.  proposed an unsupervised extractive question answering model by using unsupervised data generation for converting the QA task to a cloze translation task. Puri (2020) improved the quality of unsupervised extractive question answering systems by introducing an approach involving answer generation, question generation and roundtrip filtration. In addition, Li et al. (2020) used Wikipedia's data in order to overcome the drawback of . Compared to the previous question answering model, our model shows the novelty in generation new clauses from given contexts.

Our Approach
In this section, we present our model COSMO for question answering in the zero-shot setting. In this setting, there is no training data for the QA task, thus we train our model only on ATOMIC to acquire inferential knowledge. We apply the trained model to answer multi-choice questions on everyday inferential knowledge by augmenting the trained model with a non-parametric answer scoring module. Formally, given a narrative describing a context c, a commonsense-required question q, and a set of m candidate answers A = {a 0 , ..., a m }, the task is to choose the most plausible answer from the set A. In order to measure the plausibility of answer candidates, the required commonsense inferential knowledge is from a large external knowledge base B, which is a set of typed if-then relations. Each if-then relation takes the form of "if z i and r then z j ", where z i denotes a word sequence describing that the event i and r is a relation between events from the pre-defined set R. Such a relation is also referred to as inference dimension in ATOMIC.
To this end, we define the target task as finding the most plausible answer by applying inferential reasoning over the knowledge base B. The reasoning process is characterised by finding a plausible reasoning path Z, which starts from a given context c, along the target relation determined by a question q, to reach an answer a. A reasoning path is a sequence of events correlated by if-then relations derived from B. Therefore, for a given context c and a question q, we find the most likely answer by solving the following optimisation problem: where P r(r|q) is the probability of a target relation r given the question 1 ; P r(a|Z) denotes an answer scoring module estimating the probability of an answer a given a reasoning path; P r(Z|c, r, B) is the probability of a reasoning path Z, conditioned on the context c, the target relation r, and the knowledge base B. The local distribution P r(Z|c, r, B) can be further factorized into where P r(z 0 |c, r, B) is the distribution of the first event given a context and a target relation and P r(z t |z <t , c, r, B) characterises the distribution of future events. To simplify inferential reasoning, we assume z t with t > 0 is conditionally independent of c and z <t−1 such that both P r(z t |z t−1 , r, B) and P r(z 0 |c, r, B) can be estimated by the same module, which predicts a future event by taking a textual description and the target relation as input. The module is referred to as the KB module because it is trained on ATOMIC to acquire inferential knowledge.
In the following, we will detail the KB module as well as its training procedure on ATOMIC, followed by presenting the answer scoring module and how to apply them together for the target task.

KB Module for Inferential Reasoning
The KB module aims to encode if-then relations in the KB B into model parameters, and apply the knowledge to infer future events given a target relation and a text describing a current event or a context. As both inputs and outputs are word sequences, we formulate the task of event prediction as a sequence to sequence prediction problem. As a result, we are able to exploit the powerful pre-trained transformer based SEQ2SEQ models (Devlin et al., 2018;Yan et al., 2020) as the backbone.
The key challenge of using pre-trained SEQ2SEQ models on ATOMIC is the diversity of output events based on a given event z i and relation r. The key idea herein is to align different latent factors h k with different outputs for the same input. Each latent factor is the value of a latent variable for alignment. After introducing the latent variable, we obtain a conditional mixture model of the following form: where x denotes either the description of an event or a context. Here we replace B with the model parameters θ B as the knowledge-base is encoded into the model parameters. We further assume P r(h k |x) follows a uniform distribution during prediction because target QA datasets follow a more different distribution than ATOMIC.
As each latent variable value can be represented by a symbol, the module for P r(z|h k , x, r) is realized by a SEQ2SEQ model. It takes as input a token sequence consisting of a latent variable value h k , a word sequence x, and a relation symbol r, and predicts a word sequence representing the next event z. We enrich the input vocabulary of the chosen SEQ2SEQ model with the symbols of h k and r which are mapped to the corresponding latent embeddings and relation embeddings during forward propagation.
We select ProphetNet (Yan et al., 2020) as the SEQ2SEQ backbone model, because it achieves the best performance on a number of natural language generation tasks. The encoder and the decoder of this model utilize n-stream self-attention mechanism and future n-gram prediction in order to encourage planning for the future tokens and prevent overfitting on strong local correlations.
Training The goal of training is to learn the parameters of the following model on ATOMIC, max θ B r(x,z)∈B K k=1 P r(z|h k , x, r; θ B )P r(h k |x).
More specifically, we train the model on each if-then relation of the form "if x and r then z" in ATOMIC by taking x and r as input and predicting z. Prior work on diverse machine translation (He et al., 2018a;Shen et al., 2019;Cho et al., 2019) suggests to apply online hard EM by interleaving the following two steps for each mini-batch.
• E-step: estimate the value of the latent variable throughk = arg max k P r(z|h k , x, r; θ B ) using the current parameters θ B .
• M-step: The model parameters θ B are then updated by minimising the cross-entropy loss on P r(z|hk, x; θ B ).
However, the greedy search in the E-step may still assign the same latent variable value to different target sequences. To eliminate the problem, we modify and constrain the E-Step by requiring that, different target sequences of the same input need to be assigned different latent variable values. More specifically, for each output set S r,x := {j | ∃ r(x, z j ) ∈ B} sharing the same input x and r in a mini-batch, we tackle this problem by solving the following combinatorial optimization problem.
where u j,k is a binary variable indicating the alignment between z j and h k , and |S r,x | is the number of target events in the output set. Here we set K always larger than |S r,x |. We solve the above problem by a heuristic-based search. We compute the log probability of P r(z j |h k , x, r; θ B ) for all combination of k and j. Then we sort those log probabilities, and select u j,k satisfying the hard constraints in order. More details about the algorithm can be found in the pseudo code in Algorithm 1.
Algorithm 1: Conditional Mixture Model input : source x, target {zj} J j=1 , relation r, latent variable {h k } K k=1 (K > J) output :Model parameter θ B 1 empty list Γ 2 while j < M do 3 while k < K do 4 l j,k := p(zj|h k , x, r, ; θ B ); 5 Γ := add (x,r,zj, h k ,l j,k ) to Γ; 6 end while 7 end while 8 Γ := Sort(Γ) based on values of l j,k 9 create empty lists ∆, ζ, ι 10 repeat 11 for x,r,zj, h k ,l j,k ∈ Γ do 12 if zj / ∈ ζ and h k / ∈ ι then add (x, r, z j , l k ) to ∆, add zj to ζ, add h k to ι ; 13 end for 14 until every zj is assigned with a h k ; 15 for x, r, z j , l k ∈ ∆ do 16 Make forward and backward propagation w.r.t the training objective (Equation 5) 17 Update the model parameter (θ) 18 end for 19 return θ

Answer Scoring Module
The main objective of the answer scoring module in Eq. (1) is to assign a score to each answer candidate. For this purpose,  proposed an averaged word probability approach. In this method, event prediction is considered as language modelling task, hence the score is defined as the average probability of generating each token of an event. However, this method is not theoretically grounded and largely relies on heuristics. Based on the probabilistic model in Eq. (1), the true answers should be semantically similar to the last events in plausible reasoning paths derived from contexts and questions. Thus, the answer scoring module solely depend on the last events of reasoning paths. P r(a|Z) = P r(a|z T ) where z T denotes the event generated at time step T . As the distribution is characterised by semantic similarity between the last events and answer candidates, we define the distribution as: where d(a, z) is a distance function between an answer and an event, and γ is a hyperparameter adjusting the temperature.
After distilling the SEQ2SEQ model with information of ATOMIC, we plug the trained KB module and the answer module into the model. COSMO answers questions by applying Eq. (1). We apply beam search to find top-k reasoning paths for each latent variable value up to a pre-specified number of hops T . The most plausible answer is the most probable reasoning path, whose last event achieves the highest similarity with the answer.

Experiments
In this section, we report the evaluation of our model on zero-shot question answering. To this end, we use test set and development set of SocialIQA (Sap et al., 2019b). We evaluate the performance of our model with two variations. In zero-hop setup, we only consider generating clauses using COSMO based on the context of the question, and the answers are then scored against the generated clauses. In one-hop setting, we take a step further and generate more clauses using the generated clauses in the previous step. This approach helps us uncover more implicit context-dependent information. The final answer is chosen against the combination of all generated clauses. To further analyse the capability of our proposed model in terms of clause generation, we compare our model to other approaches on ATOMIC (Sap et al., 2019a), and test and development set of SocialIQA.

Datasets
SocialIQA This dataset consists of commonsense questions, which aims to evaluate a model's capability in inferring implicit social context. Each question in this dataset is presented with a context, which describes the situation, and three answer candidates. For the purpose of addressing this task, we convert each question to one of the relations in the KG, using a pattern-based system. The details of this module is provided in the Appendix 6.1. This dataset contains a total of 37,588 questions. However, in a zero-shot setting we only use the development and test set for evaluation, where they contain 1,954 and 2,224 questions, respectively.
ATOMIC This dataset consists of 877K sets of subject, relation, and object, where each set describes a social commonsense situation. The subjects are an event (e.g., "PersonX puts trust in PersonY"), which poses a social situation. The relations are categorised into 9 dimensions (e.g., xEffect). The object is indicated by the relation, which shows the causes of subject, the effects of subject on the agent, and others, or attributes of the agent. The original data split, 710K/80K/87K for train/development/test, by Sap (2019a) is used in our experiments.

Baselines
For evaluating our proposed zero-shot question answering model, we compare COSMO to the state-ofthe-art pretrained language models, GPT (Radford et al., 2018). Also, we consider different variation of GPT-2 (Radford et al., 2019), including GPT2-117M, GPT2-345M, and GPT2-762M. For this purpose the questions are converted to state sentence (as described in Appendix B). The language model scores the answers based on cross-entropy loss of concatenation of context, question, and answers. In addition, we compare our model to two variation of COMET-CA and COMET-CGA . We also report the performance of the state-of-the-art supervised methods, Bert (Devlin et al., 2018) and RoBERTa (Liu et al., 2019).
To further analyse the performance of our model in generation of clauses, we compare our model to the state-of-the-art automatic knowledge base completion model, that has been used in zero-shot commonsense question answering task, COMET . Also, to assess the diversity, we consider comparison of our proposed model to the SEQ2SEQ model presented in section 3, without applying latent variable (ProphetNet) (Yan et al., 2020), and state-of-the-art model for applying the latent variable (MoE) (Shen et al., 2019).

Metrics
For evaluating the performance of models in zero-shot question answering task, we report the accuracy of each model in choosing the correct answer. In addition, to compare the effectiveness of our proposed scoring function, we compare three variation of our scoring function to two different scoring functions proposed in . Furthermore, having a diverse clause generation is an advantage of our proposed model. As a quantitative evaluation, for a set of clause generated by the model denoted as {ŷ} M m=1 , we use div bleu and div ngram (He et al., 2018b), which are defined as follow: where ngram(y) indicates the set of unique ngrams in a sequence y. For each model the top-50 clause generated by beam search is considered for evaluation purposes. For div bleu we report the average result of BLEU-1, BLEU-2, BLEU-3, and BLEU-4, with Smoothing1 (Chen and Cherry, 2014). For div ngram, we report the average results of 1-gram, 2-gram, 3-gram, and 4-gram.

Experimental Details
For training our proposed Knowledge Graph Neuralisation model, COSMO 2 , we finetune the SEQ2SEQ model using the pretrained model of Yan (2020). Our implementation is based on FAIRSEQ 3 . The model consists of 12 layers of encoder and 12 layers of decoder. The embedding size and batch size are set to 1024 and 512, respectively. The number of future ngram is set to 2. We use Adam optimiser (Kingma and Ba, 2014) with a peak learning rate of 1 × 10 −4 . For the question answering module, we consider answering without taking a hop, and taking one-hop. Since we evaluate our model in zero-shot setting, the γ in equation 7 is set to one, and we use cosine similarity function as the distance function. At each step, we consider beam-10 for clause generation.
To overcome the answer imbalance given specific questions, they propose adding Pointwise Mutual Information of question and answers to the original scoring function. The distance function (Eq. (1)) in our proposed model is evaluated with three variations of directly using probability of SEQ2SEQ model (SEQ2SEQ), BLEU function, and cosine distance. The results indicate that using cosine distance function improves the results by +4.93% and +6.48% over the second best approach, on development and test set, respectively. The experiments suggest that taking the similarity of clause generations at each step with the answers provides stronger classifier over answers, compared to considering only the probability of the generative models. The lack of performance of the latter configuration can be rooted in training phase of generative models, where some phrases, regardless of the context, are seen frequently together, resulting in achieving higher probability by the model.

MODEL
Dev Acc. Test Acc. averaged word probability  36.59 33.67 averaged word probability (without pmi)  34  Diverse Clause Generation One of the strength of our proposed model is the ability to generate diverse clauses given a subject and a relation. Table 3 shows the result of diverse generation in terms of div ngram and div bleu. As it can be seen, for div ngram, our proposed model achieves the highest performance on test set of ATOMIC. Furthermore, we observe that our model outperforms the baseline methods on development and test set of SocialIQA.  The results of div bleu also shows that on ATOMIC and SocialIQA development and test set, our model outperforms all the baselines on all variation of BLEU function. The results suggest the capability of our proposed model in diverse clause generations.

Qualitative Analysis
In this section, we demonstrate the capability of our model on diverse clause generation, and its effect on commonsense question answering. Table 4 provides two example from test set of SocialIQA. For each example, top-5 clauses generated by our proposed model and COMET, given context and question, are provided. Both examples show capability of our model in generating divers outputs, which results in finding the correct answer.

Example 1
Context: Alex is a store owner and observed every person's contribution carefully. Alex rewarded every person accordingly. Question: Why did Alex do this? Answers: contribute to the local community, close his store soon, reward more for more deserving persons Correct Answer: reward more for more deserving persons COMET to be a good employee to be helpful to be a good salesperson to make sure everything goes smoothly to be a good citizen COSMO to be fair to reward good work to appreciate good work to reward good people to show appreciation Example 2 Context: Bailey was a nice person so she called the family together.

Conclusion
In this paper, we propose a novel approach for neuralising large-scale commonsense knowledge graph, Conditional SEQ2SEQ-based Mixture Model, COSMO. Our proposed model provides diverse clause generation, to ensure coverage for the target task. We use the proposed model to generate diverse context related clauses, alongside with our proposed answer classifier model, to address the task of zero-shot commonsense question answering task. Empirical results on zero-shot commonsense question answering dataset show the superiority of our model over the state-of-the-art methods, by up to 5.2%. Furthermore, our model outperforms baselines in terms of diversity of clause generations.