Self-supervised Knowledge Triplet Learning for Zero-shot Question Answering

The aim of all Question Answering (QA) systems is to be able to generalize to unseen questions. Most of the current methods rely on learning every possible scenario which is reliant on expensive data annotation. Moreover, such annotations can introduce unintended bias which makes systems focus more on the bias than the actual task. In this work, we propose Knowledge Triplet Learning, a self-supervised task over knowledge graphs. We propose methods of how to use such a model to perform zero-shot QA and our experiments show considerable improvements over large pre-trained generative models.


Introduction
The ability to understand natural language and answer questions is one of the core focus in the field of natural language processing. To measure and study the different aspects of question answering, several datasets are developed, such as SQuaD (Rajpurkar et al., 2018), HotpotQA (Yang et al., 2018), and Natural Questions (Kwiatkowski et al., 2019) which require systems to perform extractive question answering. On the other hand datasets such as SocialIQA (Sap et al., 2019b), Common-senseQA (Talmor et al., 2018), Swag (Zellers et al., 2018) and Winogrande (Sakaguchi et al., 2019) require systems to choose the correct answer from a given set. These multiple-choice question answering datasets are very challenging, but recent large pre-trained language models such as BERT (Devlin et al., 2018), XLNET (Yang et al., 2019b) and RoBERTa (Liu et al., 2019b) have shown very strong performance on them. Moreover as shown in Winogrande (Sakaguchi et al., 2019), acquiring unbiased labels requires a "carefully designed crowdsourcing procedure", which adds to the cost of data annotation. This is also quantified in other natural language tasks such as Natural Language Inference ( Gururangan et al., 2018) and Argument Reasoning Comprehension (Niven and Kao, 2019), where such annotation artifacts lead to "Clever Hans Effect" in the models (Kaushik and Lipton, 2018;Poliak et al., 2018).
One way to resolve this is to design and create datasets in a clever way, such as in Winogrande (Sakaguchi et al., 2019), the other way is to ignore the data annotations and to build systems to perform unsupervised question answering (Teney and Hengel, 2016;Lewis et al., 2019). In this paper, we focus on building unsupervised zero-shot multiple-choice question answering systems.
The task of unsupervised question answering is very challenging. Recent work (Fabbri et al., 2020;Lewis et al., 2019) try to generate a synthetic dataset using a text corpus such as Wikipedia, to solve extractive QA. Other work (Bosselut and Choi, 2019;Shwartz et al., 2020) uses large pretrained generative language models such as GPT-2 (Radford et al., 2019) to generate knowledge, questions, and answers and compare against the given choices.
In this work, we utilize the information present in Knowledge Graphs such as ATOMIC (Sap et al., 2019a) and ConceptNET (Liu and Singh, 2004) and define a new task of Knowledge Triplet Learning. Knowledge Triplet Learning is similar to Knowl-edge Representation Learning but not limited to it. Knowledge Representation Learning (Lin et al., 2018) learns the low-dimensional projected and distributed representations of entities and relations defined in a knowledge graph. As shown in Figure 1, we define a triplet (h, r, t), and given any two we try to recover the third. This forces the system to learn the all possible relations between the three inputs. We map the question answering task to Knowledge Triplet Learning, by mapping the context, question and answer to (h, r, t) respectively. We define two different ways to perform self-supervised Knowledge Triplet Learning. This task can be designed as a representation generation task or a language modeling task. We compare both the strategies in this work. We show how to use models trained on this task to perform zero-shot question answering without any additional knowledge or additional supervision. We also show how models pre-trained on this task perform considerably well compared to strong pre-trained language models on few-shot learning. We evaluate our approach on the SocialIQA dataset.
The contributions of this paper are summarized as follows: • We define the Knowledge Triplet Learning over Knowledge Graph and show how to use it for zero-shot question answering. • We compare two strategies for the above task.
• We achieve state-of-the-art results for zeroshot and propose a strong baseline for the fewshot question answering task.

Knowledge Triplet Learning
We define the task of Knowledge Triplet Learning (KTL) in this section. We define G = (V, E) as a Knowledge Graph, where V is the set of vertices, E is the set of edges. V consists of entities which can be phrases or named-entities depending on the given input Knowledge Graph. Let S be a set of fact triples, S ⊆ V ×E×V with the format (h, r, t), where h and t belong to set of vertices V and r belongs to set of edges. The h and t indicates the head and tail entities, whereas r indicates the relation between these entities. For example, from the ATOMIC knowledge graph, (PersonX puts PersonX's trust in PersonY, How is PersonX seen as?, faithful) is one such triple. Here the head is PersonX puts PersonX's trust in PersonY, relation is How is PersonX seen as? and the tail is faithful. Do note V does not contain homogenous entities, i.e, both faithful and PersonX puts PersonX's trust in PersonY belong to V .
We define the task of KTL as follows: Given input a triple (h, r, t), we learn the following three functions.
That is, each function learns to generate one component of the triple given the other two. The intuition behind learning these three functions is as follows. Let us take the above example: (PersonX puts PersonX's trust in PersonY, How is PersonX seen as?, faithful). The first function f t (h, r) learns to generate the answer t given the context and the question. The second function f h (r, t) learns to generate one context where the question and the answer may be valid. The final function f r (h, t) is a Jeopardy-style generating the question which connects the context and the answer.
In Multiple-choice QA, given the context, two answer choices may be true for two different questions. Similarly, given the question, two answer choices may be true for two different contexts. For example, given the context: PersonX puts Per-sonX's trust in PersonY, the answers PersonX is considered trustworthy by others and PersonX is polite are true for two different questions How does this affect others? and How is PersonX seen as?. Learning these three functions enables us to score these relations between the context, question, and answers.

Using KTL to perform QA
After learning this function in a self-supervised way, we can use them to perform question answering. Given a triple (h, r, t), we define the following scoring function: where h is the context, r is the question and t is one of the answer options. D is a distance function which measures the distance between the generated output and the ground-truth. The distance function varies depending on the instantiation of the framework, which we will study in the following sections. The final answer is selected as: Since the scores are the distance from the groundtruth we select the answer which has the minimum score.
In the following sections, we define the different ways we can implement this framework.

Knowledge Representation Learning
In this implementation, we use Knowledge representation learning to learn equation (1). In contrast to Knowledge representation learning, where systems try to learn a score function f r (h, t), i.e, is the fact triple (h, r, t) true or false; in this work we learn to generate the inputs vector representations, i.e, f r (h, t) ⇒ r. We can view equation 1 as generator functions, which given the two input learns to generate a vector representation of the third. As our triples (h, r, t) can have a many to many relations between each pair, we first project the two inputs from input encoding space to a different space similar to the work of TransD (Ji et al., 2015). We use a Transformer encoder Enc to encode our triples to the encoding space. We learn two projection functions, M i1 and M i2 to project the two inputs, and a third projection function M o to project the entity to be generated. We combine the two projected inputs using a function C. These functions can be implemented using feedforward networks.
where I i is the input,Ô is the generated output vector and O p is the projected vector. M and C functions are learned using fully connected networks. In our implementation, we use RoBERTa as the Enc transformer, with the output representation of the [cls] token as the phrase representation. We train this model using two types of loss functions, L2Loss where we try to minimize the L2 norm between the generated and the projected ground-truth, and Noise Contrastive Estimation (Gutmann and Hyvärinen, 2010) where along with the ground-truth we have k noise-samples. These noise samples are selected from other (h, r, t) triples such that the target output is not another true fact triple, i.e, (h, r, t noise ) is false. The NCELoss is defined as: where N k are the projected noise samples, sim is the similarity function which can be the L2 norm or Cosine similarity,Ô is the generated output vector and O p is the projected vector.
The D distance function (2) for such a model is defined by the distance function used in the loss function. For L2Loss, it is the L2 norm, and in the case of NCELoss, we use 1 − sim function.

Span Masked Language Modeling
In Span Masked Language Modeling, we model the equation 1 as a masked language modeling task. We tokenize and concatenate the triple (h, r, t) with a separator token between them, i. . We feed this tokens to a Transformer encoder Enc, and use a feed forward network to unmask the sequence of tokens. Similarly we mask h to learn f h and t to learn f t .
We train the same Transformer encoder to perform all the three functions. We use the crossentropy loss to train the model: where P M LM is the masked language modeling probability of the token t i , given the unmasked tokens h and r and other masked tokens in t. Do note we do not do progressive unmasking, i.e, all the masked tokens are jointly predicted. The D distance function (2) for this model is same as the loss function defined above.

SocialIQA
To study our framework we evaluate it on So-cialIQA. This dataset is about reasoning over social interactions and the implications of social events. Each instance in this dataset contains a context C which is a social situation, a question Q about this situation, and three answer options. There are several question types that are derived from the different ATOMIC inference dimensions, such as Intent, Effect, Attributes, etc. There are 33,410 training samples and 1954 validation samples, with a withheld test set.
Though this dataset is derived from the ATOMIC knowledge graph, the fact triples present in the graph are considerably different than the context, question and answers present in the dataset as they are crowdsourced. The average length of the event description in ATOMIC is 10, max length is 18. Whereas in SocialIQA the average length is 36 and max is 124. This shows the varied type of questions present in SocialIQA, and possess a much more challenge in unsupervised learning.

Baselines
We compare our models to three strong baselines. The first one is a pre-trained language model, GPT-2 (all sizes) which are scored using language modeling cross-entropy loss. We concatenate the context and question and find the cross-entropy loss for each of the answer choices and choose the answer which has the minimum loss. The second baseline is another pre-trained language model, RoBERTa-large. We follow the same Span Masked Language Model (SMLM) scoring using the pretrained RoBERTa model. For the third baseline, we finetune the RoBERTa-large model using the original Masked Language Modeling task over our concatenated fact triples (h, r, t).

KTL Training
We train the Knowledge Representation Learning (KRL) model using both L2Loss and NCELoss. For NCELoss we also train it with both L2 norm and Cosine similarity. Both the KRL model and SMLM model uses RoBERTa-large as the Transformer encoder. We train the model with the following hyper-parameters: batch sizes 16,32; learning rate in range: [1e-5,5e-5]; warm-up steps in range [0,0.1]. We use the transformers package (Wolf et al., 2019). From the ATOMIC knowledge graph, we generate 595595 unique triplets. All these triplets are positive facts. We learn using these triplets. For NCE, we choose k equal to 10, i.e, 10 negative samples. We perform 3 hyperparameter trials for each model, and train models with 3 different seeds [0,21,42].  Table 1: Accuracy comparison with our baseline models on the SocialIQA dataset. We compare the models on the Zero-shot task. We compare them on both Train-Val split (35k) and the Validation split (2k), to enable measuring better statistical significance. Table 1 shows our evaluation and baseline comparisons for the zero-shot task. We can observe our KTL trained models perform significantly well compared to the baselines. When comparing the different KRL models, the NCELoss with Cosine similarity performs the best. This might be due to the additional supervision provided by the negative samples as the L2Loss model only tries to minimize the two projections. When comparing different KTL instantiations we can see the SMLM model performs the best overall but has a slightly higher deviation. We are analyzing our model to understand this phenomenon better. Our KRL model does perform equal to the current state-of-the-art model, Self-Talk (Shwartz et al., 2020) which uses two GPT type models. Our models have half the parameters compared to Self-Talk. We also compare our model on the entire Train-Val set of 35,364 questions in the zero-shot setting to better gauge the statistical significance of the different model accuracies.    2 compares the different pre-trained Transformer encoders in the few-shot question answering task. We randomly sample three sets of 2,400 samples as training data from the training set. We train our models on these three sets and measure the validation set accuracy. We can see the Transformer encoder trained on KTL perform significantly better than the baseline models in this setting. This shows that encoders trained on KTL are able to learn with a few samples. We plan to continue analyzing our models and evaluate the KTL framework on different datasets. 6 Related Work

Unsupervised Question Answering
Recent work on unsupervised question answering approach the problem in two ways, a domain adaption or transfer learning problem (Chung et al., 2018), or a data augmentation problem Dhingra et al., 2018;Wang et al., 2018;. The work of (Lewis et al., 2019;Fabbri et al., 2020;Puri et al., 2020) use style transfer or template-based question, context and answer triple generation, and learn using these to perform unsupervised extractive question answering. There is also another approach of learning generative models, generating the answer given a question or clarifying explanations and/or questions, such as GPT-2 (Radford et al., 2019) to perform unsupervised question answering (Shwartz et al., 2020;Bosselut and Choi, 2019;Bosselut et al., 2019). In contrast, our work focuses on learning from knowledge graphs and generate vector representations or sequences of tokens not restricted to the answer but including the context and the question using the masked language modeling objective.

Use of External Knowledge for Question Answering
There are several approaches to add external knowledge into models to improve question answering. Broadly they can be classified into two, learning from unstructured knowledge and structured knowledge. In learning from unstructured knowledge, recent large pre-trained language models (Peters et al., 2018;Radford et al., 2019;Devlin et al., 2018;Liu et al., 2019b;Clark et al., 2020;Lan et al., 2019;Joshi et al., 2020;Bosselut et al., 2019) learn general-purpose text encoders from a huge text corpus. On the other hand, learning from structured knowledge includes learning from structured knowledge bases (Yang and Mitchell, 2017;Bauer et al., 2018;Mihaylov and Frank, 2018;Wang and Jiang, 2019; by learning knowledge enriched word embeddings. Using structured knowledge to refine pre-trained contextualized representations learned from unstructured knowledge is another approach (Peters et al., 2019;Yang et al., 2019a;Zhang et al., 2019;Liu et al., 2019a). Another approach of using external knowledge includes retrieval of knowledge sentences from a text corpora (Mitra et al., 2019;Banerjee et al., 2019;Baral et al., 2020;Das et al., 2019;Chen et al., 2017;Banerjee and Baral, 2020;Banerjee, 2019) or knowledge triples from knowledge bases (Min et al., 2019;Wang et al., 2020) that are useful to answer a specific question. In our work, we use knowledge graphs to learn a self-supervised generative task to be able to perform zero-shot multiple-choice QA.

Knowledge Representation Learning
Over the years there are several methods discovered to perform the task of knowledge representation learning, i.e., embedding entities and relations in knowledge graphs to low-dimensional continuous vector space. We mention few of them here, such as TransE (Bordes et al., 2013) which views relations as a translation vector between head and tail entities, TransH (Wang et al., 2014) which overcomes TransE's inability to model complex relations, and TransD (Ji et al., 2015) which aims to reduce the parameters by proposing two different mapping matrices for head and tail entities. For much detailed reading, we refer to this survey by Lin et al.. KRL has been used in various ways to generate natural answers (Yin et al., 2016;He et al., 2017) and generate factoid questions (Serban et al., 2016). In our work, we modify TransD and adapt it to our KTL framework to perform zero-shot QA.

Conclusion
In this work, we propose a new framework of Knowledge Triplet Learning over Knowledge Graphs. We show learning all three possible functions, f r ,f h , and f t helps the model to perform zero-shot multiple-choice question answering. We learn from the ATOMIC knowledge graph and evaluate our framework on the SocialIQA dataset. Our framework achieves state-of-the-art in the zero-shot question answering task and sets a strong baseline in the few-shot question answering task.