Learning to Collaborate for Question Answering and Asking

Question answering (QA) and question generation (QG) are closely related tasks that could improve each other; however, the connection of these two tasks is not well explored in literature. In this paper, we give a systematic study that seeks to leverage the connection to improve both QA and QG. We present a training algorithm that generalizes both Generative Adversarial Network (GAN) and Generative Domain-Adaptive Nets (GDAN) under the question answering scenario. The two key ideas are improving the QG model with QA through incorporating additional QA-specific signal as the loss function, and improving the QA model with QG through adding artificially generated training instances. We conduct experiments on both document based and knowledge based question answering tasks. We have two main findings. Firstly, the performance of a QG model (e.g in terms of BLEU score) could be easily improved by a QA model via policy gradient. Secondly, directly applying GAN that regards all the generated questions as negative instances could not improve the accuracy of the QA model. Learning when to regard generated questions as positive instances could bring performance boost.


Introduction
In this work, we consider the task of joint learning of question answering and question generation. Question answering (QA) and question generation (QG) are closely related natural language processing tasks. The goal of QA is to obtain an answer given a question. The goal of QG is almost reverse which is to generate a question from the answer. In this work, we consider answer selection (Yang et al., 2015;Balakrishnan et al., 2015) as the QA task, which assigns a numeric score to each candidate answer, and selects the top ranked one as the answer. We consider QG as a generation problem and exploit sequence-to-sequence learning (Seq2Seq) (Du et al., 2017; as the backbone of the QG model. The key idea of this work is that QA and QG are two closely tasks and we seek to leverage the connection between these two tasks to improve both QA and QG. Our primary motivations are twofolds. On one hand, the Seq2Seq based QG model is trained by maximizing the literal similarity between the generated sentence and the ground truth sentence with maximum-likelihood estimation objective function (Du et al., 2017). However, there is no signal indicating whether or not the generated sentence could be correctly answered by the input. This problem could be precisely mitigated through incorporating QA-specific signal into the QG loss function. On the other hand, the capacity of a statistical model depends on the quality and the amount of the training data (Sun et al., 2017). In our scenario, the capacity of the QA model depends on the difference between the positive and negative patterns embodied in the training examples. A desirable training dataset should contain the question-answer pairs that are literally similar yet have different category labels, i.e. some question-answer pairs are correct and some are wrong. However, this kind of dataset is hard to obtain in most situations because of the lack of manual annotation efforts. From this perspective, the QA model could exactly benefit from the QG model through incorporating additional questionanswer pairs whose questions are automatically generated by the QG model 1 .
To achieve this goal, we present a training algorithm that improves the QA model and the 1 An alternative way is to automatically generate answers for each question. Solving the problem in this condition requires an answer generation model (He et al., 2017), which is out of the focus of this work. Our algorithm could also be adapted to this scenario. QG model in a loop. The QA model improves QG through introducing an additional QA-specific loss function, the objective of which is to maximize the expectation of the QA scores of the generated question-answer pairs. Policy gradient method (Williams, 1992; is used to update the QG model. In turn, the QG model improves QA through incorporating additional training instances. Here the key problem is how to label the generated question-answer pair. The application of Generative Adversarial Network (GAN) (Goodfellow et al., 2014; in this scenario regards every generated question-answer pair as a negative instance. On the contrary, Generative Domain-Adaptive Nets (GDAN)  regards every generated question-answer pair appended with special domain tag as a positive instance. However, it is non-trivial to label the generated question-answer pairs because some of which are good paraphrases of the ground truth yet some might be negative instances with similar utterances. To address this, we bring in a collaboration detector, which takes two question-answer pairs as the input and determines their relation as collaborative or competitive. The output of the collaboration detector is regarded as the label of the generated questionanswer pair.
We conduct experiments on both document based (Yang et al., 2015) and knowledge (e.g. web table) based question answering tasks (Balakrishnan et al., 2015). Results show that the performance of a QG model (e.g in terms of BLEU score) could be consistently improved by a QA model via policy gradient. However, regarding all the generated questions as negative instances (competitive) could not improve the accuracy of the QA model. Learning when to regard generated questions as positive instances (collaborative) could improve the accuracy of the QA model.

Related Work
Our work connects to existing works on question answering (QA), question generation (QG), and the use of generative adversarial nets in question answering and text generation.
We consider two kinds of answer selection tasks in this work, one is table as the answer (Balakrishnan et al., 2015) and another is sentence as the answer (Yang et al., 2015). In natural language processing community, there are also other types of QA tasks including knowledge based QA (Berant et al., 2013), community based QA (Qiu and Huang, 2015) and reading comprehension (Rajpurkar et al., 2016). We believe that our algorithm is generic and could also be applied to these tasks with dedicated QA and QG model architectures. Despite the use of sophisticated features could learn a more accurate QA model, in this work we implement a simple yet effective neural network based QA model, which could be conventionally jointly learned with the QG model through back propagation.
Question generation draws a lot of attentions recently, which is partly influenced by the remarkable success of neural networks in natural language generation. There are several works on generating questions from different resources, including a sentence (Heilman, 2011), a topic (Chali and Hasan, 2015), a fact from knowledge base (Serban et al., 2016), etc. In computer vision community, there are also recent studies on generating questions from an image (Mostafazadeh et al., 2016). Our QG model belongs to sentence-based question generation.
GAN has been successfully applied in computer vision tasks (Goodfellow et al., 2014). There are also some recent trials that adapt GAN to text generation ), question answering , dialogue generation , etc. The relationship of the discriminator and the generator in GAN are competitive. The key finding of this work is that, directly applying the idea of "competitive" in GAN does not improve the accuracy of a QA model. We contribute a generative collaborative network that learns when to collaborate and yields empirical improvements on two QA tasks.
This work relates to recent studies which attempt to improve the performance of a discriminative QA model with generative models Dong et al., 2017;. These works regard QA as the primary task and use auxiliary task, such as question generation and question paraphrasing, to improve the primary task. This is one part of our goal and our another goal is to improve the QG model with the QA system and further to increasingly improve both tasks in a loop.
In terms of assigning category label to the generated question, Generative Adversarial Network (GAN) (Goodfellow et al., 2014; and Generative Domain-Adaptive Nets (G-DAN)  could be viewed as special cases of our algorithm. Our algorithm learns when to assign positive or negative labels, while GAN always assigns negative labels and GDAN always assigns positive labels. Besides, our work differs from  in that our question generation model is a generative model while theirs is actually a discriminative matching model. The approach of (Dong et al., 2017) learns to generate question from question via paraphrasing, and use the generated questions in the inference process. In this work, the QA model and the QG model are applied separately in the inference process. This inspires us to jointly conduct QA and QG in the inference process, which we leave as a future work.

Generative Collaborative Network
In this section, we first formulate the task of QA and QG, and then present our algorithm that jointly trains the QA and QG models.

Task Formulation
This work involves two tasks: question answering (QA) and question generation (QG).
There are different kinds of QA tasks in the natural language processing area. To verify the scalability of our algorithm, we consider two types of answer selection tasks, both of which are fundamental QA tasks in research community and of great importance in industrial applications including web search and chatbot. Both tasks take a question q and a list of candidate answers A = {a 1 , a 2 , ..., a n } as input, and outputs an answer a i which has the largest probability to correctly answer the question. The only difference is that the answer in the task of answer sentence selection (Yang et al., 2015) is a natural language sentence, while the answer in table search (Balakrishnan et al., 2015) is a structured table consisting of caption, attributes and cells. Our QA model is abbreviated as P qa (a, q; θ qa ), whose output is the probability that q and a being a correct questionanswer pair, and the parameter is denoted as θ qa .
The task of QG takes an answer a which is a natural language sentence or a structured table, and outputs a natural language question q which could be answered by a. Inspired by the remarkable progress of sequence-to-sequence (Seq2Seq) learning in natural language generation, we deal with QG in an end-to-end fashion and develop a generative model based on Seq2Seq learning. Our QG model is abbreviated as P qg (q|a; θ qg ), whose output is the probability of generating q from a and the parameter is denoted as θ qg .

Algorithm Description
We describe the joint learning algorithm in this part. The end goal is to leverage the connection of QA and QG to improve the performances on both QA and QG tasks. A brief illustration of the training progress is given in Figure 1 , which includes a QA model, a QG model and a collaboration detector (CD). A formal description of the algorithm is given in Algorithm 1. We can see that the QA model and the QG model both have two training objectives. One part is the standard supervised learning objective based on task-specific supervisions. Another part of the objective is obtained by leveraging auxiliary tasks. The supervised objective of the QA model is to maximize the probability of the correct category label, and the supervised objective of the QG model is to maximize the probability of the correct question sequence. Since the goal of QA is to predict whether a question-answer pair is correct or not, training the QA model requires negative QA pairs whose labels are zero. But these negative QA pairs are not used for training the QG model as the goal of QG is to generate the correct question.
The main contribution of this work is to explore effective learning objectives that leverage auxiliary tasks. In order to improve the QA model, we generate additional training instances, each of which is composed of a question, an answer and a category label. In this work, we clamp the answer part and feed the answer to the QG model to Algorithm 1 Generative Collaborative Network for QA and QG 1: Input: training data D; the batch size for QG training m; the beam size for QG inference k; hyper parameters λqg and λqg; hyper parameters bqa and bqg; pretrained collaboration detector P cd (q, q ) 2: Output: QA model Pqa(a, q) parameterized by θqa; QG model Pqg(q|a) parameterized by θqg 3: pretrain Pqa(a, q) and Pqg(q|a) separately on D 4: repeat 5: get a minibatch of positive QA pairs P D = {q p i , a p i } ∈ D, 1 ≤ i ≤ m, in which a p i is the answer of q p i 6: get a minibatch of negative QA pairs N D = {q n i , a n i } ∈ D, 1 ≤ i ≤ m, in which a n i is not the answer of q n i 7: apply Pqg(q|a) on P D to generate in a list of question-answer beams apply Pqa(a, q) on GD to assign a QA-specific score to each generated instance 9: choose the top ranked result from each beam in GD, and then apply P cd (q, q ) on the selected instance 10: update θqa by maximizing the following objective 11: update θqg by maximizing the following objective 12: until models converge generate the question. We use beam search and select the top ranked result as the question. 2 Here the issue is how to infer the label of the generated instance. We believe that heuristically assigning the label as 0 or 1 is problematic. For instance, let us suppose the answer sentence is "Microsoft was founded by Paul Allen and Bill Gates on April 4, 1975.", and the ground truth question is "who founded Microsoft". In this case, the generated question "who is the founder of Microsoft" is a good one yet "who is the founder of Google" and "how old is Bill Gates" are both bad cases. To address this, we introduce an additional collaboration detector (CD) to infer the label of the generated instance. Intuitively, the CD acts as a discriminative paraphrase model, which measures the semantic similarity between the ground truth question and generated question. In equation (1), the I bqa (x) is an indicator function whose value is 1 if the value of x is larger than a threshold b qa , such as 0.5 or 0.3. The hyper parameter λ qa controls the weight of the auxiliary objective to the QA model. In turn, the QA model is used to assign a QAspecific score P qa (a, q ) to each generated question q . We follow the recent reinforcement learning based approach for dialogue prediction , and define simple heuristic reward-s that characterize good questions. The goodness of the generated question is measured by the prediction of the QA model. Similar to the strategy adopted by (Zaremba and Sutskever, 2015), we use a baseline strategy b qg (e.g. 0.5) to decrease the learning variance. The expected reward (Williams, 1992; for a generated question is given in Equation (2). In this way, the parameters of the QG model could be conventionally updated with stochastic gradient descent.
We pretrain the QA model and the QG model before the joint learning process. The main reason is that a randomized QA model will provide unreliable rewards to the QG model, and a randomized QG model will generate bad questions.

The Neural Architecture of Each Module
Our algorithm includes a question answer (QA) model, a question generation (QG) model and a collaboration detector (CD) model. We implement these models with dedicated neural networks. As we have mentioned before, our training algorithm is applied to both document-based and web table based question answer tasks. In this section, we take table based QA and QG tasks to describe the neural architecture of each module. A question/query q is a natural language expression consisting of a list of words. A table t has fixed schema including one or more headers, one or more cells, and a caption. A header indicates the property of a column, and a cell is a unit where a row and a column intersects. The caption is typically an explanatory text about the table.

The Question Answer (QA) Model
We develop a neural network to match a natural language question/query to a structured table. Since a table has multiple aspects including headers, cells and the caption, the model is developed to capture the semantic relevance between a query and a table at different levels.
As the meaning of a query is sensitive to the word order, i.e. the intentions of "list of flights london to berlin" and "list of flights berlin to london" are totally different, we represent a query with a sequential model. In this work, recurrent neural network (RNN) is used to map a query of variable length to a fixed-length vector. We use gated recurrent unit (GRU) (Cho et al., 2014) as the basic computation unit, which adaptively forgets the history and remembers the input.
where z i and r i are update and reset gates of GRU. We use bi-directional RNN to get the meaning of a query from forward and backward directions, and concatenate two last hidden states as the query vector.
An important property of a table is that exchanging two rows or two columns does not change its meaning. To satisfy this condition, we develop an attention based approach, in which the header and cells are regarded as the external memory. Each header/cell is represented as a vector. Given a query vector, the model calculates the weight of each memory unit and then output a query-specific header/cell representation through weighted average (Bahdanau et al., 2015;Sukhbaatar et al., 2015). This process could be repeated executed for several times, so that more abstractive evidences could be retrieved and composed to support the final decision. Similar techniques have been successfully applied in tablebased question answering (Yin et al., 2015b;Neelakantan et al., 2015).
We represent the table caption with RNN, the same strategy we have adopted to represent the query. Element-wised multiplication is used to compose the query vector and the caption vector. Furthermore, since the number of co-occurred words directly reflect the relatedness between the question and the answer, we incorporate the embedding of co-occurred word count as additional features. Finally, we feed the concatenation of all the vectors to a sof tmax layer whose output length is 2. We have implemented a ranking based loss function l qa = max(0, 1 − P qa (a, q) + P qa (a , q)) and a negative log-likelihood (NLL) base loss function l qa = − log(P qa (a, q)). We find that NLL works better and use it in the following experiments.

The Question Generation (QG) Model
Inspired by the notable progress that sequenceto-sequence learning (Seq2Seq) (Sutskever et al., 2014) has made in natural language generation, we implement a table-to-sequence (Table2Seq) approach that generates natural language question from a structured table.
Table2Seq is composed of an encoder and a decoder. The encoder maps the caption, headers, and cells into continuous vectors, which will be fed to the decoder to generate a question in a sequential way. Similar with the way we have adopted in the QA model, we represent the caption with bidirectional GRU based RNN. The vector of each word in the caption is the concatenation of hidden states from both directions. The vectors of headers and cells are regarded as additional hidden states of the encoder. The representation of each cell is also mixed with the corresponding header representation. The initial vector of the decoder is the average of the caption vector, header vector, and the cell vector.
The backbone of the decoder is an attention based GRU RNN, which generates a word at each time step and repeats the process until generating the end-of-sentence symbol. We made two modifications to adapt the decoder to the table structure. The first modification is that the attention model is calculated over the headers, cells and the caption of a table. Ideally, the decoder should learn to focus on a region of the table when generating a word. The second modification is a table based copying mechanism. It has been proven that the copying mechanism Gu et al., 2016) is an effective way to replicate lowfrequent words from the source to the target sequence in sequence-to-sequence learning. In the decoding process, a word is generated either from the target vocabulary via standard sof tmax or from a table via the copy mechanism. A neural gate g t is used to trade-off between generating from the target vocabulary and copying from the table. The probability of generating a word y calculated as follows, where α t (y) is the attention probability of the word y from the table at time step t and β t (y) is the probability of predicting the word y from the sof tmax at time step t.
Since every component of the Table2Seq is differentiable, the parameters could be optimized in an end-to-end fashion with back-propagation. Given a question-answer pair (x, y), the supervised training objective is to maximize the probability of question word at each time step. In the inference process, beam search is used to generate top-k confident results, where k is the beam size.

The Collaboration Detector (CD)
The goal of a collaboration detector is to determine the label of the instance generated by the QG model. The positive prediction, namely the predicted value is equals to 1, stands for the collaborative relationship between the generated instance and the ground truth, while the negative prediction stands for the competitive relationship.
We consider this task as predicting the category of the given two question-answer pairs, one of which is the ground truth, and another is the generated question-answer pair. Since the answer part is the same, we simplify the problem as classifying two questions as related or not, which is a binary classification problem.
The neural architecture of the collaboration detector (CD) is exactly the same as the caption component in the QA model. We represent two questions with bidirectional RNN, and use elementwise multiplication to do the composition. The result is further concatenated with a co-occurred word count embedding, followed by a sof tmax layer. The model is trained by minimizing the negative log-likelihood label, which is provided in the training data.
The training data of the CD model includes two parts. The first part is from Quora dataset 3 , which is built for detecting if pairs of question text are semantically equivalent. The Quora dataset has 345,989 positive question pairs and 255,027 negative pairs. We further obtain the second part of the training data from the web queries, which are more similar to the web queries in our two QA task. We obtain the query dataset from query logs through clustering the web queries that click the same web page. In this way, we obtain 6,118,023 positive query pairs. We use a heuristic rule to generate the negative instances for the query dataset. For each pair of query text, we clamp the first query and retrieve a query that is mostly similar to the second query. To improve the efficiency of this process, we randomly sample 10,000 queries and define the "similarity" as the number of co-occurred words in two questions. In this way we collect another 6,118,023 negative pairs of query text. We initialize the values of word embeddings with 300d Glove vectors 4 , which is learned on Wikipedia texts. We use a held-out data consisting of 20K query pairs to check the performance of the CD model. The accuracy of the CD model on the held-out dataset is 83%. In the joint training process, we clamp the parameters of the CD model and use its outputs to facilitate the learning of the QA model.

Experiment
We conduct experiments on table-based QA and document-based QA tasks. We will describe experimental settings and report results on these two tasks in this section.

Table based QA and QG
Setting We take table retrieval (Balakrishnan et al., 2015) as the table-based QA task. Given a query and a collection of candidate table answers, the task aims to return a table that is most relevant to the query. Figure 2 gives an example of this task, in which a query matches to different aspects of a table. We regard document-based QA tasks as a special case of the table-based QA task, in which the cells and the headers are both empty.
We conduct experiments on the web data. The queries come from real-world user queries which we obtain from the search log of a commercial search engine. We filter them down to only those that are directly answered by a table. In this way, we collect 1.49M query-table pairs. An example of the data is given in Figure 2. We randomly select 1.29M as the training set, 0.1M as the dev set and 0.1M as the test set. We evaluate the performance on table-based QA with Mean Average Precision (MAP) and Preci-sion@1 (P@1) (Manning et al., 2008). We use the same candidate retrieval adopted in (Yan et al., 2017), namely representing a table as bag-ofwords, to guarantee the efficiency of the approach. Each query has 50 candidate tables on average. It is still an open problem to automatically evaluate the performance of a natural language generation system (Lowe et al., 2017). In this work, we use BLEU-4 (Papineni et al., 2002) score as the evaluation metric, which measures the overlap between the generated question and the referenced question. The hyper parameters are tuned on the validation set and the performance is reported on the test set.

Results and Analysis
We report the results and our analysis on table-based QA and QG respectively in this part.
We first report the results of single systems on table-based QA. We compare to four single systems implemented by (Yan et al., 2017). In BM25, each table is represented as a flattened vector, and the similarity between a query and a table is calculated with the BM25 algorithm. WordCnt uses the number of co-occurred words in query-caption pair, query-header pair, and query-cell pair, respectively. MT based PP is a phrase-level feature. The features come from a phrase table which is extracted from bilingual corpus via statistical machine translation approach (Koehn et al., 2003). LambdaMART (Burges, 2010) is used to train the ranker. CNN uses convolutional neural network  to measure the similarity between the query and  table caption, table headers, and table cells  We also implement four different joint learning settings. In these settings, the QA model and the QG model are all pretrained, and the same way (policy gradient) is used to improve the QG model via the QA predictions. The only difference is how the QA model benefits from the QG model. As we use external resources to train a CD model, we also implement Seq2SeqPara for comparison. We train a question generator with a Se-q2Seq model on the CD training data, and regard the generated questions as positive instances. Our generative collaborative network is abbreviated as GCN. GCN (competitive) is analogous to (Goodfellow et al., 2014), where all the generated questions are regarded as negative instances (with label as zero). On the contrary, GCN (collaborative) is analogous to , where the generated questions are regard as positive instances. Our main observation from Table 1 is that simply regarding all the generated questions as negative instances ("competitive") could not bring performance boost. On the contrary, regarding the generated questions as positive ones ("collaborative") improves the QA model. Our algorithm (GCN) significantly improves the TQNN model. Based on these results, we believe that the relationship between the QA model and the QG model should not be always competitive. Learning when to collaborate through leveraging a CD model is a practical way to improve the performance on question answering.
As described in Equation (1), the influence of the CD model on the QA model also depends on the value of the hyper parameter b qa . A small value of b qa stands for a preference of "collaborative", while a large value of b qa represents a preference of "competitive". Results are given in Figure 3. The GCN model performs better when b qa is in the range [0.3, 0.5], in which the model prefers "collaborative". We conduct an additional experiment to test whether our algorithm could improve an existing system. We take BM25 as the baseline, and incorporate one of the five joint models as an additional feature. LambdaMART is used to train the combined ranker. Results are given in Table 2. We can see that the baseline system could be dramatically improved by our system, despite the improvements of different approaches are on par.  Here we show the performances of different approaches on table-based QG. Results in terms of BLEU-4 are given in Table 3. Different from the trends on QA, "competitive" performs better than "collaborative" on QG. This is reasonable because as the joint training progresses, the QA model in "collaborative" keeps telling the QG model that the generated instances are good enough. On the contrary, the "competitive" model is more critical, which tells the QG model how wrong the generated questions are. In this way, the QG model could be increasingly improved by the QA signal. The QG model is easier to be improved compared to the QA model. Our GCN approach obtains a significant improvement over the baseline model on this task.

Method
Dev Test Table2Seq   We also report the learning curve of the GCN model as the joint training progresses. The performance on the dev set is given in Figure 4.

Document based QA and QG
To test the scalability of the algorithm, we also apply it to document based QA and QG tasks. The QA task is answer sentence selection (Yang et   2015). Given a question and a list of candidate answer sentences from a document, the goal is to find a most relevant answer sentence as the answer.
Since the WikiQA dataset (Yang et al., 2015) is too small to learn a powerful question generator, we use the MARCO dataset (Nguyen et al., 2016), which is originally designed for reading comprehension yet also has manually annotated labels for sentence/passage selection. A characteristic of MARCO dataset is that the ground truth of the test is invisible to the public. Therefore, we follow  and split the original validation set into the dev set and the test set. The results on QA and QG are given in Table 4 and Table 5.
We can see that the results are almost consistent with the results on table-based QA and QG tasks. Our GCN algorithms achieves promising performances compared to strong baseline methods.

Conclusion
We present an algorithm dubbed generative collaborative network for jointly training the question answering (QA) model and the question generation (QG) model. Different from standard GAN, the relationship between QA model (discriminator) and the QG model (generator) in our algorithm is not always competitive. We show that "collaborative" performs better than "competitive" in terms of QA accuracy, and our algorithm that learns when to collaborate obtains further improvement on both QA and QG tasks. This work could be further improved from several directions. Our current algorithm focuses on the joint training of QA and QG models, while the inferences of these two models are independent. How to conduct joint inference is an interesting future work. Besides, the samples are currently generated from the QG model via beam search. Improving the diversity of the samples requires different sampling mechanisms. Another potential direction is to jointly learn the collaboration detector together with the QA and QG models.