Self-Training for Jointly Learning to Ask and Answer Questions

Building curious machines that can answer as well as ask questions is an important challenge for AI. The two tasks of question answering and question generation are usually tackled separately in the NLP literature. At the same time, both require significant amounts of supervised data which is hard to obtain in many domains. To alleviate these issues, we propose a self-training method for jointly learning to ask as well as answer questions, leveraging unlabeled text along with labeled question answer pairs for learning. We evaluate our approach on four benchmark datasets: SQUAD, MS MARCO, WikiQA and TrecQA, and show significant improvements over a number of established baselines on both question answering and question generation tasks. We also achieved new state-of-the-art results on two competitive answer sentence selection tasks: WikiQA and TrecQA.


Introduction
Question Answering (QA) is a well-studied problem in NLP which focuses on answering questions using some structured or unstructured sources of knowledge. Alongside question answering, there has also been some work on generating questions (QG) (Heilman, 2011;Du et al., 2017; which focuses on generating questions based on given sources of knowledge.
QA and QG are closely related 1 tasks. However, NLP literature views the two as entirely separate tasks. In this paper, we explore this relationship between the two tasks by jointly learning to generate as well as answer questions. An improved ability to generate as well as answer questions will help us build curious machines that can interact with humans in a better manner. Joint modeling of 1 We can think of QA and QG as inverse of each other.
QA and QG is useful as the two can be used in conjunction to generate novel questions from free text and then answers for the generated questions. We use this idea to perform self-training (Nigam and Ghani, 2000) and leverage free text to augment the training of QA and QG models.
QA and QG models are typically trained on question answer pairs which are expensive to obtain in many domains. However, it is cheaper to obtain large quantities of free text. Our selftraining procedure leverages unlabeled text to boost the quality of our QA and QG models. This is achieved by a careful data augmentation procedure which uses pre-trained QA and QG models to generate additional labeled question answer pairs. This additional data is then used to retrain our QA and QG models and the procedure is repeated.
This addition of synthetic labeled data needs to be performed carefully. During self-training, typically the most confident samples are added to the training set (Zhu, 2005) in each iteration. We use the performance of our QA and QG models as a proxy for estimating the confidence value of the questions. We describe a suite of heuristics inspired from curriculum learning (Bengio et al., 2009) to select the questions to be generated and added to the training set at each epoch. Curriculum learning is inspired from the incremental nature of human learning and orders training samples on the easiness scale so that easy samples can be introduced to the learning algorithm first and harder samples can be introduced successively. We show that introducing questions in increasing order of hardness leads to improvements over a baseline that introduces questions randomly.
We use a seq2seq model with soft attention (Sutskever et al., 2014;Bahdanau et al., 2014) for QG and a neural model inspired from Attentive Reader (Hermann et al., 2015; for QA. However, these can be any QA and QG models. We evaluate our approach on four datasets: SQUAD, MS MARCO, WikiQA and TrecQA. We use a corpus of English Wikipedia as unlabeled text. Our experiments show that the self-training approach leads to significant improvements over a number of established approaches in QA and QG on these benchmarks. On the two answer sentence selection QA tasks: (Wik-iQA and TrecQA), we obtain state-of-the-art.

Problem Setup
In this work, we focus on the task of machine comprehension where the goal is to answer a question q based on a passage p. We model this as an answer sentence selection task i.e., given the set of sentences in the passage p, the task is to select the sentence s ∈ p that contains the answer a. Treating QA as an answer sentence selection task is quite common in literature (e.g. see Yu et al., 2014). We model QG as the task of transforming a sentence in the passage into a question. Previous work in QG (Heilman and Smith, 2009) transforms text sentences into questions via some set of manually engineered rules. However, we take an end-to-end neural approach.
Let D 0 be a labeled dataset of (passage, question, answer) triples where the answer is given by selecting a sentence in the passage. We also assume access to unlabeled text T which will be used to augment the training of the two models.

The Question Answering Model
Since we model QA as the task of selecting an answer sentence from the passage, we treat each sentence in the corresponding passage as a candidate answer for every input question.
We employ a neural network model inspired from the Attentive Reader framework proposed in Hermann et al. (2015); . We map all words in the vocabulary to corresponding d dimensional vector representations via an embedding matrix E ∈ R d×V . Thus, the input passage p can be denoted by the word sequence {p 1 , p 2 , . . . p |p| } and the question q can similarly be denoted by the word sequence {q 1 , q 2 , . . . q |q| } where each token p i ∈ R d and q i ∈ R d .
We use a bi-directional LSTM (Graves et al., 2005) with dropout regularization as in Zaremba et al. (2014) to encode contextual embeddings of each word in the passage: The final contextual embeddings h t are given by concatenation of the forward and backward pass embeddings Similarly, we use another bi-directional LSTM and encode contextual embeddings of each word in the question.
Then, we use attention mechanism (Bahdanau et al., 2014) to compute the alignment distribution a based on the relevance among passage words and the question: a i = softmax q T Wh i . The output vector o is a weighted combination of all contextual embeddings: o = i a i h i . Finally, the correct answer a * among the set of candidate answers A is given by: We learn the model by maximizing the loglikelihood of correct answers. Given the training set , the log-likelihood is: Here, θ represents all the model parameters to be estimated.

The Question Generation Model
We use a seq2seq model (Sutskever et al., 2014) with soft attention (Bahdanau et al., 2014) as our QG model. The model transduces an input sequence x to an input sequence y. Here, the input sequence is a sentence in the passage and the output sequence is a generated question. Let x = {x 1 , x 2 , . . . , x |x| }, y = {y 1 , y 2 , . . . , y |y| } and Y be the space of all possible output questions. Thus, we can represent the QG task as findingŷ ∈ Y such that:ŷ = arg max y P (y|x).
Here, P (y|x) is the conditional probability of a question sequence y given input sequence x. Decoder: Following Sutskever et al. (2014), the conditional factorizes over token level predictions: Here, y <t represents the subsequence of words generated prior to the time step t. For the decoder, we again follow Sutskever et al. (2014): is the decoder RNN state at time step t, and c t is the attention based encoding of the input sequence x at decoding time step t (described later). Also W and W t are model parameters to be learned. We use an LSTM with dropout (Zaremba et al., 2014) as the decoder RNN. The LSTM generates the new decoder state h (d) t given the representation of previously generated word y t1 obtained using a look-up dictionary, and the previous decoder state h (d) t−1 . Encoder: We use a bi-directional LSTM (Graves et al., 2005) with attention mechanism as our sentence encoder. We use two LSTM's: one that makes a forward pass in the sequence and another that makes a backward pass as in the QA model described earlier. We use dropout regularization for LSTMs as in Zaremba et al. (2014) in our implementation. The final context dependent token representation h (e) t is the concatenation of the forward and backward pass token representations: . To obtain the final context dependent token representation c j at the decoding time step j, we take a weighted average over to- Bahdanau et al. (2014), the attention weights a ij are calculated by bilinear scoring followed by softmax normalization: Learning and Inference: We train the encoder decoder framework by maximizing data loglikelihood on a large training set with respect to all the model parameters θ. Let {x (i) , y (i) } N i=1 be the training set. The log-likelihood can be written as: We use beam search for inference. As in previous works, we introduce a <UNK> token to model rare words during decoding. These <UNK> tokens are finally replaced by the token in the input sentence with the highest attention score.

Self-training Framework for Joint
Training of QA and QG models In our self-training framework, we are given unlabeled text in addition to the labeled passages, question and answer pairs. Self-training (Yarowsky, 1995;Riloff et al., 2003), also known as self-teaching, is one of the earliest techniques for using unlabeled data along with labeled data to improve learning. During self-training, the learner keeps on labeling unlabeled examples and retraining itself on an enlarged labeled training set. We extend self-training to jointly learn two models (namely, QA and QG) iteratively. The QA and QG models are first trained on the labeled corpus.
Then, the QG model is used to create more questions from the unlabeled text corpus and the QA model is used to answer these newly created questions. These new questions (carefully selected by an oracle -details later) and the original labelled data is then used to (stochastically) update these two models. This procedure can be repeated as long as both the two models continue to improve.
qg ← Train initial QG model. 3 Init: i = 0 4 while performance on dev set rises do 5 CQ i ← Set of candidate questions generated using our QG model θ (i) qg from the unlabeled text T which are not in D.

6
Q i ← k × m i questions drawn from CQ i using our question selector oracle QS.

7
A i ← Set of answers to questions Q i obtained using our QA model θ Let D i be the set of chosen questions Q i and answers A i .

9
Subsample S1 ⊂ D i of size k1 and S2 ⊂ D 0 of size k2. Let S = S1 ∪ S2 Algorithm 1 describes the procedure in detail. In each succesive iteration, we allow the addition of more questions than that introduced in the previous iteration by a multiplicative factor. This sheme adds fewer questions initially when the QA and QG models are weak and more questions thereafter when the two models have (hopefully) improved. We found that this scheme works better in practice than addiing a fixed number of questions in each iteration. The two models are updated on a subsample of the newly generated datapoints and original unlabelled data.
Self-training has been seldom used in NLP. Most prominently, it has been used for WSD (Yarowsky, 1995), noun learning (Riloff et al., 2003) and AMR parsing and generation (Konstas et al., 2017). However, it has not been explored in this way for QA and QG.

The Question Selection Oracle
A key challenge in self-training is selecting which unlabeled data sample to label (iwhich generated questions to add to the training set). The selftraining process may erroneously generate some bad or incorrect questions which can sidetrack the learning process. Thus, we implement a question selection oracle which determines which questions to add among the potentially very large set of questions generated by the QG model in each iteration.
Traditional wisdom in self-training (Yarowsky, 1995;Riloff et al., 2003) advises selecting a subset of questions on which the models have the highest confidence. We experiment with this idea, proposing multiple self-training oracles which introduce questions in the order of how confident the QA and QG models are on the new potential question: • QG: The QG oracle introduces the question in the order of how confident the QG model is on generating the question. This is calculated by a number of heuristics (described later).
• QA: The QA oracle introduces the question in the order of how confident the QA model is on answering the question. This too is calculated by some heuristics (described later).
• QA+QG: The QA+QG oracle introduces a question when both QA and QG models are confident about the question. The oracle computes the minimum confidence of the QA and QG models for a question and introduces questions which have the the highest minimum confidence score.
Our question selection heurisitcs are based on the ideas of curriculum learning and diversity: 1. Curriculum learning (Bengio et al., 2009;Sachan and Xing, 2016a) requires ordering questions on the easiness scale, so that easy questions can be introduced to the learning algorithm first and harder questions can be introduced successively. The main challenge in learning the curriculum is that it requires the identification of easy and hard questions. In our setting, such a ranking of easy and hard questions is difficult to obtain. A human judgement of 'easiness' of a question might not correlate with what is easy for our algorithms in its feature and hypothesis space. We explore various heuristics that define a measure of easiness and learn the ordering by selecting questions using this measure.
2. A number of cognitive scientists (Cantor, 1946) argue that alongside curriculum learning, it is important to introduce diverse (even if sometimes hard) samples. Inspired by this, we introduce a measure of diversity and show that we can achieve further improvements by coupling the curriculum learning heuristics with a measure for diversity.
Curriculum Learning: Studies in cognitive science (Skinner, 1958;Peterson, 2004;Krueger and Dayan, 2009) have shown that humans learn much better when the training examples are not randomly presented but organized in increasing order of difficulty. In the machine learning community, this idea was introduced with the nomenclature of curriculum learning (Bengio et al., 2009), where a curriculum is designed by ranking samples based on manually curated difficulty measures. A manifestation of this idea is self-paced learning (SPL) (Kumar et al., 2010;Jiang et al., 2014Jiang et al., , 2015 which selects samples based on the local loss term of the sample. We extend this idea and explore the following heuristics for our various oracles: 1) Greedy Optimal (GO): The simplest greedy heuristic is to pick a question q which has the minimum expected effect on the QA and QG models. The expected effect on adding q can be written as: depending on which oracle we are using. p(a * = a) can be estimated by computing the scores of each of the answer candidates for q and normalizing them. E[L QA/QG ] can be estimated by retraining the model(s) after adding this question.
2) Change in Objective (CiO): Choose question q that causes the smallest increase in L QA/QG . If there are multiple questions with the smallest increase in objective, pick one of them randomly.
3) Mini-max (M 2 ): Choose question q that minimizes the expected risk when including the question with the answer candidate a that yields the maximum error.
In this greedy heuristic, we pick a question q which has the minimum expected effect on the model. The expected effect can be written as: Here, p(a * = a) can again be achieved by computing the scores of each of the answer candidates for q and normalizing them and E L QA/QG can be estimated by evaluating the model.

5) Change in Objective-Expected Change in
Objective (CiO -ECiO): We pick a question q which has the minimum value of the difference between the change in objective and the expected change in objective described above. Intuitively, the difference represents how much the model is surprised to see this new question. Time Complexity: GO and CiO require updating the model, M 2 and ECiO require performing inference on candiate questions, and CiO -ECiO requires retraining as well as inference. Thus, M 2 and ECiO are computationally most efficient. Ensembling: We introduce an ensembling strategy that combines the heuristics into an ensemble. We tried two ensembling strategies. The first strategy computes the average score over all the heuristics for all potential (top-K in beam) questions and picks questions with the highest average. The second strategy uses minimum instead of the average. Minimum works better than average in practice and we use it in our experiments. The use of minimum is inspired by agreement-based learning (Liang et al., 2008), a well-known extension of self-training which uses multiple views of the data (described using different feature sets or models) and adds new unlabeled samples to the training set when multiple models agree on the label. Diversity: The strategy of introducing easy questions first and then gradually introducing harder questions is intuitive as it allows the learner to improve gradually. Yet, it has one key deficiency. With curriculum learning, by focusing on easy questions first, our learning algorithm is usually not exposed to a diverse set of questions. This is particularly a problem for deep-learning approaches that learn representations during the process of learning. Hence, when a harder question arrives, it can be difficult for the learner to adjust to the new question as the current representation may not be appropriate for the new level of question difficulty. We tackle this by introducing an explore and exploit (E&E) strategy. E&E ensures that while we still select easy questions first, we also want to make our selection as diverse as possible. We define a measure for diversity as the angle between the question vectors: E&E picks the question which optimizes a convex combination (tuned on the dev set) of the curriculum learning objective and sum of angles between the candidate questions and the questions in the training set.

Experiments
Implementation Details: We perform the same preprocessing on all the text. We lower-case all the text. We use NLTK for word tokenization. For training our neural networks, we only keep the most frequent 50k words (including entity and placeholder markers), and map all other words to a special <UNK> token. We choose word embedding size d = 100, and use the 100-dimensional pretrained GloVe word embeddings (Pennington et al., 2014) for initialization. We set k, m, k 1 and k 2 (hyperparameters for self-training) by grid search on a held-out development set. Datasets: We report our results on four datasets: SQUAD (Rajpurkar et al., 2016), MS MARCO (Nguyen et al., 2016), WikiQA (Yang et al., 2015) and TrecQA (Wang et al., 2007). SQUAD is a cloze-style reading comprehension dataset with questions posed by crowd workers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. MS MARCO contains questions which are real anonymized queries issued through Bing or Cortana and the documents are related web pages which may or help answer the question. WikiQA is also a datset of queries taken from Bing query logs. Based on user clicks, each query is associated with a Wikipedia page. The summary paragraph of the page is taken as candidate answer sentences, with labels on whether the sentence is a correct answer to the question provided by crowd SQUAD  workers. Finally, TrecQA is a QA answer sentence selection dataset from the TREC QA track. While WikiQA and TrecQA are directly answer sentence selection tasks, the other two are not. Hence, we treat the SQUAD and MS MARCO tasks as the answer sentence selection task assuming a one to one correspondence between answer sentences and annotated correct answer spans. Note that only a very small proportion of answers (< 0.2% in training set) span two or more sentences. Since SQUAD and MS MARCO have a hidden test set, we only use the training and development sets for our evaluation purposes and we further split the provided development set into a dev and test set. This is also the data analysis setting used in previous works (Du et al., 2017;. In fact, we use the same setting as in  for comparison. The statistics of the four datasets and the respective train, dev and test splits are given in Table 1. For WikiQA and TrecQA datasets, we use the standard data splits. We use a large randomly subsampled corpus of English Wikipedia and use the first paragraph of each document as unlabeled text for self-training.  (Denkowski and Lavie, 2014) and Rouge L (Lin, 2004) to measure the overlap between generated and ground truth questions. Baselines: For SQUAD and MS MARCO datasets, we use four QA baselines that have been used in previous works . The first two baselines, WordCnt and NormWordCnt, have been taken from Yang et al. (2015) and Yin et al. (2015), and are based on simple word overlap which have been shown to be strong baselines. These compute word co-occurrence between a question sentence and the candidate answer sentence. While WordCnt uses unnormalized word co-occurrence, NormWordCnt uses normalized word co-occurrence. The third and fourth baselines are CDSSM (Shen et al., 2014) and ABCNN (Yin et al., 2015) which use a neural network approach to model semantic relatedness of sentence pairs. For the WikiQA and TrecQA dataset, we report results of various existing stateof-the-art approaches on the two datasets 2 . For QG, we compare our model against the following four baselines used in previous work (Du et al., 2017). The first baseline is a simple IR baselines taken from Rush et al. (2015) which generates questions by memorizing them from the training set and uses edit distance (Levenshtein, 1966) to calculate distance between a question and the input sentence. The second baseline is a MT system -MOSES (Koehn et al., 2007) which models question generation as a translation task where raw sentences are treated as source texts and questions are treated as target texts. The third baseline, DirectIn, uses the longest sub-sentence of the input sentence (using a set of simple sentence splitters) as the question. The fourth baseline, H&S is a rule-based overgenerate-and-rank system proposed by Heilman and Smith (2010). The Question Selection Oracle: The first question we wish to answer is: Is careful question selection even necessary? To answer this, we plot MAP scores for our best QA model (QA+QG, En-semble+E&E) when we do not have a curriculum learning based oracle (i.e. an oracle which picks questions to be added to the dataset randomly) in Figure 1 as a function of epochs. We observe that  the MAP score degrades instead of improving with time. This supports our claim that we need to augment the training set by a more careful procedure. We also plot MAP scores for our best QA model (Ensemble+E&E) when we use various question selection oracles as a function of the amount of unlabeled data in Figure 2. We can observe that when we do not have a curriculum learning based oracle, the MAP score degrades by having more and more unlabeled data. We also observe that the QA+QG oracle performs better than QA and QG which confirms that the best oracle is one that selects questions in increasing degree of hardness in terms of both question answering and question generation. This holds for all the experimental settings. Thus we only show results for the QA+QG strategies in our future experiments. Evaluating Question Answering: First, we evaluate our models on the question answering task. Ensemble+E&E(K) is the variant where we perform self-training using K Wikipedia paragraphs. Hence, Ensemble+E&E(0) is the variant of our MAP MRR CNN (Yang et al., 2015) 0.665 0.652 APCNN (Santos et al., 2016) 0.696 0.689 NASM (Miao et al., 2016) 0.707 0.689 ABCNN (Yin et al., 2015) 0.702 0.692 KVMN (Miller et al., 2016) 0.707 0.727 Wang et al. (2016b) 0.706 0.723 Wang et al. (2016a) 0.734 0.742 Wang and Jiang (2016) 0.743 0.755  0   model without any self-training. We vary K to see the impact of the size of unlabeled Wikipedia paragraphs on the self-training model. Table 2 shows the results of the QA evaluations on the SQUAD and MS MARCO datasets. We can observe that our QA model has competetive or better performance over all the baselines on both datasets in terms of all the three evaluation metrics. When we incorporate ensembling or diversity, we see a further improvement in the result. Tables 3 and 4 show results of QA evaluations on the WikiQA and TrecQA datasets, respectively. We can again observe that our QA model is competitive to all the baselines. When we introduce ensembling and diversity while jointly learning the QA and QG models, we see incremental improvements. In both these answer sentence selection tasks, our approach achieves new state-of-the-art.    and Du et al. (2017) were not reported for all the settings.
Evaluating Question Generation: Table 5 shows the results for QG on the four datasets on each of the three evaluation metrics on all the four datasets. We can observe that the QG model described in our paper performs much better than all the baselines. We again observe that self-training while jointly training the QA and QG models leads to even better performance. These results show that self-training and leveraging the relationship between QA and QG is very useful for boosting the performance of the QA and QG models, while additionally only using cheap unlabeled data.
Human Evaluations: We asked two people not involved with this research to evaluate 1000 (randomly selected) questions generated by our best QG model and our best performing baseline (Du et al., 2017) on SQUAD for fluency and correctness on a scale of 1 to 5. The raters were also shown the passage sentence used to generate the question. The raters were blind to which system produced which question. The Pearson correlation between the raters' judgments was r = 0.89 for fluency and r = 0.78 for correctness. In our analyses, we used the averages of the two raters' judgments. The evaluation showed that our system generates questions that are more fluent and correct than those by the baseline. The mean fluency rating of our best system was 4.15 compared to 3.35 for the baseline, a difference which is statistically significant (t-test, p < 0.001).
Evaluating the Question Selection Oracle: As discussed earlier, the choice of which subset of questions to add to our labeled dataset while selftraining is important. To evaluate the various heuristics proposed in our paper, we show the effect of the question selection oracle on the final QA and QG performance in Tables 2, 3, 4 and 5. These comparisions are shown in the shaded grey portions of the tables for self-training with 10,000 Wikipedia paragraphs as unlabeled data.
We can observe that all the proposed heuristics (and ensembling and diversity strategies) lead to improvements in the final performance of both QA and QG. The heuristics arranged in increasing order of performance are: M 2 , ECiO, GO, CiO and CiO-ECiO. While the choice of which heuristic to pick seems to make a lesser impact on the final performance, we do see a much more significant performance gain by ensembling to combine the various heuristics and using E&E to incorporate diversity. The incorporatation of diversity is important because the neural network models which learnt latent representions of data usually find it hard to adjust to new level of difficulty of questions as the current representation may not be appropriate for the new level of difficulty. Low data scenarios: A key advantage of our selftraining approach is that it can leverage unlabeled text, and thus requires less labeled data. To test this, we plot MAP for our best self-training model and various QA baselines as we vary the proportion of labeled training set in Figure 3. However, we keep the unlabeled text fixed (10K Wikipedia paragraphs). We observe that all the baselines significantly drop in performance as we reduce the proportion of labeled training set. However, the drop happens at a much slower rate for our selftrained model. Thus, we can conclude that our approach requires less labeled data as compared to the baselines. Does more unlabeled text always help?: Another important question is: Does more unlabeled text always improve our models? Will the performance improve if we add more and more unsupervised data during self-training. According to our results in Tables 2, 3, 4 and 5, the answer is "probably yes". As we can observe from these tables, the performance of the QA and QG models improves as we increase K, the size of the unsupervised data during training of the various Ensem-ble+E&E(K) models. Having said that, we do see a tapering effect on the performance results, so it is clear that the performance will be capped by some upper-bound and we will need better ways of modeling language and meaning to make progress.

Related Work
Our work proposes an approach for joint modeling QA and QG. While QA has recieved a lot of attention from the research community with large scale community evaluations such as NTCIR, TREC, CLEF spurring progress, the focus on QG is much more recent. Recently, there has been a renewed interest in reading comprehensions (also known as machine comprehension -a nomenclature popularized by Richardson et al. (2013)). Various approaches (Sachan et al., 2015;Wang et al., 2015;Sachan and Xing, 2016b;Narasimhan and Barzilay, 2015) have been proposed for solving this task. After the release of large benchmarks such as SQUAD, MS MARCO and WikiQA, there has been a surge in interest on using neural network or deep-learning models for QA (Yin et al., 2015;Seo et al., 2016;Shen et al., 2016;Liu et al., 2017;. In our work, we deal with the answer sentence selection task and adapt the Attentive Reader framework proposed in Hermann et al. (2015);  as our base model. While, all these models were trained on question answer pairs, we propose a self-training solution to additionally leverage unsupervised text. Similarly, there have been works on QG. Traditionally, rule based approaches with postprocessing (Woo et al., 2016;Heilman andSmith, 2009, 2010) were the norm in QG. However, recent papers build on neural network approaches such as seq2seq (Du et al., 2017;, CNNs and RNNs  for QG. We also choose the seq2seq paradigm in our work. However, we leverage unsupervised text in contrast to these models. Finally, some very recent works have concurrently recognized the relationship between QA and QG and have proposed joint training Wang et al., 2017) for the two. Our work differs from these as we additionally propose self-training to leverage unlabeled data to improve the two models. Self-training has seldom been used in NLP. Most prominently, they have been used for word sense disambiguation (Yarowsky, 1995), noun learning (Riloff et al., 2003) and recently, AMR parsing and generation (Konstas et al., 2017). However, it has not been explored in this way for QA and QG.
An important decision in the workings of our self-training algorithm was the question selection using curriclum learning. While curriculum learning has seldom been used in NLP, we draw some ideas for curriculum learning from Sachan and Xing (2016a) who conduct a case study of curriculum learning for question answering. However, their work focuses only on QA and not QG.

Conclusion
We described self-training algorithms for jointly learning to answer and ask questions while leveraging unlabeled data. We experimented with neural models for question answering and question generation and various careful strategies for question filtering based on curriculum learning and diversity promotion. This led to improved performance for both question answering and question generation on multiple datasets and new state-ofthe-art results on WikiQA and TrecQA datasets.