Multi-Task Learning with Language Modeling for Question Generation

This paper explores the task of answer-aware questions generation. Based on the attention-based pointer generator model, we propose to incorporate an auxiliary task of language modeling to help question generation in a hierarchical multi-task learning structure. Our joint-learning model enables the encoder to learn a better representation of the input sequence, which will guide the decoder to generate more coherent and fluent questions. On both SQuAD and MARCO datasets, our multi-task learning model boosts the performance, achieving state-of-the-art results. Moreover, human evaluation further proves the high quality of our generated questions.


Introduction
Question generation (QG) receives increasing interests in recent years due to its benefits to several real applications: (1) QG can aid in the development of annotated questions to boost the question answering systems ; (2) QG enables the dialogue systems to ask questions which make it more proactive (Shum et al., 2018;Colby, 1975); (3) QG can help to generate questions for reading comprehension texts in the education field. In this paper, we focus on answer-aware QG. Giving a sentence and an answer span as input, we want to generate a question whose response is the answer.
Previous work on QG was mainly tackled by two approaches: the rule-based approach and neural-based approach. The neural-based approach receives a booming development due to the release of large-scale reading comprehension datasets like SQuAD (Rajpurkar et al., 2016) and MARCO (Nguyen et al., 2016). Most of the neural approaches on QG employ the encoder-decoder * Corresponding author. framework, which incorporate attention mechanism to pay more attention to the informative part and copy mode to copy some tokens from the input text Subramanian et al., 2018;Zhao et al., 2018;Sun et al., 2018). To make better use of answer information,  leverage multiperspective matching, and Sun et al. (2018) propose a position-aware model that aims at putting more emphasis on the answer-surrounded context words. Zhao et al. (2018) aggregate paragraphlevel information to help QG. Another line of work is to deal with question answering and question generation as dual tasks . Some other works try to generate questions from a text without answers as input (Subramanian et al., 2018;. Although some progress has been made, there is still much room for improvement for QG. Multi-task learning is an effective way to improve model expressiveness via related tasks by introducing more data and fruitful semantic information to the model (Caruana, 1998). Many works in NLP have adopted multi-task learning and prove its effectiveness on textual entailment (Hashimoto et al., 2017), keyphrase generation (Ye and Wang, 2018) and document summarization (Guo et al., 2018). To the best of our knowledge, no work attempts to employ multitask learning for question generation. Although language modeling has been applied to multi-task learning for classification tasks, they are different from our generation task.
In this work, we propose to incorporate language modeling as an auxiliary task to help QG via multi-task learning. We adopt the pointergenerator (See et al., 2017) reinforced with features as the baseline model, which yields stateof-the-art result (Sun et al., 2018). The language modeling task is to predict the next word and the  Figure 1: Overall structure of our joint-learning model. previous word whose input is a plain text without relying on any annotation. The two tasks are then combined with a hierarchical structure, where the low-level language modeling encourages our representation to learn richer language features that will help the high-level network to generate better expressive questions.
We conduct extensive experiments on two reading comprehension datasets: SQuAD and MARCO. We experiment with different settings to prove the efficacy of our multi-task learning model: with/without language modeling and with/without features. Experimental results show that the language modeling consistently yields obvious performance gain over baselines, for all evaluation metrics, including BLEU, perplexity and distinct. Our full model outperforms the existing state-of-the-art results on both datasets, achieving a high BLEU-4 score of 16.23 on SQuAD and 20.88 on MARCO, respectively. We also conduct human evaluation, and our generated questions get higher scores on all three metrics, including matching, f luency and relevance.

Model Description
The baseline model is an attention-based seq2seq pointer-generator reinforced by lexical features, like the work of Sun et al. (2018). In our proposed model, we employ multi-task learning with language modeling as an auxiliary task for QG. The whole structure of our model is shown in Figure 1.

Feature-enriched Pointer Generator
The feature-rich encoder is a bidirectional LSTM used to produce a sequence of hidden states h L t . The encoder takes a sequence of word-and-feature vectors as input ((x 1 , ..., x T )), which concatenates the word embedding e t , answer position embedding a t and lexical feature embedding l t (x t = [e t ; a t ; l t ]).The lexical feature is composed of POS tags, NER tags, and word case.
The attention-based decoder is another unidirectional LSTM, which is conditioned on the previous decoder state s i−1 , decoded word w i−1 , and context vector c i−1 which is generated via attention mechanism (Bahdanau et al., 2014): Further, a two-layer feed-forward network is used to produce the vocabulary distribution P vocab . The pointer generator (See et al., 2017) incorporates a copy mode P copy (w), which allows copying words from the source text via pointing. The final probability distribution is to combine both modes with a generation probability p g ∈[0, 1]: The model is trained to minimize the negative log-likelihood of the target sequence. We denote this loss as E.

Language Modeling
The language model is to predict the next word and previous word in the sequence with a forward LSTM and a backward LSTM, respectively. First, we feed the input sequence into a bidirectional LSTM to get the hidden representations h lm t . Then, these states are fed into a softmax layer to predict the next and the previous word: The training objective is to minimize the loss function which is defined as the average of the negative log-likelihood of the next word and the previous word in the sequence:

Multi-task Learning
Instead of sharing representations between two tasks (Rei, 2017) or encoding two tasks at the same level Chen et al., 2018;Kendall et al., 2018), we adopt a hierarchical structure to combine the two tasks, by treating language modeling as a low-level task and pointer generator network as high-level, because language modeling is fundamental and its semantic information will benefit question generation. In details, we first feed the input sequence into the language modeling layer to get a sequence of hidden states. Then we concatenate them with the input sequence to obtain the input of the feature-rich encoder. Finally, the loss of LM is added to the main loss to form a combined training objective: where β is a hyper-parameter, which is used to control the relative importance of two tasks.

Dataset
We conduct experiments on two reading comprehension datasets: SQuAD and MARCO, using the data shared by

Experiment Settings
Our vocabulary contains the most frequent 20,000 words in each training set. Word embeddings are initialized with the pre-trained 300-dimensional Glove vectors, and are allowed to be fine-tuned during training. The representations of answer position, POS tags, NER tags and word cases are randomly initialized as 32-dimensional vectors, respectively. The encoder of our baseline model consists of 2 BiLSTM layers, and the hidden size of both the encoder and decoder is set to 512. In our joint model, grid search is used to determine β and results are shown in Figure 2. Consequently, we set the value of β to 0.6. We search the best-trained checkpoint base on the dev-set. In order to mitigate the fluctuation of the training procedure, we then average the nearest 5 checkpoints to obtain a single averaged model. Beam search is used with a beam size of 12.

Automatic Evaluation
Results on BLEU The experimental results on BLEU (Papineni et al., 2002) are illustrated in Table 1. Our full model (w/ features + language modeling) significantly outperforms previous models and achieves state-of-the-art results on both datasets, with 16.23 BLEU-4 score on SQuAD and 20.88 on MARCO respectively. Results without Features To investigate the robustness of our model, we conduct an experiment whose input sequence only takes word embeddings and answer position, but without lexical fea-   Results with a 3-layer Encoder To validate that we gain the improvement not due to a deeper network, we replace the language modeling module with one encoder layer, that is to say, we adopt a 3-layer encoder. Comparing this model (w/ features+1-layer encoder) with the full model (w/ features+language modeling), we can see that our joint-learning model performs better than simply adding an extra encoding layer. The results on MARCO also clearly show that a deeper network does not guarantee better performance. Perplexity and Diversity Since BLEU only measures a hard matching between references and generated text, we further adopt perplexity and distinct  to judge the quality of generated questions. The results in Table 2 indicate that the language modeling task helps the model to generate more fluent and readable questions. Besides, the generated questions have better diversity.

Human Evaluation
For a better study on the quality of generations, we perform human evaluation. Three annotators are asked to grade the generated questions in three aspects: matching indicates whether a question can be answered with the given answer; f luency indicates whether a question is fluent and grammatical; relevance indicates whether a question can be answered according to the given context. The rating score ranges from 0 to 2. We randomly sample 100 cases from each dataset for evaluation. Results are displayed in Table 3. The coefficient between human judges is high, validating a high quality of our annotation. The results show that by incorporating language modeling, the generated questions receive higher scores across all three metrics.
Context: Prior to the early 1960s, access to the forest's interior was highly restricted, and the forest remained basically intact. Answer: The early 1960s Reference: Accessing the Amazon rainforest was restricted before what era? Baseline: When did access to the forest's interior? Joint-model: When did access to the forest's interior become restricted?
Context: This teaching by Luther was clearly expressed in his 1525 publication on the bondage of the will, which was written in response to on free will by Desiderius Erasmus (1524). Answer: 1525 Reference: When did Luther publish on the bondage of the will? Baseline: In what year was the bondage of the will on the bondage of the will? Joint-model: When was the bondage of the will published? Table 4: Examples of generated questions by different models.

Case Study
Further, Table 4 gives two examples of the generated questions on SQuAD dataset, by the base-line model and our joint model respectively. It is obvious that questions generated by our proposed model are more complete and grammatical.

Conclusion
This paper proves that equipped with language modeling as an auxiliary task, the neural model for QG can learn better representations that help the decoder to generate more accurate and fluent questions. In future work, we will adopt the auxiliary language modeling task to other neural generation systems to test its generalization ability.