BERT for Question Generation

In this study, we investigate the employment of the pre-trained BERT language model to tackle question generation tasks. We introduce two neural architectures built on top of BERT for question generation tasks. The first one is a straightforward BERT employment, which reveals the defects of directly using BERT for text generation. And, the second one remedies the first one by restructuring the BERT employment into a sequential manner for taking information from previous decoded results. Our models are trained and evaluated on the question-answering dataset SQuAD. Experiment results show that our best model yields state-of-the-art performance which advances the BLEU4 score of existing best models from 16.85 to 18.91.


Introduction
Question generation (QG) task, which takes a context and an answer as input and generates a question that targets the given answer, have received tremendous interests in recent years from both industrial and academic communities (Zhao et al., 2018) (Zhou et al., 2017) (Du et al., 2017).The state-of-the-art models mainly adopt neural approaches by training a neural network based on the sequence-to-sequence framework.So far, the best performing result is reported in (Zhao et al., 2018), which advances the state-of-the-art results from 13.9 to 16.8 (BLEU 4).
The existing QG models mainly rely on recurrent neural networks (RNN) augmented by attention mechanisms.However, the inherent sequential nature of the RNN models suffers from the problem of handling long sequences.As a result, the existing QG models (Du et al., 2017) (Zhou et al., 2017) mainly use only sentence-level information as context.When applied to a paragraphlevel context, the existing models show significant performance degradation.However, as indicated by (Du et al., 2017), providing paragraph-level information can improve QG performance.For handling long context, the work (Zhao et al., 2018) introduces a maxout pointer mechanism with gated self-attention encoder for processing paragraphlevel input.The work reports state-of-the-art performance for QG tasks.
Recently, the NLP community has seen excitement around neural learning models that make use of pre-trained language models (Devlin et al., 2018) (Radford et al., 2018).The latest development is BERT, which has shown significant performance improvement over various natural language understanding tasks, such as document summarization, document classification, etc.In this study, we investigate the employment of the pretrained BERT language model to tackle question generation tasks.We introduce two neural architectures built on top of BERT for question generation tasks.The first one is a straightforward BERT employment, which reveals the defects of directly using BERT for text generation.As will be shown in the experiment, naive employment of BERT offers poor performance, as, by construction, BERT produces all tokens at a time without considering decoding results in previous steps.Thus, we propose a sequential question generation model based on BERT as our second model for taking information from previous decoded results.Our model is simple but effective.We think this is a feature of BERT, as the power of BERT is able to simplify neural architecture design for natural language processing tasks.Our model outperforms the existing best models (Zhao et al., 2018) and pushes the state-of-the-art result from 16.85 to 21.04 (BLEU 4).
The rest of this paper is organized as follows.First, in Section 2, we review the BERT model which is the building block for our models.In Sec-tion 3, we introduce two BERT adaptions for QG tasks.Section 4 provides the performance evaluation and Section 5 concludes our findings and discuss future works.

BERT Overview
The BERT model is built by a stack of multi-layer bidirectional Transformer encoder (Vaswani et al., 2017).The BERT model has three architecture parameter settings: the number of layers (i.e., transformer blocks), the hidden size, and the number of self-attention heads in a transformer block.There are two BERT models with different model size released.
• BERT base : 12 layers, 768 hidden dimensions and 12 attention heads (in transformer) with the total number of 110M parameters.
• BERT large : 24 layers, 1024 hidden dimensions and 16 attention heads (in transformer) with the total number of 340M parameters.
For using BERT model, the input is required to be aligned as the BERTs specific input sequence.In general, a special token [CLS] is inserted as the first token for BERT's input sequence.The final hidden state of the [CLS] token is designed to be used as a final sequence representation for classification tasks.The input token sequence can be a pack of multiple sentences.To distinguish the information from different sentences, a special token [SEP] is added between the tokens of two consecutive sentences.In addition, a learned embedding is added to every token to denote whether it belongs to sentence A or sentence B. For example, given a sentence pair (s i , s j ) where s i contains |s i | tokens and s j contains |s j | tokens, the BERT input sequence is formulated as a sequence in the following form: The input representation of a given token is the sum of three embeddings: the token embeddings, the segmentation embeddings, and the position embeddings.Then the input representation is fed forward into extra layers to perform a fine-tuning procedure.BERT can be employed in three language modeling tasks: sequence-level classification, span-level prediction, and token-level prediction tasks.The fine-tuning procedure is performed 3 BERT for Question Generation

BERT-QG
As an initial attempt, we first adapt the BERT model for QG as follows.First, for a given context paragraph C = [c 1 , ..., c |C| ] and an answer phase A = [a 1 , ..., a |A| ], the input sequence X is aligned as Let BERT() be the BERT model.We first obtain the hidden representation H ∈ R |X|×h by H = BERT(X), where |X| is the length of the input sequence and h is the size of the hidden dimension.Then, H is passed to a dense layer W ∈ R h×|V | followed by a softmax function as follows.

P r(w|x
The softmax is applied along the dimension of the sequence.All the parameters are fine-tuned jointly to maximize the log-probability of the correct token q i .The model architecture is illustrated in Figure 1.As shown in the figure, we align a given context paragraph and a given answer as the input sequence and feed the input sequence into the BERT model to generate a sequence of tokens as a generated question.

BERT-SQG
In text generation tasks, as suggested by (Sutskever et al., 2014), considering the previous decoded results has significant impacts on the In BERT-SQG, we take into consideration the previous decoded results for decoding a token.We adapt the BERT model for question generation as follows.First, for a given context paragraph C = [c 1 , ..., c |C| ] and an answer phase A = [a 1 , ..., a |A| ], and Q = [ q1, ..., qi ] the input sequence X i is formulated as Then, the input sequence X i is represented by the BERT embedding layers and then travel forward into the BERT model.After that, we take the final hidden state of the last token [MASK] in the input sequence.We denote the final hidden vector of [MASK] as h [MASK] ∈ R h .We adapt BERT model by adding an affine layer W SQG ∈ R h×|V | to the output of the [MASK] token.We compute the probabilities P r(w|X i ) ∈ R |V | by a softmax function as follows.

P r(w|X
Subsequently, the newly generated token qi is appended into X and the question generation process is repeated (as illustrated in Figure 2) with the new X until [SEP] is predicted.We report the generated tokens as the predicted question.

Datasets
The SQuAD dataset contains 536 Wikipedia articles and around 100K reading comprehension questions (and the corresponding answers) posed about the articles.Answers of the questions are text spans in the articles.
We follow the same data split settings as previous work on the QG tasks (Du et al., 2017) (Zhao et al., 2018) to directly compare the state-of-theart results on QG tasks.Table 1 summarizes some statistics for the compared datasets.
• SQuAD 73K In this set, we follow the same setting as (Du et al., 2017); the accessible parts of the SQuAD training data are randomly divided into a training set (80%), a development set (10%), and a test set (10%).
We report results on the 10% test set.
• SQuAD 81K In this set, we follow the same setting as (Zhao et al., 2018); the accessible SQuAD development data set is divided into a development set (50%), and a test set (50%).

Implementation Details
We use the PyTorch version of BERT 1 to train our BERT-QG and BERT-SQG models.The Table 1: Dataset statistics: SQuAD 73K is the setting of (Du et al., 2017), and SQuAD 81K is the setting of (Zhao et al., 2018).
Train Test Dev SQuAD 73K 73240 11877 10570 SQuAD 81K 81577 8964 8964 pre-trained model uses the officially provided BERT base model (12 layers, 768 hidden dimensions, and 12 attention heads.)with a vocab of 30522 words.Dropout probability is set to 0.1 between transformer layers.The Adamax optimizer is applied during the training process, with an initial learning rate of 5e-5.The batch size for the update is set at 28.All our models use two TITAN RTX GPUs for 5 epochs training.We use Dev.data for epoch model to make predictions and select the highest accuracy rate as our score evaluation model.Also, in our BERT-SQG model, we use the Beam Search strategy for sequence decoding.The beam size is set to 3.

Model Comparison
In this paper, we compare our models with the best performing models (Du et al., 2017) (Zhao et al., 2018) in the literature.The compared models in the experiment are: • NQG-RC (Du et al., 2017): A seq2seq question generation model based on bidirectional LSTM.
• PLQG (Zhao et al., 2018): A seq2seq network which contains a gated self-attention encoder and a maxout pointer decoder to enable the capability of handling long text input.PLQG model is the state-of-the-art models for QG tasks.

Evaluation Results
Table 2 shows the comparison results using sentence-level context and Table 3 shows the results on paragraph level context.We compare the models using standard metric BLEU and ROUGE-L ( (Papineni et al., 2002)).
We have the following findings to note about the results.First, as can be observed, BERT-QG offers poor performance.In fact, the performance of BERT-QG is far from the results by other models.This result is expected as BERT-QG generates the sentences without considering the previous decoded results.However, when taking into account the previous decoded results (BERT-SQG), we effectively utilize the power of BERT and yield the state-of-the-art result compared with the existing RNN variants for QG.As shown in Table 2, BERT-SQG outperforms the existing best performing model by 2% on both benchmark datasets.
Second, the results in Table 3 further show that BERT-SQG successfully processes the paragraphlevel contexts and further push the state-of-the-art from 16.85 to 21.04 in terms of BLEU 4 score.Note that NQG-RC and PLQG both use the RNN architecture, and the RNN-based models all suffer from the issue of consuming long text input.We see that the BERT model based on transformer blocks effectively addresses the issue of processing long text.The results of our BERT-SQG model are consistent in two data set and have achieved the best score at the paragraph level.

Conclusion
In this paper, we demonstrate that BERT can be adapted to question generation tasks.We concede that our BERT-SQG model is simple.However, we think this is a feature of BERT, as the power of BERT is able to simplify neural architectures design for specific tasks.While our model is simple, our model achieves state-of-the-art performance at both sentence-level and paragraph-level input and provides strong baselines for future research.

Table 2 :
Comparison between our model and the published methods using sentence level context

Table 3 :
Comparison between our model and the published methods using paragraph level context