CopyBERT: A Unified Approach to Question Generation with Self-Attention

Contextualized word embeddings provide better initialization for neural networks that deal with various natural language understanding (NLU) tasks including Question Answering (QA) and more recently, Question Generation(QG). Apart from providing meaningful word representations, pre-trained transformer models (Vaswani et al., 2017), such as BERT (Devlin et al., 2019) also provide self-attentions which encode syntactic information that can be probed for dependency parsing (Hewitt and Manning, 2019) and POStagging (Coenen et al., 2019). In this paper, we show that the information from selfattentions of BERT are useful for language modeling of questions conditioned on paragraph and answer phrases. To control the attention span, we use semi-diagonal mask and utilize a shared model for encoding and decoding, unlike sequence-to-sequence. We further employ copy-mechanism over self-attentions to acheive state-of-the-art results for Question Generation on SQuAD v1.1 (Rajpurkar et al., 2016).


Introduction
Automatic question generation (QG) is the task of generating meaningful questions from text. With more question answering (QA) datasets like SQuAD (Rajpurkar et al., 2016) that have been released recently (Trischler et al., 2016;Choi et al., 2018;Reddy et al., 2019;Yang et al., 2018), there has been an increased interest in QG, as these datasets can not only be used for creating QA models but also for QG models.
QG, similar to QA, gives an indication of machine's ability to comprehend natural language text. Both QA and QG are used by conversational agents. A QG system can be used in the creation of artificial question answering datasets which in-turn helps QA (Duan et al., 2017). It specifically can be used in conversational agents for starting a conversation or draw attention to specific information with answer phrase {a i } A i=1 and semi-diagonal mask M ( §3.2), the model explicitly uses H multi-headed self-attention matrices from L layers of transformers to create A ∈ R n×n×L×H . This matrix along with S ∈ R n×L×H , obtained from the BERT sequence output H ∈ R n×h , is used to learn copy probability p c (q i |.) ( §3.3.2). Finally, a weighted combination p(q i |.) is obtained with simple generation probability p g (q i |.) ( §3.4). (Mostafazadeh et al., 2016). Yao et al. (2012) and Nouri et al. (2011) use QG to create and augment conversational characters. In a similar approach, Kuyten et al. (2012) creates a virtual instructor to explain clinical documents. In this paper, we propose a QG model with following contributions: • We introduce copy mechanism for BERTbased models with a unified encoder-decoder framework for question generation. We further extend this copy mechanism using selfattentions.
• Without losing performance, we improve the speed of training BERT-based language models by choosing predictions on output embeddings that are offset by one position.

Related Work
Most of the QG models that use neural networks rely on a sequence-to-sequence architecture where a paragraph and an answer is encoded appropriately before decoding the question. Sun et al. (2018) uses an answer-position aware attention to enrich the encoded input representation. Recently,  showed that learning to predict clue words based on answer words helps in creating a better QG system. With similar motivation, gated self-networks were used by Zhao et al. (2018) to fuse appropriate information from paragraph before generating question. More recently, self-attentions of a transformer has been shown to perform answer agnostic question generation (Scialom et al., 2019).
The pre-training task of masked language modeling for BERT (Devlin et al., 2019) and other such models (Joshi et al., 2019) make them suitable for natural language generation tasks. Wang and Cho (2019) argues that BERT can be used as a generative model. However, only few attempts have been made so far to make use of these pre-trained models for conditional language modeling. Dong et al. (2019) and Chan and Fan (2019) use a single BERT model for both encoding and decoding and achieve state-of-the-art results in QG. However, both of them use the [MASK] token as the input for predicting the word in place, which makes the training slower as it warranties recurrent generation (Chan and Fan, 2019) or generation with random masking (Dong et al., 2019). Both models only consider the output representations of BERT to do language modeling.
However, Jawahar et al. (2019) and Tenney et al. (2019) show that BERT learns different linguistic features at different layers. Also, Hewitt and Manning (2019) successfully probed for dependency trees from self-attention matrices of BERT. With this, we hypothesize that BERT can implicitly encode the different aspects of input for QG (Sun et al., 2018;Zhao et al., 2018) within the selfattentions across layers. As self-attention can learn soft-alignments, it can be used explicitly for copy mechanism ( §3.3.2), and can yield better results ( §4.3) than a model that only implicitly use selfattentions for QG ( §3.3.1). Similar to Dong et al. (2019), we also employ a shared architecture for unified encoding-decoding but make an explicit use of self-attentions across layers, leading to similar or better results at a fraction of their training cost.

Model
In sequence-to-sequence learning framework, a separate encoder and a decoder model is used. Such an application to BERT will lead to high computational complexity. To alleviate this, we use a shared model for encoding and decoding (Dong et al., 2019). This not only leads to a reduced number of parameters but also allows for cross attentions between source and target words in each layer of the transformer model. While such an architecture can be used in any conditional natural language generation task, here we apply it for QG.

Question Generation
For a sequence of paragraph tokens P = [p 1 , p 2 , ..., p P ], start and end positions of an answer phrase s a = (a s , a e ) in the paragraph and question tokens Q = [q 1 , q 2 , ..., q Q ] with p 1 = bop, p P = eop and q Q = eoq representing begin of paragraph, end of paragraph and end of question respectively, the task of question generation is to maximize the likelihood of Q given P and s a . To this end, with m such training examples, we maximize the following objective: where q <i represents previous question tokens [q 1 , q 2 , ..., q i−1 ]. A fixed length n sequence is created by concatenating P and Q with pad tokens into S = [P ; Q]. Similar to Devlin et al. (2019), each input token is accompanied by a segment id to differentiate between the parts of the text. The answer tokens in the paragraph and the question tokens are given segment ids 1 and the rest 0, as illustrated in Figure 1. We pass these as inputs to a pre-trained BERT-based model.

Semi-diagonal Masking
To control the information flow, we employ a semidiagonal mask. A simple diagonal mask on the self-attentions of the transformer decoder ensures that each word only attends to the words that are seen thus far (Vaswani et al., 2017). Self-attentions of the encoder do not require such masking because the input words should inform each other while encoding. Since we use a unified encoder-decoder architecture, we ensure our masking is such that each word in the paragraph attends to all other words in the paragraph but not to any of the words in the question and each word in the question only attends to previous words in the question in addition to all the words in the paragraph. This results in a semi-diagonal mask which is also proposed by Dong et al. (2019) and shown in Figure 1. Formally, from S in §3.1, we have I p = [1, 2, ..., P ] as the sequence of paragraph indices and I q = [P + 1, P + 2, .., P + Q] as the sequence of question indices with n = P + Q (ignoring the pad tokens). The semi-diagonal mask M ∈ R n×n is defined as:

Copy Mechanism
Pre-trained transformer models not only yield better contextual word embeddings but also give informative self-attentions (Hewitt and Manning, 2019;Reif et al., 2019). We explicitly make use of these pre-trained self-attentions into our QG models. This also matches with our motivation to use the copy mechanism (Gu et al., 2016) for BERT, as the self-attentions can be used to obtain attention probabilities over input paragraph text which are necessary for copy-mechanism. For the input sequence S with the semi-diagonal mask M ∈ R n×n and segment ids D, we first encode with BERT(S, M, D) to obtain hidden representations of the sequence H = {h i } n i=1 ∈ R n×h . We then define copy probability p c (y i |.) := p c (y i |q <i , P, s a ) as: where p a (k|y i ) ∈ R is the attention probability of copying token t k ∈ Y = {P } ∪ {y j } i−1 j=1 (set of all the paragraph tokens and question predictions thus far) from input position k to question position i. The distribution p a ∈ R n is set to zero for tokens not appearing in Y , whereas we add the corresponding attention probabilities for tokens occurring multiple times. We summarize these per position probabilities compactly in a matrix P a ∈ R n×n . Now, we define several methods to obtain P a with different copy mechanisms.

Normal Copy
First, we employ a simpler way to obtain attention probabilities, called normal copy: P a = softmax(HW n H T ) ∈ R n×n where W n ∈ R h×h is a parameter matrix.

Self-Copy
In a transformer architecture (Vaswani et al., 2017), if there are L layers and H attention heads at each layer, there will be M = L × H self-attention matrices of size n × n. For example, in BERT-Large model (Devlin et al., 2019), there would be 24 × 16 = 384 such matrices. Each of these selfattention matrices carry unique information. In this method for copy mechanism, called self-copy, we obtain P a as a weighted average of all these self-attentions 1 .
We obtain at each time step, a probability score for each of the M self-attention matrices in A ∈ n × n × M signifying their corresponding importance. Given a parameter matrix W a ∈ R h×M , we obtain: where S ∈ R n×1×M is a 3D tensor with added dimension 2 to S, A T ∈ R n×M ×n is the transposed tensor of 3D self-attention matrices A. S i ∈ R 1×M and A T i ∈ R M ×n are the i-th slices of the tensors S and A T . The final attention probabilities P a are obtained by removing the dimension 2 from P a . Thus, the final attention probabilities are obtained as a weighted average over all self-attention matrices.

Two-Hop Self-Copy
A self-attention matrix as mentioned above can be considered as an adjacency matrix of a graph whose nodes are words. The probability scores represent soft edge between two words. A self-attention matrix, thus, can be considered as 1-hop attention. We would like to explore 2-hop attentions, i.e, we look for neighbouring nodes of neighbouring nodes. Note that if P a is an adjacency matrix, the nodes that are connected in two hops are given by P 2 a . Both 1-hop attentions and 2-hop attentions can be useful for copying mechanism. Let P 1-hop = P a and P 2-hop = P 2 a where P a and P a are defined as mentioned in §3.3.2 with different parameters, then we define two-hop self-copy as follows:

Copy-Generate Probability
Once the copy probability p c is obtained, the combined probability is obtained as weighted combination with the generation probability p g : where c i is the likelihood to generate a token from the vocabulary or copy a token from the source and predicted tokens at position i: with h q i−1 ∈ R h×1 as the hidden representation for the question token at position i − 1, w ∈ R h×1 is a parameter vector and σ is sigmoid non-linearity. The generation probability is given by: where V ∈ R h×|V | is a parameter matrix over input vocabulary of size |V |.

Experiments
We apply the different variations of CopyBERT model as mentioned in the previous section on SQuAD v1.1 (Rajpurkar et al., 2016). For our experiments 2 , we follow the training, validation and test split as used in Du et al. (2017).

Training Setup
For training, we used a batch size of 6, learning rate of 3e −5 with early stopping. The loss reaches its minimum between 2 to 3 epochs. We also trained with a batch size of 24 using gradient accumulation and found it gave similar results after the same number of optimization steps. We fixed the maximum sequence length as 384 and chose the part (document stride) of the paragraph that contained the answer phrase in case of exceeded sequence length. We decoded using beam search with a beam width of 5 and stopping at the generated token eoq.

Evaluation Metrics and Models
For evaluating our models, we report standard metrics of BLEU4, METEOR and ROUGE-L. As baselines, we take two of the non-BERT state-of-the-art models (Du and Cardie, 2018; Zhang and Bansal, Model BLEU4 METEOR ROUGE-L CorefNQG (Du and Cardie, 2018) 15.16 19.12 -SemdriftQG (Zhang and Bansal, 2019) 18.37 22.65 6.68 Recurrent-BERT (Chan and Fan, 2019) 20.33 23.88 48.23 UniLM (Dong et al., 2019) 22  Du et al. (2017). BERT refers to BERT-Large(cased) model (Devlin et al., 2019) 2019) and the two BERT-based QG models (Dong et al., 2019;Chan and Fan, 2019). We experimented with 4 settings: one without using any copy mechanism (No Copy), one using normal copy (Normal Copy; §3.3.1), one using self-copy (Self-Copy; §3.3.2) and finally with two-hop selfcopy (Two-Hop Self-Copy; §3.3.3). . Further, we see considerable gain in BLEU4 by using Self-Copy (+1.8 over No Copy and +0.87 over Normal Copy), supporting the hypothesis of using multi-layered, multi-headed selfattentions for copy mechanism. UniLM, which is a pre-trained model from BERT-Large checkpoint with three sequence generation pre-training tasks (Dong et al., 2019) and further fine-tuned on SQuAD dataset for 10 epochs achieves 22.12 BLUE4 score. We achieve comparable performance by only using self-copy mechanism. Figure  2 shows attention patterns of self-copy in question generation.

Results
To further validate the self-copy mechanism, we also experimented by initializing with a variant of BERT 4 called SpanBERT (Joshi et al., 2019), which is pre-trained to predict longer masked spans to encourage better entity masking and has already shown to improve QA results when compared to BERT (Joshi et al., 2019). Although, Two-Hop Self-Copy did not improve upon the Self-Copy, Figure 2: CopyBERT attention visualizations of copy probability on SQuAD examples. Top: Attention focused paragraph tokens on y-axis and generated question tokens on x-axis, where we see that the learnt copy probabilities consistently extract words from the paragraph context. Bottom: Long-span attention pattern over the paragraph words (x-axis), where the copy probability looks for question words (y-axis) even when most of the question words are present in the local context around the answer phrase. these attentions can serve as explainability of QG, a good intuition behind copying different words, which we plan to explore in our future work.

Training Speed
CopyBERT trains significantly faster than UniLM. For UniLM, to fine-tune further on QG task it takes around 10 epochs to obtain its best performance. This is because the model uses input token [MASK] to predict a target question word and as a result can only train with some percentage of randomly chosen words to ensure that the probability is conditioned on previous question words. CopyBERT, in contrast, takes only 2 to 3 epochs to achieve its best performance. It took CopyBERT around 14 hours on a single GPU with 12GB main memory to train for 3 epochs, whereas UniLM took around 45 hours on the same hardware to train for 10 epochs to achieve similar results as reported in Dong et al. (2019). We expect Recurrent-BERT (Chan and Fan, 2019) to take even longer time to train due to its sequential nature.

Conclusion
We showed that having a unified encoder-decoder transformer model initialized with contextualized word embeddings and further extended with copy mechanism can already give state-of-the-art, without additional pre-training on generation tasks (Dong et al., 2019). We also sped up the training of QG models that use BERT by choosing predictions on output embeddings that are offset by one position ( §3.3). This work shows the significance of explicitly using self-attentions of BERT like models. These models can further be used in other tasks such as abstractive summarization and machine translation to see qualitative improvements.