Learning to Ask More: Semi-Autoregressive Sequential Question Generation under Dual-Graph Interaction

Traditional Question Generation (TQG) aims to generate a question given an input passage and an answer. When there is a sequence of answers, we can perform Sequential Question Generation (SQG) to produce a series of interconnected questions. Since the frequently occurred information omission and coreference between questions, SQG is rather challenging. Prior works regarded SQG as a dialog generation task and recurrently produced each question. However, they suffered from problems caused by error cascades and could only capture limited context dependencies. To this end, we generate questions in a semi-autoregressive way. Our model divides questions into different groups and generates each group of them in parallel. During this process, it builds two graphs focusing on information from passages, answers respectively and performs dual-graph interaction to get information for generation. Besides, we design an answer-aware attention mechanism and the coarse-to-fine generation scenario. Experiments on our new dataset containing 81.9K questions show that our model substantially outperforms prior works.


Introduction
Question Generation (QG) aims to teach machines to ask human-like questions from a range of inputs such as natural language texts (Du et al., 2017), images (Mostafazadeh et al., 2016) and knowledge bases (Serban et al., 2016). In recent years, QG has received increasing attention due to its wide applications. Asking questions in dialog systems can enhance the interactiveness and persistence of humanmachine interactions . QG benefits Question Answering (QA) models through data augmentation  and joint learning . It also plays an important role in education (Heilman and Smith, 2010) and clinical (Weizenbaum et al., 1966) systems.
Traditional Question Generation (TQG) is defined as the reverse task of QA, i.e., a passage and an answer (often a certain span from the passage) are provided as inputs, and the output is a question grounded in the input passage targeting on the given answer. When there is a sequence of answers, we can perform Sequential Question Generation (SQG) to produce a series of interconnected questions. Table 1 shows an example comparing the two tasks. Intuitively, questions in SQG are much more concise and we can regard them with given answers as QA-style conversations. Since it is more natural for human beings to test knowledge or seek information through coherent questions (Reddy et al., 2019), SQG has wide applications, e.g., enabling virtual assistants to ask questions based on previous discussions to get better user experiences. SQG is a challenging task in two aspects. First, information omissions between questions lead to complex context dependencies. Second, there are frequently occurred coreference between questions. Prior works regarded SQG as a dialog generation task (namely conversational QG) where questions are generated autoregressively (recurrently), i.e., a new question is produced based on previous outputs. Although many powerful dialog generation models can be adopted to address the challenges mentioned above, there are two major obstacles. First, these models suffer from problems caused by error cascades. Empirical results from experiments reveal that the later generated questions tend to become shorter with lower quality, especially becoming more irrelevant to given answers, e.g., "Why?", "What else?". Second, models recurrently generating each question struggle to capture complex context dependencies, e.g., long-distance coreference. Essentially, SQG is rather different from dialog generation since all answers are given in advance and they act as strict semantic constraints during text generation.
(1) A small boy named [John]1 was at the park one day.
(2) He was [swinging] To deal with these problems, we perform SQG in a semi-autoregressive way. More specifically, we divide target questions into different groups (questions in the same group are closely-related) and generate all groups in parallel. Especially, our scenario becomes non-autoregressive if each group only contains a single question. Since we eliminate the recurrent dependencies between questions in different groups, the generation process is much faster and our model can better deal with the problems caused by error cascades. To get information for the generation process, we perform dualgraph interaction where a passage-info graph and an answer-info graph are constructed and iteratively updated with each other. The passage-info graph is used for better capturing context dependencies, and the answer-info graph is used to make generated questions more relevant to given answers with the help of our answer-aware attention mechanism. Besides, a coarse-to-fine text generation scenario is adopted for the coreference resolution between questions.
Prior works performed SQG on CoQA (Reddy et al., 2019), a high-quality dataset for conversational QA. As will be further illustrated, a number of data in CoQA are not suitable for SQG. Some researchers (Gao et al., 2019) directly discarded these data, but the remaining questions may become incoherent, e.g., the antecedent words for many pronouns are unclear. To this end, we build a new dataset from CoQA containing 81.9K relabeled questions. Above all, the main contributions of our work are: • We build a new dataset containing 7.2K passages and 81.9K questions from CoQA. It is the first dataset specially built for SQG as far as we know.
• We perform semi-autoregressive SQG under dual-graph interaction. This is the first time that SQG is not regarded as a dialog generation task. We also propose an answer-aware attention mechanism and a coarse-to-fine generation scenario for better performance.
• We use extensive experiments to show that our model outperforms previous work by a substantial margin. Further analysis illustrated the impact of different components.
2 Related Work 2.1 Traditional Question Generation TQG was traditionally tackled by rule-based methods (Lindberg et al., 2013;Mazidi and Nielsen, 2014;Hussein et al., 2014;Labutov et al., 2015), e.g., filling handcrafted templates under certain transformation rules. With the rise of data-driven learning approaches, neural networks (NN) have gradually taken the mainstream. Du et al. (2017) pioneered NN-based QG by adopting the Seq2seq architecture (Sutskever et al., 2014). Many ideas were proposed since then to make it more powerful, including answer position features , specialized pointer mechanism , self-attention (Scialom et al., 2019), answer separation (Kim et al., 2019), etc. In addition, enhancing the Seq2seq model into more complicated structures using variational inference, adversarial training and reinforcement learning Kumar et al., 2019) have also gained much attention. There are also some works performing TQG under certain constraints, e.g., controlling the topic (Hu et al., 2018) and difficulty (Gao et al., 2018) of questions. Besides, combining QG with QA (Wang et al., 2017; is also focused by many researchers.

Sequential Question Generation
As human beings tend to use coherent questions for knowledge testing or information seeking, SQG plays an important role in many applications. Prior works regarded SQG as a dialog generation task (namely conversational QA). Pan et al. (2019) pretrained a model performing dialog generation, and then fine-tuned its parameters by reinforcement learning to make generated questions relevant to given answers. Gao et al. (2019) iteratively generated questions from previous outputs and leveraged off-the-shelf coreference resolution models to introduce a coreference loss. Besides, additional human annotations were performed on sentences from input passages for conversation flow modeling. Since SQG is essentially different from dialog generation, we discard its dialog view and propose the first semi-autoregressive SQG model. Compared with using the additional human annotation in Gao et al. (2019), our dual-graph interaction deals with context dependencies automatically. Besides, our answer-aware attention mechanism is much simpler than the fine-tuning process in Pan et al. (2019) to make outputs more answer-relevant.

Dataset
As the reverse task of QA, QG is often performed on existing QA datasets, e.g., SQuAD (Rajpurkar et al., 2016), NewsQA (Trischler et al., 2016), etc. However, questions are independent in most QA datasets, making TQG the only choice. In recent years, the appearance of large-scale conversational QA datasets like CoQA (Reddy et al., 2019) and QuAC (Choi et al., 2018) makes it possible to train data-driven SQG models, and the CoQA dataset was widely adopted by prior works. Since the test set of CoQA is not released to the public, its training set (7.2K passages with 108.6K questions) was split into new training and validation set, and its validation set (0.5K passages with 8.0K questions) was used as the new test set.
Different from traditional QA datasets where the answers are certain spans from given passages, answers in CoQA are free-form text 1 with cor-responding evidence highlighted in the passage. This brings a big trouble for QG. As an example, consider the yes/no questions counting for 19.8% among all questions. Given the answer "yes" and a corresponding evidence "...the group first met on July 5 , 1967 on the campus of the Ohio state university...", there are many potential outputs, e.g., "Did the group first met in July?", "Was the group first met in Ohio state?". When considering the context formed by previous questions, the potential outputs become even more (the original question in CoQA is "Was it founded the same year?"). When there are too many potential outputs with significantly different semantic meanings, training a converged QG model becomes extremely difficult. For this reason, Gao et al. (2019) directly discarded questions that cannot be answered by spans from passages. However, the remaining questions can become incoherent, e.g., antecedent words for many pronouns become unclear.
To this end, we build a new dataset from CoQA by preserving all 7.7K passages and rewriting all questions and answers. More specifically, we first discarded questions that are unsuitable for SQG. To do so, three annotators were hired to vote for the preservation/deletion of each question. A question is preserved if and only if it can be answered by a certain span from the input passage 2 . As a result, most deleted questions were yes/no questions and unanswerable questions. Besides, the kappa score between results given by different annotators was 0.83, indicating that there was a strong interagreement between annotators. For the remaining QA-pairs, we preserved their original order and replaced all answers by spans from input passages. After that, we rewrote all questions to make them coherent. To avoid over-editing, annotators were asked to modify as little as possible. It turned out that in most cases, they only needed to deal with coreference since the prototype of pronouns were no longer existed. To further guarantee the annotation quality, we hired another project manager who daily examined 10% of the annotations from each annotator and provided feedbacks. The annotation was considered valid only when the accuracy of examined results surpasses 95%. Our annotation process took 2 months, and we finally got a dataset containing 7.7K passage with 81.9K QA-pairs.  Table 1 4 Model In this section, we formalize the SQG task and introduce our model in details. As shown in Figure 1, the model first builds a passage-info graph and an answer-info graph by its passage-info encoder and answer-info encoder respectively. After that, it performs dual-graph interaction to get representations for the decoder. Finally, different groups of questions are generated in parallel under a coarse-to-fine scenario. Both encoders and decoder take the form of Transformer architecture (Vaswani et al., 2017).

Problem Formalization
In SQG, we input a passage composed by n sentences P = {S i } n i=1 and a sequence of l answers where Q i can be answered by A i according to the input passage P and previous QA-pairs.
As mentioned above, we perform SQG in an semi-autoregressive way, i.e., target questions are divided into into different groups. Ideally, questions in the same group are expected to be closelyrelated, while questions in different groups should be as independent as possible. Our model takes a simple but effective unsupervised question clustering method. The intuition is: if two answers come from the same sentence, the two corresponding questions are likely to be closely-related. More specifically, if the k-th sentence S k contains p answers from {A i } l i=1 , we cluster them into an answer-group G ans By replacing each answer in G ans k with its corresponding question, we get a questiongroup G ques k = {Q j 1 , Q j 2 , ..., Q jp }, and we further define a corresponding target-output T k as " is a special token. In Table 1, there are four target outputs T 1 , T 2 , T 4 , T 5 (no T 3 since the third sentence in Table 1 do not contain any answer), T 2 is "What was he doing there? [sep] On What? [sep] ... [sep] What was Tim doing?" corresponding with the second sentence, and T 5 is "What did he say?" corresponding with the last sentence. Supposing there are m answer-and question-groups, then our model generates all the m target-outputs in parallel, i.e., all questions are generated in a semi-autoregressive way.

Passage-Info encoder
As shown in Figure 1, our passage-info encoder maps input sentences where every s i ∈ R 2ds . We regard each sentence as a sequence of words and replace each word by its pre-trained word embeddings (Mikolov et al., 2013) which is a dense vector. After that, the sequence of word embeddings is sent to a Transformer-encoder that outputs a corresponding sequence of vectors. By averaging these vectors, we get the local representation s local i ∈ R ds of S i . After we get the local representations of all sentences {S i } n i=1 in passage P , another Transformerencoder is adopted to map the sequence {s local ∈ R ds is called the Figure 2: Illustration of answer embeddings and an answer-attention head for the forth sentence in Table 1.
global representation for S i . In other words, the passage-info encoder takes a hiarachical structure. We expect the local and global representations capture intra-and inter-sentence context dependencies respectively, and the final representation for S i is

Answer-Info Encoder
As described in Section 4.1, the input answers are split into m answer-groups. For G ans k corresponding with the k-th sentence of the input passage, we define {G ans k , S k } as a "rationale" R k , and further obtain its representation r k ∈ R 2dr by our answerinfo encoder, which is based on a Transformerencoder regarding sentence S k as its input.
To further consider information from G ans k , two more components are added into the answer-info encoder, as shown in Figure 2. First, we adopt the answer-tag features. For each word w i in sentence S k , the embedding layer computes [x w i ; x a i ] ∈ R dr as its final embedding, where x w i is the pre-trained word embedding and x a i contains answer-tag features. More specifically, we give w i a label from {O, B, I} if it is "outside", "the beginning of", "inside of" any answer from G ans k , and use a vector corresponding with this label as x a i . Second, we design the answer-aware attention mechanism. In the multi-head attention layer, there are not only l h vanilla "self-attention heads", but also l a "answer-aware heads" for each answer in G ans k . In an answer-aware head corresponding with answer A, words not belonging to A are masked out during the attention mechanism. The output of the Transformer-encoder is a sequence of vectors H enc k = {h enc k } (h enc k ∈ R dr ) corresponding with the input word sequence from S k .
After getting H enc k , we further send the se-quence of vectors to a bi-directional GRU network (Chung et al., 2014) and take its last hidden state as the final rationale embedding r k ∈ R 2dr .
We further build a passage-info graph V and an answer-info graph U based on these representations. For the rationale corresponding with the k-th sentence of the input passage, we add node u k , v k in graph U, V respectively. For the example in Table 1, U is compused by {u 1 , u 2 , u 4 , u 5 } and V is compused by {v 1 , v 2 , v 4 , v 5 }, as shown in Figure 1. The initial representation for u k is computed by: (1) where r k ∈ R 2dr is the rationale representation, e k ∈ R de is the embedding of index k, and W u ∈ R (de+2dr)×dg , b u ∈ R dg are trainable parameters. And the initial representation for v k is: where s k ∈ R 2ds is the sentence representation and After adding these points, there are m nodes in U and V respectively. For u i , u j ∈ U corresponding with the i-th, j-th input sentences respectively, we add an edge between them if |i − j| < δ (δ is a hyper-parameter). Similarly, we add edges into V and the two graphs are isomorphic.

Dual-Graph Interaction
In our answer-info graph U, node representations contain information focused on input answers. In the passage-info graph V, node representations capture inter-and intra-sentence context dependencies. As mentioned above, a good question should be answer-relevant as well as capturing complex context dependencies. So we should combine information in both U and V. Our dual-graph interaction is a process where U and V iteratively update node representations with each other. At time step t, representations u i respectively under three steps. First, we introduce the information transfer step. Taking U as an example. Each u (t−1) i receives a (t) i from its neighbors (two nodes are neighbors if there is an edge between them) by: where N (u i ) is composed by all neighbors of node u i and W ij ∈ R dg×dg , b ij ∈ R dg are parameters controlling the information transfer. For u i , u j and u i , u j whose |i − j| = |i − j |, we use the same W and b. In other words, we can first create a sequence of matrices {W 1 , W 2 , ...} ∈ R dg×dg and vectors {b 1 , b 2 , ...} ∈ R dg , and then use |i − j| as the index to retrieve the corresponding W ij , b ij . For graph V, we similarly computẽ In the second step, we compute multiple gates. For each u (t−1) i in U, we compute an "update gate" y (t) i and a "reset gate" z (t) i by: where W y , W z ∈ R 2dg×dg are paramenters. Similarly, for each v (t−1) i in V we compute: Finally, we perform the information interaction, where each graph updates its node representations under the control of gates computed by the other graph. More specifically, node representations are updated by: The idea of using gates computed by the other graph to update node representations in each graph enables the information in input passage and answers interact more frequently, both of which act as strong constraints to the output questions.
By iteratively performing the three steps for T times, we get the final representations u

Decoder
For the k-th input sentence S k containing certain answers, our decoder generates the corresponding target-output T k . As mentioned above, the generation process of all target-outputs are independent. The decoder is based on the Transformer-decoder containing a (masked) multi-head self-attention layer, a multi-head encoder-attention layer, a feedforward projection layer and the softmax layer. To compute keys and values for the multi-head encoder-attention layer, it leverages the outputs from our answer-info encoder, i.e., it uses H enc k described in Section 4.3 to generate T k corresponding with the k-th sentence.
To generate coherent questions, we need to capture the context dependencies between input answers and passages. To this end, both u (T ) k and v (T ) k , which comes from the dual-graph interaction process, are used as additional inputs for generating T k . First, they are concatenated with the output of each head from both (masked) multi-head selfattention layer and multi-head encoder-attention layer before sending to the next layer. Second, they are concatenated with inputs of the feed-forward projection layer. The two representations are also expected to make generated questions more relevant to given inputs.

Coarse-To-Fine Generation
Since the semi-autoregressive generation scenario makes it more challenging to deal with coreferences between questions (especially questions in different groups), we perform question generation in a coarse-to-fine manner. The decoder only needs to generate "coarse questions" where all pronouns are replaced by a placeholder "[p]". To get final results, we use an additional pre-trained coreference resolution model to fill pronouns into different placeholders. To make a fair comparison, we use the coreference resolution model (Clark and Manning, 2016) adopted by prior works CoreNQG (Du and Cardie, 2018) and CorefNet (Gao et al., 2019).

Experiments
In this section, we first introduce the three kinds of baselines. After that, we compare and analyse the results of different models under both automatic and human evaluation metrics.

Baselines
We compared our model with seven baselines that can be divided into three groups. First, we used three TQG models: the Seq2seq (Du et al., 2017) model which pioneered NN-based QG, the Copy-Net (See et al., 2017) model that introduced pointer mechanism, and CoreNQG (Du and Cardie, 2018) which used hybrid features (word, answer and coreference embeddings) for encoder and adopted copy mechanism for decoder. Second, since prior works regarded SQG as a conversation generation task, we directly used two powerful multi-turn dialog systems: the latent variable hierarchical recurrent encoder-decoder architecture VHRED (Serban et al., 2017), and the hierarchical recurrent attention architecture HRAN (Xing et al., 2018). Third, we used prior works mentioned above. For Pan et al. (2019), we adopted the ReDR model which had the best performance. For Gao et al. (2019), we used the CorefNet model. Although a CFNet in this paper got better results, it required additional human annotations denoting the relationship between input sentences and target questions. So it is unfair to compare CFNet with other methods. It is worth mentioning that when generating questions using the second and third groups of baselines, only previously generated outputs were used as dialog history, i.e., the gold standard questions are remain unknown (in some prior works, they were directly used as dialog history, which we think is inappropriate in practice).

Automatic Evaluation Metrics
Following the conventions, we used BLEU (Papineni et al., 2002), ROUGE-L (Lin, 2004) and METEOR (Lavie and Agarwal, 2007) as automatic evaluation metrics. We also computed the average word-number of generated questions. As shown in Table 2, our semi-autoregressive model outperformed other methods substantially. When we focus on the second and third groups of baselines regarding SQG as multi-turn dialog generation tasks, we can find that models from the third group are more powerful since they make better use of information from input passages. Besides, models from the second group tend to generate shortest questions. Finally, similar to the problem that dialog systems often generate dull and responses, these models also suffer from producing general but meaningless questions like "What?", "How?", "And else?".
When we compare the first and third groups of baselines (which are all QG models), it is not surprising that SQG models show more advantages than TQG models, as they take the relationships between questions into consideration. Besides, CorefNet gets better performance among all baselines, especially ReDR. This indicates that comparing with implicitly performing reinforcement learning through QA models, explicitly using target answers as inputs can be more effective.  Note that if we directly compare the performance between SQG task and TQG task under the same model (e.g., the Seq2seq model), evaluation scores for TQG tasks are much higher, which is not surprising since SQG is harder than TQG dealing with dependencies between questions. Another fact lies in the computation of automatic evaluation metrics. As shown in Table 2, questions in SQG datasets are much shorter than TQG. Since our automatic evaluation metrics are based on n-gram overlaps between generated and gold standard questions, the scores significantly go down with the growth of n (for this reason, the BLEU 4 scores are not listed in Table 2). This also illustrates the importance of performing human evaluation.

Human Evaluation
It is generally acknowledged that automatic evaluation metrics are far from enough for SQG. So we perform human evaluation in five aspects. Fluency measures if a question is grammatically correct and is fluent to read. Coherence measures if a question is coherent with previous ones. Coreference measures if a question uses correct pronouns. Answerability measures if a question is targeting on the given answer. Relevance measures if a question is grounded in the given passage. Since performing human evaluation is rather expensive and time-consuming, we picked up the best TQG model (CoreNQG), SQG model (CorefNet) to compare with our model. We randomly selected 20 passages from the test set with 207 given answers and asked 10 native speakers to evaluate the outputs of each model independently. Under each aspect, reviewers are asked to choose a score from {1, 2, 3}, where 3 indicates the best quality.
The average scores for each evaluation metric are shown in Table 4. We can find that our model gets the best or competitive performance in each metric. When it comes to fluency, all models get high performance, and the CorefNet that outputs  shortest questions gets the best score. As for coherence, CoreNQG gets poor results since it generates questions independently. When it comes to coreference, our model only slightly lower than CorefNet, which added direct supervision to attention weights by a coreference resolution model. Finally, our model gets the best performance on both answerabity and relevance. However, it is worth noticing that all models get rather poor performances under these two aspects, indicating that making a concise question meaningful (i.e., targeting on given answers) with more information from input passage (i.e., performing proper information elimination) is a major challenge in SQG. Besides, as pointed out by Table 3, questions in our SQG dataset are significantly shorter compared with TQG dataset, making subtle errors much easier to be noticed.

Ablation Test
In this section, we perform ablation test to verify the influence of different components in our model. First, we modify Equation 7 into to get the no interact model, i.e., two graphs are independently updated without any interaction. Second, we build a uni-graph model by removing the passage-info encoder (the remaining rationale graph is updated similarly to Li et al. (2015)). Third, we discard the attention-aware heads in the rationale encoder to get a uni-heads model. Then, we build the no co2fine model without the coarseto-fine generation scenario. Finally, we build a non-auto model that performs SQG in an nonautoregressive way, i.e., each question is generated in parallel.
Peter was a very sad puppy. He had been inside of the pet store for a very long time.  Table 6: Example outputs from different models. We mark the given answers in the passage as blue.
As shown in Table 5, each component in our model plays an important part. Results for the no interact model indicate that compared with independently updating the passage-info graph and answer-info graph, making these information more interacted by our dual-graph interaction scenario is more powerful. Not surprisingly, the uni-graph model removing the passage encoder (i.e., less focusing on context dependencies between sentences from input passage), and the uni-heads model discarding our answer-aware attention mechanism (i.e., less focusing on given answers) get significant worse performance compared with our full model. Besides, our coarse-to-fine scenario helps to better deal with the dependencies between questions since there are widespread coreferences. Finally, although the architecture of non-auto model is a special case of our model where each group only contains a single question, the performance drops significantly, indicating the importance of using semi-autoregressive generation. However, the dualgraph interaction still makes its performance better than the Seq2seq and CopyNet in Table 2.

Running Examples
In Table 6, we present some generated examples comparing our model and the strongest baseline CorefNet. On the one hand, our model performs better than CorefNet, especially that the output questions are more targeting on given answers (turn 2, 6, 7). It also correctly deals with coreferences (e.g., distinguishing "Peter" and "Sammie"). On the other hand, the generated questions have poor quality when gold standard questions involve more reasoning (turn 2, 6). Besides, the gold standard questions are more concise as well (turn 4, 6).

Conclusion
In this paper, we focus on SQG which is an important yet challenging task. Different from prior works regarding SQG as a dialog generation task, we propose the first semi-autoregressive SQG model, which divides questions into different groups and further generates each group of closely-related questions in parallel. During this process, we first build a passage-info graph, an answer-info graph, and then perform dual-graph interaction to get representations capturing the context dependencies between passages and questions. These representations are further used during our coarse-to-fine generation process. To perform experiments, we analyze the limitation of existing datasets and create the first dataset specially used for SQG containing 81.9K questions. Experimental results show that our model outperforms previous works by a substantial margin.
For future works, the major challenge is generating more meaningful, informative but concise questions. Besides, more powerful question clustering and coarse-to-fine generation scenarios are also worth exploration. Finally, performing SQG on other types of inputs, e.g., images and knowledge graphs, is an interesting topic.

A Examples of Data Labeling
In Table 7, we use a typical example to show how we relabeled CoQA. As introduced in our paper, we first deleted questions that cannot be answered by certain span from the passage. In Table 7, we deleted QA-pairs in turn 15, 18, 19 since they are yes/no questions, turn 3, 16 since the answer "female" is not a span from the input passage, and turn 13 since its answer is scattered in the sentence "Some of his cats have orange fur, some have black fur, some are spotted and one is white".
After deleting questions that are not suitable for SQG, we replaced the remaining answers into certain spans from the input passage. As shown in Table 7, in most cases the original answers were already a certain span. We slightly modified answers in turn 2, 7 from "Eight", "Three" into "8", "3" respectively. Finally, we rewrote all remaining questions to make them coherent. During this process, we mainly deal with information omission and coreference. In our example, we added a word "feline" into questions in turn 14 since the question 13 was deleted.

B Details of Experiments
We used the 200-dimentional pre-trained GloVe word embeddings 3 as initial value of word embeddings. During the training process, these embeddings were further fine-tuned. The NLTK 4 package was used for sentence splitting and word tokenization. In our model, we set d s , d r , d g to 200, 256 and 128. For the passage-info encoder, we used 16 heads in the multil-attention layer. For the answerinfo encoder, we used 8 vanilla self-attention heads and additional 6 answer-aware heads for each answer. To construct the two graphs, we set δ into 3. In our dual-graph interaction, we set T into 4. To train our model, we used an Adam optimizer with momentums β 1 = 0.9, β 2 = 0.99 and = 10 −8 to minimize the loss function. We varied the learning rate throughout training, including a warmup step and a decreasing step similar to the original Transformer. Besides, we applied dropout between 0.4 and 0.5 to prevent over-fitting. Our model was trained on two Nvidia RTX 2080Ti graphics cards.
Since we noticed that the available baseline codes used different scripts to compute BLEU,