Semantic Graphs for Generating Deep Questions

This paper proposes the problem of Deep Question Generation (DQG), which aims to generate complex questions that require reasoning over multiple pieces of information about the input passage. In order to capture the global structure of the document and facilitate reasoning, we propose a novel framework that first constructs a semantic-level graph for the input document and then encodes the semantic graph by introducing an attention-based GGNN (Att-GGNN). Afterward, we fuse the document-level and graph-level representations to perform joint training of content selection and question decoding. On the HotpotQA deep-question centric dataset, our model greatly improves performance over questions requiring reasoning over multiple facts, leading to state-of-the-art performance. The code is publicly available at https://github.com/WING-NUS/SG-Deep-Question-Generation.


Introduction
Question Generation (QG) systems play a vital role in question answering (QA), dialogue system, and automated tutoring applications -by enriching the training QA corpora, helping chatbots start conversations with intriguing questions, and automatically generating assessment questions, respectively. Existing QG research has typically focused on generating factoid questions relevant to one fact obtainable from a single sentence (Duan et al., 2017;Zhao et al., 2018;Kim et al., 2019), as exemplified in Figure 1 a). However, less explored has been the comprehension and reasoning aspects of questioning, resulting in questions that are shallow and not reflective of the true creative human process.
People have the ability to ask deep questions about events, evaluation, opinions, synthesis, or reasons, usually in the form of Why, Why-not, How,

Input Paragraph A: Pago Pago International Airport
Pago Pago International Airport, also known as Tafuna Airport, is a public airport located 7 miles (11.3 km) southwest of the central business district of Pago Pago, in the village and plains of Tafuna on the island of Tutuila in American Samoa, an unincorporated territory of the United States.

Input Paragraph B: Hoonah Airport
Hoonah Airport is a state-owned public-use airport located one nautical mile (2 km) southeast of the central business district of Hoonah, Alaska. Question: Are Pago Pago International Airport and Hoonah Airport both on American territory? Answer: Yes

Input Sentence:
Oxygen is used in cellular respiration and released by photosynthesis, which uses the energy of sunlight to produce oxygen from water. What-if, which requires an in-depth understanding of the input source and the ability to reason over disjoint relevant contexts; e.g., asking Why did Gollum betray his master Frodo Baggins? after reading the fantasy novel The Lord of the Rings. Learning to ask such deep questions has intrinsic research value concerning how human intelligence embodies the skills of curiosity and integration, and will have broad application in future intelligent systems. Despite a clear push towards answering deep questions (exemplified by multi-hop reading comprehension (Cao et al., 2019) and commonsense QA (Rajani et al., 2019)), generating deep questions remains un-investigated. There is thus a clear need to push QG research towards generating deep questions that demand higher cognitive skills.
In this paper, we propose the problem of Deep Question Generation (DQG), which aims to generate questions that require reasoning over multiple pieces of information in the passage. Figure 1 b) shows an example of deep question which requires a comparative reasoning over two disjoint pieces of evidences. DQG introduces three additional challenges that are not captured by traditional QG systems. First, unlike generating questions from a single sentence, DQG requires document-level understanding, which may introduce long-range dependencies when the passage is long. Second, we must be able to select relevant contexts to ask meaningful questions; this is non-trivial as it involves understanding the relation between disjoint pieces of information in the passage. Third, we need to ensure correct reasoning over multiple pieces of information so that the generated question is answerable by information in the passage.
To facilitate the selection and reasoning over disjoint relevant contexts, we distill important information from the passage and organize them as a semantic graph, in which the nodes are extracted based on semantic role labeling or dependency parsing, and connected by different intra-and intersemantic relations (Figure 2). Semantic relations provide important clues about what contents are question-worthy and what reasoning should be performed; e.g., in Figure 1, both the entities Pago Pago International Airport and Hoonah Airport have the located at relation with a city in United States. It is then natural to ask a comparative question: e.g., Are Pago Pago International Airport and Hoonah Airport both on American territory?. To efficiently leverage the semantic graph for DQG, we introduce three novel mechanisms: (1) proposing a novel graph encoder, which incorporates an attention mechanism into the Gated Graph Neural Network (GGNN) , to dynamically model the interactions between different semantic relations; (2) enhancing the word-level passage embeddings and the node-level semantic graph representations to obtain an unified semantic-aware passage representations for question decoding; and (3) introducing an auxiliary content selection task that jointly trains with question decoding, which assists the model in selecting relevant contexts in the semantic graph to form a proper reasoning chain. We evaluate our model on HotpotQA (Yang et al., 2018), a challenging dataset in which the questions are generated by reasoning over text from separate Wikipedia pages. Experimental results show that our model -incorporating both the use of the semantic graph and the content selection task -improves performance by a large margin, in terms of both automated metrics (Section 4.3) and human evaluation (Section 4.5). Error analysis (Section 4.6) validates that our use of the semantic graph greatly reduces the amount of semantic errors in generated questions. In summary, our contributions are: (1) the very first work, to the best of our knowledge, to investigate deep question generation, (2) a novel framework which combines a semantic graph with the input passage to generate deep questions, and (3) a novel graph encoder that incorporates attention into a GGNN approach.

Related Work
Question generation aims to automatically generate questions from textual inputs. Rule-based techniques for QG usually rely on manually-designed rules or templates to transform a piece of given text to questions (Heilman, 2011;Chali and Hasan, 2012). These methods are confined to a variety of transformation rules or templates, making the approach difficult to generalize. Neuralbased approaches take advantage of the sequenceto-sequence (Seq2Seq) framework with attention . These models are trained in an end-to-end manner, requiring far less labor and enabling better language flexibility, compared against rule-based methods. A comprehensive survey of QG can be found in Pan et al. (2019).
Many improvements have been proposed since the first Seq2Seq model of Du et al. (2017): applying various techniques to encode the answer information, thus allowing for better quality answerfocused questions Sun et al., 2018;Kim et al., 2019); improving the training via combining supervised and reinforcement learning to maximize question-specific rewards (Yuan et al., 2017); and incorporating various linguistic features into the QG process (Liu et al., 2019a). However, these approaches only consider sentence-level QG. In contrast, our work focus on the challenge of generating deep questions with multi-hop reasoning over document-level contexts.
Recently, work has started to leverage paragraphlevel contexts to produce better questions. Du and Cardie (2018) incorporated coreference knowledge to better encode entity connections across documents. Zhao et al. (2018) applied a gated selfattention mechanism to encode contextual information. However, in practice, semantic structure is difficult to distil solely via self-attention over the entire document. Moreover, despite considering longer contexts, these works are trained and evaluated on SQuAD (Rajpurkar et al., 2016), which we argue as insufficient to evaluate deep QG because more than 80% of its questions are shallow and only relevant to information confined to a single sentence (Du et al., 2017).   Figure 2: The framework of our proposed model (on the right) together with an input example (on the left). The model consists of four parts: (1) a document encoder to encode the input document, (2) a semantic graph encoder to embed the document-level semantic graph via Att-GGNN, (3) a content selector to select relevant question-worthy contents from the semantic graph, and (4) a question decoder to generate question from the semantic-enriched document representation. The left figure shows an input example and its semantic graph. Dark-colored nodes in the semantic graph are question-worthy nodes that are labeled to train the content selection task.

Methodology
Given the document D and the answer A, the objective is to generate a questionQ that satisfies: where document D and answer A are both sequences of words. Different from previous works, we aim to generate aQ which involves reasoning over multiple evidence sentences E = {s i } n i=1 , where s i is a sentence in D. Also, unlike traditional settings, A may not be a sub-span of D because reasoning is involved to obtain the answer.

General Framework
We propose an encoder-decoder framework with two novel features specific to DQG: (1) a fused word-level document and node-level semantic graph representation to better utilize and aggregate the semantic information among the relevant disjoint document contexts, and (2) joint training over the question decoding and content selection tasks to improve selection and reasoning over relevant information. Figure 2 shows the general architecture of the proposed model, including three modules: semantic graph construction, which builds the DPor SRL-based semantic graph for the given input; semantic-enriched document representation, employing a novel Attention-enhanced Gated Graph Neural Network (Att-GGNN) to learn the semantic graph representations, which are then fused with the input document to obtain graph-enhanced document representations; and joint-task question generation, which generates deep questions via joint training of node-level content selection and wordlevel question decoding. In the following, we describe the details of each module.

Semantic Graph Construction
As illustrated in the introduction, the semantic relations between entities serve as strong clues in determining what to ask about and the reasoning types it includes. To distill such semantic information in the document, we explore both SRL-(Semantic Role Labelling) and DP-(Dependency Parsing) based methods to construct the semantic graph. Refer to Appendix A for the details of graph construction.
• SRL-based Semantic Graph. The task of Semantic Role Labeling (SRL) is to identify what semantic relations hold among a predicate and its associated participants and properties , including "who" did "what" to "whom", etc. For each sentence, we extract predicate-argument tuples via SRL toolkits 1 . Each tuple forms a subgraph where each tuple element (e.g., arguments, location, and temporal) is a node. We add intertuple edges between nodes from different tuples if they have an inclusive relationship or potentially mention the same entity.
• DP-based Semantic Graph. We employ the biaffine attention model (Dozat and Manning, 2017) for each sentence to obtain its dependency parse tree, which is further revised by removing unimportant constituents (e.g., punctuation) and merging consecutive nodes that form a complete semantic unit. Afterwards, we add inter-tree edges between similar nodes from different parse trees to construct a connected semantic graph.
The left side of Figure 2 shows an example of the DP-based semantic graph. Compared with SRLbased graphs, DP-based ones typically model more fine-grained and sparse semantic relations, as discussed in Appendix A.3. Section 4.3 gives a performance comparison on these two formalisms.

Semantic-Enriched Document Representations
We separately encode the document D and the semantic graph G via an RNN-based passage encoder and a novel Att-GGNN graph encoder, respectively, then fuse them to obtain the semantic-enriched document representations for question generation.
Document Encoding. Given the input document D = [w 1 , · · · , w l ], we employ the bi-directional Gated Recurrent Unit (GRU)  to encode its contexts. We represent the encoder hidden states as is the context embedding of w i as a concatenation of its bi-directional hidden states.
Node Initialization. We define the SRL-and DP-based semantic graphs in an unified way. The semantic graph of the document D is a heteroge- :N e denote graph nodes and the edges connecting them, where N v and N e are the numbers of nodes and edges in the graph, respectively. Each node v = {w j } nv j=mv is a text span in D with an associated node type t v , where m v / n v is the starting / ending position of the text span. Each edge also has a type t e that represents the semantic relation between nodes.
We obtain the initial representation h 0 v for each node v = {w j } nv j=mv by computing the word-tonode attention. First, we concatenate the last hidden states of the document encoder in both directions as the document representation d D = [ x l ; x 1 ]. Afterwards, for a node v, we calculate the attention distribution of d D over all the words {w mv , · · · , w j , · · · , w nv } in v as follows: where β v j is the attention coefficient of the document embedding d D over a word w j in the node v. The initial node representation h 0 v is then given by the attention-weighed sum of the embeddings of its constituent words, i.e., h 0 v = nv j=mv β v j x j . Wordto-node attention ensures each node to capture not only the meaning of its constituting part but also the semantics of the entire document. The node representation is then enhanced with two additional features: the POS embedding p v and the answer tag embedding a v to obtain the enhanced initial . Graph Encoding. We then employ a novel Att-GGNN to update the node representations by aggregating information from their neighbors. To represent multiple relations in the edge, we base our model on the multi-relation Gated Graph Neural Network (GGNN) , which provides a separate transformation matrix for each edge type. For DQG, it is essential for each node to pay attention to different neighboring nodes when performing different types of reasoning. To this end, we adopt the idea of Graph Attention Networks (Velickovic et al., 2017) to dynamically determine the weights of neighboring nodes in message passing using an attention mechanism.
Formally, given the initial hidden states of graph At each state transition, an aggregation function is applied to each node v i to collect messages from the nodes directly connected to v i . The neighbors are distinguished by their incoming and outgoing edges as follows: where N (i) and N (i) denote the sets of incoming and outgoing edges of v i , respectively. W te ij denotes the weight matrix corresponding to the edge type t e ij from v i to v j , and α (k) ij is the attention coefficient of v i over v j , derived as follows: , here a and W A are learnable parameters. Finally, an GRU is used to update the node state by incorporating the aggregated neighboring information.
After the K-th state transition, we denote the final structure-aware representation of node v as h K v . Feature Aggregation. Finally, we fuse the semantic graph representations H K with the document representations X D to obtain the semanticenriched document representations E D for question decoding, as follows: We employ a simple matching-based strategy for the feature fusion function Fuse. For a word w i ∈ D, we match it to the smallest granularity node that contains the word w i , denoted as v M (i) . We then concatenate the word representation . When there is no corresponding node v M (i) , we concatenate x i with a special vector close to 0.
The semantic-enriched representation E D provides the following important information to benefit question generation: (1) semantic information: the document incorporates semantic information explicitly through concatenating with semantic graph encoding; (2) phrase information: a phrase is often represented as a single node in the semantic graph (cf Figure 2 as an example); therefore its constituting words are aligned with the same node representation; (3) keyword information: a word (e.g., a preposition) not appearing in the semantic graph is aligned with the special node vector mentioned before, indicating the word does not carry important information.

Joint Task Question Generation
Based on the semantic-rich input representations, we generate questions via jointly training on two tasks: Question Decoding and Content Selection. Question Decoding. We adopt an attention-based GRU model  with copying (Gu et al., 2016;See et al., 2017) and coverage mechanisms (Tu et al., 2016) as the question decoder. The decoder takes the semantic-enriched representations E D = {e i , ∀w i ∈ D} from the encoders as the attention memory to generate the output sequence one word at a time. To make the decoder aware of the answer, we use the average word embeddings in the answer to initialize the decoder hidden states.
At each decoding step t, the model learns to attend over the input representations E D and compute a context vector c t based on E D and the current decoding state s t . Next, the copying probability P cpy ∈ [0, 1] is calculated from the context vector c t , the decoder state s t and the decoder input y t−1 . P cpy is used as a soft switch to choose between generating from the vocabulary, or copying from the input document. Finally, we incorporate the coverage mechanisms (Tu et al., 2016) to encourage the decoder to utilize diverse components of the input document. Specifically, at each step, we maintain a coverage vector cov t , which is the sum of attention distributions over all previous decoder steps. A coverage loss is computed to penalize repeatedly attending to the same locations of the input document.

Content Selection.
To raise a deep question, humans select and reason over relevant content. To mimic this, we propose an auxiliary task of content selection to jointly train with question decoding. We formulate this as a node classification task, i.e., deciding whether each node should be involved in the process of asking, i.e., appearing in the reasoning chain for raising a deep question, exemplified by the dark-colored nodes in Figure 2.
To this end, we add one feed-forward layer on top of the final-layer of the graph encoder, taking the output node representations H K for classification. We deem a node as positive ground-truth to train the content selection task if its contents appear in the ground-truth question or act as a bridge entity between two sentences.
Content selection helps the model to identify the question-worthy parts that form a proper reasoning chain in the semantic graph. This synergizes with the question decoding task which focuses on the fluency of the generated question. We jointly train these two tasks with weight sharing on the input representations.

Data and Metrics
To evaluate the model's ability to generate deep questions, we conduct experiments on Hot-potQA (Yang et al., 2018), containing ∼100K crowd-sourced questions that require reasoning over separate Wikipedia articles. Each question is paired with two supporting documents that contain the evidence necessary to infer the answer. In the DQG task, we take the supporting documents along with the answer as inputs to generate the question. However, state-of-the-art semantic parsing models have difficulty in producing accurate semantic graphs for very long documents. We therefore pre-process the original dataset to select relevant sentences, i.e., the evidence statements and the sentences that overlap with the ground-truth question, as the input document. We follow the original data split of HotpotQA to pre-process the data, resulting in 90,440 / 6,072 examples for training and evaluation, respectively.
Following previous works, we employ BLEU 1-4 (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007), and ROUGE-L (Lin, 2004) as automated evaluation metrics. BLEU measures the average n-gram overlap on a set of reference sentences. Both METEOR and ROUGE-L specialize BLEU's n-gram overlap idea for machine translation and text summarization evaluation, respectively. Critically, we also conduct human evaluation, where annotators evaluate the generation quality from three important aspects of deep questions: fluency, relevance, and complexity.

Baselines
We compare our proposed model against several strong baselines on question generation.
• Seq2Seq + Attn : the basic Seq2Seq model with attention, which takes the document as input to decode the question.
• NQG++ : which enhances the Seq2Seq model with a feature-rich encoder containing answer position, POS and NER information.
• ASs2s (Kim et al., 2019): learns to decode questions from an answer-separated passage encoder together with a keyword-net based answer encoder.
• S2sa-at-mp-gsa (Zhao et al., 2018): an enhanced Seq2Seq model incorporating gated self-attention and maxout-pointers to encode richer passage-level contexts (B4 in Table 1). We also implement a ver-sion that uses coverage mechanism and our answer encoder for fair comparison, labeled B5.
• CGC-QG (Liu et al., 2019a): another enhanced Seq2Seq model that performs word-level content selection before generation; i.e., making decisions on which words to generate and to copy using rich syntactic features, such as NER, POS, and DEP. Implementation Details. For fair comparison, we use the original implementations of ASs2s and CGC-QG to apply them on HotpotQA. All baselines share a 1-layer GRU document encoder and question decoder with hidden units of 512 dimensions. Word embeddings are initialized with 300dimensional pre-trained GloVe (Pennington et al., 2014). For the graph encoder, the node embedding size is 256, plus the POS and answer tag embeddings with 32-D for each. The number of layers K is set to 3 and hidden state size is 256. Other settings for training follow standard best practice 2 .

Comparison with Baseline Models
The top two parts of Table 1 show the experimental results comparing against all baseline methods. We make three main observations: 1. The two versions of our model -P1 and P2 -consistently outperform all other baselines in BLEU. Specifically, our model with DP-based semantic graph (P2) achieves an absolute improvement of 2.05 in BLEU-4 (+15.2%), compared to the document-level QG model which employs gated self-attention and has been enhanced with the same decoder as ours (B5). This shows the significant effect of semantic-enriched document representations, equipped with auxiliary content selection for generating deep questions.
2. The results of CGC-QG (B6) exhibits an unusual pattern compared with other methods, achieving the best METEOR and ROUGE-L but worst BLEU-1 among all baselines. As CGC-QG performs word-level content selection, we observe that it tends to include many irrelevant words in the question, leading to lengthy questions (33.7 tokens on average, while 17.7 for ground-truth questions and 19.3 for our model) that are unanswerable or with semantic errors. Our model greatly reduces the error with node-level content selection based on semantic relations (shown in Table 3).  3. While both SRL-based and DP-based semantic graph models (P1 and P2) achieve state-of-theart BLEU, DP-based graph (P2) performs slightly better (+3.3% in BLEU-4). A possible explanation is that SRL fails to include fine-grained semantic information into the graph, as the parsing often results in nodes containing a long sequence of tokens.

Ablation Study
We also perform ablation studies to assess the impact of different components on the model performance against our DP-based semantic graph (P2) model. These are shown as Rows A1-4 in Table 1. Similar results are observed for the SRL-version.
• Impact of semantic graph. When we do not employ the semantic graph (A2, -w/o Semantic Graph), the BLEU-4 score of our model dramatically drops to 13.85, which indicates the necessity of building semantic graphs to model semantic relations between relevant content for deep QG. Despite its vital role, result of A1 shows that generating questions purely from the semantic graph is unsatisfactory. We posit three reasons: 1) the semantic graph alone is insufficient to convey the meaning of the entire document, 2) sequential information in the passage is not captured by the graph, and that 3) the automatically built semantic graph inevitably contains much noise. These reasons ne-cessitate the composite document representation.
• Impact of Att-GGNN. Using a normal GGNN (A3, -w/o Multi-Relation & Attention) to encode the semantic graph, performance drops to 14.15 (−3.61%) in BLEU-4 compared to the model with Att-GGNN (A4, -w/o Multi-Task). This reveals that different entity types and their semantic relations provide auxiliary information needed to generate meaningful questions. Our Att-GGNN model (P2) incorporates attention into the normal GGNN, effectively leverages the information across multiple node and edge types.
• Impact of joint training. By turning off the content selection task (A4, -w/o Multi-Task), the BLEU-4 score drops from 15.53 to 14.66, showing the contribution of joint training with the auxiliary task of content selection. We further show that content selection helps to learn a QG-aware graph representation in Section 4.7, which trains the model to focus on the question-worthy content and form a correct reasoning chain in question decoding.

Human Evaluation
We conduct human evaluation on 300 random test samples consisting of: 100 short (<50 tokens), 100 medium (50-200 tokens), and 100 long (>200 tokens) documents. We ask three workers to rate the 300 generated questions as well as the ground-truth What was the ranking of the population of the city Barack Obama was born in 1999? Table 3: Error analysis on 3 different methods, with respects to 5 major error types (excluding the "Correct"). Pred. and G.T. show the example of the predicted question and the ground-truth question, respectively. Semantic Error: the question has logic or commonsense error; Answer Revealing: the question reveals the answer; Ghost Entity: the question refers to entities that do not occur in the document; Redundant: the question contains unnecessary repetition; Unanswerable: the question does not have the above errors but cannot be answered by the document.
questions between 1 (poor) and 5 (good) on three criteria: (1) Fluency, which indicates whether the question follows the grammar and accords with the correct logic; (2) Relevance, which indicates whether the question is answerable and relevant to the passage; (3) Complexity, which indicates whether the question involves reasoning over multiple sentences from the document. We average the scores from raters on each question and report the performance over five top models from Table 1. Raters were unaware of the identity of the models in advance. Table 2 shows our human evaluation results, which further validate that our model generates questions of better quality than the baselines. Let us explain two observations in detail: • Compared against B4 (S2sa-at-mp-gsa), improvements are more salient in terms of "Fluency" (+13.33%) and "Complexity" (+8.48%) than that of "Relevance" (+6.27%). The reason is that the baseline produces more shallow questions (affecting complexity) or questions with semantic errors (affecting fluency). We observe similar results when removing the semantic graph (A2.w/o Semantic Graph). These demonstrate that our model, by incorporating the semantic graph, produces questions with fewer semantic errors and utilizes more context.
• All metrics decrease in general when the input document becomes longer, with the most obvious drop in "Fluency". When input contexts is long, it becomes difficult for models to capture questionworthy points and conduct correct reasoning, leading to more semantic errors. Our model tries to alleviate this problem by introducing semantic graph and content selection, but question quality drops as noise increases in the semantic graph when the document becomes longer.

Error Analysis
In order to better understand the question generation quality, we manually check the sampled outputs, and list the 5 main error sources in Table 3. Among them, "Semantic Error", "Redundant", and "Unanswerable" are noticeable errors for all models. However, we find that baselines have more unreasonable subject-predicate-object collocations (semantic errors) than our model. Especially, CGC-QG (B6) has the largest semantic error rate of 26.4% among the three methods; it tends to copy irrelevant contents from the input document. Our model greatly reduces such semantic errors to 8.3%, as we explicitly model the semantic relations between entities by introducing typed semantic graphs. The other noticeable error type is "Unanswerable"; i.e., the question is correct itself but cannot be answered by the passage. Again, CGC-QG remarkably produces more unanswerable questions than the other two models, and our model achieves comparable results with S2sa-at-mp-gsa (B4), likely due to the fact that answerability requires a deeper understanding of the document as well as commonsense knowledge. These issues cannot be fully addressed by incorporating semantic relations. Examples of questions generated by different models are shown in Figure 3.

Analysis of Content Selection
We introduced the content selection task to guide the model to select relevant content and form proper reasoning chains in the semantic graph. To quantitatively validate the relevant content selection, we calculate the alignment of node attention 2) " Na Na " appeared on the Disney film , " Confessions of a Teenage Drama Queen " .

3) Confessions of a Teenage Drama Queen is a 2004 American teen musical comedy film directed by Sara Sugarman and produced by Robert Shapiro and
Matthew Hart for Walt Disney Pictures . α v i with respect to the relevant nodes v i ∈RN α v i and irrelevant nodes v i / ∈RN α v i , respectively, under the conditions of both single training and joint training, where RN represents the ground-truth we set for content selection. Ideally, a successful model should focus on relevant nodes and ignore irrelevant ones; this is reflected by the ratio between v i ∈RN α v i and v i / ∈RN α v i . When jointly training with content selection, this ratio is 1.214 compared with 1.067 under singletask training, consistent with our intuition about content selection. Ideally, a successful model should concentrate on parts of the graph that help to form proper reasoning. To quantitatively validate this, we compare the concentration of attention in single-and multi-task settings by computing the entropy H = − α v i log α v i of the attention distributions. We find that content selection increases the entropy from 3.51 to 3.57 on average. To gain better insight, in Figure 3, we visualize the semantic graph attention distribution of an example. We see that the model pays more attention (is darker) to the nodes that form the reasoning chain (the highlighted paths in purple), consistent with the quantitative analysis.

Conclusion and Future Works
We propose the problem of DQG to generate questions that requires reasoning over multiple disjoint pieces of information. To this end, we propose a novel framework which incorporates semantic graphs to enhance the input document representations and generate questions by jointly training with the task of content selection. Experiments on the HotpotQA dataset demonstrate that introducing semantic graph significantly reduces the semantic errors, and content selection benefits the selection and reasoning over disjoint relevant contents, leading to questions with better quality.
There are at least two potential future directions. First, graph structure that can accurately represent the semantic meaning of the document is crucial for our model. Although DP-based and SRL-based semantic parsing are widely used, more advanced semantic representations could also be explored, such as discourse structure representation (van Noord et al., 2018;Liu et al., 2019b) and knowledge graph-enhanced text representations (Cao et al., 2017;Yang et al., 2019). Second, our method can be improved by explicitly modeling the reasoning chains in generation of deep questions, inspired by related methods (Lin et al., 2018;Jiang and Bansal, 2019) in multi-hop question answering.

A Supplemental Material
Here we give a more detailed description for the semantic graph construction, where we have employed two methods: Semantic Role Labelling (SRL) and Dependency Parsing (DP).

A.1 SRL-based Semantic Graph
The primary task of semantic role labeling (SRL) is to indicate exactly what semantic relations hold among a predicate and its associated participants and properties . Given a document D with n sentences {s 1 , · · · , s n }, Algorithm 1 gives the detailed procedure of constructing the semantic graph based on SRL.

Algorithm 1 Build SRL-based Semantic Graphs
Input: Document D = {s 1 , · · · , s n } Output: Semantic graph G for each tuple t = (a, v, m) in S do 7: We first create an empty graph G = (V, E), where V and E are the node and edge sets, respectively. For each sentence s, we use the state-ofthe-art BERT-based model (Shi and Lin, 2019) provided in the AllenNLP toolkit 3 to perform SRL, resulting a set of SRL tuples S. Each tuple t ∈ S consists of an argument a, a verb v, and (possibly) a modifier m, each of which is a text span of the 3 https://demo.allennlp.org/semantic-role-labeling sentence. We treat each of a, v, and m as a node and link it to an existing node v i ∈ V if it is similar to v i . Two nodes A and B are similar if one of following rules are satisfied: (1) A is equal to B; (2) A contains B; (3) the number of overlapped words between A and B is larger than the half of the minimum number of words in A and B. The edge between two similar nodes is associated with a special semantic relationship SIMILAR, denoted as r s . Afterwards, we add two edges a, r a→v , v and v, r v→m , m into the edge set, where r a→v and r v→m denotes the semantic relationship between (a, v) and (v, w), respectively. As a result, we obtain a semantic graph with multiple node and edge types based on the SRL, which captures the core semantic relations between entities within the document.

A.2 DP-based Semantic Graph
Dependency Parsing (DP) analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words that modify Document 1) John E. EchoHawk (Pawnee) is a leading member of the Native American self -determination movement . 2) Self -determination " is meant to reverse the paternalistic policies enacted upon Native American tribes since the U.S. government created treaties and established the reservation system . them, in a tree structure. Given a document D with n sentences {s 1 , · · · , s n }, Algorithm 2 gives the detailed procedure of constructing the semantic graph based on dependency parsing.

DP-based Semantic
To better represent the entity connection within the document, we first employ the coreference resolution system of AllenNLP to replace the pronouns that refer to the same entity with its original entity name. For each sentence s, we employ the AllenNLP implementation of the biaffine attention model (Dozat and Manning, 2017) to obtain its dependency parse tree T s . Afterwards, we perform the following operations to refine the tree: • IDENTIFY NODE TYPES: each node in the dependency parse tree is a word associated with a POS tag. To simplify the node type system, we manually categorize the POS types into three groups: verb, noun, and attribute. Each node is then assigned to one group as its node type.
• PRUNE TREE: we then prune each tree by removing unimportant continents (e.g., punctuation) based on pre-defined grammar rules. Specifically, we do this recursively from top to bottom where for each node v, we visit each of its child node c. If c needs to be pruned, we delete c and directly link each child node of c to v.
• MERGE NODES: each node in the tree represents only one word, which may lead to a large and noisy semantic graph especially for long documents. To ensure that the semantic graph only retains important semantic relations, we merge consecutive nodes that form a complete semantic unit. To be specific, we apply a simple yet effective rule: merging a node v with its child c if they form a consecutive modifier, i.e., both the type of v and c are modifier, and v and c is consecutive in the sentence.
After obtaining the refined dependency parse tree T s for each sentence s, we add intra-tree edges to construct the semantic graph by connecting the nodes that are similar but from different parse trees. For each possible node pair v i , v j , we add an edge between them with a special edge type SIM-ILAR (denoted as r s ) if the two nodes are similar, i.e., satisfying the same condition as described in Section A.1. Figure 4 shows a real example for the DP-and SRL-based semantic graph, respectively. In general, DP-based graph contains less words for each node compared with the SRL-based graph, allowing it to include more fine-grained semantic relations. For example, a leading member of the Native American self-determination movement is treated as a single node in the SRL-based graph. While in the DP-based graph, it is represented as a semantic triple a leading member, pobj, the Native American self-determination movement . As the node is more fine-grained in the DP-based graph, this makes the graph typically more sparse than the SRL-based graph, which may hinder the message passing during graph propagation.

A.3 Examples
In experiments, we have compared the performance difference when using DP-and SRL-based graphs. We find that although both SRL-and DPbased semantic graph outperforms all baselines in terms of BLEU 1-4, DP-based graph performs slightly better than SRL-based graph (+3.3% in BLEU-4).