Generating Question-Answer Hierarchies

The process of knowledge acquisition can be viewed as a question-answer game between a student and a teacher in which the student typically starts by asking broad, open-ended questions before drilling down into specifics (Hintikka, 1981; Hakkarainen and Sintonen, 2002). This pedagogical perspective motivates a new way of representing documents. In this paper, we present SQUASH (Specificity-controlled Question-Answer Hierarchies), a novel and challenging text generation task that converts an input document into a hierarchy of question-answer pairs. Users can click on high-level questions (e.g., “Why did Frodo leave the Fellowship?”) to reveal related but more specific questions (e.g., “Who did Frodo leave with?”). Using a question taxonomy loosely based on Lehnert (1978), we classify questions in existing reading comprehension datasets as either GENERAL or SPECIFIC . We then use these labels as input to a pipelined system centered around a conditional neural language model. We extensively evaluate the quality of the generated QA hierarchies through crowdsourced experiments and report strong empirical results.


Introduction
Q: What is this paper about? A: We present a novel text generation task which converts an input document into a modelgenerated hierarchy of question-answer (QA) pairs arranged in a top-down tree structure (Figure 1). Questions at higher levels of the tree are broad and open-ended while questions at lower levels ask about more specific factoids. An entire document has multiple root nodes ("key ideas") that unfold into a forest of question trees. While readers are initially shown only the root nodes of the question trees, they can "browse" the document by clicking on root nodes of interest

Massive Attack (band)
On 21 January 2016, the iPhone application "Fantom" was released. The application was developed by a team including Massive Attack's Robert Del Naja and let users hear parts of four new songs by remixing them in real time, using the phone's location, movement, clock, heartbeat, and camera. On 28 January 2016, Massive Attack released a new EP, Ritual Spirit, which includes the four songs released on Fantom. Figure 1: A subset of the QA hierarchy generated by our SQUASH system that consists of GENERAL and SPECIFIC questions with extractive answers.
to reveal more fine-grained related information. We call our task SQUASH (Specificity-controlled Question Answer Hierarchies).
Q: Why represent a document with QA pairs? 1 A: Questions and answers (QA) play a critical role in scientific inquiry, information-seeking dialogue and knowledge acquisition (Hintikka, 1981(Hintikka, , 1988Stede and Schlangen, 2004). For example, web users often use QA pairs to manage and share knowledge (Wagner, 2004;Wagner and Bolloju, 2005;Gruber, 2008). Additionally, unstructured lists of "frequently asked questions" (FAQs) are regularly deployed at scale to present information.
Industry studies have demonstrated their effectiveness at cutting costs associated with answering customer calls or hiring technical experts (Davenport et al., 1998). Automating the generation of QA pairs can thus be of immense value to companies and web communities.
Q: Why add hierarchical structure to QA pairs? A: While unstructured FAQs are useful, pedagogical applications benefit from additional hierarchical organization. Hakkarainen and Sintonen (2002) show that students learn concepts effectively by first asking general, explanationseeking questions before drilling down into more specific questions. More generally, hierarchies break up content into smaller, more digestable chunks.
User studies demonstrate a strong preference for hierarchies in document summarization (Buyukkokten et al., 2001;Christensen et al., 2014) since they help readers easily identify and explore key topics (Zhang et al., 2017).
Q: How do we build systems for SQUASH? A: We leverage the abundance of reading comprehension QA datasets to train a pipelined system for SQUASH. One major challenge is the lack of labeled hierarchical structure within existing QA datasets; we tackle this issue in Section 2 by using the question taxonomy of Lehnert (1978) to classify questions in these datasets as either GENERAL or SPECIFIC. We then condition a neural question generation system on these two classes, which enables us to generate both types of questions from a paragraph. We filter and structure these outputs using the techniques described in Section 3.
Q: How do we evaluate our SQUASH pipeline? A: Our crowdsourced evaluation (Section 4) focuses on fundamental properties of the generated output such as QA quality, relevance, and hierarchical correctness. Our work is a first step towards integrating QA generation into document understanding; as such, we do not directly evaluate how useful SQUASH output is for downstream pedagogical applications. Instead, a detailed qualitative analysis (Section 5) identifies challenges that need to be addressed before SQUASH can be deployed to real users. Q: What are our main contributions? A1: A method to classify questions according to their specificity based on Lehnert (1978). A2: A model controlling specificity of generated questions, unlike prior work on QA generation. A3: A novel text generation task (SQUASH), which converts documents into specificity-based hierarchies of QA pairs. A4: A pipelined system to tackle SQUASH along with crowdsourced methods to evaluate it. Q: How can the community build on this work? A: We have released our codebase, dataset and a live demonstration of our system at http:// squash.cs.umass.edu/. Additionally, we outline guidelines for future work in Section 7.
2 Obtaining training data for SQUASH The proliferation of reading comprehension datasets like SQuAD (Rajpurkar et al., 2016(Rajpurkar et al., , 2018 has enabled state-of-the-art neural question generation systems Kim et al., 2018). However, these systems are trained for individual question generation, while the goal of SQUASH is to produce a general-to-specific hierarchy of QA pairs. Recently-released conversational QA datasets like QuAC  and CoQA (Reddy et al., 2018) contain a sequential arrangement of QA pairs, but question specificity is not explicitly marked. 2 Motivated by the lack of hierarchical QA datasets, we automatically classify questions in SQuAD, QuAC and CoQA according to their specificity using a combination of rule-based and automatic approaches.

Rules for specificity classification
What makes one question more specific than another? Our scheme for classifying question specificity maps each of the 13 conceptual question categories defined by Lehnert (1978) to three coarser labels: GENERAL,SPECIFIC, As a result of this mapping, SPECIFIC questions usually ask for low-level information (e.g., entities or numerics), while GENERAL questions ask for broader overviews (e.g., "what happened in 1999?") or causal information (e.g, "why did..."). Many question categories can be reliably identified using simple templates and rules; A complete list is provided in Table 1  plates and rules (Table A1); for the remaining half, we resort to a data-driven approach. First, we manually label 1000 questions in QuAC 5 using our specificity labels. This annotated data is then fed to a single-layer CNN binary classifier (Kim, 2014) using ELMo contextualized embeddings . 6 On a 85%-15% train-validation split, we achieve a high classification accuracy of 91%. The classifier also transfers to other datasets: on 100 manually labeled CoQA questions, we achieve a classification accuracy of 80%. To obtain our final dataset (Table 2), we run our rule-based approach on all questions in SQuAD 2.0, QuAC, and CoQA and apply our classifier to label questions that were not covered by the rules. We further evaluate the specificity of the questions generated by our final system using a crowdsourced study in Section 4.3.

A pipeline for SQUASHing documents
To SQUASH documents, we build a pipelined system ( Figure 2) that takes a single paragraph as input and produces a hierarchy of QA pairs as output; for multi-paragraph documents, we SQUASH each paragraph independently of the rest. At 5 We use QuAC because its design encourages a higher percentage of GENERAL questions than other datasets, as the question-asker was unable to read the document to formulate more specific questions. 6 Implemented in AllenNLP ). a high level, the pipeline consists of five steps: (1) answer span selection, (2) question generation conditioned on answer spans and specificity labels, (3) extractively answering generated questions, (4) filtering out bad QA pairs, and (5) structuring the remaining pairs into a GENERAL-to-SPECIFIC hierarchy. The remainder of this section describes each step in more detail and afterwards explains how we leverage pretrained language models to improve individual components of the pipeline.

Answer span selection
Our pipeline begins by selecting an answer span from which to generate a question. To train the system, we can use ground-truth answer spans from our labeled datasets, but at test time how do we select answer spans? Our solution is to consider all individual sentences in the input paragraph as potential answer spans (to generate GEN-ERAL and SPECIFIC questions), along with all entities and numerics (for just SPECIFIC questions). We did not use data-driven sequence tagging approaches like previous work Cardie, 2017, 2018), since our preliminary experiments with such approaches yielded poor results on QuAC. 7 More details are provided in Appendix C.

Conditional question generation
Given a paragraph, answer span, and desired specificity label, we train a neural encoderdecoder model on all three reading comprehension datasets (SQuAD, QuAC and CoQA) to generate an appropriate question.  Figure 2: An overview of the process by which we generate a pair of GENERAL-SPECIFIC questions , which consists of feeding input data ("RC" is Reading Comprehension) through various modules, including a question classifier and a multi-stage pipeline for question generation, answering, and filtering.
Data preprocessing: At training time, we use the ground-truth answer spans from these datasets as input to the question generator. To improve the quality of SPECIFIC questions generated from sentence spans, we use the extractive evidence spans for CoQA instances (Reddy et al., 2018) instead of the shorter, partially abstractive answer spans (Yatskar, 2019). In all datasets, we remove unanswerable questions and questions whose answers span multiple paragraphs. A few very generic questions (e.g. "what happened in this article?") were manually identified removed from the training dataset. Some other questions (e.g., "where was he born?") are duplicated many times in the dataset; we downsample such questions to a maximum limit of 10. Finally, we preprocess both paragraphs and questions using byte-pair encoding (Sennrich et al., 2016).
Architecture details: We use a two-layer biL-STM encoder and a single-layer LSTM (Hochreiter and Schmidhuber, 1997) decoder with soft attention (Bahdanau et al., 2015) to generate questions, similar to . Our architecture is augmented with a copy mechanism (See et al., 2017) over the encoded paragraph representations. Answer spans are marked with <SOA> and <EOA> tokens in the paragraph, and representations for tokens within the answer span are attended to by a separate attention head. We condition the decoder on the specificity class (GENERAL, SPECIFIC and YES-NO) 8 by concatenating an embedding for the ground-truth class to the input of each time step. We implement models in PyTorch v0.4 (Paszke et al., 2017), and the best-performing model achieves a perplexity of 11.1 on the validation set. Other hyperparameters details are provided in Appendix B.
Test time usage: At test time, the question generation module is supplied with answer spans and class labels as described in Section 3.1. To promote diversity, we over-generate prospective candidates (Heilman and Smith, 2010) for every answer span and later prune them. Specifically, we use beam search with a beam size of 3 to generate three highly-probable question candidates. As these candidates are often generic, we additionally use top-k random sampling (Fan et al., 2018) with k = 10, a recently-proposed diversity-promoting decoding algorithm, to generate ten more question candidates per answer span. Hence, for every answer span we generate 13 question candidates. We discuss issues with using just standard beam search for question generation in Section 5.1.

Answering generated questions
While we condition our question generation model on pre-selected answer spans, the generated questions may not always correspond to these input spans. Sometimes, the generated questions are either unanswerable or answered by a different span in the paragraph. By running a pretrained QA model over the generated questions, we can detect questions whose answers do not match their original input spans and filter them out. The predicted answer for many questions has partial overlap with the original answer span; in these cases, we display the predicted answer span during evaluation, as a qualitative inspection shows that the predicted answer is more often closer to the correct answer. For all of our experiments, we use the AllenNLP implementation of the BiDAF++ question answering model of  trained on QuAC with no dialog context.

Question filtering
After over-generating candidate questions from a single answer span, we use simple heuristics to filter out low-quality QA pairs. We remove generic and duplicate question candidates 9 and pass the remaining QA pairs through the multistage question filtering process described below.
Irrelevant or repeated entities: Top-k random sampling often generates irrelevant questions; we reduce their incidence by removing any candidates that contain nouns or entities unspecified in the passage. As with other neural text generation systems (Holtzman et al., 2018), we commonly observe repetition in the generated questions and deal with this phenomenon by removing candidates with repeated nouns or entities.
Unanswerable or low answer overlap: We remove all candidates marked as "unanswerable" by the question answering model, which prunes 39.3% of non-duplicate question candidates. These candidates are generally grammatically correct but considered irrelevant to the original paragraph by the question answering model. Next, we compute the overlap between original and predicted answer span by computing word-level precision and recall (Rajpurkar et al., 2016). For GENERAL questions generated from sentence spans, we attempt to maximize recall by setting a minimum recall threshold of 0.3. 10 Similarly, we maximize recall for SPECIFIC questions generated from named entities with a minimum recall constraint of 0.8. Finally, for SPECIFIC questions generated from sentence spans, we set a minimum precision threshold of 1.0, which filters out questions whose answers are not completely present in the ground-truth sentence.
Low generation probability: If multiple candidates remain after applying the above filtering criteria, we select the most probable candidate for each answer span. SPECIFIC questions generated from sentences are an exception to this rule: for these questions, we select the ten most probable candidates, as there might be multiple questionworthy bits of information in a single sentence. If no candidates remain, in some cases 11 we use a fallback mechanism that sequentially ignores filters to retain more candidates.
Subsequently, Yoda battles Palpatine in a lightsaber duel that wrecks the Senate Rotunda. In the end, neither is able to overcome the other and Yoda is forced to retreat. He goes into exile on Dagobah so that he may hide from the Empire and wait for another opportunity to destroy the Sith. At the end of the film, it was revealed that Yoda has been in contact with Qui-Gon's spirit, learning the secret of immortality from him and passing it on to Obi-Wan.

Forming a QA hierarchy
The output of the filtering module is an unstructured list of GENERAL and SPECIFIC QA pairs generated from a single paragraph. Figure 3 shows how we group these questions into a meaningful hierarchy. First, we choose a parent for each SPECIFIC question by maximizing the overlap (word-level precision) of its predicted answer with the predicted answer for every GEN-ERAL question. If a SPECIFIC question's answer does not overlap with any GENERAL question's answer (e.g., "Dagobah" and "destroy the Sith") we map it to the closest GENERAL question whose answer occurs before the SPECIFIC question's answer ("What happened in the battle ...?"). 12

Leveraging pretrained language models
Recently, pretrained language models based on the Transformer architecture (Vaswani et al., 2017) have significantly boosted question answering performance (Devlin et al., 2019) as well as the quality of conditional text generation (Wolf et al., 2019). Motivated by these results, we modify components of the pipeline to incorporate language model pretraining for our demo. Specifically, our demo's question answering module is the BERT-based model in Devlin et al. (2019), and the question generation module is trained by fine-tuning the publicly-available GPT2-small model (Radford et al., 2019). Please refer to Appendix D for more details. These modifications produce better results qualitatively and speed up the SQUASH pipeline since question overgeneration is no longer needed. Note that the figures and results in Section 4 are using the original components described above.  Table 3: Human evaluations demonstrate the high individual QA quality of our pipeline's outputs. All interannotator agreement scores (Fleiss κ) show "fair" to "substantial" agreement (Landis and Koch, 1977).

Evaluation
We evaluate our SQUASH pipeline on documents from the QuAC development set using a variety of crowdsourced 13 experiments. Concretely, we evaluate the quality and relevance of individual questions, the relationship between generated questions and predicted answers, and the structural properties of the QA hierarchy. We emphasize that our experiments examine only the quality of a SQUASHed document, not its actual usefulness to downstream users. Evaluating usefulness (e.g., measuring if SQUASH is more helpful than the input document) requires systematic and targeted human studies (Buyukkokten et al., 2001) that are beyond the scope of this work.

Individual question quality and relevance
Our first evaluation measures whether questions generated by our system are well-formed (i.e., grammatical and pragmatic). We ask crowd workers whether or not a given question is both grammatical and meaningful. 14 For this evaluation, we acquire judgments for 200 generated QA pairs and 100 gold QA pairs 15 from the QuAC validation set 13 All our crowdsourced experiments were conducted on the Figure Eight platform with three annotators per example (scores calculated by counting examples with two or more correct judgments). We hired annotators from predominantly English-speaking countries with a rating of at least Level 2, and we paid them between 3 and 4 cents per judgment.
14 As "meaningful" is potentially a confusing term for crowd workers, we ran another experiment asking only for grammatical correctness and achieved very similar results. 15 Results on this experiment were computed after removing 3 duplicate generated questions and 10 duplicate gold questions.
(with an equal split between GENERAL and SPE-CIFIC questions). The first row of Table 3 shows that 85.8% of generated questions satisfy this criterion with a high agreement across workers. Question relevance: How many generated questions are actually relevant to the input paragraph? While the percentage of unanswerable questions that were generated offers some insight into this question, we removed all of them during the filtering pipeline (Section 3.4). Hence, we display an input paragraph and generated question to crowd workers (using the same data as the previous wellformedness evaluation) and ask whether or not the paragraph contains the answer to the question. The second row of Table 3 shows that 78.7% of our questions are relevant to the paragraph, compared to 83.3% of gold questions.

Individual answer validity
Is the predicted answer actually a valid answer to the generated question? In our filtering process, we automatically measured answer overlap between the input answer span and the predicted answer span and used the results to remove lowoverlap QA pairs. To evaluate answer recall after filtering, we perform a crowdsourced evaluation on the same 300 QA pairs as above by asking crowdworkers whether or not a predicted answer span contains the answer to the question. We also experiment with a more relaxed variant (partially contains instead of completely contains) and report results for both task designs in the third and fourth rows of Table 3. Over 85% of predicted spans partially contain the answer to the gener-Cowell formed a new company Syco, which is divided into three units -Syco Music, Syco TV and Syco Film. Cowell returned to music with his latest brainchild signed to Syco ...

What is Syco?
How many units does Syco have?
Returning home to Brantford after six months abroad, Bell continued experiments with his "harmonic telegraph". The basic concept behind his device was that messages could ...

What was Bell's telegraph?
Where did he take his experiments?
After five years, however, Limon would return to Broadway to star as a featured dancer in Keep Off the Grass under the choreographer George Balanchine.

Why did he return to Broadway?
Who did he work with?
Tan Dun earned widespread attention after composing the score for Ang Lee's Crouching Tiger, Hidden Dragon (2000), for which he won an Academy Award, a Grammy Award ....

How was Tan Dun received?
What award did he win?
From 1969 to 1971, Cash starred in his own television show, The Johnny Cash Show, on the ABC network. The show was performed at the Ryman Auditorium in Nashville. ...

What did he do in 1969?
What network was he in? Figure 4: SQUASH question hierarchies generated by our system with reference snippets . Questions in the hierarchy are of the correct specificity class (i.e., GENERAL , SPECIFIC ).
ated question, and this number increases if we consider only questions that were previously labeled as well-formed and relevant. The lower gold performance is due to the contextual nature of the gold QA pairs in QuAC, which causes some questions to be meaningless in isolation (e.g."What did she do next?" has unresolvable coreferences).  Table 4: Human evaluation of the structural correctness of our system. The labels "different / same paragraph" refer to the location of the intruder question. The results show the accuracy of specificity and hierarchies.

Structural correctness
To examine the hierachical structure of SQUASH ed documents, we conduct three experiments.
How faithful are output questions to input specificity? First, we investigate whether our model is actually generating questions with the correct specificity label. We run our specificity classifier (Section 2) over 400 randomly sampled questions (50% GENERAL, 50% SPECIFIC) and obtain a high classification accuracy of 91%. 16 This automatic evaluation suggests the model is capable of generating different types of questions.
Are GENERAL questions more representative of a paragraph than SPECIFIC questions? To see if GENERAL questions really do provide more high-level information, we sample 200 GENERAL-SPECIFIC question pairs 17 grouped 16 Accuracy computed after removing 19 duplicates. 17 We avoid gold-standard control experiments for structural correctness tests since questions in the QuAC dataset were not generated with a hierarchical structure in mind. Pilot studies using our question grouping module on gold data together as described in Section 3.5. For each pair of questions (without showing answers), we ask crowd workers to choose the question which, if answered, would give them more information about the paragraph. As shown in Table 4, in 89.5% instances the GENERAL question is preferred over the SPECIFIC one, which confirms the strength of our specificity-controlled question generation system. 18 How related are SPECIFIC questions to their parent GENERAL question? Finally, we investigate the effectiveness of our question grouping strategy, which bins multiple SPECIFIC QA pairs under a single GENERAL QA pair. We show crowd workers a reference GENERAL QA pair and ask them to choose the most related SPECIFIC question given two choices, one of which is the system's output and the other an intruder question. We randomly select intruder SPECIFIC questions from either a different paragraph within the same document or a different group within the same paragraph. As shown in Table 4, crowd workers prefer the system's generated SPECIFIC question with higher than random chance (50%) regardless of where the intruder comes from. As expected, the preference and agreement is higher when intruder questions come from different paragraphs, since groups within the same paragraph often contain related information (Section 5.2).

Qualitative Analysis
In this section we analyze outputs (Figure 4, Figure 5) of our pipeline and identify its strengths and weaknesses. We additionally provide more examples in the appendix ( Figure A1). led to sparse hierarchical structures which were not favored by our crowd workers. 18 We also ran a pilot study asking workers "Which question has a longer answer?" and observed a higher preference of 98.6% for GENERAL questions.

B
In what year did the US army take place?
In what year did the US army take over?
In what year did the US army take place in the US?
T What year was he enlisted? When did he go to war? When did he play as anti aircraft?

What is our pipeline good at?
Meaningful hierarchies: Our method of grouping the generated questions (Section 3.5) produces hierarchies that clearly distinguish between GEN-ERAL and SPECIFIC questions; Figure 4 contains some hierarchies that support the positive results of our crowdsourced evaluation.
Top-k sampling: Similar to prior work (Fan et al., 2018;Holtzman et al., 2019), we notice that beam search often produces generic or repetitive beams (Table 5). Even though the top-k scheme always produces lower-probable questions than beam search, our filtering system prefers a top-k question 49.5% of the time.

What kind of mistakes does it make?
We describe the various types of errors our model makes in this section, using the Paul Weston SQUASH output in Figure 5 as a running example. Additionally, we list some modeling approaches we tried that did not work in Appendix C.
Reliance on a flawed answering system: Our pipeline's output is tied to the quality of the pretrained answering module, which both filters out questions and produces final answers. QuAC has long answer spans  that cause low-precision predictions with extra information (e.g., "Who was born in Springfield?"). Additionally, the answering module occasionally swaps two named entities present in the paragraph. 19 Redundant information and lack of discourse: In our system, each QA pair is generated independently of all the others. Hence, our outputs lack an inter-question discourse structure. Our system often produces a pair of redundant SPECIFIC questions where the text of one question answers the other (e.g., "Who was born in Springfield?" vs. "Where was Weston born?"). These errors can likely be corrected by conditioning the generation module on previously-produced questions (or additional filtering); we leave this to future work.
Lack of world knowledge: Our models lack commonsense knowledge ("How old was Weston when he was born?") and can misinterpret polysemous words. Integrating pretrained contextualized embeddings  into our pipeline is one potential solution.
Multiple GENERAL QA per paragraph: Our system often produces more than one tree per paragraph, which is undesirable for short, focused paragraphs with a single topic sentence. To improve the user experience, it might be ideal to restrict the number of GENERAL questions we show per paragraph. While we found it difficult to generate GENERAL questions representative of entire paragraphs (Appendix C), a potential solution could involve identifying and generating questions from topic sentences.
Coreferences in GENERAL questions: Many generated GENERAL questions contain coreferences due to contextual nature of the QuAC and CoQA training data ("How did he get into music?"). Potential solutions could involve either constrained decoding to avoid beams with anaphoric expressions or using the CorefNQG model of Du and Cardie (2018).

Which models did not work?
We present modelling approaches which did not work in Appendix C. This includes, i) end-toend modelling to generate sequences of questions using QuAC, ii) span selection NER system, iii) generation of GENERAL questions representative of entire paragraphs, iv) answering system trained on the combination of QuAC, CoQA and SQuAD.

Related Work
Our work on SQUASH is related to research in three broad areas: question generation, information retrieval and summarization.
Question Generation: Our work builds upon neural question generation systems Du and Cardie, 2018). Our work conditions generation on specificity, similar to difficultyconditioned question generation (Gao et al., 2018). QA pair generation has previously been used for dataset creation (Serban et al., 2016;Du and Cardie, 2018). Joint modeling of question generation and answering has improved the performance of individual components (Tang et al., 2017;Wang et al., 2017;Sachan and Xing, 2018) and enabled visual dialog generation (Jain et al., 2018).
Information Retrieval: Our hierarchies are related to interactive retrieval setting (Hardtke et al., 2009;Brandt et al., 2011) where similar webpages are grouped together. SQUASH is also related to exploratory (Marchionini, 2006) and faceted search (Yee et al., 2003).

Future Work
While Section 5.2 focused on shortcomings in our modeling process and steps to fix them, this section focuses on broader guidelines for future work involving the SQUASH format and its associated text generation task.
Evaluation of the SQUASH format: As discussed in Section 1, previous research shows support for the usefulness of hierarchies and QA in pedagogical applications. We did not directly evaluate this claim in the context of SQUASH, focusing instead on evaluating the quality of QA pairs and their hierarchies. Moving forward, careful user studies are needed to evaluate the efficacy of the SQUASH format in pedagogical applications, which might be heavily domain-dependent; for example, a QA hierarchy for a research paper is likely to be more useful to an end user than a QA hierarchy for an online blog. An important caveat is the imperfection of modern text generation systems, which might cause users to prefer the original human-written document over a generated SQUASH output. One possible solution is a three-way comparison between the original document, a human-written SQUASHed document, and a system-generated output. For fair comparison, care should be taken to prevent experimenter bias while crowdsourcing QA hierarchies (e.g., by maintaining similar text complexity in the two human-written formats).
Collection of a SQUASH dataset: Besides measuring the usefulness of the QA hierarchies, a large dedicated dataset can help to facilitate endto-end modeling. While asking human annotators to write full SQUASHed documents will be expensive, a more practical option is to ask them to pair GENERAL and SPECIFIC questions in our dataset to form meaningful hierarchies and write extra questions whenever no such pair exists.

QA budget and deeper specificity hierarchies:
In our work, we generate questions for every sentence and filter bad questions with fixed thresholds. An alternative formulation is an adaptive model dependent on a user-specified QA budget, akin to "target length" in summarization systems, which would allow end users to balance coverage and brevity themselves. A related modification is increasing the depth of the hierarchies. While two-level QA trees are likely sufficient for documents structured into short and focused paragraphs, deeper hierarchies can be useful for long unstructured chunks of text. Users can control this property via a "maximum children per QA node" hyperparameter, which along with the QA budget will determine the final depth of the hierarchy.
We propose SQUASH, a novel text generation task which converts a document into a hierarchy of QA pairs. We present and evaluate a system which leverages existing reading comprehension datasets to attempt solving this task. We believe SQUASH is a challenging text generation task and we hope the community finds it useful to benchmark systems built for document understanding, question generation and question answering. Additionally, we hope that our specificity-labeled reading comprehension dataset is useful in other applications such as 1) finer control over question generation systems used in education applications, curiositydriven chatbots and healthcare . Table 2 shows us that QuAC has the highest percentage of GEN-ERAL questions. On the other hand CoQA and SQuAD, which allowed the question-asker to look at the passage, are dominated by SPECIFIC questions.

Confirming our intuition,
These findings are consistent with a comparison across the three datasets in Yatskar (2019). Interestingly, the average answer length for SPECIFIC questions in QuAC is 12 tokens, compared to 17 tokens for GENERAL questions. We provide the exact distribution of rule-labeled, hand-labeled and classifier-labeled questions in Table A1.

B Hyperparameters for Question Generation
Our question generation system consists of a two layer bidirectional LSTM encoder and a unidirectional LSTM decoder respectively. The LSTM hidden unit size in each direction and token embedding size is each set to 512. The class specificity embeddings size is 16. Embeddings are shared between the paragraph encoder and question decoder. All attention computations use a bilinear product (Luong et al., 2015). A dropout of 0.5 is used between LSTM layers. Models are trained using Adam (Kingma and Ba, 2014) with a learning rate of 10 −3 , with a gradient clipping of 5.0 and minibatch size 32. Early stopping on validation perplexity is used to choose the best question generation model.

C What did not work?
End-to-End Sequential Generation. We experimented with an end-to-end neural model which generated a sequence of questions given a sequence of answer spans. As training data, we leveraged the sequence IDs and follow-up information in the QuAC dataset, without specificity labels. We noticed that during decoding the model rarely attended over the history and often produced questions irrelevant to the context. A potential future direction would involve using the specificity labels for an end-to-end model. Span Selection NER system. As discussed in Section 3.1 and , we could frame answer span selection as a sequence labelling problem. We experimented with the NER system in AllenNLP (with ELMo embeddings) on the QuAC dataset, after the ground truth answer spans marked with BIO tags, after overlapping answers were merged together. We recorded low F1 scores of 33.3 and 15.6 on sentence-level and paragraph-level input respectively.
Paragraph-level question generation. Our question generation model rarely generated GENERAL questions representative of the entire paragraph, even when we fed the entire paragraph as the answer span. We noticed that most GEN-ERAL questions in our dataset were answered by one or two sentences in the paragraph.
Answering system trained on all datasets. Recently, Yatskar (2019) Table A1: Distribution of scheme adopted to classify questions in different datasets. "CNN" refers to the data-driven classifier. Roughly half the questions were classified using the rules described in Table 1.  Figure A1: Three SQUASH outputs generated by our system, showcasing the strengths and weaknesses described in Section 5.

D Technical Note on Modified System
This technical note describes the modifications made to the originally published system to make a faster and more accurate system. We have incorporated language modelling pre-training in our modules using GPT-2 small (Radford et al., 2019) for question generation and BERT (Devlin et al., 2019) for question answering. The official code and demo uses the modified version of the system.

D.1 Dataset
The primary modification in the dataset tackles the problem of coreferences in GENERAL questions, as described in Section 5.2. This is a common problem in QuAC and CoQA due to their contextual setup. We pass every question through the spaCy pipeline extension neuralcoref 20 using the paragraph context to resolve co-references. We have also black-listed a few more question templates (such as "What happened in <year>?") due to their unusually high prevalence in the dataset.

D.2 Question Generation
Our question generation system is now fine-tuned from a pretrained GPT-2 small model (Radford et al., 2019). Our modified system is based on Wolf et al. (2019) and uses their codebase 21 as a starting point. We train our question generation model using the paragraph and answer as language modelling context. For GENERAL questions, our input sequence looks like "<bos> ..paragraph text.. <answer-general> ..answer text.. <question-general> ..question text.. <eos>" and equivalently for SPECIFIC questions. In addition, we leverage GPT-2's segment embeddings to denote the specificity of the answer and question. Each token in the input is assigned one out of five segment embeddings (paragraph, GENERAL answer, SPECIFIC answer, GENERAL question and SPECIFIC question). Finally, answer segment embeddings were used in place of paragraph segment embeddings at the location of the answer in the paragraph to denote the position of the answer in the paragraph. For an illustration, refer to Figure A2. 20 https://github.com/huggingface/ neuralcoref/ 21 https://github.com/huggingface/ transfer-learning-conv-ai  Figure A2: An illustration of the model used for generating a SPECIFIC question. A paragraph (context), answer and question and concatenated and the model is optimized to generate the question. Separate segment embeddings are used for paragraphs, GENERAL answers, GENERAL questions, SPECIFIC answers and SPECIFIC questions. Note that the answer segment embedding is also used within the paragraph segment to denote the location of the answer. The question generation model now uses top-p nucleus sampling with p = 0.9 (Holtzman et al., 2019) instead of beam search and top-k sampling. Due to improved question generation quality, we no longer need to over-generate questions.

D.3 Question Answering
We have switched to a BERT-based question answering module (Devlin et al., 2019) which is trained on SQuAD 2.0 (Rajpurkar et al., 2018). We have used an open source PyTorch implementation to train this model 22 .

D.4 Question Filtering
We have simplified the question filtering process to incorporate a simple QA budget (described in Section 7). Users are allowed to specify a custom "GENERAL fraction" and "SPECIFIC fraction" which denotes the fraction of GENERAL and SPE-CIFIC questions retained in the final output.