Stay Hungry, Stay Focused: Generating Informative and Specific Questions in Information-Seeking Conversations

We investigate the problem of generating informative questions in information-asymmetric conversations. Unlike previous work on question generation which largely assumes knowledge of what the answer might be, we are interested in the scenario where the questioner is not given the context from which answers are drawn, but must reason pragmatically about how to acquire new information, given the shared conversation history. We identify two core challenges: (1) formally defining the informativeness of potential questions, and (2) exploring the prohibitively large space of potential questions to find the good candidates. To generate pragmatic questions, we use reinforcement learning to optimize an informativeness metric we propose, combined with a reward function designed to promote more specific questions. We demonstrate that the resulting pragmatic questioner substantially improves the informativeness and specificity of questions generated over a baseline model, as evaluated by our metrics as well as humans.


Introduction
Conversations are a primary means to seek and communicate information between humans, where asking the right question is an important prerequisite for effective exchange of knowledge. Learning to ask questions in conversations can help computer systems not only acquire new knowledge, but also engage human interlocutors by making them feel heard (Huang et al., 2017).
Previous work on question generation often falls into three classes: generating questions according to a discrete schema or end goal (Bordes et al., 2017;Zhang et al., 2018b), transforming the answer statement into a question (Mitkov et al., 2003;Rus et al., 2010;Heilman and Smith, 2010), or generating questions with data-driven systems by conditioning on the context where the answer comes Figure 1: Asking questions in a conversation to acquire information. In this communication setting, the question asker has access to the background and topic, but no access to the private textual knowledge that contains the answer. In this example, the baseline non-pragmatic question generator (BL) generates an uninformative question (one that has already been answered), while our pragmatic system (Ours) and humans (Ref) actively seek new information.
from (Du et al., 2017;Zhou et al., 2017). Despite their successful adaptation to conversations to predict the question that elicits the observed answer (Gao et al., 2019;Pan et al., 2019;Nakanishi et al., 2019), they are not suitable for modeling communication of knowledge in open-domain conversations, because the crucial problem of what to communicate has already been assumed to be addressed by conditioning on the schema of information need or the context that contains the answer.
We instead study the problem of question generation in a more realistic setting, i.e., in opendomain information-seeking conversations where the question asker cannot access the answering context. This is an important step towards practical natural language processing (NLP) systems that can reason about the state of mind of agents they interact with purely through natural language interactions, so that they can generate more help-ful responses. In this paper, we build a question generator that reasons pragmatically about what information the answerer can provide, and generates questions to gather new information in a conversation (see Figure 1 for an example).
We identify several key challenges in this task: (1) generating informative questions without access to potential answers; (2) evaluating generated questions beyond comparing them to the reference question, because multiple questions can reveal unseen information despite being very different to each other; (3) navigating a large search space of potential questions to improve informativeness by reasoning about the other agent's knowledge, which is more complex than limited reference games in previous work on computational pragmatics.
To address these issues, we first develop a baseline question generation model that generates questions in a conversation without conditioning on the unseen knowledge. We then propose automatic metrics to quantify how much new information questions reveal, as well as how specific they are to the conversation. Next, we use reinforcement learning to optimize our question generator on these metrics. In our experiments on the QuAC dataset, we show that the proposed method substantially improves the specificity and informativeness of the generated questions as evaluated by our automatic metrics. These results are corroborated by blinded human evaluation, where questions generated by our system are also of higher overall quality than those by the baseline system as judged by humans. To recap, our main contributions are: 1 • To the best of our knowledge, our work represents the first attempt at studying question generation to seek information in open-domain communication, which involves challenging NLP problems, e.g., evaluation of open-ended language generation and pragmatic reasoning; • To address these problems, we propose automatic metrics to quantify the informativeness and specificity of questions, which are essential for efficient iterative system development; • We show that optimizing the proposed metrics via reinforcement learning leads to a system that behaves pragmatically and has improved communication efficiency, as also verified by human evaluation. This represents a practical method for pragmatic reasoning in an opendomain communication setting.

Related Work
Question Generation. Question generation has long been studied in the education and psychology communities as a means to assess and promote reading comprehension in humans (Davey and McBride, 1986). In natural language processing, question generation has been explored to improve the systems in various natural language processing tasks, e.g., the quality of question answering systems (Duan et al., 2017) as well as information retrieval in an open-domain question answering system (Nogueira et al., 2019). Some of the first question generation systems are rule-based (Mitkov et al., 2003;Rus et al., 2010;Heilman and Smith, 2010), while large-scale question answering datasets, e.g., SQuAD (Rajpurkar et al., 2016, 2018, have recently kindled research interest in data-driven approaches. Du et al. (2017) and Zhou et al. (2017) apply sequence-to-sequence (seq2seq) models to generate SQuAD questions from Wikipedia sentences containing the answers.
The release of large conversational question answering datasets such as QuAC (Choi et al., 2018) and CoQA (Reddy et al., 2019)  to extend previous neural seq2seq question generators by conditioning them on the conversation history and the context that contains the answer, while Scialom and Staiano (2019) remove answers to the reference question to generate curiosity-driven questions from the rest of the context.
Despite their success, most existing approaches to question generation are limited to either reading comprehension settings where potential answers are known a priori, or goal-oriented settings where the schema of knowledge is limited (Bordes et al., 2017;Zhang et al., 2018b). This prevents them from being applied to an open-domain communication setting, where the purpose of questions is to acquire information that is unknown ahead of time.
Evaluating System-generated Questions. Automatic evaluation of system-generated text has long been an important topic in NLP. Traditional ngram overlap-based approaches (Papineni et al., 2002;Lin, 2004) are computationally efficient, but have been shown to correlate poorly with human judgement of quality (Novikova et al., 2017). More recently, Zhang et al. (2020) leverage large pretrained language models (BERT, Devlin et al., 2019) to relax the limitation of exact n-gram overlap. Hashimoto et al. (2019) combine human judgement with system-reported likelihood of generated text to make population-level estimates of quality and diversity. However, most existing metrics either evaluate generated text against very few references, or provide only relative ranking for multiple systems at a population level rather than reliable feedback for each example. This renders them inapplicable to generating informative questions in a conversation, where multiple questions can be equally informative and relevant in a given scenario, and per-example feedback is necessary.
Pragmatic Reasoning for Informativeness. Pragmatic reasoning is tightly related to informativeness and efficiency in communication. Starting from the cooperative maxims for conversational pragmatic reasoning (Grice, 1975), Frank and Goodman (2012) developed a computational framework that has been applied to reference games with images (Andreas and Klein, 2016) and colors (Monroe et al., 2017), as well as generating descriptions for images (Cohn-Gordon et al., 2019). Decision-theoretic principles (van Rooy, 2003) have also been applied to quantify the informativeness of community questions (Rao and Daumé III, 2018). These approaches usually assume that either the list of referents (images, colors, or answers) or the space of utterances (descriptions or questions) is enumerable or can be directly sampled from, or both. More crucially, the speaker agent usually has complete access to this information to readily gauge the effect of different utterances. We instead study a more realistic information-seeking setting, where the questioner cannot access the answers, let alone aggregate them for pragmatic reasoning, and where these simplifying assumptions will not hold.

Method
In this section, we outline the setup for the communication problem we set out to address, present a baseline system, and lay out our approach to extending it to reason pragmatically to acquire information more efficiently.

Problem Setup
We consider a communication game between two agents, a teacher and a student (see Figure 1 for an example). The two agents share a common topic of discussion T (Background and Topic in the figure), as well as a common goal for the student to acquire some knowledge K on this topic that only the teacher has direct access to (Private Knowledge in the figure). We consider the scenario where the agents can only communicate to each other by engaging in a conversation, where the conversation history H is shared between the agents. We further constrain the conversation to one where the student asks questions about the shared topic, and the teacher provides answers based on K. Note that this setup is very similar to that of the "Game of Interrogation" by (Groenendijk, 1999), except we relax the definition, using natural language instead of focusing on predicate logic, as we will detail in the sections that follow.
In this paper, we are interested in building a model of the student (question asker) in this scenario. Specifically, we investigate how to enable the student to reason pragmatically about which questions to ask to efficiently acquire knowledge, given only the topic T and the conversation history H. This setting of information-seeking conversations involves many interesting and challenging problems in natural language processing: • Quantifying textual information. We need to be able to quantify how much knowledge the student has acquired from K.
• Evaluating language generation when a single reference is insufficient. At any state in the conversation, there is usually more than one valid question, some more effective and more appropriate than others. To address this problem, we need to come up with evaluation metrics and objective functions accordingly, rather than relying on the similarity between generated questions and the single reference that is available in existing datasets.
• Pragmatic reasoning with partial information and a large search space. In order to train computational agents capable of pragmatic reasoning, previous work typically takes the approach of either limiting the space of referents, or the space of possible utterances, or both. However, the former is infeasible in a communication setting as the student does not have access to K beyond what is already revealed in the conversation, and the latter is also impractical for natural conversations that cover a diverse set of topics.
We address these challenges by proposing two automatic reward functions that evaluate the informativeness and specificity of questions, and optimizing them with reinforcement learning.

Generating Questions in Conversations
Before we delve into the proposed approaches for training a question generator model to be pragmatic, an introduction of the model itself is due.
For the purposes of this paper, we assume that the shared topic T , the shared conversation history H, and the teacher's knowledge K (which the student has no access to) are all made available to agents in natural language. Since we consider information-seeking conversations only, the conversation history is grouped into pairs of questions and answers: H = [(q 1 , a 1 ), (q 2 , a 2 ), . . . , (q |H| , a |H| )].
To generate conversational questions, we build a sequence-to-sequence model that encodes the information available to the student and decodes it into the next question in the conversation (see Figure  2(a)). Specifically, we first model the shared topic T with a bi-directional LSTM (BiLSTM) (Hochreiter and Schmidhuber, 1997), and use the resulting topic representation h T in the conversation encoder. Then we obtain a representation of the conversation with hierarchical LSTM encoders: we first encode each pair of question and answer with h T using a BiLSTM, then feed these pair representations into a unidirectional LSTM in the direction that the conversation unfolds. To generate the question, we apply an LSTM decoder with attention both on the topic and the conversation history (Bahdanau et al., 2015). This allows us to efficiently batch computation for each conversation by sharing these representations across different turns. We include detailed description of the model in Appendix A.
As a baseline, we train this model to minimize the negative log likelihood (NLL) of questions observed in the training set: where θ stands for model parameters, N is the number total conversations in the training dataset, H (i) the conversation history of the i-th conversation in the dataset, and N p = N i=1 |H (i) | is the total number of question-answer pairs in the training dataset. Intuitively, this trains the model to mimic the observed questions in the dataset, but does not provide guarantees or assessment of how well generated questions are actually able to acquire information from the teacher agent.

Evaluating Informativeness through Question Answering
In order to train the question generation model to generate pragmatically apt questions that reveal new information from K, we need to be able to quantify informativeness in communication first. However, informativeness is difficult to quantify in an open-domain dialogue, and sometimes even subjective. In this paper, we focus on providing an objective metric for how much new information is revealed by a question. Since questions do not reveal information directly, but rather rely on the answers to them to introduce new facts into the conversation, we begin by defining the informativeness of an answer a once it is provided. Specifically, we are interested in characterizing how much new information an answer a reveals about K beyond what is already provided in the conversation history H <j up until this point in the conversation. Theoretical quantities like mutual information might seem appealing in this context given their strong grounding in information theory. However, applying them would potentially require us to fully specify the state space the world can be in for an open-domain conversation, as well as estimating the probability distribution over potential configurations, neither of which is trivial, if feasible. Therefore, we turn to more practical quantities in defining the informativeness of an answer a given the conversation history H <j by leveraging the observation that, the more new information an answer reveals about K, the more likely it involves words that have not already been mentioned in H <j . Therefore, making use of the unigram precision function Prec(a, a ) between the predicted answer a and an answer a that is already provided in the conversation history H <j , we define the informativeness of the predicted answer as follows Intuitively, the more a overlaps with any of the previously revealed answers, the less new information it contains. This metric of informativeness has the advantages of objectivity and ease of automatic evaluation. Also note that the choice of unigram precision is here not one of necessity but simplicity and practicality. It is in principle interchangeable

BiGRU BiGRU
What other songs were on the album?

Max pooling + Affine + Sigmoid
True next question?
We use this definition of answer informativeness to define the utility of potential questions. Specifically, we define the informativeness of a question as the amount of new information it can immediately reveal through its answer is the complete context available to the teacher up until the question is raised, QA(q, C <j ) is a pretrained conversational question answering (QA) model that answers the question q from the knowledge source K given this context. This is equivalent to using a point estimate for P (a|q, C <j ) to evaluate q's expected utility, which is practical for pragmatic reasoning at scale by avoiding the need for aggregating over a large set of candidate answers for each question. In contrast, previous work on pragmatics often require probabilistic normalization in the space of speaker utterances (questions) and listener actions (answers), which is intractable in our setting.
This definition of informativeness is also explainable: it is easy for a human to inspect the answer provided by the QA model and compare it to previous ones to understand how much new information has been revealed. Note that this definition itself also does not rely on any specific QA model, although more accurate QA models could result in more accurate estimates of informativeness. For simplicity, we use a bidirectional attention flow model (Seo et al., 2017) with self-attention (Clark and Gardner, 2018) as adapted for conversational QA by Choi et al. (2018) (see Figure 2(b)).

Evaluating Question Specificity
Now that we have a metric to evaluate informativess, can we maximize it and obtain a good model for generating pragmatic conversational questions? It turns out that there are two issues with naïvely optimizing this value: generated questions could be overly generic or disruptive of the conversation flow while still acquiring new information. For instance, questions like What else? almost always reveal new information. On the other hand, in our example in Figure 1, Did they go on tour for their 1983 album? seems more disruptive (topicchanging) as the next question in the conversation than the candidate questions in the figure.
To address this, we take a similar approach to previous work by selecting negative examples to target these issues and training a classifier to distinguish them from questions that were actually part of the conversation (Lowe et al., 2017;Rao and Daumé III, 2018). Once this classifier is trained, we can make use of the score it assigns different candidate questions to evaluate how specific each is to the current conversation history. Specifically, we select two kinds of negative questions: frequent questions from the training set (frequency>1) and random questions other than the observed one from the same conversation. We train a model (with shared parameters with the QA model, see Figure  2(b)) to assign a probability that a question is the true next question (positive) given the conversation history, and define this quantity as the specificity of the question where ξ is the parameters of the classifier optimized with binary cross entropy loss. Once this classifier is trained jointly with the QA model, we can use this specificity reward to bias the model towards generating questions that are not only informative, but also specific to the given conversation history.
Conceptually, our specificity idea is related to a few separate but connected concepts in NLP, namely discourse coherence, relevance, and reducing genericness in natural language generation.
The coherence and relevance of a piece of text in a discourse is highly correlated with the perceived quality of the generated text. Previous work has approached generating coherent utterances in conversations through encouraging the model to learn similar distributed representations throughout the conversation (Baheti et al., 2018;Xu et al., 2018;Zhang et al., 2018a). In contrast, we achieve the same goal with a discriminative classifier, which is trained to contrast the true follow-up question (relevant and coherent) against randomly sampled questions (irrelevant) from other conversations and out-of-order questions (uncoherent). The idea of discerning discourse consistency has also been applied to large pretrained language models (Devlin et al., 2019;Iter et al., 2020), which is demonstrated to sometimes yield performance gains when the they are finetuned on downstream tasks.
On the other hand, since we sample frequent questions in the training set as negative examples for the classifier, it also discourages the model from generating overly generic questions. Previous work has attacked the problem of genericness in conversational natural language generation by proposing auxiliary training objectives, e.g., ones that maximize the utility of the generated utterance estimated with adversarial networks (Rao and Daumé III, 2019), specificity estimates that are estimated from data (Ko et al., 2019a,b), or the mutual information between the generated turn and previous ones (Li et al., 2016). Our proposed method can be viewed as a generalization of these approaches, where the objective to be optimized at the time of generation is implicitly specified via a parameterized model by choosing negative examples for contrast.

Generating Informative and Specific Questions
Given the informativeness metric and specificity reward, we can improve upon these by maximizing the following reward function that blends the two in a weighted sum Since this quantity can only be evaluated once a complete question has been generated, the nondifferentiability of the decoding process prevents us from directly optimizing it with respect to θ using gradient-based optimization. However, we can still estimate the gradient of the expected reward of generated questions, E q∼P θ [R(q)] using REIN-FORCE (Williams, 1992), a reinforcement learning technique. For an example q, the gradient estimate is the gradient of the following loss function whereq is a sample from P θ and we dropped the dependency on C <j for notational clarity. b(q) is called the baseline function, which, if chosen carefully, reduces the variance of this gradient estimate and results in faster convergence. We apply a technique called self-critical sequence training (Rennie et al., 2017), which selects b(q) = R(q G ), the reward obtained by the greedily decoded sequence, q G , from the question generator.
To ensure that the generator maximizes the desired reward function without losing fluency in generated questions, we combine R with negative log likelihood during model finetuning (Paulus et al., 2018). We finetune a pretrained question generator (with NLL ) using the following objective Here, . We choose λ 1 = 0.5 and λ 2 = 0.98 in our experiments, which were chosen by tuning the model on the dev set.

Experiments
Data. For our experiments, we use the QuAC dataset presented by Choi et al. (2018). Although other similar datasets share some common characteristics, some crucial differences render them inapplicable for our experiments. For instance, CoQA (Reddy et al., 2019) gives both agents access to the context, while Wizard of Wikipedia (Dinan et al., 2019) does not assign the student agent clear goals of acquiring new information.
Since QuAC's test set is held private for fair evaluation, for this work we repurpose the original dev set as our test set. We randomly split the training set  into training and development partitions, ensuring that the Wikipedia entities discussed in conversations do not overlap between these partitions. The goal of the split is to obtain a development set that is roughly as large as the repurposed test set. The statistics of our data split can be found in Table 1.
Training. We follow the recipe available in Al-lenNLP (Gardner et al., 2018) to train the QA model on QuAC, and make sure that it obtains performance on par with that reported by Choi et al. (2018) on the official dev set (with multiple answer references). 2 We use the Adam optimizer (Kingma and Ba, 2015) with default hyperparameters to train and finetune our question generator, and anneal the learning rate by 0.5 whenever dev performance does not improve for more than 3 consecutive epochs (patience=3). When training finishes, the specificity classifier achieves approximately 75% F 1 on the dev set when the true next question, sampled frequent questions and random questions from the same conversation have a balanced ratio of 1:1:1. For unanswerable questions in QuAC, we revise Equation (3) and set informativeness to zero if the predicted answer is CANNOTANSWER, as the answer does not reveal new information about the hidden knowledge K.

Metric-based Evaluation
For the baseline model and our model finetuned for informativeness and specificity, we generate predictions with greedy decoding for simplicity. We evaluate them on conventionally used metrics such as perplexity (PPLX) of the reference question and the F 1 score of the ROUGE-L metric (ROUGE-L) (Lin, 2004) between the predicted questions and the reference. The former helps verify the overall quality of our model, while the latter helps us compare single-reference metrics to our proposed ones. We also report the informativeness metric (INFO) and specificity reward (SPEC) for these models, and compare them to the reference questions on these measures on both the dev and test sets. As shown in Table 2, the baseline model and our pragmatically finetuned model achieve comparable performance when evaluated against the reference question using n-gram overlap metrics (ROUGE-L), and the perplexity of the reference question is only slightly worse. As expected, these metrics tell us nothing about how well the model is going to fare in actual communication, because perplexity does not evaluate the usefulness of generated questions, and ROUGE-L can barely tell these systems apart.
We can also see in Table 2 that the finetuned model improves upon the baseline model on both informativeness and specificity. Further, we notice that despite their high specificity, the reference questions are only about as informative as our baseline questions on average, which is a bit surprising at first sight. Further analysis reveals that about 12.6% of dev questions and 15.7% test ones are considered unanswerable by crowd workers, which is a byproduct of the information-asymmetric setting adopted when the data was collected. As a result, many reference questions could be considered uninformative by our definition, since they might cause the QA model to abstain from answering.

Human Evaluation
Although the results in Table 2 show that our model sees substantial improvements on the proposed informativeness and specificity metrics, it remains unclear whether these improvements correlate well with human judgement of quality, which is critical in the application of the resulting system. To study this, we conduct a comparative human evaluation.
We randomly selected 200 turns from the test set, and asked two NLP PhD students to evaluate the reference questions, as well as those generated by the baseline model and our model. These questions are evaluated on their overall quality, informativeness, and specificity, where the annotators are asked to rank the candidate questions on each metric with ties allowed. System identity is hidden from the annotators, and the order of the systems is shuffled for each comparison. Prior to annotation, both annotators were educated to follow the same guidelines to encourage high agreement (see Appendix D for details).
As shown in Table 3, human annotators favor our system over the baseline on informativeness   (93.75% of our questions are considered equally or more informative), and to a lesser extent, overall quality (81.5%) and specificity (82%). We find that 26.1% of questions generated by these systems are identical, which inflates the number of ties in human evaluation. We expect a starker contrast if a sampling-based decoding strategy were applied for generation diversity, which we leave to future work. We also attirbute this difference in humanperceived quality on these three aspects partly to the inherent nature of these annotation tasks: while our annotators agree on 80.3% of the pair-wise judgments regarding informativeness, agreement decreases to 70.7% for overall quality and 69.2% for specificity since they are more subjective. It is encouraging, however, that our system is also considered equally or more informative than the human reference 80.25% of the time. What negatively affects human's perception of the overall quality of questions our system generates is largely attributable to the over-genericness of these questions compared to the references, and a sometimes blatant lack of common sense (e.g., questions like "What did he do after his death?").

Analysis
We further analyze concrete examples of generated questions in conversations to understand the behavior of our informativeness and specificity metrics.
Case Study. To sanity check whether our informativeness metric and specificity reward match human intuition, we manually inspect a few examples from the test set. Figure 3 represents a case where all the questions our system generated are considered equal to or more informative than the reference and baseline generated questions by our metric. As shown in the example, the baseline system is prone to generating topical but uninformative questions (BL 2 and BL 3 ). Our system finetuned on our reward function is more pragmatic and asks about relevant questions that can likely be answered from the unseen paragraph K. Our informativeness  metric also correctly identifies that both Ours 3 and Ref 3 are good questions that reveal new information about K, although there is very little overlap between the two. On the other hand, the specificity reward successfully identifies that BL 3 and Ref 4 are the least specific questions of their respective turn, where the former is disconnected from the most recent topic under discussion (the song), the latter is phrased in an overly generic way.
We also demonstrate some clear failure cases. In Figure 4, we see that our informativeness and specificity measures make judgements a human will unlikely make, as the topic implies K is unlikely to contain information about Moyet's first album/recording. In fact, the QA model fails to recognize that these questions (BL 1,2 , Ours 1,2,3 , Ref 1 ) are unanswerable, and instead assigns them high informativeness. The specificity model, on the other hand, fails to recognize near paraphrases (BL 1 vs Ours 1 ) and a question that was likely just answered (BL 3 ). A positive finding in this example is that the informativeness metric is well-aligned with pragmatic behavior in the fourth turn-had Moyet won the Grammy, the previous answer (A 3 ) would have mentioned it instead of just her nomination.
We include the answering contexts for these examples in Appendix B for the reader's reference.
Explainable Informativeness. As stated in Section 3.3, our definition of informativeness is explainable to humans-we demonstrate this with concrete examples. For instance, in the example in Figure 3, although the question What happened in 1983? is phrased rather vaguely, the QA model is able to identify its correct answer from the paragraph The band released their third album, True, in March 1983, which offers new information (note the answer in the figure only reflects the actual human-human conversation, not this hypothetical one). Similarly, the QA model correctly identifies that the question our model generated on the second turn (Ours 2 ) has the same answer as the human reference (Ref 2 ), which introduces a new entity into the conversation. BL 2 and BL 3 are deemed uninformative in this case since the QA model offered the same answer about the album True again. Although this answer is about an incorrect entity in this context (the album True instead Parade, which is the focus of discussion), the large amount of overlap between this answer and Ans 1 is still sufficient to regard these questions as less informative.
We note that this informativeness metric does have an exploitable flaw-it does not prevent the questioner from asking vague, open-ended questions (e.g., What else do you know?) to acquire knowledge. In fact, we find this strategy is also adopted by QuAC's crowd workers. However, our specificity reward penalizes genericness, and therefore alleviates this issue in the questions our system generates. We show that our system repeats n-grams from previous questions less frequently, and refer the reader to Appendix C for details.

Conclusion
In this paper, we presented a question generation system in information-seeking conversations. By optimizing our proposed automatic metrics for informativeness and specificity, the model is able to generate pragmatically relevant and specific questions to acquire new information about an unseen source of textual knowledge. Our proposed method presents a practical if shallow implementation of pragmatics in an open-domain communication setting beyond simple reference games. We hope that our work brings the community's attention to this important problem of natural language communication under information asymmetry.

A Model Details
In this section, we include the details of the question generation model and the informativeness/specificity model we used in our experiments.

A.1 Question Generator
For the input to the encoder and decoder models in our question generator, we tokenize them with the spaCy toolkit, 3 and initialize word representations with 100-dimensional GloVe vectors (Pennington et al., 2014). As shown in Figure 2, we also introduce special XML-like symbols to delimit different parts of the input to various models. The representations of these special symbols are randomly initialized, and finetuned with those of the top 1000 most frequent words in the training set during training.
For the topic containing the title of the Wikipedia page and the background on the entity after concatenating them with special symbols, we feed them into a topic BiLSTM model and obtain the topic representation with a multi-layer perceptron (MLP) attention mechanism, using the concatenated final state from each direction of the BiLSTM as the key To represent the input words in the decoder, we use the same embedding matrix as the encoder. We also employ weight tying between the input embeddings and the output weights for word prediction to reduce parameter budget (Press and Wolf, 2017). For each word in the decoder input, we concatenate its embedding with h attn T for topical context. We provide the decoder access through attention to all of the representations of encoded tokens, i.e., Finally, the weighted average of encoder representations is combined with the decoder LSTM's representation of the decoded sequence to yield a probabilistic distribution over words in the vocabulary.

A.2 Informativeness/Specificity Model
For informativeness, we follow closely the open implementation of BiDAF++ for QuAC that is available in AllenNLP (Gardner et al., 2018). For each word, we concatenate its word representations with character representations derived from a convolutional neural network from its character spelling. We replace the ELMo embeddings with GloVe ones for computational efficiency, which results in a relatively small drop in QA performance compared to AllenNLP's implementation (by about 2-3 F 1 on the official dev set). Note that following Choi et al.
(2018), we use gated recurrent units (GRUs;Cho et al., 2014) in this part of the model. For the specificity model, we first encode the topic and conversation history in a similar fashion as we did for the encoder in the question generator. Then, this representation is combined with the question representation from the BiGRU encoder in the QA model via a bidirectional attention (biattention) mechanism. The resulting representation is combined with the question representation from the bi-attention in the QA model, and max pooled over time, before an affine transform is applied to convert the representation into a score.

B Contexts for Case Study Examples
We include in Figures 5 and 6 the contexts that contain the answer for the examples we studied in Section 6, with gold answers in the case study highlighted in the paragraphs. Following Choi et al. (2018), we concatenate an artificial CANNOTANSWER token to the end of the paragraph for the question answering model to abstain from answering the question.
The band released their third album, True, in March 1983. Produced by Tony Swain and Steve Jolley, the album featured a slicker pop sound. It was at this point that Steve Norman began playing saxophone for the band. Preceded by the title track which reached number one in various countries, the album also reached number one in the UK. Their next single, "Gold", reached number 2.
[ :: The :::::::: follow-up ::::: album, :::::: Parade, ::: was :::::: released :: in :::: June :::: 1984, ::: and :: its ::::: singles ::::: were :::: again ::: big ::::::: successes :: in ::: the ::::: charts :: in :::::: Europe, :::::: Oceania ::: and :::::: Canada.]Ans 1 [ ::: The :::::: album's :::::: opening :::: song, ::::: "Only ::::: When ::: You :::::: Leave"]Ans 2 , [ ::::: became ::: the ::::: band's ::: last ::::::: American ::: hit.]Ans 3 At the end of 1984, the band performed on the Band Aid charity single and in 1985 performed at Wembley Stadium as part of Live Aid. During this same year, Spandau Ballet achieved platinum status with the compilation The Singles Collection, which kept the focus on the band between studio albums and celebrated its five years of success. However, the album was released by Chrysalis Records without the band's approval and the band instigated legal action against the label. In 1986, Spandau Ballet signed to CBS Records and released the album Through the Barricades, in which the band moved away from the pop and soul influences of True and Parade and more toward rock. Though the first single, "Fight for Ourselves" peaked at 15 in the UK, the title track and the album both reached the Top 10 in the UK and Europe. After a hiatus from recording, the band released their next album, Heart Like a Sky, in September 1989. The album and its singles were unsuccessful in the UK, and the album itself was not released in the United States. It did, however, do well in Italy (where its singles "Raw" and "Be Free with Your Love" reached the Top 10) and also in Belgium, Germany and the Netherlands. CANNOTANSWER Figure 5: Private context that contains the answers to questions in our case study example in Figure 3.

C Specificity Analysis
We examine the outputs of our model to assess whether finetuning on the specificity reward results in more specific questions rather than generic and repetitive ones. To measure this, we compute the n-gram overlap between generated questions and all questions in the conversation history for all systems. The lower this repetition is, the more likely the system is bringing up new entities or topics in its questions, and thus more specific to the given conversation history. As can be seen in Figure  7, our system improves upon the baseline system by reducing this repetition noticeably in longer ngrams (n ≥ 3). When n is very large (n ≥ 8), our pragmatic system is less repetitive even compared to the human reference, which often contains long and repetitive questions like Are there any other interesting aspects about this article? as a generic inquiry for more information.
Following a period of personal and career evaluation, [ ::::: Hoodoo]Ans 2 was released in 1991. The album sold respectably in the UK, [ ::: and ::::: Moyet ::: was ::::::::: nominated :: for :: a :::::: Grammy ::: for ::: the ::::: single ::: "It ::::: Won't :: Be :::::: Long".]Ans 3 However, the release of Hoodoo marked the beginning of an eight-year fight for Moyet to secure complete control of her artistic direction. Like many similar artists (including Aimee Mann and the late Kirsty MacColl), Moyet was reluctant to record a radio-friendly "pop" album simply for the sake of creating chart hits. Moyet's next album, Essex (1994), was also a source of controversy for her; in order for the album to be released, her label(now Sony) insisted that certain Essex tracks be re-recorded and re-produced, and that there be additional material remixed to create a more' commercial' package. The video for the single "Whispering Your Name" again featured Dawn French. Following the release of Essex, Sony released a greatest hits compilation of Moyet's work. Singles entered the UK charts at No. 1 and, following a UK tour, was re-issued as a double CD set which included "Live (No Overdubs)", a bonus live CD. Upon re-issue, Singles charted again, this time in the Top 20. Due to prolonged litigation with Sony, Moyet did not record or release a new studio album for over eight years after the release of Essex.  Ours Reference Figure 7: Proportion of repeated n-grams in questions from the conversation history. As can be seen from the plot, our pragmatic system reduces the amount of ngrams repeated from previous questions especially for longer n-grams.

D Human Evaluation Details
In this section, we include further details about how the human evaluation is carried out to compare different systems.
We begin by randomly sampling 200 turns of questions from the 7354 question-answer pairs in the test set, and collect the questions from the human reference, the baseline system, and our finetuned system. Then, for each turn, we shuffle the order of the three candidate questions, and present them in a group to the annotators. The questions are accompanied with the entity and topic under discussion, as well as the conversation history from the QuAC dataset that led up to the turn under evaluation. We ask the annotators to provide a ranking amongst the questions from these unidentified systems, and allow ties when the annotators cannot observe a qualitative difference between two or more question candidates. An example annotation task for a turn of questions can be found in Figure 8. To encourage high inter-annotator agreement, we first conducted a trial annotation on 20 examples on the dev set, and composed annotation guidelines (see Figure 9) with some minimal examples to clarify edge cases.   Figure 8: The annotation interface for the human evaluation, which is built with Qualtrics.com. The annotators are given the title of the Wikipedia article under discussion, a short introductory paragraph, the title of the Wikipedia section discussed, the QuAC conversation history that leads up to the current turn, and the candidate questions to evaluate. They are asked to rank these candidate questions on overall quality, informativeness, and specificity, as we outline in the guidelines.

Evalutating Questions in a Information-Gathering Conversation
In this task, you will be asked to read a conversation between two agents on a given topic (an entity from Wikipedia, e.g., "Albert Einstein"), and evaluate a set of follow-up questions as candidates for the next utterance in the conversation. More specifically, the agents discuss about a given section in that Wikipedia article (e.g., "Early Life"). Only one of the two agents, the teacher, or answerer, has access to the text of the section, from which answers are provided. The student's (asker's) goal is to have a meaningful conversation and gather information from this unseen section of text through the conversation.

Setting
You will be provided the same information that is available to the student, i.e., the shared conversational topic (Wikipedia page title, a short introductory paragraph), the section title under discussion, as well as the entire history of conversation between the teacher and the student.

Task
Your task is to evaluate the quality of three candidate questions for each combination of topic under discussion, section title, and conversation history. You will be ranking these questions on three different evaluation metrics, where ties are allowed for any metric (and encouraged if there isn't a clear signal setting candidate questions apart). Specifically, you will be evaluating these questions on their Overall Quality. A good question should be fluent, specific, and moves the conversation forward. Does this question seem relevant to the conversation? Does it move the conversation forward by gathering more information? Is it grammatical and/or fluent?
If you had to choose one of these questions to ask as the student, in which order will you choose these questions (ties are allowed)? Informativeness. A good question in this setting should gather new information that hasn't already been revealed by the teacher. Does this question attempt to gather new information from the section under discussion?
Note that a question doesn't truly gather new information from the section if references in it are phrased too vaguely to be resolved to anything specific in the conversation history, or if it asks about something completely irrelevant to the (unseen) section under discussion.
Depending on the context, a seemingly repetitive question can actually gather more information (e.g., asking about other films an actor/actrees has appeared in given the knowledge of some of his/her films). Use your best judgement in these cases. Specificity. A good question should also be tightly related to the topic under discussion, as well as what has just been discussed. Is this question specific for the current conversation, merely applicable to general discussions about this topic, applicable to discussions about virtually any topic, or worse, obviously irrelevant to the current discussion?
Note that pronoun use (e.g., "her", "it") shouldn't be discounted as less specific than mentioning the specific entities, as they are commonly used to refer to topics or entities under discussion.

→ rvey
Place Bookmark Mobile view off Figure 9: Human evaluation guidelines to compare system-generated questions with the human reference.