Are You Asking the Right Questions? Teaching Machines to Ask Clarification Questions

Inquiry is fundamental to communication, and machines cannot effectively collaborate with humans unless they can ask questions. In this thesis work, we explore how can we teach machines to ask clari-ﬁcation questions when faced with uncertainty, a goal of increasing importance in today’s automated society. We do a pre-liminary study using data from StackEx-change, a plentiful online resource where people routinely ask clarifying questions to posts so that they can better offer assistance to the original poster. We build neural network models inspired by the idea of the expected value of perfect information: a good question is one whose expected answer is going to be most useful. To build generalizable systems, we propose two future research directions: a template-based model and a sequence-to-sequence based neural generative model.


Introduction
A main goal of asking questions is to fill information gaps, typically through clarification questions, which naturally occur in conversations (Purver, 2004;Ginzburg, 2012). A good question is one whose likely answer is going to be the most useful. Consider the exchange in Figure 1, in which an initial poster (who we'll call "Terry") asks for help configuring environment variables. This question is underspecified and a responder ("Parker") asks a clarifying question "(a) What version of Ubuntu do you have?" Parker could alternatively have asked one of: Parker should not ask (b) because it's not useful; they should not ask (c) because it's too specific and an answer of "No" gives little help. Parker's question (a) is optimal: it is both likely to be useful, and is plausibly answerable by Terry. Our goal in this work is to automate Parker. Specifically, after Terry writes their initial post, we aim to generate a clarification question so that Terry can immediately amend their post in hopes of getting faster and better replies.
Our work has two main contributions: 1. A novel neural-network model for addressing this task that integrates the notion of expected value of perfect information ( §2). 2. A novel dataset, derived from StackExchange, that enables us to learn a model to ask clarifying questions by looking at the types of questions people ask ( §4.1). 1 To develop our model we take inspiration from the decision theoretic framework of the Expected Value of Perfect Information (EVPI) (Avriel and Williams, 1970), a measure of the value of gathering additional information. In our setting, we use EVPI to calculate which question is most likely to elicit an answer that would make the post more informative. Formally, for an input post p, we want to choose a question q that maximizes E a∼p,q [U(p+a)], where a is a hypothetical answer and U is a utility function measuring the completeness of post p if a were to be added to it. To achieve this, we construct two models: (1) an answer model, which estimates P[a | p, q], the likelihood of receiving answer a if one were to ask question q on post p; (2) a completeness model, U(p), which measures how complete a post is. Given these two models, at prediction time we search over a shortlist of possible questions for that which maximizes the EVPI.
We are able to train these models jointly based on (p, q, a) triples that we extract automatically from StackExchange. Figure 1 depicts how we do this using StackExchange's edit history. In the figure, the initial post fails to state what version of Ubuntu is being run. In response to Parker's question in the comments section, Terry, the author of the post, edits the post to answer Parker's clarification question. We extract the initial post as p, question posted in the comments section as q, and edit to the original post as answer a to form our (p, q, a) triples.
Our results show significant improvements from using the EVPI formalism over both standard feedforward network architectures and bag-ofngrams baselines, even when our system builds on strong information retrieval scaffolding. In comparison, without this scaffolding, the bag-ofngrams model outperforms the feedforward network. We additionally analyze the difficulty of this task for non-expert humans.

Related Work
The problem of question generation has received sparse attention from the natural language processing community. Most prior work focuses on generating reading comprehension questions: given text, write questions that one might find on a standardized test (Vanderwende, 2008;Heilman, 2011;Rus et al., 2011). Comprehension questions, by definition, are answerable from the provided text. Clarification questions are not. Outside reading comprehension questions, Labu-tov et al. (2015) studied the problem of generating question templates via crowdsourcing, Liu et al. (2010) use template-based question generation to help authors write better related work sections, Mostafazadeh et al. (2016) consider question generation from images, and Artzi and Zettlemoyer (2011) use human-generated clarification questions to drive a semantic parser.

Model Description
In order to choose what question to ask, we build a neural network model inspired by the theory of expected value of perfect information (EVPI). EVPI is a measurement of: if I were to acquire information X, how useful would that be to me? However, because we haven't acquired X yet, we have to take this quantity in expectation over all possible X, weighted by each X's likelihood. In the question generation setting, for any given question q that we can ask, there is set A of possible answers that could be given. For each possible answer a ∈ A, there is some probability of getting that answer, and some utility if that were the answer we got. The value of this question q is the expected utility, over all possible answers. The theory of EVPI then states that we want to choose the question q that maximizes: In Eq 1, p is the post, q is a potential question from a set of candidate questions Q ( §3.1) and a is a potential answer from a set of candidate answers A ( §3.1). P[a|p, q] ( §3.2) measures the probability of getting an answer a given an initial post p and a clarifying question q. U(p + a) ( §3.3) measures how useful it would be if p were augmented with answer a. Finally, using these pieces, we build a joint neural network that we can optimize end-toend over our data ( §3.4). Figure 2 describes the behavior of our model during test time.

Question & Answer Candidate Generator
Given a post, our first step is to generate a set of candidate questions and answers. Our model learns to ask questions by looking at questions asked in previous similar situations. We first identify 10 posts similar to the given post in our dataset using Lucene 2 (a software extensively used in information retrieval) and then consider the ques- Figure 2: The behavior of our model during test time. Given a post p, we retrieve 10 posts similar to p using Lucene and consider the questions asked to those as question candidates and the edits made to the posts in response to the questions as answer candidates. Our answer model generates an answer representation Fans(p, qj) for each question candidate qj and calculates how close is an answer candidate a k to Fans(p, qj). Our utility calculator calculates the utility of the post if it were updated with the answer a k . We select the question qj that maximizes the expected utility of the post p (Equation 1).
tions asked to these posts as our set of question candidates and the edits made to the posts in response to the questions as our set of answer candidates.

Answer Modeling
Given a post p and a question candidate q i , our second step is to calculate how likely is this question to be answered using one of our answer candidates a k . To calculate this probability, we first generate an answer representation F ans (p, q i ) and then measure how close is the answer candidate a k to our answer representation using the equation: where λ is a tunable parameter that controls the variance of the distribution.
We train our answer generator using the following intuition: a question can be asked in several different ways. For e.g. in Figure 1, the question "What version of Ubuntu do you have?" can be asked in other ways like "What version of operating system are you using?", "Version of OS?", etc. Additionally, a question can generate several different answers. For instance, "Ubuntu 14.04 LTS", "Ubuntu 12.0", "Ubuntu 9.0", are all valid answers. To capture these generalizations, we define the following loss function: loss ans (p,q,ā, Q) = ||F ans (p,q) −ā|| 2 (3) In equation 3, the first term forces the answer representation F ans (p i ,q i ) to be as close as possible to the correct answer a i and the second term forces it to be close to the answer a j corresponding to a question q j very similar to q i (i.e. ||q i −q j || is near zero).

Utility Calculator
Given a post p and an answer candidate a k , our third step is to calculate the utility of the updated post i.e. U(p + a k ) which measures how useful it would be if a given post p were augmented with an answer a k . We use the intuition that a post p i , when updated with the answer a i that it is paired with in our dataset, would be more complete than if it is updated with some other answer a j . Therefore we label all the (p i , a i ) pairs from our dataset as positive (y = 1) and label p i paired with other nine answer candidates generated using Lucene ( §3.1) as negative (y = 0). The utility of the updated post is then defined as U(p + a) = σ(F utility (p,ā)) where F utility is a feedforward neural network. We want this utility to be close to one for all the positively labelled (p, a) pairs and close to zero for all the negatively labelled (p, a) pairs. We therefore define our loss using the binary cross-entropy formulation below: loss util (y,p,ā) = y log(σ(F utility (p,ā))) (4)

Our joint neural network model
Our fundamental representation is based on recurrent neural network, specifically long shortterm memory architecture (LSTM) (Hochreiter and Schmidhuber, 1997) over word embeddings  Table 1: Results of two setups 'Lucene negative candidates' and 'Random negative candidates' on askubuntu when trained on a combination of three domains: askubuntu, unix and superuser. We report four metrics: accuracy (percent of time the top ranked question was correct), mean reciprocal rank (the reciprocal of the ranked position of the correct question in the top 10 list), recall at 3 (percent of time the correct answer is in the top three) and recall at 5. obtained using a GloVe (Pennington et al., 2014) model trained on the entire datadump of StackExchange. We define three LSTMs corresponding to p, q and a and two feedforward neural networks corresponding to our answer model F ans (p,q) and our utility calculator F utility (p,ā). We jointly train the parameters of all our neural network models to minimize the sum of the loss of our answer model (Eq 3) and our utility calculator (Eq 4): Given such an estimate P[a|p, q] of an answer and a utility U(p + a) of the updated post, predictions can be done by choosing that "q" that maximizes Eq 1.

Dataset
StackExchange is a network of online question answering websites containing timestamped information about the posts, comments on the post and the history of the revisions made to the post. Using this, we create our dataset of {post, question, answer} triples: where post is the initial unedited post, question is the comment containing a question and answer is the edit made to the post that matches the question comment 3 . We extract a total of 37K triples from the following three domains of StackExchange: askubuntu, unix and superuser.

Experimental Setups
We define our task as given a post and 10 question candidates, select the correct question candidate. For every post p in our dataset of (p, q, a) triples, the question q paired with p is our positive question candidate. We define two approaches to generate negative question candidates: Lucene Negative Candidates: We retrieve nine question candidates using Lucene ( §3.1) and Random Negative Candidates: We randomly sample nine other questions from our dataset.

Primary Research Questions
Our primary research questions that we evaluate experimentally are: a. Does a neural architecture improve upon a simple bag-of-ngrams baseline? b. Does the EVPI formalism provide leverage over a similarly expressive feed-forward network? c. How much harder is the task when the negative candidate questions come from Lucene rather than selected randomly?

Baseline Methods
Random: Randomly permute the set of 10 candidate questions uniformly. Bag-of-ngrams: Construct a bag-of-ngrams representation for the post, the question and the answer and train a classifier to minimize hinge loss on misclassification loss. Feed-forward neural: Concatenate the post LSTM representation, the question LSTM representation and the answer LSTM representation and feed it through a feed forward neural network of two fully-connected hidden layers.

Results
We describe results on a test split of askubuntu when our models are trained on the union of all data, summarized in Table 1. The left half of this table shows results when the candidate sets is from Lucene-the "hard" setting and the right half of this table shows the same results when the candidate set is chosen randomly-the "easy" setting.
Here, we see that for all the evaluation metrics, EVPI outperforms all the baselines by at least a few percentage points. A final performance of 51% recall at 3 in the "hard" setting is encouraging, though clearly there is a long way to go for a perfect system.

33
5 How good are humans at this task?
In this section we address two natural questions: (a) How does the performance of our system compare to a human solving the same task? (b) Just because the system selects a question that is not the exact gold standard question, is it certainly wrong?
To answer these questions, we had 14 computer science graduate students perform the task on 50 examples. Most of these graduate students are not experts in unix or ubuntu, but are knowledgable. Given a post and a randomized list of ten possible questions, they were instructed to select what they thought was the single best question to ask, and additionally mark as "valid" any additional questions that they thought would also be okay to ask. We also asked them to rate their confidence in {0, 1, 2, 3}. Most found this task quite challenging because many of the questions are about subtle nuances of operating system behavior. These annotator's accuracy on the "hard" task of Lucene-selected questions, was only 36%, significantly better than our best system (23%), but still far from perfect. If we limited to those examples on which they were more confident (confidence of 2 or 3), their accuracy raised to 42%, but never surpassed that. A major problem for the human annotators is the amount of background knowledge required to solve this problem. On an easier domain, or with annotators who are truly experts, we might expect these numbers to be higher.

Proposed Research Directions
In our preliminary work, we focus on the question selection problem i.e. select the right clarification question from a set of prior questions. To enable our system to generalize well to new context, we propose two future research directions:

Template Based Question Generation
Consider a template like "What version of are you running?". This template can generate thousands of specific variants found in the data like "What version of Ubuntu are you running?", "What version of apt-get are you running?", etc. We propose the following four step approach to our template-based question generation method: 1. Cluster questions based on their lexical and semantic similarity. 2. Generate a template for each cluster by removing topic specific words from questions.
3. Given a post, select a question template from a set of candidate question templates using a model similar to our preliminary work. 4. Finally, fill in the blanks in the template using topic specific words retrieved from the post.

Neural Network Generative Model
Sequence-to-sequence neural network models have proven to be effective for several language generation tasks like machine translation (Sutskever et al., 2014), dialog generation , etc. These models are based on an encoder-decoder framework where the encoder takes in a sequence of words and generates a vector representation which is then taken in by a decoder to generate the output sequence of words.
On similar lines, we propose a model for generating the clarification question one word at a time, given the words of a post. A recent neural generative question answering model (Yin et al., 2016) built an answer language model which decides, at each time step, whether to generate a common vocabulary word or an answer word retrieved from a knowledge base. Inspired from this work, we propose to build a question generation model which will decide, at each time step, whether to generate a common vocabulary word or a topic specific word retrieved from the current post, thus incorporating the template-based method into a more general neural network framework.