Learning to Ask for Conversational Machine Learning

Natural language has recently been explored as a new medium of supervision for training machine learning models. Here, we explore learning classification tasks using language in a conversational setting – where the automated learner does not simply receive language input from a teacher, but can proactively engage the teacher by asking questions. We present a reinforcement learning framework, where the learner’s actions correspond to question types and the reward for asking a question is based on how the teacher’s response changes performance of the resulting machine learning model on the learning task. In this framework, learning good question-asking strategies corresponds to asking sequences of questions that maximize the cumulative (discounted) reward, and hence quickly lead to effective classifiers. Empirical analysis across three domains shows that learned question-asking strategies expedite classifier training by asking appropriate questions at different points in the learning process. The approach allows learning classifiers from a blend of strategies, including learning from observations, explanations and clarifications.


Introduction
The ability to learn new tasks and behaviors from language is characteristic of human intelligence. In recent years, the fields of machine learning and NLP have seen an renewed interest in incorporating natural language supervision in models of machine intelligence (Narasimhan et al., 2015;Elhoseiny et al., 2013;Goldwasser and Roth, 2014;Fried et al., 2018;Wang et al., 2016). In particular, methods such as Bab-bleLabble (Hancock et al., 2018) and LNL (Srivastava et al., 2017) show progress towards realistic Figure 1: Question-Answer dialog can enable learning from a mix of strategies, including label observations (traditional supervised learning), explanations and clarifications (to overcome parsing limitations). The output from the teacher-learner interaction is a classification model (here, for important emails). We present a framework that (a) enables learning classifiers from a mix of such supervision; (b) learns to ask appropriate sequences of questions to accelerate this. applications of supervised learning from language on tasks such as information extraction and email categorization. However, until now, such methods have been limited in two ways.
First, despite a body of work on leveraging language for tasks involving human robot interaction (She and Chai, 2017;Cakmak and Thomaz, 2012;Krishnamurthy and Kollar, 2013) and interactive learning in non-linguistic settings (see Section 2), existing approaches for training machine learning models from language are largely noninteractive, i.e. the learner agent receives statically collected text-based advice from a teacher as input, but does not directly engage with the teacher. 1 In comparison, when humans learn, they do not rely only on passively receiving instruction from a teacher. Rather, the interaction takes the form of a mixed-initiative dialog, where they ask questions and proactively seek clarifications to simplify learning. These questions can generalize learning to novel situations, explore hypotheses, or fill information gaps. The ability to ask questions can, thus, fundamentally facilitate learning.
Second, existing approaches have focused on using language either as a standalone replacement for labeled data (Hancock et al., 2018), or to drive learning such as through specifying features for learning tasks (Eisenstein et al., 2009). In contrast, many realistic scenarios of learning from language would involve not learning from language alone, but learning from a mix of supervision, including both traditional labeled data, and natural language advice. Thus, automated learners should be capable of learning from a blend of observations, explanations and clarification.
In this work, we introduce a framework for learning from language in a conversational setting (LiD, for Learning with Interactive Dialog), which is a step towards alleviating these shortcomings. Language provides a natural medium for conversational interactions between a learner and a teacher, specifically in the form of questionanswer dialog. The premise driving our work is that the ability to ask questions can be leveraged by an automated learner to accelerate its learning. We explore a data-driven approach for learning effective question-asking strategies in the specific context of learning classification tasks. The signal for learning to ask questions is grounded in the learning task itself. i.e., the value of a question is evaluated in utilitarian terms of how it affects performance on a downstream classification task. This follows a Wittgensteinian view of language as a cooperative game (Wittgenstein, 1953) between agents (here, the teacher and a learner) with a shared goal (here, building an effective classifier). While the space of questions that an interactive learner can ask can be vast in general, here we specifically focus on leveraging interactivity for three specific aspects (highlighted in Figure 1): 1 Zhang et al. (2018) diverge from prior work in this respect, and model language games between teachers and learners. However, their learning tasks are toylike, and the method does not generalize to realistic scenarios. 2. Asking for explanations of a concept. 3. Requesting clarifications about explanations.
As illustrated in Figure 1, these dimensions can facilitate multiple aspects of the learning process: including learning from labeled examples (similar to traditional supervised learning), learning from natural language explanations (similar to recent work on learning from explanations) and alleviating limitations in the learner's semantic parsing abilities (in vein with work such as ). Learning systems that reify these abilities can enable users to interactively teach new concepts using a blend of traditional and languagebased supervision. Our contributions are: • A reinforcement learning formulation to guide question-asking strategies for learning from language.
• A method for interactively training classifiers using a mix of labeled data, natural language explanations and clarifications. Our exploration highlights some of the challenges involved in interactive learning from language.

Challenges in Relation to Previous Work
From the perspective of traditional supervised learning, the problem of asking questions can be seen as cognate with active learning. Methods in active learning have explored various criteria for choosing which of a set of unlabeled examples to label next while training supervised machine learning models (Settles, 2012;Collins et al., 2008). This can be seen as asking a specific kind of question (as illustrated in Figure 1). Learning to ask questions generalizes active learning in multiple ways by possibly soliciting a wider range of data measurements. These include feature labels ('Are emails with subject "urgent" usually important?'), label proportions ('Around what fraction of emails are important?'), constraints on model expectations ('Are you more likely to reply to important emails?'), etc. Approaches such as Srivastava et al. (2018) map such language to data measurements that computational models can reason over. 2 Statistical frameworks such as Generalized Expectation (Druck et al., 2008), Posterior Regularization (Ganchev et al., 2010) and Bayesian Measurements (Liang et al., 2009) then allow for model training from a broad range of such data measurements in conjunction with unlabeled data, rather than using labeled examples. Other recent approaches such as Huang et al. (2015) and Siddiquie and Gupta (2010) have expanded poolbased active learning to learning from multiple types of queries, especially in the context of multilabel and multi-class learning. Similarly, Parikh and Grauman (2011) explore feature space construction for visual tasks in an interactive setting. Although in principle soliciting different types of data measurements can help learning, each type requires its own interface. The advantage of using natural language as a medium is that it allows us to unify the different modes of supervision into a single, familiar user interface. However, using natural language as a medium of supervision comes with its own set of challenges, as we discuss next.

Dependence on Language Interpreter
Since both generation and transmission of language advice can be noisy, the optimal question asking strategy may depend not only on the information content of data measurements, but also factors such as the quality of the learner's semantic parsing model and the teacher's skill. 3 To explain, while useful from an information theoretic sense, a teacher's explanations may be too complex to handle for the learning agent's parser, in which case it might be preferable to stick to asking about the teacher about instance labels (which would require minimal parsing). Thus, question-asking strategies need to be sensitive to the learner's own semantic parsing ability, which may also change during the course of interactions with users.

Context Dependence
Rather than learning a static criterion for choosing what question to ask (as in active learning), our focus is on asking questions in conversational settings, which are inherently dynamic processes. To explain, asking a teacher to rephrase an explanation only makes sense in specific contexts (when the interpretation of something said previously is unclear). Further, the question to ask can depend on factors such as the task domain, supervision 3 In this work, we are not interested in learning semantic parsing models. We presume the existence of pretrained semantic parsers for learning agents. Our focus is rather on whether some question asking strategies may be more effective than others for a learning agent with given capabilities. previously received, etc. These factors motivated our choice of a reinforcement learning approach for learning question-asking strategies.

Relation to Question Generation approaches:
The problem of learning to ask questions has previously been explored by several approaches. Vanderwende (2008) and Olney et al. (2012) explore generating reading comprehension questions conditioned on a given text. More recently, Romeo et al. (2016) and Rao and Daumé III (2018) present neural network models that rank questions in community QA forums, whereas Misra et al. (2018) generate questions for visual scene understanding. Our framework significantly differs from these in its sequential framework, and that the questions to be asked are grounded in quantitative performance on a downstream task.

Approach
In this section, we describe our framework for interactive learning from question-answer dialog. We first describe an approach to learn classifiers using a mix of explanations and labeled examples in Section 3.1. This is a preliminary towards question-asking strategies that subsume active learning as well as language advice; and constitutes a subroutine that is repeatedly invoked in our approach. Section 3.2 describes our reinforcement learning formulation for learning questionasking strategies in simulated conversational settings. The space of actions consists of a vocabulary of question types that a learner can ask, and reward is based on improvements in the classification model that the teacher's response to an asked question leads to.

Learning classifiers from a mix of observations and explanations
We base our learning framework on previous work by Srivastava et al. (2018), who train loglinear classifiers (with parameters θ) using natural language explanations of the individual classes and unlabeled data. Further, they use the semantics of linguistic quantifiers (such as 'usually','always', etc.) as priors in a Posterior Regularization objective to drive the model training. In particular, their training objective takes the following form: This email is not important r t < l a t e x i t s h a 1 _ b a s e 6 4 = " B m o b w t q 6 d 1 V d n r Q p 0 o h 1 J w b Y U l g = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / o h 6 9 L B b B U 0 l E 0 G P R i 8 e K 9 g P a U D b b T b t 0 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 H n f T m l t f W N z q 7 x d 2 d n d 2 z 9 w D 4 9 a J s k 0 4 0 2 W y E R 3 Q m q 4 F I o 3 U a D k n V R z G o e S t 8 P x 7 c x v P 3 F t R K I e c Z L y I K Z D J S L B K F r p Q f e x 7 1 a 9 m j c H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 H n f T m l t f W N z q 7 x d 2 d n d 2 z 9 w D 4 9 a J s k 0 4 0 2 W y E R 3 Q m q 4 F I o 3 U a D k n V R z G o e S t 8 P x 7 c x v P 3 F t R K I e c Z L y I K Z D J S L B K F r p Q f e x 7 1 a 9 m j c H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 H n f T m l t f W N z q 7 x d 2 d n d 2 z 9 w D 4 9 a J s k 0 4 0 2 W y E R 3 Q m q 4 F I o 3 U a D k n V R z G o e S t 8 P x 7 c x v P 3 F t R K I e c Z L y I K Z D J S L B K F r p Q f e x 7 1 a 9 m j c H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 H n f T m l t f W N z q 7 x d 2 d n d 2 z 9 w D 4 9 a J s k 0 4 0 2 W y E R 3 Q m q 4 F I o 3 U a D k n V R z G o e S t 8 P x 7 c x v P 3 F t R K I e c Z L y I K Z D J S L B K F r p Q f e x 7 1 a 9 m j c H We assume that the dialog between the learner and the teacher is in the form of turn-wise conversationsconsisting of a sequence of questions asked by the learner, and the teacher's responses to those questions. At each step in this process, the teacher's response is parsed by the learner (using a pre-trained semantic parser), and can be incorporated into the learner's concept model as either a labeled example or a data measurement (the learner can also choose to seek a clarification). A reward (denoted by r t ) can be computed at each step, which denotes the marginal change in classification performance on a held-out set of examples due to the last response. In this framework, learning good question-asking strategies corresponds to asking sequences of questions that maximize the cumulative (discounted) reward, and hence quickly lead to effective concept models. The framework also allows for asking sequences of multiple questions before seeing a major jump in model performance.
The objective reflects a tension between explaining the unlabeled data (likelihood term) and emulating the natural language advice provided by a teacher. The KL divergence represents difference between predictions from the trained model on unlabeled data p θ (Y |X) and language advice (each explanation is incorporated as a data measurement; the conjunction of these defines the 'valid set' of posterior distributions Q that perfectly concur with the natural language advice). The second term essentially computes the minimum distance between the model posterior and the set Q.
Here, we show that we can naturally extend this approach to learn classifiers from a mix of both labeled and unlabeled data, and natural language explanations. To do this, we simply append a loglikelihood term for the labeled examples to the objective in Equation 1. The updated objective is: Here, L labeled (θ) denotes the log-likelihood term for a set of n labeled labeled examples X labeled = {(x k , y k )} n labeled (normalized by n labeled ), whereas the other two terms are as before: L unlabeled (θ) denoting log-likelihood over a set of n unlabeled unlabeled examples, and a posterior regularizer term (KL-divergence) penalizing violations of the parse natural language advice. In the E-step of the Posterior Regularization training (Ganchev et al., 2010), the computation of the posterior regularizer remains unchanged. However, the M-step is modified so that the classifier parameters θ are learned using both the inferred labels for the unlabeled data, and provided labels for the labeled examples.
In Equation 2, µ > 0 determines the relative weights of provided example labels and natural language advice in the optimization objective and is a hyper-parameter for the method. In learning scenarios where there is little labeled data, we would like to rely primarily on constraints specified from natural language explanations, and unlabeled data. On the other hand, in scenarios where there is a lot of labeled data available enabling robust inductive inference, we would like to primarily rely on it rather than explanations. 4 While setting up the optimization problem, the value of µ can be adapted to reflect this intuition. In our experiments, we found setting µ = 1/ max(n labeled , 1) to work well across settings.

RL formulation for learning to ask
Figure 2 illustrates our framework for learning classification tasks in a question-answer dialog setting. We assume the presence of a teacher to answer questions posed by the learner. We restrict the structure of dialog to a sequence of questions (q 1 . . . q T ) asked by the learner, and the teacher's responses (u 1 . . . u T ) to them. We further assume the presence of a held-out set of labeled examples of the concept, which can be used to evaluate the learner's classification performance as the dialog progresses. At each step t, the learner's action a t consists of choosing a question to ask the teacher. The teacher's response to the learner's question is parsed (in the form of a labeled example, or a data measurement), which is then incorporated into the learner's concept model (by retraining with the additional labeled example or the new data measurement). The classification performance, c t , of the updated model is evaluated on the held-out set. The change in classification performance from the previous step, r t approximates the marginal value of the question in learning the task, and constitutes the learner's reward at that step.
Our approach for learning question asking strategies models the dialog as a simple Markov Decision Process. Since our state and action spaces are discrete (as described in the following sections), we can use a table-based SARSAlearning procedure (which allows for on-policy learning over Q-learning) to estimate the stateaction values Q(s, a) of different question types in different contexts. We next describe the statespace, actions and rewards, and the learning procedure.

Action Space
As mentioned earlier, there can be a multitude of questions that a learner can ask a teacher. Here, we are interested in exploring three specific types of questions that are specially germane for facilitating learning from a mix of labeled examples and explanations. These consist of the following: 1. Seeking labels for specific examples: This is similar to traditional active learning. In particular, we can have a different action corresponding to every active learning criterion, which chooses which example to label next. In our experiments, we use two active learning techniques (with a corresponding action for each): • Random: Ask for class label for a randomly chosen unlabeled instance. • Maximum Uncertainty: Ask for class label for the instance in the data for which the current concept model is most uncertain (highest entropy). If there are multiple such instances, randomly pick one among them.

Asking for an explanation for the concept:
This action seeks out from the teacher a short natural language explanation of the concept. This is then incorporated in the concept model as a quantitative constraint. In general, this can encompass several types of questions: • Asking for probability estimates about specific labels and features. e.g., 'How often are emails about meetings important?' • Asking for discriminative features for particular concept labels. e.g., 'Can you think of a feature that if present always denotes that an email is important?' • Asking about class probabilities. e.g., 'Around what fraction of emails in your inbox are important?' In principle, each of the above provide admissible constraints which the classifier training procedure (from Section 3.1) can handle. However, to simplify analysis, in our experiments, we conflate these actions into one category, and ask for general explanations of the form 'Can you give me an explanation of the concept?', which could return a variety of data measurements (subsuming P (y),P (y|x) and P (x|y), previously explored in Srivastava et al. (2018)).
3. Requesting clarification about the previous explanation: This action asks for a clarification about the interpretation of a previous explanation (which can be helpful in cases when the learner is uncertain about the interpretation). For this, the learner verifies if the interpretation of the previous explanation (using the learner's semantic parser) was correct or not. To do this, we generate a question of the form 'Did you mean . . .' using a synchronous grammar which deterministically maps logical forms to natural language descriptions (see Figure 1 for an example). The teacher responds with yes, if the parsed logical form matches the gold annotated logical form, and with no otherwise. In case the teacher responds with no, the current explanation is discarded (not used in model training), and the learner moves ahead to ask for a new label or explanation.
Simulating Interactions: We note that each of the question types described above -(1) asking for labels for examples, (2) asking for concept explanations, and (3) verifying interpretations of language explanations -can be simulated with corresponding statically collected data -consisting of (1) labeled examples for classification, (2) natural language explanations of classes, and (3) annotations of those explanations with logical forms. This has a significant implication: rather than relying on questioning human users in real-time, we can simulate the conversational exchange by asking questions to an oracle, which has access to previously pre-collected data of the above-described form for each classification task. While this is a coarse approximation of actual dialog between an automated learner and human teachers, it can serve as a useful proof-of-concept, and allows for quick experimentation. We rely on this simulated setting for learning policies for question asking.

Rewards
The reward, r t , evaluates the change in classification performance due to an asked question at each step t of the dialog. The performance of the classification model is evaluated on a held out set of n heldout = 50 labeled examples for each learning task. In our experiments, we use the model's F1 score as the metric for classification performance, c t . We define the reward as the absolute change in model performance from the previous step:

State-space
Next, we describe the featurized state space for our reinforcement learning formulation. The best question to ask at a particular point can likely depend on the state of the conversation. This could include factors such as the pedagogical phase in the learning process (exploratory vs confirmatory), previous questions asked, etc. Thus, defining a rich enough state space is an important consideration for a formulation of conversational learning. In our treatment, we assume a discrete state-space, which is defined as the cross product of the following (also discrete) features.
• Curricular stage in the Learning process: We use a discrete variable to model the curricular stage of the learner, approximating it by the number of steps (questions previously asked) in the interaction at any point. We cluster the number of steps in the following five bins of values: BEGINNER (0 steps), NOVICE (1-5 steps), IN-TERMEDIATE (6-10 steps), ADVANCED (11-15 steps) and MATURE (> 15 steps). 5 • Reward in the previous state: We discretize the value of reward as belonging to one of four ranges (abstractly named GOOD, INCREASING, FLAT, and DECREASING), with thresholds chosen to correspond to the inter-quartile ranges for the value of rewards observed in evaluating a random policy.
• Velocity of reward: This is a ternary variable indicating whether the value of the discrete variable for the reward (above) is BETTER,WORSE or the SAME than the previous step.
• Type of the previous two actions: As mentioned in the description of the action space.
• Domain of learning task: Indicates the domain of the current classification task. We use datasets corresponding to three domains: EMAIL CATEGORIZATION, SHAPE CLASSIFI-CATION and BIRD SPECIES IDENTIFICATION; hence this variable can take these three values.
• Confidence of previous parse: We model the learner's confidence in parsing the response u t from a teacher as the ratio of the probability of the highest probability (predicted) logical form from the learner's semantic parser and the next best logical form. We discretize this ratio into three values, corresponding to the upper (HIGH), lower (LOW) and middle two interquartile ranges (MEDIUM) for the value of the ratio over all explanations in our data.
While our state space captures several facets, it does not model some other important factors: • Teacher behavior: Whether the teacher provides correct information, and uses easily interpretable language.
• Task difficulty: This refers to how expressible a classification task is using language explanations. For example, some concept maybe significantly easier to explain using language than others, depending on the logical language available to the semantic parser. For example, it maybe impossible to explain digit recognition using pixel level features using language.

Model training
Since the action and state-spaces are discrete and not prohibitively large, we use the on-policy SARSA learning algorithm for policy control (Rummery and Niranjan, 1994), where we represent the state-action Q-values for pairs of states s and actions a as a table. We use an -greedy strategy ( = 0.20). i.e., the strategy balances between exploitation and exploration by picking the next action to be the estimated optimal one (having the maximum estimated Q-value for a state) with probability 1 − , and choosing the next action randomly with a probability of . The initial policy is defined by uniform randomly initializing Q(s, a) values between 0 and 1.

Data
Our empirical analysis uses existing datasets for learning classifiers from language. These are datasets for email categorization from natural language explanations from Srivastava et al. (2017); and bird species classification and synthetic shape classification tasks from Srivastava et al. (2018). In all, thesee consist of 67 classification tasks belonging to these three domains. 6 For each task, the corresponding data consisting of natural language explanations of classes as well as annotated logical forms for these explanations are available.
For each task, we hold out a random sample of 50 examples for evaluating classifier performance. The rest of the examples (ranging between 50 and 100 for individual tasks) are considered as unlabeled data at the start of each interaction. At each step in a simulated interaction, either (1) the label

Experiments
In this section, we evaluate LiD's performance for three domains of classification tasks. As previously mentioned, for policy learning we simulate conversational interactions between teachers and learners by asking questions to a oracle, which has access to previously collected data about each classification task. One limitation of learning question-asking strategies from simulated interactions is that for some classification tasks, we may run out of explanations in the course of model training (since the number of explanations of a concept are limited). In these cases, we end the interaction as soon as all explanations are already provided to the learner.

Learned vs Random policy evaluation
First, we compare learned policies for question asking with a naive policy that randomly takes a new action at each step in the learning process. Figure 3 shows averaged cumulative reward for question asking strategies on 20 unseen classification tasks, after SARSA learning for 10 epochs on the remaining 37 classification tasks. In the figure, the x-axis corresponds to the number of steps in a dialog, and the y-axis denotes the cumulative reward (averaged over 20 tasks) for a learned policy vs a random policy. We note that policy learning leads to consistently superior performance (on average, LiD achieves any given level of classification performance in fewer steps), which unambiguously indicates value in asking the right sequences of questions. We observed that the trend was also similar in most individual learning tasks. For example, the cumulative reward after 10 steps was higher for the learned policy than the random policy for 18 of the 20 learning tasks. The difference in performance was statistically significant at p < 0.05 using a signed permutation test. We characterize some learned behaviors that drive this improved performance in Section 5.4.

Reliance on NL vs Parsing accuracy
Intuitively, semantic parsing competence of a learner should be a significant consideration in whether it should rely on explanations. To test this, we simulate scenarios of learners with different levels of semantic parsing ability by choosing the true logical form for any explanation with the corresponding probability, and choosing an alternative logical form from the remaining candidates in the k-best list from the semantic parser otherwise. Figure 4 depicts the effect of parsing competence on learned question-asking strategies. The empirical behavior largely corroborates our expectation, as the learned strategies increasingly avoid seeking natural language explanations of concepts as the parsing performance worsens. In the base case where the learner has no parsing competence, the model learns to exclusively ask for labeled examples only (In the figure, the fraction is seen to converge close to 0.1 rather than 0 due to thegreedy nature of the policy).

Differential value of explanations
A notable issue in learning from explanations, which we do not model here is that a teacher's multiple explanations of a concept can have a large variance in their utility to a learner. In particular, we might expect that teachers would be more likely to provide the most useful explanations first, and minor explanations subsequently. From an ablation study, we observe that this is indeed a valid concern. Figure 5 shows the average marginal increase in classification performance over 50 visual shape classification tasks from explanations with different rank (based on the actual order of providing from human users). This indicates that explanations from teachers provided later contribute significantly less towards classification performance.

Examples of Learned behavior
From a qualitative analysis, learned policies are seen to be intuitive and interpretable. In particular, the policies overwhelmingly seek clarifications when confidence in the parser is low. On the other hand, there are strong inclinations to continue using an action type as long as it yields high returns. Interestingly, the optimal policy differs significantly in behavior across domains. As example, learned policies rely nearly twice as much on explanations for bird species identification as for email classification tasks. The probable reason is that parsing is harder for email explanations, as features in this domain are often compositional.

User study
We perform a small user-study to also evaluate performance of the learned policies on an email categorization task with actual human teachers. We still train the question-asking strategy using the simulated teacher framework (since learning the policy from crowdsourced human users would be expensive and slow). 20 users were asked to in- teract with the learned LiD policy to teach a chosen email-classification task. For each task, the system asked a sequence of 10 questions, and the human teacher's responses were incorporated into the system to update the classification model. The users were also asked to teach another task with questions asked through a random policy. Table 1 shows the average cumulative reward for humans interacting with LiD vs a random policy for this experiment. We note that LiD leads to better performance on average. This trend is the same as in the simulated analysis, although we note that the learning is slower with real teachers than in the simulated setting on the same tasks, and the gain in performance is substantially smaller. A contributing reason for this is likely annotator bias (Geva et al., 2019), since in the simulated testing scenarios, the teacher's explanations can often likely come from a small set of turkers whose language explanations for teaching other tasks were used for training the learner's semantic parsing model. We note that the learned policy was rated by human users as more natural than a random policy on a Likert scale (with range 1-5).

Conclusion
In this paper, we have provided a reinforcement learning formulation for learning to ask questions for interactive training of machine learning models. This framework is attractive in grounding the value of questions asked in a measurable downstream task. Further, change in model performance is a natural reward to drive this learning. While this provides a conceptually useful framework for framing question generation, in its current form the approach makes simplistic assumptions on the types of questions that can be asked, as well as on the structure of the dialog between the teacher and the learner. While the system outperforms a random policy on learning classification tasks, the dialog looks contrived from a human perspective. An interesting direction could be to pair the framework with neural text generation methods to model fine-grained question types, and generate more natural-looking in-teractions through dialog. An important scientific question is to characterize learning tasks for which learning from language is likely to outperform pure inductive learning. Future work can also extend the approach to other supervised learning tasks, as well as bootstrap from natural dialog data.