Toward Automatically Measuring Learner Ability from Human-Machine Dialog Interactions using Novel Psychometric Models

While dialog systems have been widely deployed for computer-assisted language learning (CALL) and formative assessment systems in recent years, relatively limited work has been done with respect to the psychometrics and validity of these technologies in evaluating and providing feedback regarding student learning and conversational ability. This paper formulates a Markov decision process based measurement model, and applies it to text chat data collected from crowdsourced native and non-native English language speakers interacting with an automated dialog agent. We investigate how well the model measures speaker conversational ability, and find that it effectively captures the differences in how native and non-native speakers of English accomplish the dialog task. Such models could have important implications for CALL systems of the future that effectively combine dialog management with measurement of learner conversational ability in real-time.


Introduction
Advances in multimodal dialog technologies have helped improve the state of the art in interactive computer-assisted language learning (CALL) and educational assessment applications in recent years. However, while much progress has been made with respect to the technology infrastructure and automated processing required in such dialog applications, relatively less work has carefully investigated the efficacy and validity of such assessment instruments, for instance, how well they measure students' capabilities. In other words, there is relatively little investigation into the psychometrics of such CALL applications and dialogbased assessments 1 . 1 Psychometrics is the field of study concerned with the theory and technique of psychological measurement, which includes the measurement of knowledge, abilities, attitudes, and personality traits. Psychometricians use a specialized set of statistical tools to create scientifically valid "standardized" assessments of various behaviors. Typically, a test is consid-Interactive tasks such as multi-turn conversations have had limited use as standardized assessments due in part to the difficulty of evaluating these performances. When such assessment tasks are used, the conversational performance is scored primarily using human raters (take for instance, the IELTS exam 2 ). Machine scoring of complex task performances has made substantial progress, especially is the domain of written essays (Shermis, 2014), but has been limited by path complexity in interactive performances such as dialog (Graesser et al., 2005).
While technical language use, (e.g. grammar or pronunciation) might be scorable at the word or phrase grain size, pragmatic conversational ability can only be judged in the context of the conversation history, personal goals, and interpersonal dynamics. In a conversational task, for example, the "correctness" of single utterances cannot be scored independently as their function, and therefore their value, depends upon the current state of the dialog. An utterance at one stage of the conversation might be of high value while the same utterance at a different point would be detrimental. Each utterance must be evaluated based on the speaker's conversational goals, what they have already accomplished in the conversation, and what sequence of interactions might bring them closer to their goal.
Such data is unsuitable to model with traditional psychometric models that assume conditionally independent performance data, such as either classical test theory or item response theory (De Boeck and Wilson, 2004), requiring a more structured and dynamic model (Mislevy et al., 2002). It is this modeling gap that we attempt to bridge in this paper using Markov Decision Process (or MDP)-based measurement modeling (LaMar, 2018). To our knowledge, this is the first attempt at developing a psychometric model for dialog data that explicitly accounts for temporal dependencies in the observed data stream. ered to have been standardized if data have been collected on large numbers of subjects using a set of structured rules for administration and scoring. These data are used to determine the mean score and the standard deviation, which the psychometrician then uses to benchmark the performance of those being tested. For more details, see Association et al. (1999) or Weiss and Zureich (2008).
2 https://www.ielts.org/ While the field does need more research into psychometrics and validity of dialog-based summative assessments, there has been substantial work by the learning and formative assessment community in examining learning gains/progressions and modeling cognitive strategies in conversational tutoring applications (see for example Person et al., 2001;VanLehn et al., 2002;Heffernan and Koedinger, 2002;Michael et al., 2003;Pon-Barry et al., 2006;Rus et al., 2013). Researchers have also examined how one can perform adaptive dialog management to personalize the instruction to individual participants over the course of the interaction (Forbes-Riley and Litman, 2011;Vail and Boyer, 2014). This includes using learning progressions, natural language processing and affective computing to adaptively selecting appropriate tasks for the learner to work on, but also adapting the scaffolding while the learner is working on a tasks (Rus et al., 2013).
Such research has important implications for dialog system design as well. Particularly for CALL applications, it is important to integrate formative assessment of student ability into the dialog management process, in order to better adapt instruction to student needs, both in terms of the level of instruction (obtained in real time through measurement models) as well as the content and dialog path (decided by the dialog manager). We envision that future statistical dialog systems could combine statistical dialog management achieved using Partially Observable Markov Decision Processes or POMDPs (see for instance Young, 2006;Williams and Young, 2007;Young et al., 2010) in tandem with statistical measurement (using POMDP-based models) in order to develop more effective conversational language learning applications.
Our work also directly relates to user modeling in dialog systems. While there is plenty of theoretical work on such models (see for example, Kobsa, 1990;Kass, 2012), implemented statistical versions of user models typically estimate the probability of a particular user response given a candidate system response or a interaction history thereof (e.g., Eckert et al., 1997;Levin et al., 2000;Horvitz and Paek, 2001;Pietquin, 2005;Kim et al., 2008). However, the difference in our case is that in order to serve as a measurement model of student performance, our MDP represents the cognitive model of an ideal automated interlocutor. Given a specified set of model parameters, the MDP model can generate action (or response) probabilities for every possible conversational state, depending on a learner/userspecific latent 'conversational ability' parameter which needs to be estimated for each user. Note that for the purposes of this paper, we will be broadly looking at conversational ability (in achieving a certain goal), and not necessarily technical English language proficiency.
The rest of the paper is organized as follows: Section 2 lays out the mathematical foundations of how MDP models can be used to model learner ability, including the equations for statistical parameter estimation. Section 3 then describes the dialog infrastructure used along with details regarding the conversational task and crowdsourcing data collection, followed by the formulation of the task-specific MDP for our use case in Section 4. Section 5 analyzes the results of running the model on our dataset and studies how well the model differentiates between native and non-native speakers (who are potential language learners) of English, with example dialogs included for illustration purposes. Finally, we conclude with a discussion of the current state of the art and outstanding issues for future research.

Markov Decision Process Measurement Models
As an extension of inverse reinforcement learning, partially observable Markov Decision Processes (POMDPs) have recently been used to represent a cognitive model that describes both human decision making and people's ability to infer the goals and beliefs of others. Baker et al. (2011) describe a "Bayesian theory of mind" in which cognition is modeled as a POMDP. They hypothesize that people act based on their beliefs, modeled by the state space, action set, and transition functions, and in accordance to their desires, which are modeled by the reward structure. With this cognitive framing, POMDPs can be used for measurement within a goal-directed task by comparing actions selected by human participants with the model's predicted probability of those actions (LaMar, 2018). The model and estimation algorithms will be described briefly below; for full details can be found in LaMar (2018). Note that in this work we utilize the more constrained MDP, in which the problem state is assumed to be observable, but extensions to full POMDP models are a natural next step.

Mathematical Formulation
As a decision model, the MDP defines the probability of selecting of an action a ∈ A given a specific state of the task s ∈ S. This probability, p(a|s), is known as the policy. Action selection occurs within the context of a reward function r(s, a, s ), which specifies the immediate reward for taking action a in state s and entering state s and a transition model p(s |s, a), which is the probability of transitioning to a state s given that action a was taken in state s. An additional parameter γ ∈ [0, 1], known as the discount parameter, represents the relative value of future versus immediate rewards. From this specification, one can calculate the Q function, which is the expected sum of discounted rewards obtained by taking action a while in state s, Q(s,a)=∑ s ∈S p(s |s, a) (r(s, a, s ) + γ ∑ a ∈A p(a |s )Q(s , a )) .
(1) Note that ∑ a ∈A p(a |s )Q(s , a ) is the expected value of the next state, marginalized over the possible next actions. Thus the quantity inside the large parentheses is the sum of the immediate reward and the discounted value of the future state. The expectation of this sum is then taken over all possible states s that might result from action a in state s. The Q function is recursive, as the value of a state is defined using the Q function itself, but can be calculated using dynamic programming (Howard, 1960).
When MDPs are used in the context of artificial agents, they generally employ an optimal policy which selects the action that maximized Q in each state. To model human performance, however, optimal decision making is not assumed. Instead a Bolzmann policy is used (Baker et al., 2009), where β ∈ [0, ∞) represents the decision maker's ability to choose actions that will result in higher total rewards. As β increases, the probability choosing an optimal action increases. When β goes to zero, actions are selected uniformly at random from the action set.

MDPs for Measurement and Inference
Researchers have recently extended the MDP framework to study the quality of inferences that can be made about student/learner cognition based on records of action; for instance, to model learner goals and beliefs (Rafferty et al., 2015;Baker et al., 2009), to model inquiry strategies (LaMar et al., 2017), and to model student decision making ability (LaMar, 2018). Using the Boltzmann policy (Eq. 2), the MDP model can be seen as a generative latent-trait model provided that the latent traits of interest can be formulated as parameters of the model. While elements of the reward function and the transition model can be parameterized for inference about the decision maker's goals and beliefs, here we focus on the capability parameter β j , a person-specific Boltzmann parameter, indicating a person's capability to optimally solve the given problem. The formulation of the Q function remains as in Equation 1, except that we note explicitly the dependency upon the capability parameter β j . The conditional probability of student j selecting action a when in state s now becomes .
If the reward and transition parameters are fixed to objectively correct values, the Q function acts as a scoring function, determining the relative value of the actions available in each state. The β j parameter is then similar to a traditional ability parameter in IRT, measuring the extent to which the highest valued action is taken at each decision point.

Parameter Estimation
The observed data for student j consist of a sequence of state-action pairs, where N j is the total number of actions taken by the student. Each pair indicates a state and the action taken in that state. The Markov property applies to this model, allowing us to take each action to be conditionally independent, conditioned upon student capability and the system state in which the action was taken. Thus the probability of the observed data can be written as where the optimal value of the person-specific ability parameter,β j , can be estimated by finding the value of β j that maximizes this likelihood: To estimate the population parameters of the lognormal distribution 3 , µ and σ , we use marginal maximum likelihood (MML), marginalizing over the person-specific parameter distributions. The personspecific β j can be estimated either using maximum aposteriori (MAP) or maximum likelihood estimation (MLE) methods. With smaller population sizes the MLE estimation has been found to be more robust and is used for this study. Both the MML and MLE estimations are performed using a two-phase numerical optimization with a global optimization algorithm followed by a local optimization algorithm, both drawn from the nlopt library. Gaussian quadrature is used for the approximation of the integrals and the Q-function is approximated using policy iteration methods.

Dialog System
We use an open-source dialog system 4 to develop a text-based chatbot application. But note that this work is not limited to or dependent on the dialog system being used. Indeed, there are multiple academic (Olympus (Bohus et al., 2007), Alex (Jurčíček et al., 2014), Virtual Human Toolkit (Hartholt et al., 2013), Open-Dial 5 , etc.) and industrial (Voxeo 6 , Alexa 7 , etc.) implementations of dialog systems, any of which can be Figure 1: Example webpage screenshot of the text dialog interface that participants might see for the task described in this paper.
used, but many of these often use special architectures, interfaces, and languages paying relatively less attention to existing W3C and other industry standards (see Ramanarayanan et al. (2017) for more details). We however choose to use the Anonymous cloud-based dialog system for its standards-compliance, modularity and flexibility in developing both text-and speechbased applications. In this study we will limit ourselves to text-based dialog for simplicity.

Conversational item design
This study leverages a conversational practice task developed for English language learners, where subjects are asked to pose as a customer services representative at a pizza restaurant, and field an order from an automated customer (played by the dialog system). See Figure 1 for a screenshot of the web-based dialog interface that participants interacted with. Participants are instructed that their primary goal is to sell a pizza while ensuring that they collect all information necessary to complete the order (such as the name of the customer, his address if delivery is requested, etc.). They are further instructed that if they manage to sell the customer mushroom toppings, they will be awarded a bonus for task performance. We used regular expressions to perform the natural language understanding. Figure 2 depicts the dialog flow of the conversational item. Recall that for the purposes of this paper, the target of measurement is the student's ability to navigate conversational conventions and achieve the pre-specified task goal (to maximize the pizza sale) through conversation with the automated customer, and not their technical language skills.

Crowdsourcing data collection
We used Amazon Mechanical Turk for our crowdsourcing data collection experiments. Crowdsourcing has been used in the past for the assessment of dialog systems as well as for collection of dialog interactions (see for instance (McGraw et al., 2010;Rayner et al., 2011;Jurcıcek et al., 2011;Ramanarayanan et al., 2016)). In addition to interacting with the text chatbot interface to complete the conversational task, workers were requested to fill out a 2-3 minute survey regarding different aspects of the interaction, such as their overall experience, how engaged they felt while interacting with the system, how well the system understood them, and basic demographic information. Particularly relevant for this study are callers' self-reported first language, and their ratings of system performance, defined as a qualitative measure of how the system performed as per caller expectations and whether the system responses were appropriate. In all we collected and analyzed dialogs from 390 participants, 54% of which selfreported as native English language speakers and 70% of which were male, primarily in the 20-40 age range. See Tables 3-7 for example dialogs.   To serve as a measurement model for student performance, the MDP must represent the cognitive model of an ideal pizza shop representative. The full MDP cognitive model consists of a set of actions, a state space, the transition functions, and the reward structure. In Table 1 the action set is listed in the left column, while the transition function is partially illustrated by the probability of effects from each action. The state space is defined by a set of state variables which includes information slot boolean variables such as gotSize, go-tAddress, and gotCustomerName. For order information which might affect the choice of future actions, we model the possible values along with a value for "unknown." For example, the wantsMushroom variable has three discrete values, 0 for unknown, -1 for "does not want mushrooms," and 1 for "wants mushrooms;" wantsDelivery is coded similarly. The possibility of customer annoyance (isAnnoyed), which was ommitted from Table 1 for clarity, adds complexity to the dialog task. The cognitive model assumes that every time the customer is asked a question that they have already answered they have a .5 probability of becoming annoyed. This means that while pestering the customer to order mushrooms might result in an mushroom pizza order, it also might result in an annoyed customer. Annoyed customers do not buy pizzas. The final isSold state variable gets set to 1 only if all the required information has been gathered:name, topping-preference, size, delivery-or-takeout, and address (in the case of delivery) and the customer is not annoyed. Note that the model is intended to reflect the thinking of a competent participant engaging with that task and thus includes stochastic transitions based on "likely" outcomes from different conversational moves. For example, in the test task the customer will always ask for delivery, however the cognitive model for the pizza salesperson gives the probability of the customer wanting delivery as .5, as the representative does not know whether the customer will want delivery until they ask.

MDP Model for the Pizza Dialog Task
The rewards are defined based on actions taken and values of the state variables. Table 2 shows the five reward or cost conditions used for this model. The biggest reward comes from completing the call with a pizza ordered, for a reward of 3 points. An additional point is awarded if the pizza order includes mushrooms and an additional half point is given for collecting the not-required, but nice-to-have phone number of the customer. When customers get annoyed, the whole conversation gets unpleasant causing a cost of 0.3 points (equivalently a reward of -0.3) for every action taken while the customer is annoyed. Finally, each and every action costs 0.2 points. This "time cost" is essential for motivating efficiency in action planning.
Given this specification, the PizzaOrder MDP can generate action probabilities for every possible conversational state, dependent upon the β j parameter, which we take to be conversational ability as demonstrated by the task. Human the pizza will be ready in 30 minutes System Alright, thanks! Goodbye! We ran the MDP measurement model over the full data set of 390 dialogs, estimating the person-specific conversational ability (which, again, is their ability to achieve the pre-specified task goal through conversation, and not their technical language skill) for each dialog as β j . In our analyses we examine log(β j ) which should follow a more familiar normal distribu-  Human and your adress? System I'm sorry, I didn't catch that. Can you repeat that, please? Human and your address? System 23 Maple Street. Human Ok, that will take 30min System Alright, thanks! Goodbye! tion. To evaluate the validity 8 of the resulting estimates, we analyzed them vis-a-vis their reported native language, with the expectation that native-English speakers would generally have higher conversational ability 9 . Recall that our sample included 209 native English speakers, and 181 participants who reported a non-English language as their native tongue.

Analyses and Observations
We observed that participants who reported English as their first language had average log(β j ) estimates 0.31 logits higher than players who reported a non-English language as their first language (t = 3.00, df =  374, p = 0.003). Figure 3 shows a boxplot of the estimated log(β ) values grouped by native language.
We also compared the dialog-ability estimates to participant-reported satisfaction ratings with respect to the dialog system performance. Here we hypothesize that participants who used phrases that the dialog engine did not recognize would both be dissatisfied with the performance of the system and have low estimated conversational ability. In our sample, 254 participants reported that the system performed well (4 or 5 on a 5point Likert scale), while 103 participants rated the system at a 3 or lower. Players who rated that the system performed well had an average log(β j ) estimate 0.47 logits higher than those who rated the system poorly (t = 3.64, df = 159.7, p < 0.001) (Figure 4), which seems to conform with our hypothesis. However, note that these system performance ratings are subjective and might vary depending on the speaker sample and specific conversational item under study.
While these results provide, as yet, only weak validity evidence for the measurement model, they do indicate that the model is performing as expected. We also examined the actual dialogs of different participants interacting with the system in order to better understand how the model of student dialog reflects actual student performance. We have listed example dialogs of nonnative participants interacting with the system of different estimated dialog ability and self-reported system performance rating. Note that these are presented as is, without correcting for errors in spelling or grammar. Table 3 shows an example dialog which was assigned a low dialog ability rating (log(β j )) as well as a low system performance rating. In this case, while the Hindi speaker mentioned the deal on mushrooms, he asked for the pizza size again even though the automated customer had already given him that information. Per our earlier model specification, this might have 'annoyed' the automated customer. Crucially, though, he failed to ask the automated customer whether he wanted delivery or not, and subsequently his address, which resulted in a low log(β j ) score on the task overall. Table  4 shows an example where the automated customer did not get annoyed, but it nonetheless shows clear gaps in the non-native participant's conversational competence in achieving the goal of maximizing the sale. In contrast to these examples, the Indonesian speaker (Table  6) asked the automated customer for each of the requisite pieces of information to complete the task successfully resulting in a successful interaction that received a high log(β j ) score, despite the fact that he didn't sell the customer mushrooms. A native speaker of Dutch (Table 5) who performed well on the task in general, but was scored slightly lower (log(β j ) = 0.258) did persist in selling mushroom toppings to the automated customer while asking for his name and address, but incorrectly spelled the word 'address'. However, the participant caught this error in the next dialog turn, ultimately resulting in successful completion. Note that there were also cases that received a high log(β j ) score with low system performance ratings, many of which were due to system natural language understanding issues. Going forward, we will aim to improve this aspect of the system to improve user experience and modeling accuracy.

Discussion and Outlook
We have presented a Markov decision process-based measurement model (MDP-MM) for the assessment of of learners' ability to complete a simple customer interaction dialog task. We put forth a formal mathematical description of the model including a maximum likelihood based method to estimate the parameters of the model given input data. On applying the model to crowdsourced customer services dialog interactions at a pizza restaurant, we observed that the model ability (log(β j )) estimate is able to differentiate between native and non-native speakers of English and partic- ipant ratings of system performance in a statistically significant manner. Note that the MDP-MM is particularly useful over traditional methods of measurement when the dialogs increase in complexity and branching, and the resulting paths cannot be easily enumerated for scoring. We plan to investigate several lines of research going forward. First, while we have shown the model's efficacy in capturing conversational ability of participants in successfully completing a given task to a certain extent, neither the degree of nativeness nor their rating of system performance are ideal correlates to establish the validity of the model. A more appropriate variable might be, for instance, an 3 rd -party expert rating of their conversational ability (where experts could be English language teachers, for instance). In addition, we hand-crafted a specific set of actions, transition probabilities and rewards for the model presented in this paper based on our subjective expertise. Careful selection of these parameters is important because they directly influence model behavior. Future iterations could benefit from a more scientifically objective method of model specification. We will also need more data from more conversational items and participants to concretely establish the utility of the model and its applicability to a wide variety of dialog use cases in a statistically significant manner.
Second, while this paper has focused on conversational task ability, our longer term goal is to apply such a model to the measurement of conversational language proficiency. This will require modifications to both the task (the goals, dialog flow design, natural language understanding and dialog management logic) as well as the specific variables we measure (such as fluency, language use, vocabulary and grammatical accuracy, prag-matics and historical discourse context, among others).
Third, while the proposed model assumes that the state of the system is known at every given point of time for simplicity, relaxing this assumption is a natural next step. In such a case, we would have to use a partially observable extension of the MDP-MM model (or a POMDP-MM) that explicitly models the uncertainty in the observation process that estimates the state of the system at every time step.
A fourth important future research direction, as mentioned in this paper's introductory paragraphs, involves the integration of statistical measurement of student conversational ability with dialog management, especially for computer-assisted language learning (CALL) or formative assessment applications. Such integration would leverage the measurement of learner conversational ability and/or language proficiency into the dialog manager, allowing one to adapt the conversational instruction flow both based on the content of what the learner said, as well as his/her conversational ability. In addition, popular statistical dialog management modules are based on POMDPs, which might allow for easier combination with the POMDP-based measurement model into a unified model, given that both share the underlying mathematical framework. For example, in such a scenario, one could imagine that the user action model, user goal model and dialog model in a POMDPbased dialog manager (that estimate the user's next action and state, and the next dialog system state, respectively, as described in Young, 2006) would now depend (and be conditional) on the user's conversational ability and/or language proficiency estimate.
Finally, we also plan to evaluate model efficacy and integrability into a full-blown spoken dialog scenario (as opposed to text chat, as in this paper). In addition, the current paper uses simple regular expressionbased natural language understanding; incorporating more accurate statistical natural language understanding modules could further improve model performance and estimation accuracy. Such improvements and the early nature of the model notwithstanding, the relative lack of previous work in measuring conversational ability in CALL dialogue and the results presented in this paper speak to the necessity and potential of such measurement models in developing more comprehensive and effective CALL applications.