Scoring Interactional Aspects of Human-Machine Dialog for Language Learning and Assessment using Text Features

While there has been much work in the language learning and assessment literature on human and automated scoring of essays and short constructed responses, there is little to no work examining text features for scoring of dialog data, particularly interactional aspects thereof, to assess conversational proficiency over and above constructed response skills. Our work bridges this gap by investigating both human and automated approaches towards scoring human–machine text dialog in the context of a real-world language learning application. We collected conversational data of human learners interacting with a cloud-based standards-compliant dialog system, triple-scored these data along multiple dimensions of conversational proficiency, and then analyzed the performance trends. We further examined two different approaches to automated scoring of such data and show that these approaches are able to perform at or above par with human agreement for a majority of dimensions of the scoring rubric.


Introduction
Learning and assessment solutions in today's educational marketplace are placing increasing importance and resources on developing technologies that are dialogic (as opposed to monologic) in nature. Conversational proficiency is a crucial skill for success in today's workplace (Weldy and Icenogle, 1997;Oliveri and Tannenbaum, 2019), which makes R&D on technologies that help develop and assess this skill important to complement our understanding from sociolinguistics (see for example Young, 2011;Doehler and Pochon-Berger, 2015). Dialog system technologies are one solution capable of addressing and automating this need by allowing learners to practice and improve their interactional compentence at scale (Suendermann-Oeft et al., 2017;Yu et al., 2019). However, such conversational technologies need to be able to provide targeted and actionable feedback to users in order for them to be useful to learners and widely adopted. Automated scoring of multiple aspects of conversational proficiency is one way to address this need.
While the automated scoring of text and speech data has been a well-explored topic for several years, particularly for essays and short constructed responses in the case of the former (Shermis and Burstein, 2013;Burrows et al., 2015;Madnani et al., 2017) and monolog speech for the latter (Neumeyer et al., 2000;Witt and Young, 2000;Xi et al., 2012;Bhat and Yoon, 2015), there has been a relative dearth of work on the interpretable automated scoring of dialog. Evanini et al. (2015) examined the automatic scoring of pseudo-dialogues, i.e., there were no branching dialog states; the system's response was fixed and did not vary based on the learner's response. Litman et al. (2016) developed a system to predict expert human rater scores based on audio signal and fluency features. Ramanarayanan et al. (2017a) analyzed this scoring problem at the level of each response in the dialog (i.e., each turn) instead of the entire conversation and across multiple dimensions of speaking proficiency. However, no study has performed a comprehensive examination of the automated scoring of content of whole dialog responses (with branching) based primarily on text features, based on a comprehensive multidimensional rubric and scoring paradigm designed specifically for dialog data, and interaction aspects in particular.
This study describes our contributions toward (i) developing a comprehensive rubric design Table 1: Human scoring rubric for interaction aspects of conversational proficiency. Scores are assigned on a Likert scale from 1-4 ranging from low to high proficiency. A score of 0 is assigned when there were issues with audio quality or system malfunction or off-topic or empty responses.

Construct
Sub-construct Description Interaction Engagement Examines the extent to which the user engages with the dialog agent and responds in a thoughtful manner. Turn Taking Examines the extent to which the user takes the floor at appropriate points in the conversation without noticeable interruptions or gaps.

Repair
Examines the extent to which the user successfully initiates and completes a repair in case of a misunderstanding or error by the dialog agent. Appropriateness Examines the extent to which the user reacts to the dialog agent in a pragmatically appropriate manner. Overall Holistic Performance Measures the overall performance. specifically tailored to conversational dialog along multiple dimensions, particularly those focused on interaction, (ii) triple-scoring a selection of dialog data based on this rubric, and finally (iii) examining the performance of two methods for automated scoring of such data -the first a state-of-the-art feature engineering method that passes word and character n-grams, length and syntax features into multiple state-of-the-art classifiers, and the second a model engineering method that leverages endto-end memory networks to model dependencies between turn and prompt histories using memory components -and analyzing this performance visa-vis human raters. Note that for the purposes of this paper, while our data is spoken dialog, we will focus on text features derived from transcriptions, and therefore will focus on how they can be used to score various aspects of interaction in an interpretable manner. A subsequent future analysis will comprehensively examine how these can be combined with speech features.

Collection
We crowdsourced, using Amazon Mechanical Turk, the collection of 2288 conversations of nonnative speakers interacting with a dialog application designed to test general English speaking competence in workplace scenarios, and pragmatic skills in particular. The application, dubbed "Request Boss" requires participants to interact with their boss and request a meeting with her to review presentation slides using pragmatically appropriate language. To develop and deploy this application, we leveraged HALEF 1 , an open-source modular cloud-based dialog system that is compatible with multiple W3C and open industry stan-1 http://halef.org dards (Ramanarayanan et al., 2017b). The HALEF dialog system logs speech data collected from participants to a data warehouse, which are then transcribed and scored.

Human Scoring
In order to understand how well participants performed in our conversational task, we had each of the 2288 dialog responses triple scored by human expert raters on a custom-designed rubric. This rubric was iteratively modified and refined to score constructs specific to dialog data 2 . The final conversational scoring rubric defined 12 subconstructs under the 3 broad constructs of linguistic control, task fulfillment and interaction, apart from an overall holistic score. However, for purposes of this first study, we will focus on the relatively understudied interaction construct, in particular aspects of engagement, turn-taking, repair and (pragmatic) appropriateness. See Table 1 for more details. We asked expert raters to score each dialog for each rubric dimension on a scale from 1 to 4, and to assign dialogs that contained no or corrupted or significantly off-topic audio responses a score of 0. The expert raters were scoring leaders with significant experience in scoring various spoken and written assessments of English language proficiency. We used an automatic randomized design to assign three (out of eight possible) raters to every dialog such that (i) all raters had a commensurate number of responses to rate, and (ii) the same group of raters did not rate the same set of files (achieved by randomization; this prevents unwitting biases due to individual raters affecting the overall score analysis).

Feature Description Word n-grams
Word n-grams are collected for n = 1 to 2. This feature captures patterns about vocabulary usage (key words) in responses. Character n-grams Character n-grams (including whitespace) are collected for n = 2 to 5. This feature captures patterns that abstract away from grammatical and other language use errors.

Response length
Defined as log(chars), where chars represents the total number of characters in a response. Syntactic dependencies A feature that captures grammatical relationships between individual words in a sentence. This feature captures linguistic information about "who did what to whom" and abstracts away from a simple unordered set of key words.

Machine Scoring
This section first lays out our setup for interpretable machine scoring including details of the feature extraction and machine learning methods. We then analyze human performance (by examining inter-rater statistics) and use this to benchmark the performance of machine scoring methods. Following standardized convention in automated scoring, we only consider dialogs with a non-zero score to train scoring models (because a separate filtering mode is typically trained to eliminate "unscorable" responses, which include responses with no, garbled or out-of-topic audio data, see Higgins et al., 2011, for a more detailed motivation and rationale for this approach).

Feature Engineered Content Scoring
We used a set of features that have been employed in many previously published approaches to building content scoring models (see Madnani et al., 2017Madnani et al., , 2018. We refer to this system as c-rater ML; see Table 2 for more details. All of the features are binary (indicating presence or absence) and try to capture how well responses contain (a) the right concepts (approximately captured by words and bigrams), (b) the right syntactic relationships between those concepts (approximately captured by dependency triples), (c) spelling and morphological relations (character n-grams) and (d) length of the response (captured by length features). We used SKLL, 3 an open-source Python package that wraps around the scikit-learn package (Pedregosa et al., 2011) to perform machine learning experiments. We experimented with rescaled linear support vector machine (SVM) and multilayer perceptron (MLP) regressors. The former 3 https://github.com/EducationalTestingService/skll allows us to interpret how the algorithm performs, while the latter is used for comparison purposes to understand how deep neural networks might perform on this task given the data we have. In our case, we found that the SVM classifier beat the MLP across the board, possibly because our feature space is sparse and high-dimensional, consisting of binary presence/absence features. We ran 10 fold cross-validation experiments and report the best overall results for the SVM system. We used cross entropy (log-loss) as an objective function for optimizing learner performance. We further tuned and optimized the free parameters of each learner using a grid-search method. We computed both accuracy and quadratic weighted kappa (which takes into account the ordered nature of the categorical labels) as metrics, reported in Table 3.

End to End Memory Network (MemN2N) architecture
We also investigated the efficacy of the End to End Memory Network (MemN2N) architecture (Sukhbaatar et al., 2015;Chen et al., 2016) adapted to the dialog scoring task. The end to end MemN2N architecture models dependencies in text sequences using a recurrent attention model coupled with a memory component, and is therefore suited to modeling how response and prompt histories contribute to a dialog score. In our case, the MemN2N architecture learns a mapping between an output score and an input tuple consisting of the current response, the response history and the prompt history. See Figure 1. We modified the original MemN2N architecture in Sukhbaatar et al. (2015) in the following ways: (i) instead of the original (query, fact history, answer) tuple that is used to train the network in the original paper, we have an (current response, response his-  tory, prompt history, score) tuple in our case. In other words, we not only embed and learn memory representations between the current response and the history of previous responses, but the history of prior system prompts that have been encountered thus far; (ii) we used an LSTM instead of a matrix multiplication at the final step of the network before prediction; and (iii) we experimented with Google word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) initializations for word embeddings in addition to experimenting with multiple memory hops. We train the network at the turn level; in other words, for each turn, the training data would consist of an input of (response for current turn, response history, prompt history) and an output of the dialog-level score (in other words, each turn is assumed to have the same score as that of the full dialog). During testing, we compute the score for each dialog in the test set as the median of scores predicted by the trained network for each turn in that dialog.
We used a similar crossvalidation setup as described in §3.1 with the exact same 10 folds with experiments optimizing a cross-entropy-based objective function as in the earlier case to enable a fair comparison. We tuned hyperparameters of the network using the hyperas toolkit 4 . This included the number of neurons in the Dense and LSTM layers as well as the addition of Dropout layers after each memory component. We experimented with 1, 2 and 3 memory hops and found 2 to be optimal. Interestingly, we also found that initializing the memory embedding matrics with pretrained Google word2vec or GloVe embeddings worked better than randomly-initialized ones for prompt history encoding as compared to response history encoding.

Observations and Results
The final two columns of Table 3 display two interrater agreement statistics -Conger κ and Krippendorff α -for the human expert scores assigned to the data. Recall that each dialog was scored by 3 out of 8 possible raters. We observe a moderate to high agreement between raters for all dimensions of the scoring rubric, which is not too surprising given that all our raters had significant experience in rating monologic speech data. Table 3 also shows the performance of our two different systems in scoring various aspects of interaction at the level of the entire dialog. Observe that fusing the MemN2N with the c-rater ML system leads to a small but significant improvement over either of the systems alone. Additionally, it is interesting to note that the quadratic weighted kappa (QW κ) of the fusion system is in a similar ballpark as the κ and α metrics for human interrater agreement, particularly for engagement and turn-taking subscores. While these measures are not directly comparable, this trend is encouraging nonetheless, suggesting that a combination of ngram, length, syntactic dependency and memorybased attention over embedding representations of words over the entire dialog are useful in capturing at least some aspects of these sub-constructs of interaction. On the other hand, the fusion system performance for repair and appropriateness subscores is still below par, suggesting that more feature engineering and modeling research is required to model these aspects of interaction. These dimensions of interaction are also harder to predict, given that repair and pragmatic appropriateness are more high-level and abstract in nature.

Discussion
This paper has examined approaches to both human and machine scoring of text dialogs collected as part of a language learning application, particularly looking at interactional aspects. We observed, through careful design of the human scoring paradigm, a moderate-to-high agreement between the raters. We further examined two methods for automated scoring of such data -the first a feature engineering method that passes word and character n-grams, length and syntax features into an SVM based classifier, and the second a model engineering method that leverages end-toend memory network (MemN2N) to model dependencies between turn and prompt histories using memory components -and found that a fusion of both methods performs close to or at par with human inter-rater agreement statistics.
While our results are encouraging, there is still much work ahead in understanding and scoring interactional competence. One of the key reasons for this has to do with the fact that the features were considered were text-based, and it is unclear how some features that don't directly consider information from audio or visual channels are useful in predicting properties related to interaction (engagement, for instance). Repair and appropriateness, and even turn taking to a lesser extent are related to proficiency in language use, and hence it makes sense that features such as n-grams and syntax use might be somewhat useful in predicting these aspects of interaction. However, some of the results might also be explained by some of our examined features being highly correlated with more interpretable/relevant features. For instance, length might be an indication of a more proficient and verbose speaker, which might in turn correlate with a high level of engagement. Nonetheless, an understanding of how meaningful our text-based results are will be incomplete without examining features derived from audio (and visual streams, if available), including non-verbal and prosodic cues.
It is also worth mentioning tangentially related work on dialog interaction quality at this point (see for instance Schmitt and Ultes, 2015;Stoyanchev et al., 2019;See et al., 2019). While such work primarily focuses on investigating techniques to measure and improve the quality of the overall dialog interaction as opposed to providing targeted assessment and feedback on the quality of spoken language used during interactions, it might nonetheless be useful to take this body of work into account while developing techniques for automated proficiency scoring. This lays out multiple avenues for future work. First, as mentioned earlier, would be examining both text and speech signals for a more complete examination of the scoring problem. Second, we would like to look at other broad aspects of conversational proficiency, such as delivery (for instance, fluency, intonation, vocabulary and grammar) and topic development (elaboration and task specificity, for example) in addition to building on the interaction aspects described here. Third, we will investigate combining feature-engineering and model-engineering approaches towards developing specific features and model architecture improvements that will help us push the automated scoring performance even higher. These will feed into our ultimate goal of being able to provide language learners with targeted, actionable feedback on different facets of conversational proficiency.