Discussion Tracker: Supporting Teacher Learning about Students’ Collaborative Argumentation in High School Classrooms

Teaching collaborative argumentation is an advanced skill that many K-12 teachers struggle to develop. To address this, we have developed Discussion Tracker, a classroom discussion analytics system based on novel algorithms for classifying argument moves, specificity, and collaboration. Results from a classroom deployment indicate that teachers found the analytics useful, and that the underlying classifiers perform with moderate to substantial agreement with humans.


Introduction
Collaborative argumentation in student dialogue is essential to individual learning as well as group problem-solving (Reznitskaya and Gregory, 2013). Strong collaborative argumentation is characterized by specific claims, supporting evidence, and reasoning about that evidence as well as by building upon, questioning, and debating ideas posed by others. However, teaching collaborative argumentation is an advanced skill that many high school teachers struggle to develop (Lampert et al., 2010), partially due to the practical challenge of keeping track of important features of students' talk while managing class and reflecting on students' talk when no record of it exists.
To address this challenge, we have developed Discussion Tracker (DT), a system that leverages natural language processing (NLP) to provide teachers with automatically generated data about three important dimensions of students' collaborative argumentation: argument moves, specificity and collaboration. Discussion Tracker includes visualizations, interactive coded transcripts, collaboration maps, analytics across discussions, and instructional planning. In contrast to teacher dashboards which largely focus on discussion analytics such as amount of student/teacher talk, teacher wait time, and teacher question type (Chen et al., 2014;Gerritsen et al., 2018;Pehmer et al., 2015;Blanchard et al., 2016), DT focuses on students' collaborative argumentation. In contrast to related NLP algorithms which largely focus on coding student essays (Ghosh et al., 2016;Klebanov et al., 2016;Nguyen and Litman, 2016), asynchronous online discussions (Swanson et al., 2015), and news articles (Li and Nenkova, 2015), DT's NLP algorithms address the challenges of coding transcripts of synchronous, face-to-face classroom discussions.

Description of Discussion Tracker (DT)
To use DT, a teacher first uploads a classroom discussion transcript. Next, NLP classifiers code the transcript using a previously developed scheme for representing three important dimensions of collaborative argumentation Olshefski et al., 2020): argument moves (claim, evidence, explanation), specificity (low, medium, high), and collaboration (new, agree, extension, challenge/probe). Student turns are the unit of analysis for collaboration. Argumentative Discourse Units (ADUs) -either entire turns, or segments within turns -are the argumentation and specificity units of analysis.
Each NLP classifier in DT was developed by training on a previously collected and freely available corpus 1 of collaborative argumentation (Olshefski et al., 2020)    A pretrained BERT model (Devlin et al., 2019;Wolf et al., 2019) is used to generate word embeddings for each word in an ADU (or turn, for collaboration). An average pooling layer is then used to compute the final embedding for the target ADU. For predicting specificity, a softmax classifier is applied to the target ADU embedding. For predicting argument moves, the target ADU as well as a window of surrounding ADUs are embedded, then concatenated to form the final feature vector. A softmax layer is applied on top of the feature vector to complete the argument move classifier. This improves our prior argumentation models  by using a pre-trained neural network and adding context information . The collaboration classifier is slightly more complex since collaboration labels depend on the relationship between a target turn and a particular reference turn. For the purpose of this work we assume that the target turn is already provided in the input transcript. A pretrained BERT model and average pooling layer are used to generate embeddings for the target and reference turns. An element-wise multiplication between the two embeddings is performed, yielding the feature vector used by a softmax classifier. All models use the bert-base-uncased BERT variant from the HuggingFace (Wolf et al., 2019) library, which results in the smallest available dimensionality to keep computational complexity to a minimum. The three models were built using the Keras library (Chollet and others, 2015). The Adam optimizer was used, as well as early stopping to automatically determine the number of epochs for training by monitoring validation loss (the validation set was chosen randomly and consisted of 10% of the initial training set for each fold).
After classification, all discussion analytics are automatically generated from the NLP codes. The DT overview screen (Figure 1) includes pie charts indicating the distribution of the codes for students' argument moves, specificity, and collaboration. Other screens include interactive coded transcripts (Figure 2), collaboration maps (Figure 3), identification of strengths and weaknesses to support teacher goal-setting (Figure 4), and a history page (not shown) that compares the code distributions across discussions.
We initially implemented a desktop version of DT using Python and Tkinter. The screenshots in the figures and the usability evaluation below are based on this version. To make DT more portable across hardware and to allow teachers to easily use DT on multiple machines (e.g., school, home), we now have  a web version of DT 2 . This version is implemented in Python and uses the REMI package 3 to convert Python into HTML and launch a webserver to accept requests for the site and handle user input. With this setup it is easy to integrate the classifiers, implemented as a REST API on the same server hosting .

Evaluation
From January to March 2020, we collected data (corpus C2) to evaluate both teacher perceptions of DT as well as NLP classifier performance. In particular, the desktop version of DT was used by 18 high school English Language Arts teachers from 4 schools, where: 1) each teacher led a discussion about a literary text that was audio-recorded and observed by a researcher, 2) each teacher completed an online survey within a day, 3) experienced annotators 4 hand-coded transcripts of the discussion for the three dimensions of collaborative argumentation discussed above and uploaded them into the DT system, 4) within two weeks, researchers conducted a 45-minute cognitive interview (Voet and Wever, 2017) with each teacher while they were using DT to look at their students' discussion 5 , and 5) the same day, teachers completed a second survey that mirrored the first with additional items for ratings of DT.
DT Usability. We measured teachers' perceptions of the overall usefulness of DT and of specific features/visualizations through Likert-scale items on the survey from step 5 above. Survey items were based on Holden and Rada's (2011) teacher survey of perceived usability of technology. To remove noise that might distract from this usability evaluation, we evaluated DT under the best possible NLP conditions by using the manual codings of collaborative argumentation from step 3 above to generate all analytics. The NLP codings are separately evaluated in the classifier discussion below. Table 1 indicates that teachers perceived DT to be very helpful for their learning about facilitating collaborative argumentation. For nine of the 13 items, all teachers selected either "Agree" or "Strongly agree" (a mean score of 4.5), and no item received a "Strongly disagree." Although the item "I find the system easy to

Question
Mean Question Mean The overview of the discussion is helpful. 4.67 I find the system easy to use.

4.11
The pie charts of different features of the student discussion are helpful. 4.78 The system helps me to recognize my students' strengths during discussion.

4.72
The annotated transcript of student discussion is helpful. 4.89 The system helps me to recognize my students' weakness during discussion.

4.72
The collaboration diagram is helpful. 4.22 The system gives me more insight into student learning than I usually get from thinking about the discussion.

4.67
The system-generated strengths and weaknesses are helpful. 4.44 The system encourages me to make more changes to my facilitation of discussion than I usually do.

4.28
The goal-setting is helpful. 4.56 Overall, Discussion Tracker is helpful for my teaching of literature discussions.

4.72
The instructional resources are helpful. 4.17 Table 1: Teacher survey items and Likert score means.
use" received the lowest score (4.11), all teachers either agreed or agreed strongly with the item. Other items that scored higher, however, varied more in responses. For example, although the majority of teachers agreed with "The collaboration diagram is helpful," three neither agreed nor disagreed. NLP Classifier Performance. As the gold standard for evaluating DT classifier performance, we used the manual annotations from step 3 of the data collection discussed above. Table 2 shows the distribution of the gold-standard codes, while Table 3 shows classifier performance when compared to these gold-standards. 6 The results in Table 3 were obtained by training each classifier separately on corpus C1 (footnote 1) and testing on corpus C2 (the 18 discussions collected in this study). Hyperparameter optimization was performed using cross-validation on C1 in order to find out how much contextual information before/after the target ADU to consider (i.e. context window size). This yielded an argument classifier that added a window of 2 ADUs preceding and 2 ADUs following the target ADU for embedding. Though all classifiers show respectable results, predictions for argument move and specificity are more consistent for individual class labels, as evidenced by the small difference between macro and micro F-score. The lower macro F-score for collaboration is due to poor prediction performance for the agree and challenge/probe codes.

Summary and Future Directions
In this work we described the development of a classroom analytics system and reported usability results from real world classroom deployment. We conducted a survey that showed teachers found the system easy to use and the analytics (based on human-annotated labels) helpful in analyzing collaborative argumentation. Evaluation of the automated NLP classifiers showed that they are in moderate to substantial agreement with the labels provided by human annotators. The main goal of future work is to continue to enhance our neural classification methods, and to develop an end-to-end, completely automated system. To this end, we will consider several aspects: perform new data collections and improve classifier performance; incorporate Automatic Speech Recognition to perform automated transcription; develop algorithms to automatically segment turns into ADUs. In addition, we will further develop the interface by addressing teacher feedback and improving the system's ease of use. Finally, teachers will evaluate our newer versions of DT, including versions where the analytics are based on classifier outputs rather than human-annotated labels.