Modeling Second-Language Learning from a Psychological Perspective

Psychological research on learning and memory has tended to emphasize small-scale laboratory studies. However, large datasets of people using educational software provide opportunities to explore these issues from a new perspective. In this paper we describe our approach to the Duolingo Second Language Acquisition Modeling (SLAM) competition which was run in early 2018. We used a well-known class of algorithms (gradient boosted decision trees), with features partially informed by theories from the psychological literature. After detailing our modeling approach and a number of supplementary simulations, we reflect on the degree to which psychological theory aided the model, and the potential for cognitive science and predictive modeling competitions to gain from each other.


Introduction
Educational software that aims to teach people new skills, languages, and academic subjects have become increasingly popular. The wide-spread deployment of these tools has created interesting opportunities to study the process of learning in large samples. The Duolingo shared task on Second Lanugage Acquisition Modeling (SLAM) was a competitive modeling challenge run in early 2018 . The challenge, organized by Duolingo 1 , a popular second language learning app, was to use log data from thousands of users completing millions of exercises to predict patterns of future translation mistakes in heldout data. The data was divided into three sets covering Spanish speakers learning English (en es), English speakers learning Spanish (es en), and English speakers learning French (fr en). This paper reports the approach used by our team, 1 http://duolingo.com which finished in third place for the en es data set, second place for es en, and third place for fr en.
Learning and memory has been a core focus of psychological science for over 100 years. Most of this work has sought to build explanatory theories of human learning and memory using relatively small-scale laboratory studies. Such studies have identified a number of important and apparently robust phenomena in memory including the nature of the retention curve (Rubin and Wenzel, 1996), the advantage for spaced over massed practice (Ruth, 1928;Cepeda et al., 2006;Mozer et al., 2009), the testing effect (Roediger and Karpicke, 2006), and retrieval-induced forgetting (Anderson et al., 1994). The advent of large datasets such as the one provided in the Duolingo SLAM challenge may offer a new perspective and approach which may prove complementary to laboratory scale science (Griffiths, 2015;Goldstone and Lupyan, 2016). First, the much larger sample sizes may help to better identify parameters of psychological models. Second, datasets covering more naturalistic learning situations may allow us to test the predictive accuracy of psychological theories in a more generalizable fashion (Yarkoni and Westfall, 2017).
Despite these promising opportunities, it remains unclear how much of current psychological theory might be important for tasks such as the Duolingo SLAM challenge. In the field of education data mining, researchers trying to build predictive models of student learning have typically relied on traditional, and interpretable, models and approaches that are rooted in cognitive science (e.g., Atkinson, 1972b,a;Corbett and Anderson, 1995;Pavlik and Anderson, 2008). However, a recent paper found that state-of-the-art results could be achieved using deep neural networks with little or no cognitive theory built in (so called "deep knowledge tracing", Piech et al., 2015). Khajah, Lindsey, & Mozer (2016) compared deep knowledge tracing (DKT) to more standard "Bayesian knowledge tracing" (BKT) models and showed that it was possible to equate the performance of the BKT model by additional features and parameters that represent core aspects of the psychology of learning and memory such as forgetting and individual abilities (Khajah et al., 2016). An ongoing debate remains in this community whether using flexible models with lots of data can improve over more heavily structured, theory-based models (Tang et al., 2016;Xiong et al., 2016;Zhang et al., 2017).
For our approach to the SLAM competition, we decided to use a generic and fairly flexible model structure that we provided with hand-coded, psychologically inspired features. We therefore positioned our entry to SLAM somewhat in between the approaches mentioned above. Specifically, we used gradient boosting decision trees (GBDT, Ke et al., 2017) for the model structure, which is a powerful classification algorithm that is known to perform well across various kinds of data sets. Like deep learning, GBDT can extract complex interactions among features, but it has some advantages including faster training and easier integration of diverse inputs.
We then created a number of new psychologically-grounded features for the SLAM dataset covering aspects such as user perseverance, learning processes, contextual factors, and cognate similarity. After finding a model that provided the best held-out performance on the test data set, we conducted a number of "lesioning" studies where we selectively removed features from the model and re-estimated the parameters in order to assess the contribution of particular types of features. We begin by describing our overall modeling approach, and then discuss some of the lessons learned from our analysis.

Task Approach
We approached the task as a binary classification problem over instances. Each instance was a single word within a sentence of a translation exercise and the classification problem was to predict whether a user would translate the word correctly or not. Our approach can be divided into two components-constructing a set of features that is informative about whether a user will answer an instance correctly, and designing a model that can achieve high performance using this feature set.

Feature Engineering
We used a variety of features, including features directly present in the training data, features constructed using the training data, and features that use information external to the training data. Except where otherwise specified, categorical variables were one-hot encoded.

Exercise features
We encoded the exercise number, client, session, format, and duration (i.e., number of seconds to complete the exercise), as well as the time since the user started using Duolingo for the first time.

Word features
Using spaCy 2 , we lemmatized each word to produce a root word. Both the root word token and the original token were used as categorical features. Due to their high cardinality, these features were not one-hot encoded but were preserved in single columns and handled in this form by the model (as described below).
Along with the tokens themselves we encoded each instance word's part of speech, morphological features, and dependency edge label. We noticed that some words in the original dataset were paired with the wrong morphological features, particularly near where punctuation had been removed from the sentence. To fix this, we reprocessed the data using Google SyntaxNet 3 .
We also encoded word length and several word characteristics gleaned from external data sources. Research in psychology has suggested certain word features that play a role in how difficult a word is to process, as measured by how long readers look at the word as well as people's performance in lexical-decision and word-identification tasks. Two such features that have somewhat independent effects are word frequency (i.e., how often does the word occur in natural language; Rayner, 1998) and age-of-acquisition (i.e., the age at which children typically exhibit the word in their vocabulary; Brysbaert and Cortese, 2011;Ferrand et al., 2011). We therefore included a feature that encoded the frequency of each word in the language being acquired, calculated from Speer et al. (2017), and a feature that encoded the mean ageof-acquisition (of the English word, in English native speakers), derived from published age-ofacquisition norms for 30,000 words (Kuperman et al., 2012), which covered many of the words present in the dataset. Additionally, words sharing a common linguistic derivation (also called "cognates"; e.g., "secretary" in English and "secretario" in Spanish), are easier to learn than words with dissimilar translations (De Groot and Keijzer, 2000). As an approximate measure of linguistic similarity, we used the Levenshtein edit distance between the word tokens and their translations scaled by the length of the longer word. We found translations using Google Translate 4 and calculated the Levenshtein distance to reflect the letter-by-letter similarity of the word and its translation (Hyyrö, 2001).

User features
Just as we did for word tokens, we encoded the user ID as a single-column, high-cardinality feature. We also calculated several other user-level features that related to the "learning type" of a user. In particular, we encoded features that might be related to psychological constructs such as the motivation and diligence of a user. These features could help predict how users interact with old and novel words they encounter.
As a proxy for motivation, we speculated that more motivated users would complete more exercises every time they decide to use the app. To estimate this, we grouped each user's exercises into "bursts." Bursts were separated by at least an hour. We used three concrete features about these bursts, namely the mean and median number of exercises within bursts as well as the total number of bursts of a given user (to give the model a feature related to the uncertainty in the central tendency estimates).
As a proxy for diligence, we speculated that a very diligent user might be using the app regularly at the same time of day, perhaps following a study schedule, compared to a less diligent user whose schedule might vary more. The data set did not provide a variable with the time of day, which would have been an interesting feature on its own. Instead, we were able to extract for each exercise the time of day relative to the first time a user had used the app, ranging from 0 to 1 (with 4 https://cloud.google.com/translate/ 0 indicating the same time, 0.25 indicating a relative shift by 6 hours, etc.). We then discretized this variable into 20-minute bins and computed the entropy of the empirical frequency distribution over these bins. A lower entropy score indicated less variability in the times of day a user started their exercises.
The entropy score might also give an indication for context effects on users' memory. A user practicing exercises more regularly is more likely to be in the same physical location when using the app, which might result in better memory of previously studied words (Godden and Baddeley, 1975).

Positional features
To account for the effects of surrounding words on the difficulty of an instance, we created several features related to the instance word's context in the exercise. These included the token of the previous word, the next word, and the instance word's root in the dependency tree, all stored in single columns as with the instance token itself. We also included the part of speech of each of these context words as additional features. When there was no previous word, next word, or dependency-tree root word, a special None token or None part of speech was used.

Temporal features
A user's probability of succeeding on an instance is likely related to their prior experience with that instance. To capture this, we calculated several features related to past experience.
First, we encoded the number of times the current exercise's exact sentence had been seen before by the user. This is informed by psychological research showing memory and perceptual processing improvements for repeated contexts or "chunks" (e.g., Chun and Phelps, 1999).
We also encoded a set of features recording past experience with the particular instance word. These features were encoded separately for the instance token and for the instance root word created by lemmatization. For each token (and root) we tracked user performance through four weighted error averages. At the user's first encounter of the token, each error term E starts at zero. After an encounter with an instance of the token with label L (0 for success, 1 for error), it is updated according to the equation: where α determines the speed of error updating.
The four weighted error terms use α = {.3, .1, .03, .01}, allowing both short-run and long-run changes in a user's error rate with a token to be tracked. Note that in cases where a token appears multiple times in an exercise, a single update of the error features is conducted using the mean of the token labels. Along with the error tracking features, for each token we calculated the number of labeled, unlabeled, and total encounters; time since last labeled encounter and last encounter; and whether the instance is the first encounter with the token.
In the training data, all instances are labeled as correct or incorrect, so the label for the previous encounter is always available. In the test data, labels are unavailable, so predictions must be made using a mix of labeled and unlabeled past encounters. In particular, for a user's test set with n exercises, each exercise will have between zero and n − 1 preceding unlabeled exercises.
To generate training-set features that are comparable to test-set features, we selectively ignored some labels when encoding temporal features on the training set. Specifically, for each user we first calculated the number of exercises n in their true test set 5 . Then, when encoding the features for each training instance, we selected a random integer r in the range [0, n − 1], and ignored the labels in the prior r exercises. That is, we encoded features for the instance as though other instances in those prior exercises were unlabeled, and ignored updates to the error averages from those exercises. The result of this process is that each instance in the training set was encoded as though it were between one and n exercises into the test set.

Modeling
After generating all of the features for the training data, we trained GBDT models to minimize log loss. GBDT works by iteratively building regression trees, each of which seeks to minimize the residual loss from prior trees. This allows it to capture non-linear effects and high-order interactions among features. We used the LightGBM 6 implementation of GBDT (Ke et al., 2017).
For continuous-valued features, GBDT can split a leaf at any point, creating different predicted val-5 If the size of the test set were not available, it could be estimated based on the fact that it is approximately 5% of each participant's data. 6 http://lightgbm.readthedocs.io/ ues above and below that threshold. For categories that are one-hot encoded, it can split a leaf on any of the category's features. This means that for a category with thousands of values, potentially thousands of tree splits would be needed to capture its relation to the target. Fortunately, LightGBM implements an algorithm for partitioning the values of a categorical feature into two groups based on their relevence to the current loss, and create a single split to divide those groups (Fisher, 1958). Thus, as alluded to above, high-cardinality features like token and user ID were encoded as single columns and handled as categories by Light-GBM.
We trained a model for each of the three language tracks of en es, es en, and fr en, and also trained a model on the combined data from all three tracks, adding an additional "language" feature. Following model training, we averaged the predictions of each single-language model with that of the all-language model to form our final predictions. Informal experimentation showed that model averaging provided a modest performance boost, and that weighted averages did not clearly outperform a simple average.
To tune model hyper-parameters and evaluate the usefulness of features, we first trained the models on the train data set and evaluated them on the dev data set. Details of the datasets and the actual files are provided on the Harvard Dataverse (Settles, 2018). Once the model structure was finalized, we trained on the combined train and dev data and produced predictions for the test data. The LightGBM hyperparameters used for each model are listed in Table 1.

Performance
The AUROC of our final predictions was .8585 on en es, .8350 on es en, and .8540 on fr en. For reference this placed us within .01 of the winning entry for each problem (.8613 on en es, .8383 on es en, and .8570 on fr en). Also note that the Duolingo-provided baseline model (L2-regularized regression trained with stochastic gradient descent weighted by frequency) obtains .7737 on en es, .7456 on es en, and .7707 on fr en. We did not attempt to optimize F1 score, the competition's secondary evaluation metric.

Feature Removal Experiments
To better understand which features or groups of features were most important to our model's predictions, we conducted a set of experiments in which we lesioned (i.e., removed) a group of features and re-trained the model on the train set, evaluating performance on the dev set. For simplicity, we ran each of the lesioned models on all language data and report the average performance. We did not run individual-language models as we did for our primary model. The results of the lesion experiments are shown in Figure 1. The models are as follows.
none: All features are included. temporal: Temporal information, including number and timing of past encounters with the word and error tracking information, is removed.
Interestingly, we found that for both user-level and word-level features, the bulk of the model's predictive power could be achieved using ID's alone, represented as high-cardinality categorical features. Removing other word features, such as morphological features and part of speech, created only a small degradation of performance. In the case of users, removing features such as entropy and average exercise burst length led to a tiny increase of performance. In the case of both users and words, though, we find that in the absence of ID features the other features are helpful and lead to better performance than removing all features. We also found that removing all information about neighboring words and the dependency-parse root word degraded performance. This confirms that word context matters, and suggests that users commonly make errors in word order, subject-verb matching and other grammatical rules.
Our external word features-Levenshtein distance to translation, frequency, and age of acquisition-provided a slight boost to model performance, showing the benefit of considering what makes a word hard to learn from a psychological and linguistic perspective. Adding temporal features about past encounters and errors helped the models, but not as much as we expected. While not included in the final model, we had also tried augmenting the temporal feature set with more features related to massing and spacing of encounters with a word, but found it did not improve performance. This is perhaps not surprising given how small the benefit of the existing temporal features are in our model.
Though not plotted above, we also ran a model lesioning exercise-level features including client, session type, format, and exercise duration. This model achieved an AUROC of .787, far lower than any other lesion. This points to the fact that the manner in which memory is assessed often affects observed performance (e.g., the large literature in psychology on the difference between recall and recognition memory, Yonelinas, 2002).

Discussion
When approaching the Duolingo SLAM task, we hoped to leverage psychological insights in building our model. We found that in some cases, such as when using the word's age-of-acquisition, this was helpful. In general, though, our model gained its power not from hand-crafted features but from applying a powerful inference technique (gradient boosted decision trees) to raw input about user IDs, word IDs, and exercise features.
There are multiple reasons for the limited applicability of psychology to this competition. First, computational psychological models are typically designed based on small laboratory data sets, which might limit their suitability for generating highly accurate predictions in big data settings. Because they are designed not for prediction but for explanation, they tend to use a small number of input variables and allow those variables to interact in limited ways. In contrast, gradient boosted decision trees, as well as other cutting-edge techniques like deep learning can extract high-level interactions among hundreds of features. While they are highly opaque, require a lot of data, and are not amenable to explanation, these models excel at prediction.
Second, it is possible that our ability to use theories of learning, including ideas about massed and spaced practice, was disrupted by the fact that the data may have been adaptively created using these very principles (Settles and Meeder, 2016). If Duolingo adaptively sequenced the spacing of trials based on past errors, then the relationship between future errors and past spacing may have substantially differed from that found in the psychological literature (Cepeda et al., 2006). Finally, if the task had required broader generalization, psychologically inspired features might have performed more competitively. In the SLAM task, there is a large amount of labeled training data for every user and for most words. This allows simple ID-based features to work because the past history of a user will likely influence their future performance. However, with ID-based features there is no way to generalize to newlyencountered users or words, which have an ID that was not in the training set. The learned IDbased knowledge is useless here because there is no way to generalize from one unique ID to another. Theory-driven features, in contrast, can often generalize to new settings because they capture aspects that are shared across (subsets of) users, words, or situations of the learning task. For example, if we were asked to generalize to a completely new language such as German, many parts of our model would falter but word frequency, age of acquisition, and Levenshtein distance to firstlanguage translation would still likely prove to be features which have high predictive utility.
In sum, we believe that the Duolingo SLAM dataset and challenge provide interesting oppor-tunities for cognitive science and psychology. Large-scale, predictive challenges like this one might be used to identify features or variables that are important for learning. Then, complementary laboratory-scale studies can be conducted which establish the causal status of such features through controlled experimentation. Conversely, insights from controlled experiments can be used to generate new features that aid predictive models on naturalistic datasets (Griffiths, 2015;Goldstone and Lupyan, 2016). This type of two-way interaction could lead to long-run improvements in both scientific explanation and real-world prediction.