A Memory-Sensitive Classification Model of Errors in Early Second Language Learning

In this paper, we explore a variety of linguistic and cognitive features to better understand second language acquisition in early users of the language learning app Duolingo. With these features, we trained a random forest classifier to predict errors in early learners of French, Spanish, and English. Of particular note was our finding that mean and variance in error for each user and token can be a memory efficient replacement for their respective dummy-encoded categorical variables. At test, these models improved over the baseline model with AUROC values of 0.803 for English, 0.823 for French, and 0.829 for Spanish.


Introduction
Learning a new language is often a challenging task for adults. However, there are many linguistic and cognitive factors that can facilitate (or impair) acquisition of a non-native language, ranging from properties of the languages a learner already knows, to the methods and nature of study. Much work has sought to manipulate these factors in order to both further our understanding of the cognitive systems in play and facilitate learning.
Here, we present a model that explores these factors to predict outcomes for three populations of language learners that use Duolingo, a language learning app that gamifies lessons for a wide variety of to-be-learned languages. We start by describing the various features we developed from the data before describing the random forest model used and the subsequent outcomes.

Related Work
Little work has been done building predictive models of adult language acquisition, but many studies have explored the linguistic factors that impact vocabulary learning in a non-native language. Semantic properties of nouns, for example, have been found to impact word learning. Cognates, or words that overlap in form and meaning in both languages (e.g. lemon in English and limón in Spanish), have been shown to be easier to learn (de Groot & Keijzer, 2000). The same study showed that words that are rated as more concrete (hat as opposed to liberty) are easier to learn. While perhaps more surprising than the cognate result, this effect is often explained by the fact that more concrete words create more perceptual connections to their conceptual referents (it is easier to imagine a physical hat than the abstract concept of liberty), and it is therefore easier to connect new words to concepts via those connections.
There are likewise many factors than can hinder word learning. For example, interlingual homographs, or words that share surface form but have different meanings (pan as something to fry on in English and bread in Spanish) are harder to process and may therefore also be harder to learn (Dijkstra, Timmermans & Schriefers, 2000).
Beyond the linguistic particulars of individual words, the temporal dynamics of learning can powerfully moderate memory. One of the most well established results in cognitive psychology is that two repetitions of a to-be-learned item are best separated by some temporal gap, if the goal is long-term retention (Ebbinghaus 1885/1964, Cepeda, Pashler, Vul, Wixted, & Rohrer, 2006Cepeda, Vul, Rohrer, Wixted, & Pashler, 2008;Donovan & Radosevich, 1999;T. D. Lee & Genovese, 1988). That is, given a fixed amount of available time to learn something, a learner is better off distributing that time over multiple learning sessions than cramming it all into a single session. Further, the more time that is allowed to pass before a learner encounters a previously learned item again, the longer into the future the learner can expect to retain that item (or equivalently, the greater the probability of successful retrieval of that item at a particular future time; but see Cepeda et al. 2008).

Data
The data were collected in 2017 from Duolingo, as part of the NAACL HLT 2018 Shared Task on Second Language Acquisition Modeling (SLAM, Settles, Brust, Gustafson, Hagiwara & Madnani, 2018). The data consisted of exercise and phrase level information for three populations of language learners in their first 30 days of using the app: English-speaking learners of Spanish and French as well as Spanish-speaking learners of English.
The data were split into a training set, which consisted of each user's first 80% of sessions, a development set (for testing model generalization before the test phase) that contained the next 10% of each user's data, and a test set that contained the final 10% of exercises for each user. The training data set consisted of 1,882,701 exercises in total (38.9% from learners of Spanish, 43.8% from learners of English and 17.3% from learners of French), while the development data contained 255,383 exercises (45.3% from learners of Spanish, 37.6% learners of from English and 17.1% from learners of French), and the test set contained 249,484 exercises (45.9% from learners of Spanish, 37.4% from learners of English and 16.7% from learners of French).

Features
Our approach to modeling errors in second language acquisition was driven primarily by two distinct bodies of research: linguistic effects in second language acquisition, and drivers of robust memory in general. As such we discuss each set of features separately.

Linguistic features
In this section, we describe the semantic and morpho-syntactic features added to the model. Values for tokens that were not in databases listed below were set to the mean of the feature.
Word length. Orthographic and phonological length (orthoLength and phonLength respectively) are predictive of word difficulty, and longer written or spoken words generally leave more room for potential errors (Baddeley, Thomon & Buchanan, 1975). Phonological length was taken from the CLEARPOND database (Marian, Bartolotti, Chabal & Shook, 2012). Word neighbors. A greater number of orthographic and phonological neighbors (orthoNei and phonNei) for a given word in both the to-belearned and known languages might cause interference leading to errors. These data were also taken from the CLEARPOND database. Word Frequency. The log transformed frequency (logWordFreq) of the English, Spanish and French words to be learned were also included as predictors, as well as the average log frequency of the phonological (logOrthoNeiFreq) and orthographic neighbors (logPhonNeiFreq) in the to-be-learned as well as known language. Edit Distance. Because cognate status impacts language learning, the Levenshtein distance between a given token and its translation to user language (English for Spanish and French learners, and Spanish for English learners) was calculated by feeding single word translations through the Google Translate API and calculating edit distances between the translations. Cognates like lemon and limón should have a short edit distance, while words like boy and niño will have relatively longer distances. Interlingual homographs. Additionally, the interlingual homograph status for each token (whether or not the token shares its surface form with a translation of a different token) were added for each language by using the Google Translate API. Morphological Complexity. As a proxy for how morphologically complex any given word is, the number of morphological features present in the given morphology columns were summed and treated as a proxy for morphological complexity (morphoComplexity). Concreteness. Mean and standard deviations for concreteness ratings were taken from Brysbaert, Warriner and Kuperman (2014) and added to the model.

Memory features
Repetition & Experience. Each instance (i.e., each token in each exercise for each user) was labeled with (1) the number of times the current user had encountered that token, up to and including the current instance (nthOccurrence) and (2) the number of instances the user had seen in total, up to and including the current instance (userTrial). Spaced Repetition. The amount of time that elapses between successive repetitions of a given item strongly moderate memory for that item (see "Related Work", above). As such, we extracted a number of spacing-related features. To measure the temporal lag, and to capture the power law relationship between time and forgetting, we calculated (separately for each user) the log(days) that had elapsed between: (1) each token and its previous occurrence (tokenLag1), (2) each token's previous occurrence and its next most recent occurrence (tokenLag2), (3) each token's stem (e.g. help, for helping) and its previous occurrence (stemLag1), (4) each token's stem's previous occurrence and its next most recent occurrence (stemLag2), (5) each token's combination of several morphological features (number, person, tense, verbform) and the previous occurrence of that particular combination (morphoLag1; to capture any possible spacing effect for verb conjugation skills) and (6) each token's combination of those same morphological features and their next most recent occurrence. Finally, (7) since some evidence suggests that the temporal gap between an item's first and second occurrence is particularly important for retention (Karpicke & Roediger, 2007), we also labeled each instance with the log(days) that elapsed between the first and second occurrence of the token's stem (lagTr1Tr2).

Categorical Features
Included in our classifier were a number of categorical features, each encoded as binary indicator variables distributed over a number of columns equal to the number of levels in the category. Importantly, our approach to modeling was constrained by limited computational power and memory, so we chose to include only categorical features with a relatively small number of levels, to reduce the dimensionality of the data. Those features were: part of speech (pos; 16 levels), countries (94 levels), session (3 levels), format (3 levels), and all of the morphological features available for each language (46 levels for learners of Spanish, 17 levels for learners of English, and 10 levels for learners of French). Client was also included, though we treated iOS and Android as equivalent, preserving only the distinction between web and mobile access to the Duolingo application (2 levels). Notably, the above listing omits two of the categorical features we considered of greatest potential value in predicting early learner errors: user (223 levels for learners of Spanish, 179 levels for learners of English, and 216 levels for learners of French; 618 total) and token (2116 levels for learners of Spanish, 1615 in learners of English, 1682 in learners of French). Some users inevitably learn faster and make fewer errors than others, and some tokens are simply harder to learn on average. Instead of encoding these with dummy variables, we elected to replace the user feature with two continuous values, determined jointly by the user and the combination of the levels of the features format, session, and client for each instance: (1) the mean and (2) the variance of the error rate for that user under that combination of feature levels (userMeanError, userVarError, respectively), for a total of three values for each user. Similarly, we replaced the token feature with (1) the mean and (2) the variance of the error rate for each combination of the features token, stem, format, and pos, creating four values per token. This approach allowed us to substantially reduce demands on computational resources while simultaneously capturing much of the predictive power that fully encoding each user and token would have provided. The particular features used to create means within user and token were chosen to maximize potential differences between accuracy in different modalities. Indeed, to foreshadow our results, these features each ranked among the most important for our random forest classifier.

Interactions
Several interactions between features were also coded into the model. Due to time constraints, only the following interactions were added: stemLag1 x stemLag2 and stemLag1 x stemLag2 x lagTr1Tr2, to capture spacing effects, lagTr1Tr2 x morphoComplexity and lagTr1Tr2 x morphoLag1 to capture lag differences between morphological features, format x prevFormat to capture possible task switching effects, and orthoNei x format and phonNei x format and format x client to capture differences due to listening vs. typing, and finally morphoComplexity x pos as any complexity effect may be stronger nouns and verbs than function words.

Model
In order to focus on feature engineering, random forest techniques were chosen over gradient boosting, logistic regression or other classification techniques. The random forest classifier scales well to large datasets, is not particularly prone to overfitting problems, and requires less parameter tuning.
Random forest classifiers combine the outputs of multiple decision tree classifiers with random features taken in each decision in order to generate one final prediction (Breiman, 2001). Each decision-tree classifier split the data along some number of parameters (equal to the square root of the total number of features in this model) that fits a classifier. Each split of the data was again split along the other included parameters until the leaves of the tree contained only data points with the same label (i.e., only error or only no-error instances). For each learner population, we generated 1000 decision trees to generate predictions. Out-of-bag errors were used to estimate errors in training. Figure 1 shows the top 20 importance scores for each language (out of an across-language total of 174 features or interactions). The importance Figure 1: Top 20 importance features grouped by tobe-learned language. Error bars represent standard deviation of the importance of each feature across decision trees. For categorical features, the importances of each level, and their variances (to generate stand deviations), were summed to calculate the overall importance and variability in importance, respectively. score of a random forest model conveys the predictive power of a given feature relative to the other predictors. Color depicts which features were engineered and which were provided in the raw data. Full importance values, for each language are listed in Appendix A, including the directionality of the relationship between each continuous feature and the error rate. For example, because userMeanError is higher on incorrect trials than correct trials, the directionality is considered positive.

Results and Discussion
The mean and variance in error rate for each user (userMeanError and userVarError) were the most important features, indicating that each user's history was strongly predictive of their performance at test, and that the variability within each user was nearly as predictive as the difference between users.
Countries, the third most important feature, may have ranked third in all three languages because the importance measure was calculated by summing over each feature level, possibly overstating the value of that feature in total. Nevertheless countries may represent user background information not given in the dataset including their previous language experience (as a Portuguese speaking user from Brazil may be learning Spanish via English, but would likely make different errors than an English monolingual from Canada).
The next most important generated feature was the average time spent on each token within an exercise (timePerToken). This likely captures time spent on each exercise better because it accounts for the length of the exercise at the token level.
Next is userTrial, which was calculatedly simply as which learning instance a given user is on. This likely captures the experience a user accumulates with the language and perhaps the app more generally.
Next of note is the mean and variance in error rate for each token, showing that each token has some properties that capture difficulty. This is especially true for learners of French, as the importance of tokenMeanError is ranked fourth in French as compared to eighth in both English and Spanish.
The interaction between format and previous format shows that there is some cost associated with task switching, perhaps to a slightly higher degree in English and Spanish, as this feature did not quite rank among the top ten in French, but was surpassed in that language by the lag between the first two occurrences of a token's stem.
Finally, the various lag features that reflect recent experience and many of their interactions comprise of the next most important features, indicating that spacing effects are generally predictive of errors of the overall model, the highest of these being the lag between the first two instances of a given token. This is an important and potentially useful feature. A measure of this lag is easy to calculate and necessarily occurs early in learning, making it useful in predictions that are memory intensive and catered to particular users or tokens.
Overall these features, and indeed many of the engineered features, improved the models over baseline, as seen in Table 1. This is particularly noteworthy considering user and token were removed in our model (and replaced with user-and token-level error rates), but were included in the SLAM baseline provided with the data. Indeed, the mean and variance across users and tokens account for ~25% of the importance across all languages. Though the importance of these features aggregate error rates in the training data, the metrics did not differ considerably when evaluated with the development data (AUROC = .824, .818, and .802 for English, French and Spanish respectively). This shows that aggregating is a feasible approach in cases where computational constraints prohibit the exact representation of important high dimensional categorical features. Notably, the within-user variability was an important

Future work
Due to the time-limited nature of this shared modeling task, considerable work remains to be done to both optimize the performance of this model and further understand the cognitive processes involved in early language learning. To improve the model, we would first refine the relative importance of the current features, by performing ablation tests and model comparisons; some of the current features play little to no role in improving model performance. Furthermore, many interactions in the current feature space, such as userMeanAcc x tokenMeanAcc, may be important predictors given each individual feature's importance, and that each user's previous language experience will impact the difficulty associated with any given token. The spacing effect might likewise interact with individual user and token related information.
There is additionally much work to be done in quantifying the benefit of using user-and tokenlevel error rates as opposed to dummy-encoded variables. While these features are a memory and time sensitive solution, we have not yet explored how much model performance is affected by this change relative to a dummy-encoded solution, how much time is saved, and how much data is required to achieve this performance.
Our approach focused on linguistic and cognitive features that are known in their respective literatures to impact learning, and so the bulk of our efforts were devoted to feature engineering. Fu-ture work will therefore dedicate more resources to model development. While in the present work only a random forest ensemble classifier was used to generate predictions, logistic regression, deep learning, and/or other modeling approaches may better suit this particular learning task, and should be thoroughly explored.
Finally, there are many more features than can be developed, including word embeddings of tokens and syntactic structure differences. Our work has scratched the surface of linguistic and cognitive theory that might be applied to modeling language learning, but the vast scientific literatures in those and other fields no doubt offer rich possibilities for new features. The relative contribution of all of these features and their interactions to machine learning models of error production is likely to greatly expand our knowledge of early second language learning.