Predicting Second Language Learner Successes and Mistakes by Means of Conjunctive Features

This paper describes the system developed by the Centre for English Corpus Linguistics for the 2018 Duolingo SLAM challenge. It aimed at predicting the successes and mistakes of second language learners on each of the words that compose the exercises they answered. Its main characteristic is to include conjunctive features, built by combining word ngrams with metadata about the user and the exercise. It achieved a relatively good performance, ranking fifth out of 15 systems. Complementary analyses carried out to gauge the contribution of the different sets of features to the performance confirmed the usefulness of the conjunctive features for the SLAM task.


Introduction
This paper presents the participation of the Centre for English Corpus Linguistics (CECL) in the 2018 Duolingo shared task on Second Language Acquisition Modeling (SLAM) which was held in conjunction with the 13th Workshop on Innovative Use of NLP for Building Educational Applications. The objective of the task is to build a model to predict whether second language learners will make a mistake on each of the words (tokens) that compose the exercises they answered. There were three tracks: English speakers learning Spanish (es en), Spanish speakers learning English (es en) and English speakers learning French (en en).
To develop the model, the organizers of the challenge made available a very large number of exercises carried out by a large number of learners of Duolingo, a free online language-learning platform, which attracted more than 200 million learners since its launching in 2012 (see Settles et al. (2018) for details). In this training set, the tokens on which each learner made a mistake were marked, but the error itself was not provided. This task is thus very different from the one at the root of many applications of natural language processing in the field of education that aim to automatically evaluate texts produced by second language learners (Weigle, 2013). The traditional approach for the latter, which relies on linguistic indices more or less strongly correlated with text quality such as lexical richness, syntactic complexity and especially the presence of errors of different types (e.g., Burstein et al., 2004;Futagi et al., 2008;Yannakoudakis et al., 2011;Santos et al., 2012;Ramineni and Williamson, 2013;Somasundaran et al., 2015;Bestgen, 2016Bestgen, , 2017, is obviously not applicable to the SLAM challenge. Compared to the automatic evaluation of learner texts, the SLAM task has several advantages (+), but also several disadvantages (-): + Each learner produced a relatively large number of responses allowing to estimate his or her level of competence; + The learners' responses are spaced out in time making possible to try to model the evolution of their competence throughout their learning; + The same exercises were presented to a large number of different learners making it possible to get a relatively good estimate of the difficulty of each of them; -The exercises are very short, as 99% of the utterances consist of no more than six tokens, which strongly limits the linguistic context available for any NLP procedure; -And above all, as indicated above, the prompt to be processed by the learner is provided, but not the actual answer.
As previous research of the CECL in this field deals with the question of automatic evaluation and only partially took into account the temporal dimension of learning (Bestgen and Granger, 2014), I chose to break down the problem in two steps: • Try to get the best prediction without using the sequential information available in the dataset.
• Add the sequential information and see whether it can improve the prediction.
Having not been successful in the second step, I focused this report on the first. It is therefore not really an attempt to model second language acquisition, but to predict the successes and mistakes of second language learners. The proposed system can be seen as a baseline system since it does not take into account the richest information made available.
The developed system achieved a relatively good performance since it ranks fifth out of 15 systems, but nevertheless at a respectable distance from the best systems. Its main characteristic is to include conjunctive features, built by combining several primitive features. In machine learning, these conjunctive features are classically obtained by means of a polynomial kernel, but this has the effect of greatly lengthening the time needed to learn the model (Fan et al., 2008;Yoshinaga and Kitsuregawa, 2012). It was more efficient to obtain them manually and to use a (much faster) linear approach to learn the model.
The remainder of this report describes the datasets made available for this challenge, the system developed and the results obtained as well as the analyzes performed to get a better idea of the usefulness of the various components of the system.

Data
As explained in Settles et al. (2018), each instance to be categorized corresponded to a token of an exercise that has been presented to a user in one of three possible types of exercise, in one of three possible types of session and at a given time of his or her participation in the learning activities of the Duolingo platform. Several other metadata were provided for each exercise such as the country from which a user had done it. For each token, a series of morpho-syntactic features were also provided. The datasets were very large.
The fr en dataset, which was by far the smallest, contained more than 410 000 exercises and almost 1 200 000 tokens. The other data sets were approximately 2.12 times (es en) and 2.83 times (en es) larger.
These datasets were divided by the organizers into three sets, the TRAIN set with 80% of the data, the DEV set with 10% and the TEST set with remaining 10%. The final results of the challenge were determined by the organizers on the TEST set. In this report, all the developments that led to the predictive models were only done on the fr en dataset because its smaller size allowed the fastest processing. They were based on the TRAIN set to build the models and on the DEV set for evaluation.

Main Features Used
As a quick glance at the exercises, undertaken by students during their first 30 days of learning with the Duolingo platform (Settles et al., 2018), suggested that they were relatively simple from a lexical and syntactical point of view, I chose to base the features on the tokens and to disregard morpho-syntactic information.
Each instance (i.e., a token in an exercise) was encoded as a vector of 47 binary features, consisting of the following three feature sets: • The main part (5 features) was composed of the target token and the tokens (T) that surround it in the exercise. For a token such as "pas" (not) in the exercise "Ce n' est pas un sandwich" (This is not a sandwich), the following five features were encoded: the trigram including the two tokens that precede it (n' est pas), the bigram including the token that precedes it (est pas), the token itself (pas), the bigram including the next token (pas un) and the trigram including the two following tokens (pas un sandwich) 1 . When a ngram is incomplete because a token is too close to the beginning or to end of the exercise, the missing element is replaced by the pseudo-token "<s>".
• The second set of features (7 features) was based on three metadata: the unique identifier for each student (U), the exercise format (F: three different values), and the session type (S: three different values). These features were encoded alone and in conjunction, producing the following features: U, F, S, UF, US, FS and UFS.
• Finally, the conjunction of each token feature 2 with each of the metadata feature, such as n' est pas UFS, was encoded (35 features).
Each different type of feature was prefixed with a unique character sequence to avoid any collision between features of different types. Of the 47 features used to encode each instance, some were very common in the dataset, such as the format, the session and their conjunctions, others were moderately frequent such as a user id or a token, but the majority was much rarer such as the conjunction of a user, a format, a session and a trigram.

Sequential Information Use
All the features, which included a target token and had been previously seen by a user, were duplicated with a new value that reflected the number of times it had been seen, the proportion of mistakes this user made on it, and the time that had elapsed since he or she had seen it for the last time. These values were transformed by means of an exponential 3 function. More details are not given on these features because they were very inefficient as shown in the analyzes reported below.

Procedure to Build the Models
The feature extraction was performed by means of a series of custom SAS programs running in SAS University (freely available for research at http://www.sas.com/en us/software/universityedition.html). The predictive models used during the development phase were built on the fr en dataset by means of the L1-regularized logistic regression (L1-LR) available in the LIBLINEAR package (-s 6, Fan et al., 2008). The only metaparameter that can be optimized was the regular-ization parameter C. A series of tests carried out on the TRAIN and DEV fr en sets led to setting it to 0.75. It was also the L1-LR with this same C parameter that was used in all the analyzes reported here, except for the models used for the final submission that were build by means of the L2regularized logistic regression (-s 7, L2-LR) because it appeared while preparing the submission that it produced slightly higher performances.

Analyses and Results
All the performances are summarized in terms of the area under the receiver operating characteristic curve (AUROC), the challenge main evaluation metric. The F1 score was also proposed as a secondary metric by the challenge organizers, but it is not reported here because no attempt was made to optimize it 4 .
In the tables presented below, T stands for the Token ngrams, M for the Metadata, with U for User, F for Format and S for Session, Mc for the conjunctive features derived from the metadata and TM for the conjunctive features derived from the token ngrams and the metadata.

Performance on the Test Set
The performance and ranking of the base model and of the model that takes into account the sequential information is given in Table 1 along with the performances of the systems ranked first, those of the two closest teams in the ranking and those of the baseline provided by the organizers. As a reminder, the proposed models were developed for the fr en dataset and simply applied to the two other tracks. For the three tracks, the regularization parameter C for the L2-LR was set on the basis of the TRAIN and DEV sets at the following values: 0.10 for fr en and es en and 0.05 for en es. The final models were learned on the concatenated TRAIN and DEV sets.
The performances of the proposed models were significantly better than the baseline, but not as good as the best system. They were lower than those of the team ranked fourth in two tracks, but higher in the fr en track on the basis of which they  were developed. The benefits brought by using the sequential information were very small, probably because the procedure employed did not introduce new features, but duplicated a number of them with different values.

In-depth Analysis of the Feature Sets
The remainder of this report analyzes in detail the contribution of the different sets of features to the performance of the base model. All these analyses were conducted on the TRAIN and DEV fr en dataset as explained above. First, the ablation approach was used to assess the independent contribution of each set of features to the overall performance of the system. It consists in removing some sets of features of the model and re-evaluating it.
As Table 2 shows, the conjunctive features, including those built from the metadata alone, made a significant contribution to performance. The model that only includes the token ngrams clearly underperformed. The metadata are thus necessary to achieve an acceptable performance.
A second analysis was conducted to evaluate the impact of the three lengths of ngrams in the base model (Table 3). The results indicated that the trigrams were not very useful contrarily to the bigrams.
To get a better idea of the usefulness of the conjunctive features, Table 4 presents the number of features of each type to which the L1-LR assigned a non-zero weight (Andrew and Gao, 2007). It also indicates how many of these features were   present in the DEV set. This table shows that the conjunctive features, including the more complex ones, were frequently selected by the L1-LR and that a non-negligible proportion of them were present in the DEV set. These are of course the types that encompassed the largest number of different features.
However, an ablation approach on these feature subtypes suggests that many conjunctive features are not truly essential as shown in Table 5. The first row of the table reports the performance of the base model. The second section shows that the conjunctions of four and three types of features are not necessary for achieving this performance. The third section indicates that it is the conjunctive features including the tokens and the exercise format on the other hand that make the most important contribution (see below for instances). With regard to the conjunctive features based on the metadata only, UF (alone or with Session in UFS) is the most useful. The last line of the table corresponds to the model without conjunctive features (except the token ngrams). Overall, it appears that the Ses- All these observations confirm the interest of some of the conjunctive features for the SLAM task, the token ngrams being a specific type of conjunctive features whose usefulness is well established in NLP. Their interest can be illustrated concretely by the two following examples. In the fr en TRAIN set, users made 78% of errors on the token "-" when it is preceded by the token "après" (after), forming the bigram "après -" (N = 198) found in "après-midi" (afternoon). This overall percentage hides a large difference between the reverse-tap exercises (N = 91) on which 100% of errors were made and the reverse-translate exercises (N = 51) in which 49% of errors were made. The opposite profile is observed for the bigram "Vous connaissez" (You know), whose target token is "connaissez", for which there were in general 66% of errors (N = 73). When presented in the reverse-translate format, there were 94% of errors (N = 48) while there were only 9% of errors in the reverse-tap format (N = 22).

Conclusion
The base model presented in this paper does not take into account the longitudinal nature of the data made available by the organizers. Despite this, it achieved relatively high performances, ranking fifth out of 15 teams with an average of 0.016 AUROC point less than the best team, but it also outperformed nine team by more than 0.016 AUROC point. It must however be recognized that the inclusion of longitudinal information in this approach was inefficient. A psycholinguistically motivated approach would have probably produced better results (Settles and Meeder, 2016). The papers of the best teams participating in this challenge should allow to determine whether they have used non-sequential features that are identical or similar to those used here. If it is not the case, it might be interesting to determine whether the conjunctive features used here would allow to further improve their system performances.
It would also be interesting to look at other metadata provided by the organizers. In particular, the country from which a user has done the exercises could perhaps allow to take into account the L1 transfer, which is known to affect the type of errors produced by learners of a foreign language (Wong and Dras, 2009;Jarvis et al., 2013).
In a future edition of the challenge, it might be interesting to include in the test set a larger proportion of tokens that do not appear (or very rarely) in the training set and to carry out part of the evaluation separately on those tokens. In the current datasets, only 116 of the 1 920 different tokens present in the fr en TEST set were absent from the TRAIN and DEV sets. Even more, these 116 different tokens represented only 0.12% of the instances to categorize (168 out of 135 525). It should be noted that the datasets included a sizable proportion of rarely seen tokens (i.e. 27% of the different tokens in fr en TRAIN and DEV sets were present at most 3 times), but they represented only a very small fraction of the TEST set (less than 0.5%). Increasing the proportion of new or infrequently seen tokens in the test materials could favor the use of features that can be generalized to unseen tokens. If this path is followed, it could be interesting to provide, in the training datasets, the exercises and the mistakes actually produced to further the development of predictive models that try to figure out the relation between a token and the mistake (while providing only the exercises for the test material to avoid the use of simple error detection systems).