Neural sequence modelling for learner error prediction

This paper describes our use of two recurrent neural network sequence models: sequence labelling and sequence-to-sequence models, for the prediction of future learner errors in our submission to the 2018 Duolingo Shared Task on Second Language Acquisition Modeling (SLAM). We show that these two models capture complementary information as combining them improves performance. Furthermore, the same network architecture and group of features can be used directly to build competitive prediction models in all three language tracks, demonstrating that our approach generalises well across languages.


Introduction
Most recent work on second language acquisition (SLA) has focused on intermediate-to-advanced learners in assessment settings driven by a series of shared tasks (Dale and Kilgarriff, 2011;Dale et al., 2012;Ng et al., 2013Ng et al., , 2014Lee et al., 2015Lee et al., , 2016Daudaravicius et al., 2016). The 2018 Duolingo Shared Task on Second Language Acquisition Modeling (SLAM) (Settles et al., 2018) targets early stage learners and aims to provide personalised learning instructions. Participating teams are provided with transcripts from exercises submitted by learners over their first 30 days of learning on Duolingo, 1 which are annotated for token (word) level errors. The task is to predict what errors each learner will make in the future based on their learning history. There are three language tracks in this shared task: • en es: native Spanish speakers learning English; • es en: native English speakers learning Spanish; • fr en: native English speakers learning French.
Teams can either focus on a particular language track, or explore generalised models and features across all three languages.
Inspired by the success of neural sequence models in grammatical error detection and correction (Yuan and Briscoe, 2016;Rei and Yannakoudakis, 2016;Yannakoudakis et al., 2017;Schmaltz et al., 2017), we propose two recurrent neural network sequence models for this problem: sequence labelling and sequence-to-sequence modelling. We demonstrate the utility of these two models for the future learner error prediction task. We also provide evidence of performance gains by using an ensemble of these two models, suggesting that they are complementary to each other.
For model development, we focus on the English track only and language-specific features are introduced and studied. When it comes to official evaluation, two new prediction systems, one for the es en track and another for the fr en track, are built using the same network architecture and the same (hyper-)parameter setting, without tuning for new datasets or languages. Competitive results on all three language tracks show that our approach generalises well and might be used as a generic solution across different languages.
The remainder of this paper is organised as follows: Section 2 describes our approach and two neural sequence models in detail, Section 3 discusses the feature types that we exploit in our models, Section 4 reports our experiments and results on the development set for the en es track, Section 5 presents our official results on the test sets for all three language tracks. Finally, Section 6 provides conclusions and ideas for future work.

Approach
We introduce two models for the task of future learner error prediction: a sequence labelling model and a sequence-to-sequence model. The following sections describe these two models.

Neural sequence labelling
We treat error prediction as a sequence labelling problem. Similar to Yannakoudakis et al. (2017), we construct a bidirectional recurrent neural network for detecting future learner errors. Unlike their system, error-free and correct sequences are fed into our model, and the goal is to predict where a learner is likely to make token-level errors based on their learning history. The model receives a sequence of tokens x = (x 1 , x 2 , ..., x T ) as input, and assigns a label y to each input token x. A bidirectional long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) is used to learn context-specific representations: where − → h t is the hidden state of the forwardmoving LSTM at time t, that reads the input sequence from the first token to the last; ← − h t is the hidden state of the backward-moving LSTM at time t, which reads the input sequence in reverse order; and h t is the concatenation of both hidden states, that captures both historical and future sequential information.
A softmax output layer predicts the label distribution for each input token, given the whole input sequence x: where W o is an output weight matrix. We optimise the model by minimising categorical cross-entropy between the predicted label distributions and the gold labels: 2.2 Sequence-to-sequence modelling We utilise a sequence-to-sequence model with a soft attention mechanism similar to that of Yuan and Briscoe (2016), which contains a bidirectional LSTM encoder and an attention-based LSTM decoder. An encoder first reads and encodes an input sequence x = (x 1 , x 2 , ..., x T ) into hidden state representations h = (h 1 , h 2 , ..., h T ), which is the same as the one used in our sequence labelling model (see Section 2.1, Equation 3). A decoder then generates an output sequence y = (y 1 , y 2 , ..., y T ) 2 by predicting the next token y t based on the input sequence x and all the previously generated tokens {y 1 , y 2 , ..., y t−1 }: where W o is a decoder output weight matrix, and s t is the hidden state of the LSTM decoder at decoding time t: where c t is the input sequence representation for predicting the output token y t , and is calculated using a soft attention mechanism: The weight α tj is computed with a softmax function: A feedforward neural network is used to represent the energy function: where W α and U α are attention weight matrices.

Feature space
Besides original word tokens, new features (in the form of discrete labels) are introduced, which provide additional exercise and learner information. These features are described briefly below. 3.1 Exercise-level feature set user: a unique identifier for each learner; format: the exercise format (reverse translate, reverse tap, or listen); 3 session: the exercise session type (lesson, practice, or test); 4 client: the learner's device platform (android, ios, or web); country: the country from which the learner has done the exercise. 3.3 Language-specific feature set CEFR word level: The Common European Framework of Reference (CEFR) (Council of Europe, 2011) describes what language learners can do at different stages of their learning and defines language proficiency in six levels: A1, A2, B1, B2, C1 and C2, with A1 being the lowest and C2 the highest. These six CEFR levels can be grouped into three broad levels: basic (A1 and A2), independent (B1 and B2) and proficient (C1 and C2).
The CEFR levels for all the English words appeared in the dataset are extracted from the English Vocabulary Profile (EVP), 5 which is based on the 50-million word Cambridge Learner Corpus (CLC) and the 1.2-billion word Cambridge English Corpus (CEC). The EVP is a free online vocabulary resource that contains information about which words and phrases are known and used by learners at each CEFR level (Capel, 2012).
Even though we only focus on English words here, it is worth noting that the CEFR levels were CLC error rate: We collect error rate information from the CLC, which is a large annotated corpus of learner English developed by Cambridge University Press and Cambridge English Language Assessment since 1993 (Nicholls, 2003). It comprises examination scripts written by learners of English who took Cambridge English examinations around the world with over 80 L1s and representing all six CEFR levels.
Two criteria are applied to create two sub corpora: • CLC(KET): contains examination scripts for A2 Key, formerly known as Cambridge English: Key (KET) 6 ; and A2 Key for Schools, formerly known as Cambridge English: Key for Schools (KETfS) 7 .
KET is the lowest level General English examination in the Cambridge English range, which targets at A2 level. KETfS is at the same level as KET, but its examination content is targeted at the interests and experiences of schoolchildren.
• CLC(ES): contains examination scripts written by native speakers of Spanish, which account for around 24.6% of the non-native speakers represented in the CLC.
For every word w, an error rate E(w) is defined as: where count(t = w) is the number of times the word w is seen in the target side (i.e. corrected version) of the corpus, and count(s = w, t = w) is the number of times any word except w in the source side (i.e. original version) has been corrected to the word w in the target side.
We compute E(w) from the CLC, CLC(KET) and CLC(ES); and then create two new features CLC-KET and CLC-ES: All the exercise-level and token-level features are directly extracted from the metadata and preprocessed data provided by the shared task organisers. The language-specific features are only generated for the English data to be used in the en es track.

Dataset and evaluation
The shared task dataset comprises answers submitted by more than 6,000 Duolingo users over the course of their first 30 days. Token-level binary labels are provided: Correct reference : She is my mother Learner answer: She is mader Label: 0 0 1 1 Matched tokens are given the label '0'; and missing or misspelt tokens (ignoring capitalisation, punctuation and accents) are given the label '1' to indicate an error. Only correct references and label sequences are provided, not original learners' responses. Therefore, in our experiments, we map correct reference to its label sequence.
The dataset is partitioned sequentially into training, development and test sets, which all contain the same group of learners. The training set contains the first 80% of the sessions for each learner, followed by the next 10% for development and the final 10% for testing. Each learner's test items are subsequent to their development items, which in turn are all subsequent to their training items.
During development, we focus on learners of English. The training set provided for the en es track contains approximately 2,622,958 tokens (however, only 13% are labelled with '1') in about 824,012 sentences. The development set includes additional 387,374 tokens in 115,770 sentences. All the data has been pre-processed using the Google SyntaxNet dependency parser 8 by the shared task organisers.
System performance is evaluated in terms of area under the ROC curve (AUROC) and F1 (with a threshold of 0.5).

Training
All our models are built using OpenNMT (Klein et al., 2017). For the sequence labelling model, our training procedure is similar to Yannakoudakis et al. (2017)); while for the sequence-to-sequence model, we follow Yuan and Briscoe (2016). Additionally, we set the source and target word embedding sizes to 750, as well as the LSTM hidden layer size. We no longer limit the vocabulary size or the maximum sentence length as both of them are small enough to train effectively. New features defined in Section 3 are added to the models incrementally and results are presented in the next section.

Results
Evaluation results on the development set for the en es track are reported. We also include a baseline model provided by the shared task organisers for comparison purposes. The baseline model uses L2-regularised logistic regression, trained with stochastic gradient descent (SGD) weighted by frequency (Settles et al., 2018).
Sequence labelling model Results for the sequence labelling models are presented in Table 1, and all our models outperform the baseline (Table 1 #0). We start by adding exercise-level features incrementally (Table 1 #1 -5). Introducing new exercise-level features yields consistent improvements in overall performance. The one trained on all our exercise-level features gives the best AUROC and F1 scores (Table 1 #5).
Token-level features (Table 1 #6-7) and language-specific features (Table 1 #8-10) are then added to the current best model. However, none of them yields further gains. A closer inspection of the training data reveals a number of cases where POS and DEP tags provided by the shared task organises are not reliable, as in the following examples (incorrect tags are marked in red): Since we use the tags in the dataset directly, without cleaning any noisy data or pre-processing the data again, it is not surprising that adding these features yields worse performance.
In terms of the language-specific features, we also notice that the CEFR word level feature is not very informative as not all the words in the dataset are also in the EVP; and for words that are, most of them turn out to be at either A1 or A2 level.
Sequence-to-sequence model We follow the same training procedure to build sequence-tosequence models -see Table 2. Similar results are observed: all our models perform better than the baseline (Table 2 #0); exercise-level features contribute to the overall performance improvements (Table 2 #1-5); and token-level and languagespecific features seem to be detrimental and bring performance down (Table 2 #6-10). The best sequence-to-sequence model uses all the exerciselevel features, achieving an AUROC score of 0.837 and an F1 score of 0.464 -see Table 2 #5.
Combining two sequence models Our best sequence labelling model (seqlabel) and our best sequence-to-sequence model (seq2seq) achieve the same AUROC score of 0.837; while seqlabel yields a better F1 score of 0.480, compared to an F1 score of 0.464 for seq2seq.
We further combine these two best models using linear interpolation: (14) where P seqlabel represents the score from the sequence labelling model, P seq2seq represents the score from the sequence-to-sequence model, and λ is a parameter that controls the impact the sequence-to-sequence model has on the final score. After tuning λ on the development set, we set it to 0.5.
Results of our best individual models and the final combined model are reported in Table 3. We can see that the combined model yields the overall best results, which suggests that our two individual neural sequence models capture complementary information even though they are both trained on the same group of features.

Official evaluation results
Our submissions to the shared task are the results of our best systems. As each participating team is allowed to submit up to 10 runs, we first run our best sequence labelling, sequence-to-sequence and combined systems from the previous section on the en es test set.
After determining that our language-specific features are not helpful, we train new models for   the es en and fr en tracks using the same network architecture and the same group of features as for en es. No tuning of (hyper-)parameters is performed for new datasets or languages. The official results of our submissions for all three language tracks are reported in Table 4. Results on the en es test set are similar to those on the en es development set (see Table 3) -no significant drop is observed. The combined model produces the best overall performance, and the seqlabel model outperforms the seq2seq model.
In the fr en track, the combined model again yields the highest AUROC and F1 scores, followed by the seq2seq model and the seqlabel model. Our es en seq2seq model had not finished training by the shared task submission deadline, therefore, we only submit the es en seqlabel model. Based on the results for the other two language tracks, we expect our es en results might be further improved by combining a seqlabel model and a seq2seq model.

Conclusions and future work
In this paper, we have described the use of recurrent neural sequence labelling and sequence-tosequence models for future learner error predic-tion. We have provided evidence of further performance gains by combining them together, showing that these two types of sequence models are complementary. We have also explored different types of features, which capture exercise-level, tokenlevel and language-specific information. Furthermore, we have demonstrated that the same network architecture and group of features can be applied directly to build competitive prediction systems across all three languages, without the need for language-specific parameter tuning.
Results of our best systems on the official test sets yield: AUROC=0.841 (ranked sixth out of the fifteen participating teams) and F1=0.479 (ranked third) for the en es track; AUROC=0.835 (ranked sixth) and F1=0.508 (ranked third) for fr en; and AUROC=0.807 (ranked sixth) and F1=0.435 (ranked fifth) for es en.
Plans for future work include combining the training and development sets to train new models, using better quality token-level features, and exploring other exercise-level features like the amount of time it took for the learner to construct and submit their answer and the number of days since the learner started using Duolingo. We would also like to test our approach as well as our language-specific features on a broader scale (i.e. using corpora which cover language learners at different levels, ideally ranging from basic to proficient).  Table 4: Official results of our submitted systems on the test sets for all three tracks: seqlabel is our best sequence labelling model, seq2seq is our best sequence-to-sequence model, and combined is the combination of these two models. For comparison, we also include the baseline results provided by the shared task organisers and the results from the top-performing systems.
and Ahmed Zaidi for providing us with the CLC and EVP resources. Our gratitude goes also to the shared task organisers for coordinating the task. We acknowledge NVIDIA for an Academic Hardware Grant.