TMU System for SLAM-2018

We introduce the TMU systems for the second language acquisition modeling shared task 2018 (Settles et al., 2018). To model learner error patterns, it is necessary to maintain a considerable amount of information regarding the type of exercises learners have been learning in the past and the manner in which they answered them. Tracking an enormous learner’s learning history and their correct and mistaken answers is essential to predict the learner’s future mistakes. Therefore, we propose a model which tracks the learner’s learning history efficiently. Our systems ranked fourth in the English and Spanish subtasks, and fifth in the French subtask.


Introduction
The second language acquisition modeling (SLAM) is an interesting research topic in the fields of psychology, linguistics, and pedagogy as well as engineering. Popular language learning applications such as Duolingo accumulate learning data of language learners on a large-scale; thus, there has been an increasing interest for SLAM using machine learning using such data. In this study on SLAM, we aim to clarify both: (1) the inherent nature of second language learning, and (2) effective machine learning/natural language processing (ML/NLP) engineering strategies to build personalized adaptive learning systems.
In order to predict the learner's future mistakes, it is important to track a huge history of what and how exercises were solved by that learner and be able to model it. Therefore, we propose a model that can efficiently track a learner's learning history. (Piech et al., 2015;Khajah et al., 2014Khajah et al., , 2016 Figure 1: An exercise example. Given exercise is a "correct" input. Outputs are "1" each time a learner makes a mistake

2018 Duolingo Shared Task on SLAM
We used data from Duolingo in this shared task. Duolingo is the most popular language-learning online application. Learners solve the exercises and this shared task use only 3 type of exercises. Exercise (a) is a reverse translate item, where learners translate written prompt from the language they know into the language they are learning. Exercise (b) is a reverse tap item, where learners construct an answer given a set of words and distractors in the second language. Exercise (c) is a listen item, where learners listen and transcribe an utterance in the second language. In this shared task, There are 3 exercise data of the following groups of second language learners: • English learners (who already speak Spanish) • Spanish learners (who already speak English) • French learners (who already speak English) The Duolingo data set, which contains more than 2 million annotated words, is created from the answers submitted by more than 6,000 learners during their first 30 days. In the related exercises, learners answer questions related to the second language they are learning; thus, they inevitably make various mistakes during the course. In this task, we predict mistakes on word level given an exercise. Figure 1 is an exercise example. Given a "correct" exercise as input a system has to predict labels as output. In general, most tokens are perfect matches; however, the remainder of the tokens are either missing or spelled incorrectly (ignoring capitalization, punctuation, and accents). The former is assigned the label "0" (OK), while the latter is assigned the label "1" (Mistake).

TMU System
To track a lot of learner's histories, our proposed TMU system has two components: (1) a base component that predicts whether a learner has made a mistake for the given word in an exercise (Fig.  2, Prediction Bi-LSTM) and (2) a component that tracks a specific learner's information regarding the learned exercises and the words that he or she might have mistaken (Fig. 2, History LSTM). It is expected to track huge history of the learned exercise by inputting the hidden state of the Prediction model to the History LSTM.
In prediction, we receive exercise as input and make predictions on word-level. Using Bi-LSTM for sequence labeling on exercise level, e.g., information as POS tags or dependency edge labels, allows us to share information within each exercise for better prediction. We perform training by feeding input exercises arranged in a chronologi-cal order for each learner. Table 1 lists all the features used by our system. We use features (1-7) included in the dataset distributed by the task organizers as well as the tracking history (8) (Section 3.3) and labels for language identification (9). We trained a single model with three languages, including English, Spanish, and French; in addition, we used the language identification feature to distinguish them.

Features
There are three types of inputs for the Bi-LSTM. The first input includes word-level features that indicate information changing for each word in an exercise. In particular, word surface and POS are used as word-level features. The second input consists of exercise-level features. In particular, days, session, format, time, and history are used as exercise-level features. The third input includes learner-level features. For this, learner and language features are extracted for each learner.

Prediction Bidirectional LSTM
We used bidirectional LSTM (Bi-LSTM) to predict whether a learner has mistaken each word in an exercise. The k-th word and POS of the j-th exercise of the i-th learner are converted into e i (j,k)

Embeddings
Description English, Spanish, French and p i (j,k) distributed representations, respectively. Further, the session and format of the j-th exercise of the i-th learner are converted into s i j and f i j distributed representations, respectively. Days and time are represented as b i j and t i j , respectively. User and language are converted into u i and l i distributed representations, respectively. History is the last hidden state c i (j−1,M ) of the History LSTM, which will be described later (Section 3.3).
The inputs of the Bi-LSTM are given as is the concatenation of all features and N is the length of the j-th exercise. x i (j,k) is converted into the forward hidden state is fed into the extra hidden layer: whereĥ i (j,k) ∈ R dĥ×1 is an extra hidden layer output, W h ∈ R dĥ×d h is a weighting matrix, and b h ∈ R dĥ×1 is a bias. The extra hidden layer outputĥ i (j,k) is linearly transformed using the output layer as follows and the probability distribution p i (j,k) ∈ R t×1 of the true/false tag is acquired using the softmax function, where t is the size of the tag, which is set to 2 in our study.
where Wĥ ∈ R t×dĥ is a weighting matrix and bĥ ∈ R t×1 is a bias.

History LSTM
As previously mentioned, to correctly predict each learner's mistakes, it is important to consider not only the history of learned exercises, but also the learner's answers to exercises. Thus, the History LSTM tracks all previous information regarding the learned exercises and how they were answered by each learner. For each j-th exercise, o i (j,1) , o i (j,2) , · · · , o i (j,N ) is given as an input to the j-th History LSTM, where o i (j,k) = [h i (j,k) ; g i (j,k) ]. h i (j,k) (Section 3.2) is considered as information about the j-th exercise of the i-th learner and g i (j,k) ∈ R 1×1 is the gold answer of the i-th learner to the j-th exercise. In addition, the first hidden state and cell memory of the j-th History LSTM is initialized with the last hidden state and cell memory of the previous j-1-th History LSTM. The hidden state c i (j,1) is created from o i (j,1) using the LSTM for the next step of the Prediction Bi-LSTM.

Training
The objective function is defined as follows: where D is the training data and θ represents model parameters. We use Backpropagation Through Time (BPTT) for training. In general, low-frequency words are replaced by unk word to learn unk vector. However, in our study, unknown words appear not because they  have low-frequency, but because they have not been learned yet. Hence, we use words that appear for the first time in an exercise to be replaced by unk word to learn unk vector. In addition, we use words without unk replacement to track the history for the History LSTM.
The final loss is calculated as follows: where αL unk θ is calculated by replacing the word appearing for the first time with unk, while (1 − α)L orig θ is calculated using this word itself. In particular, α expresses the degree of emphasis placed on unk and a learned word. For example, when a word "Japanese" appears for the first time, then: Original exercise: I am Japanese Replaced by unk: I am <unk> If the unk does not exist in any exercise, L θ has the same value as L orig θ .

Testing
During our test, predictions were made on exercises of the test data arranged in chronological order for each learner. We update History LSTM using output and hidden state of Prediction Bi-LSTM. Test data does not have gold answers unlike training data. Hence, each system used its own converted probability outputs of the Prediction Bi-LSTM component with arg max as gold answers.
In addition, we performed ensemble predictions. The parameters of ensemble models are initialized with different values. As the final prediction result, we used the average of the probability outputs of each Prediction Bi-LSTM. Each system used its own converted probability outputs of the Prediction Bi-LSTM component as gold answers. Table 2 shows the number of exercises for train, dev and test data for each language. The hyper parameters of our model are listed in Table 3. All  (4) 0.01 Dev, (Section 3.5) 3,000 Ensemble, (Section 3.5) 10 words that appeared in the training data were included in the vocabulary. Preliminary experiments showed that the AUROC of the one model trained on data of three languages was higher than those models trained for each language. Therefore, we trained a single model with three language tracks, including English, Spanish and French. Especially, AUROC increased for low-resource French language. Each model of the ensemble uses different dev and training sets randomly sampled from the data. In particular, since we needed to evaluate the learning results of Future Days of each learner, we combined the provided official training and dev sets and arranged exercises in chronological order of Days for each learner. Next, we randomly sampled exercises from final learning exercises of learners to create a dev set and the remaining data were used as training data. Table 4 lists the results of SLAM for English learners, Spanish learners, and French learners. The systems are ranked by their AUROC. The TMU system ranked fourth in English and Spanish subtasks, while it ranked fifth in the French subtask.

Analysis of Tracking History
In order to confirm the importance of history tracking, we compared the model that considers history (W/ History Model) with the model that    Table 3. Table 5 lists our evaluation results 1 . It can be observed that the AUROC of prediction of the W/ History Model case is considerably higher than that of the W/O History Model. As we expected, it is important to consider what learner have learned in the past and how they responded to it in order to improve future predictions.

Conclusion
In this study, we described the TMU system for the 2018 SLAM Shared Task. Our system is based on RNN; It has two components: (1) Bi-LSTM for predicting learners' error and (2) LSTM for tracking learners' learning history.
In this work, we have not used any languagespecific information. As future work, we plan to exploit additional data for each language, such as pre-trained word representations, ngrams, and character-based features. Additionally, we hope to incorporate word difficulty features (Kajiwara and Komachi, 2018). In particular, the more complex a word is, the more difficult it likely is to be learned.