CLUF: a Neural Model for Second Language Acquisition Modeling

Second Language Acquisition Modeling is the task to predict whether a second language learner would respond correctly in future exercises based on their learning history. In this paper, we propose a neural network based system to utilize rich contextual, linguistic and user information. Our neural model consists of a Context encoder, a Linguistic feature encoder, a User information encoder and a Format information encoder (CLUF). Furthermore, a decoder is introduced to combine such encoded features and make final predictions. Our system ranked in first place in the English track and second place in the Spanish and French track with an AUROC score of 0.861, 0.835 and 0.854 respectively.


Introduction
Education systems that can adapt to the presenting of educational materials according to students' personal learning needs have great potential. Specifically, in the area of second language learning, we try to predict whether the learning materials are too easy or too hard for language learners. Therefore, we study the Second Language Acquisition Modeling (SLAM) task to build a model of the language learning process.
Bayesian Knowledge Tracing (BKT) (Corbett and Anderson, 1994;Pardos and Heffernan, 2010;Pelánek, 2017) that models students' knowledge over time is a well-established problem. It takes a Hidden Markov Model (HMM) with binary hidden states to represent knowledge acquisition for each concept separately. BKT had been successfully applied to subjects like mathematics and programming, where a limited number of concepts can be predefined. However, in language learning, it's difficult to define a small number of concepts, especially when the vocabulary size increases over time. Deep Knowledge Tracing (DKT) (Piech et al., 2015;Wilson et al., 2016) is a recent implementation of knowledge tracing which uses Recurrent Neural Networks (RNNs) to model student's learning trace. Although RNNs and its commonly used variants, such as Gated Recurrent Units (Cho et al., 2014) and Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997), are capable of exploring dynamic temporal behavior for a time sequence, it's hard to model extremely long learning history that can range over months even years. Half-life Regression (Settles and Meeder, 2016) is a novel approach for the SLAM task, which combines a psycholinguistic model of human memory with modern machine learning techniques. It had demonstrated state-ofart performance for predicting student recall rates.
Mapping symbols, such as characters or words, into a continuous space is a popular method in natural language processing (Hinton, 1986;Mikolov et al., 2013;Pennington et al., 2014;Mikolov et al., 2017). It achieved remarkable success in many tasks, for example, neural language modeling (Bengio et al., 2003;Collobert and Weston, 2008;Mikolov et al., 2010), machine translation Bahdanau et al., 2015), text classification (Lai et al., 2015;Zhang et al., 2015;Conneau et al., 2017), sentiment analysis (dos Santos and Gatti, 2014;Poria et al., 2015) and machine reading comprehension (Xiong et al., 2017;Hu et al., 2017). In this work, we introduce a similar neural approach for the SLAM task, where we use neural encoders to extract features from each exercise as well as metadata about student and session. To be specific, we build a Context encoder, a Linguistic feature encoder, a User information encoder and a Format information encoder (CLUF) to calculate high-level representations from characters, words, part-of-speech (POS) labels, syntactic dependency labels, user id and country, exercise type, client, etc.

Dataset
The Duolingo SLAM dataset (Settles et al., 2018) is organized into three language tracks: • en es: English learners (who already speak Spanish) • es en: Spanish learners (who already speak English) • fr en: French learners (who already speak English) According to Table 1, most tokens (more than 80%) are perfect matches and are given the label 0 for "OK". Tokens that are missing or spelled incorrectly (ignoring capitalization, punctuation, and accents) are given the label 1 denoting a mistake. Across the three language tracks, en es has the lowest positive ratio, while es en has the highest out-of-vocabulary (OOV) ratio. Table 2 shows the features provided with the SLAM dataset. In our system, we used all features except the morphology features and syntactic dependency edges, as we did not get any improvement during experiments. Perhaps it is because that the neural networks already encoded similar information from characters, words and their syntactic dependency labels.

Method
We used in total four encoders to model the students' learning behavior. Inputs to these encoders are embeddings learned from one-hot representations of raw features. The context encoder consists of a character level LSTM encoder and a word level LSTM encoder. The linguistic feature encoder is also a LSTM model, where POS and syn-  At last, user encoder and format encoder are both fully-connected neural networks. The user encoder takes account of user id, users' nationality and other user related information, while the format encoder encodes exercise format, session type, client type and time used for the exercise. The decoder combines the outputs of these encoders and then makes predictions through a sigmoid unit.

Context Encoder
The context encoder operates at both the word level and the character level. The word level encoding is capable of capturing better semantics and longer dependency than the character level encoding. But learning new words is a key part in language learning. By modeling the character sequence, we may be able to learn certain word The word level context encoder is a Bidirectional LSTM model. Given a sequence of words represented as one-hot vectors (w 1 , w 2 , ..., w N ), we can get the word embedding of w t as where E w is the word embedding matrix, which is learned during training.
Given the input vector x t , the forward, backward, and combined activations of the j-th hidden layer are computed as where K 0 is the number of layers of the network, j = 1, 2, ..., K 0 .
The character level context encoder is a hierarchical LSTM model. Given a sequence of one-hot representations of characters in word w t , (c 1 , c 2 , ..., c M ), we can get the embedding of c i as where E c is the character embedding matrix, which is learned during training.
The outputs of the lookup layer are then fed into a multilayer LSTM unit The mean-over-time (MoT) layer takes H wt as inputs Then the outputs of the MoT layer (h w 1 , h w 2 , ..., h w N ) are inputs to a Bidirectional LSTM model, where K 2 is the number of layers of the BiLSTM, j = 1, 2, ..., K 2 . The final outputs of the context encoder are computed as: where o t = g t +ĝ t .

Linguistic Feature Encoder
The linguistic feature encoder is also a LSTM model. Similar to the context encoder, we trained embedding representations of the POS labels and the syntactic dependency labels. The POS embeddings and syntactic dependency embeddings are concatenated together and then fed into a LSTM unit, where pos t is the POS embedding of word w t and dep t is the syntactic dependency label embedding of word w t . j is the layer index, and we have K 3 layers in this LSTM unit.

User Encoder
The user encoder is a one-layer fully-connected feedforward network. The encoder takes user metadata as inputs where u is the embedding of the user id, s is the embedding of the user's nationality and days is the time since the student started learning this language. W µ , b µ are trained network parameters. We used the tanh activation function for the user encoder.

Format Encoder
Similar to the user encoder, the format encoder is also a one-layer fully-connected feedforward network. The inputs are format, session, client, and the response time, where W f , b f are trainable parameters.

Decoder
The decoder takes the outputs (O, L, µ 1 , f 1 ) of the context encoder, linguistic encoder, user encoder and format encoder as inputs. The prediction for word w t in the given sequence (w 1 , w 2 , ..., w N ) is computed as where W ν , b ν , W γ , b γ , W p , and b p are trainable parameters. For decoding, we used the sigmoid activation function σ.

Training
The model is trained to minimize the following loss function   where α is the hyper parameter to balance the negative and positive samples and y t is the label of the time step t. In our experiment, we set α to 0.7.

Experiments
We considered the words that appear less than five times in the training data as unknown token. For students with more than one nationality, only the first one was used. The embedding size was set to 100, and the Dropout (Srivastava et al., 2014) regularization was applied, where the dropout rate was set to 0.5. We used the Adam optimization algorithm (Kingma and Ba, 2014) with a learning rate of 0.001. The word level context encoder was a twolayer Bidirectional LSTM. The character level context encoder had one LSTM layer for encoding each word and three Bidirectional LSTM layers above the MoT layer. Furthermore, the linguistic  Term en es es en fr en Relative impr (%) 11.24 11.93 9.72 Table 6: The relative improvement over the baseline encoder was a two-layer LSTM. Both of the user encoder and format encoder were one-layer fullyconnected feedforward networks.

Results
The evaluation metrics for the SLAM task were the Area Under the Receiver Operation Characteristic (AUROC) curve and the F1 score.
As provided in Table 3, Table 4 and Table 5, our model achieved the AUROC score of 0.861, 0.835, and 0.854 and the F1 score of 0.559, 0.524 and 0.569 for the en es, es en, and fr en track, respectively. We ranked in first place in the en es track and second place in the es en and fr en track. Table 6 shows that CLUF gained significant improvements on all tracks compared to the baseline model. The improvement on the en es and es en track were close, while the improvement on the fr en track was a bit lower. We think this is because the fr en (327k exercises) track has much less training data than the en es (824k exercises) and es en (732k exercises) track.

Discussion
Our intuition behind CLUF is to factorize raw features into four independent parts: 1) word surface form models the word formation rules; 2) the linguistic encoder is to provide linguistic and syntactic dependency information; 3) the user part explores students' second language acquisition skills   Table 7 shows the performance of our CLUF model when excluding one of the context, linguistic, user and format encoder. We can see that the performance drops substantially if we don't use the contextual or format features. On the other hand, excluding the linguistic features does not affect the performance much. At last, we can achieve fairly good performance even if we don't use any user information.

Conclusion
We presented a neural network based model, CLUF, for the SLAM task. We encoded the contextual, linguistic, user and format features separately. Our system achieved one of the best results in this task. Moreover, our CLUF model was language invariant, as it performed approximately equally well across three language tracks. We further explored how effective each encoder was. We found that the context encoder was the most effective one, while the linguistic encoder was the least effective one.