Deep Factorization Machines for Knowledge Tracing

This paper introduces our solution to the 2018 Duolingo Shared Task on Second Language Acquisition Modeling (SLAM). We used deep factorization machines, a wide and deep learning model of pairwise relationships between users, items, skills, and other entities considered. Our solution (AUC 0.815) hopefully managed to beat the logistic regression baseline (AUC 0.774) but not the top performing model (AUC 0.861) and reveals interesting strategies to build upon item response theory models.


Introduction
Given the massive amount of data collected by online platforms, it is natural to wonder how to use it to personalize learning. Students should receive, based on their estimated knowledge, tailored exercises and lessons, so they can be guided through databases of potentially millions of exercises.
With this objective in mind, numerous models have been designed for student modeling (Desmarais and Baker, 2012). Based on the outcomes of students, one can infer the parameters of these so-called student models, measure knowledge, and tailor instruction accordingly.
In the 2018 Duolingo Shared Task on Second Language Acquisition Modeling (Settles et al., 2018), we had access to attempts of thousands of students over sentences (composed of thousands of possible words, each of these being labeled as correct or incorrect), and we had to predict whether a student would write correctly or not the words of a new sentence. Sentences were annotated with precious side information such as lexical, morphological, or syntactic features. This problem is coined as knowledge tracing (Corbett and Anderson, 1994) or predicting student performance (Minaei-Bidgoli et al., 2003) in the literature. In this particular challenge, it is done at the word level.
In this paper, we explain the motivations that led us to our solution, and show how our models handle typical models in educational data mining as special cases. In Section 2, we show related work. In Section 3, we present the existing model of DeepFM and clarify how it can be applied for knowledge tracing, notably the SLAM task. In Section 4, we detail the data preparation, in order to apply DeepFM. Finally, we expose our results in Section 5 and further work in Section 6.

Item
Response Theory (IRT) models (Hambleton et al., 1991) have been extensively studied and deployed in many real-world applications such as standardized tests (GMAT). They model the ability (level information) of students, and diverse parameters of items (such as difficulty), and involve many criteria for the selection of items to measure the ability of examinees.
Related work in knowledge tracing consists in predicting the sequence of outcomes for a given learner. Historically, Bayesian Knowledge Tracing (BKT) modeled the learner as a Hidden Markov model (Corbett and Anderson, 1994), but with the advent of deep learning, a Deep Knowledge Tracing (DKT) model has been proposed (Piech et al., 2015), relying on long short-term memory (Hochreiter and Schmidhuber, 1997). However, Wilson et al. (2016) have shown that a simple variant of IRT could outperform DKT models.
All of these IRT, BKT or DKT models do not consider side information, such as knowledge components, which is why new models naturally rose. Vie and Kashima (2018) have used Bayesian factorization machines for knowledge tracing, and recovered most student models as special cases.
Wide and deep learning models have been proposed by Google (Cheng et al., 2016) to learn lower-order and higher-order features. Guo et al. (2017) have proposed a variant where they replace the wide linear model by a factorization machine, and this is the best model we got for the Shared Task challenge.

DeepFM for knowledge tracing
We now introduce some vocabulary. We assume that our observed instances can be described by C categories of discrete or continuous features (such as user id, item id or country, but also time). Entities denote couples of categories and discrete values (such as user=2, country=FR or again time if the category is continuous). We denote by N the number of possible entities, number them from 1 to N . The DeepFM model we are describing will learn an embedding for each of those entities 1 .
Each instance can be encoded as a sparse vector x of size N : each component will be set at a certain value (for example, 1 if the category of the corresponding entity is discrete, the value itself if it is continuous, and 0 if the entity is not present in the observation). For each instance, our model will output a probability p(x) = ψ(y F M + y DN N ), where ψ is a link function such as the sigmoid σ or the cumulative distribution function (CDF) Φ of the standard normal distribution.
The DeepFM model is made of two components, the FM component and the Deep component.

FM component
Given an embedding size d ∈ N, the output of a factorization machine is the following: The first term shows that a bias w k ∈ R is learned for each entity k. The second term models the pairwise interactions between entities by learning a vector v k ∈ R d for each entity k.
3.1.1 Relation to existing student models If d = 0 and ψ is the sigmoid function σ, p(x) = σ( w, x ) and the FM component behaves like logistic regression.
In particular, if there are two fields users (of n possible values) and items, then each instance encoding x ij of user i and item j is a concatenation of two one-hot vectors, and p(x ij ) = σ(w i + w n+j ) = σ(θ i − d j ) for appropriate values of w, which means the Rasch model is recovered.
As pointed out by Settles et al. (2018), their baseline model is a logistic regression with side information, which makes it similar to an additive factor model. To see more connections between our FM component and existing educational data mining models, see Vie and Kashima (2018).

Deep component
The deep component is a L-layer feedforward neural network that outputs: where each layer 0 ≤ ℓ < L verifies: for learned parameters W , a, b for each layer, and the first layer is given by the corresponding v ic embeddings of the activated entities (the ones for each category c = 1, . . . , C, which correspond to the nonzero entries of x): In order to select the hyperparameters, we followed the instructions of (Guo et al., 2017) and the default values of the available implementation on GitHub 2 .

Training
Training is performed by minimizing the log loss of the output probabilities compared to the true outcomes of the students over the tokens. For all models trained, the optimizer was Adam (Kingma and Ba, 2014), with learning rate γ = 10 −3 and minibatches of size 1024.

Fundamental, discrete categories
Fundamental categories (<fundamental>) refer to the features that have discrete values, such as user (which refer to the user ID) or countries (which can be in a many-to-many relationship).

Noisy discrete categories
Duolingo was providing the SyntaxNet features (morphosyntactic rules) such as: We call them noisy (<noisy> below), because they are the output of another algorithm. Also, not all of them were specified, there were some missing entries.

Continuous categories
• time for answering the question • days since when the user subscribed the Duolingo platform.

Encoding
In the baseline model provided by Duolingo, all fundamental features were encoded as a concatenation of n-hot encoders 3 . Then they used logistic regression and achieved AUC 0.772.
Here are the models we considered.
• IRT: user + token, d = 0 • Logistic regression baseline: The implementation of Deep Factorization Machines we used needed a concatenation of one-hot encoders. So we picked the first country among the list of countries for each instance. Also, it could not handle missing entries, so for the noisy partial categories, we used a None entity.

Results
We first tried different models on a validation set. All models were trained using 500 epochs for the vanilla FM, or 100 epochs for DeepFM with early stopping, and refit on the validation set.

On validation set
A vanilla FM was used considering ψ = Φ the CDF of the standard normal distribution as link function, like in the implementation of 4 (Rendle, 2012). Then, for our experiments, we used the TensorFlow implementation of DeepFM provided by Alibaba on GitHub 5 . Our encoding is available on GitHub 6 . Vanilla FM had comparable performance of the LR baseline. It agrees with the findings of Vie and Kashima (2018) that a bigger dimension may not necessarily help.

On test set
The DeepFM model managed to improve the baseline by 3 points AUC. We got AUC 0.815, while the top performing solution had AUC 0.861. Our best performing model was DeepFM: using only the discrete features, train a model of latent embedding size 10 during a fixed number of epochs (50). DeepFM* using all features was slightly worse.

Further Work
We could embed the dependency graph provided by Duolingo in the encoding of the vanilla FM.
Ensemble methods such as xgboost (Chen and Guestrin, 2016) could be considered, as typically encountered in challenges.
Here we want to combine information of the student which is quite poor (almost only their outcomes), compared to the knowledge of tokens (syntactic trees, or word2vec, etc.). This is why we could use extra embeddings, such as a LSTM encoding of the sentence as feature for the token.
The performance of DeepFM* that was using all features was slightly worse than DeepFM that was limited to the fundamental features. We might mitigate this problem by using a field-aware factorization machine (Juan et al., 2016) that learns a parameter per category of feature in order to draw more importance on some category (such as user) than others (such as date).

Conclusion
In this paper, we showed how to use deep factorization machines for knowledge tracing. Our findings show interesting combinations of features, together with embeddings provided by deep neural networks. In some way, it shows how to learn dense embeddings from the sparse features typically encountered in learning platforms.