Heterogeneous Graph Neural Networks for Concept Prerequisite Relation Learning in Educational Data

Prerequisite relations among concepts are crucial for educational applications, such as curriculum planning and intelligent tutoring. In this paper, we propose a novel concept prerequisite relation learning approach, named CPRL, which combines both concept representation learned from a heterogeneous graph and concept pairwise features. Furthermore, we extend CPRL under weakly supervised settings to make our method more practical, including learning prerequisite relations from learning object dependencies and generating training data with data programming. Our experiments on four datasets show that the proposed approach achieves the state-of-the-art results comparing with existing methods.


Introduction
With the increasing availability of learning resources and the requirement of self-regulated learning, there is a rising need to organize knowledge in a reasonable order. Concept prerequisite relations are essentially considered as the dependency among concepts, and they are crucial for people to learn, organize, apply and generate knowledge (Margolis and Laurence, 1999). For example, if someone wants to learn the knowledge about Conditional Random Fields, the knowledge about Hidden Markov Model should be learned first. Consequently, the concept Hidden Markov Model is a prerequisite concept of the concept Conditional Random Fields. Nowadays, prerequisite relations among concepts have played a crucial role in educational applications, such as curriculum planning  and intelligent tutoring (Wang and Liu, 2016;Chen et al., 2018).
Recently, several attempts have been made to extract prerequisite relations among concepts from textbooks Liang et al., 2018), MOOCs (Massive Open Online Courses) (Pan * * corresponding author et al., 2017), courses (Liang et al., 2015a;Liang et al., 2017;Li et al., 2019a;Roy et al., 2019) and scientific papers (Gordon et al., 2016). They either proposed a local statistical information, such as reference distance (Liang et al., 2015a) and cross-entropy (Gordon et al., 2016) to measure the prerequisite relations between concepts, or proposed handcrafted features to learn a prerequisite relation classifier (Pan et al., 2017). Liang et al. (2017) proposed CPR-Recover to recover concept prerequisite relations from course dependencies. More recently, Li et al. (2019a) applied variational graph autoencoders to learn concept prerequisite relations from courses. While Roy et al. (2019) developed a supervised learning approach called PREREQ.
However, there are still several challenges to learn the prerequisite relations among concepts. Firstly, there are multiple and complex relations among concepts and learning resources, but they were not fully utilized before. Secondly, labeling training data is enormously expensive and time consuming, especially when domain expertise is required for concept prerequisite relation judgement.
In order to address these challenges, we propose a novel concept prerequisite relation learning approach, named CPRL, which firstly learns concept representation via a relational graph convolutional network (R-GCN) (Schlichtkrull et al., 2018) on a heterogeneous graph, and predicts the concept prerequisite relations with a Siamese network. Then, it is optimized with the learning object dependencies and handcrafted features.
Moreover, we extend CPRL under the weaklysupervised settings to make our approach more practical, including learning prerequisite relation from learning object dependencies and generating training data with data programming paradigm.
Our contributions can be summarized as follows: • We propose a heterogeneous concept-learning object graph (HCLoG), which can model the multiple and complex relations among concepts and learning resources to learn concept representation.
• We propose a novel concept prerequisite relation learning approach, named CPRL, which combines evidences from concept representations via R-GCN on HCLoG, learning object dependencies, and concept pairwise features.
• We extend CPRL under weakly supervised settings to avoid costly training data labeling.
• We conduct extensive experiments on four real-world datasets with different domains: Textbook, MOOC, LectureBank and University Course, and our approach achieves new state-of-the-art performance.

Problem Formulation
The educational data can be a textbook or a course, which can be modeled as a sequential learning objects (denoted as LO for short), such as book chapters, MOOC videos and lectures. There are concepts in an educational data, and we would like to extract the prerequisite relation among these concepts, as shown in Figure 1. For convenience, we will use the following notations: where o i denotes the i-th learning object in D and is represented as a document. The document can be the text from a book chapter, or the speech script from a MOOC video.
• C = {c 1 , c 2 , ..., c N } is a set of concepts in D.
Therefore, the problem could be formally defined as: given an educational data D and its corresponding concepts C, the goal is to learn a function F θ : C × C → {0, 1}, which can predict whether c i is a prerequisite concept of c j by mapping the concept pair c i , c j to a binary class.

The CPRL Framework
The overview of our proposed CPRL is shown in Figure 2.
We firstly build a heterogeneous conceptlearning object graph from the educational data, and then use a relational graph convolutional network (R-GCN) (Schlichtkrull et al., 2018) to represent the concepts and learning objects. Then, pairwise features for concepts are extracted according to their textual and structural information. Finally, all features are combined to learn the concept prerequisite relations.
It should be noted that the dependencies among learning objects can be viewed as a signal of weak supervision, which are also used to train the model.

Heterogeneous Concept-Learning Object Graph
We build a heterogeneous concept-learning object graph from an educational data, which contains concepts and learning objects, so the concept cooccurrence and the learning object-concept relations can be explicitly modeled.
The heterogeneous concept-learning object graph is defined as a graph G = (V, E), where V consists of two types of nodes: concept nodes V c = {c 1 , c 2 , ..., c N } and learning object nodes V o = {o 1 , o 2 , ..., o M }, and E represents the relations among them.
Specifically, we define the following three types of edges in G.
1. an edge between a concept and a learning object, and the weight is the term frequencyinverse document frequency (tfidf) of the concept in the document, where the term frequency is the number of times the concept appears in the document, while the inverse document frequency is the logarithmically scaled inverse fraction of the number of documents that contain the concept. E.g., e co in Figure 2.
2. an edge between two concepts which cooccur in a fixed size sliding window in documents. Point-wise mutual information (P-MI) is used to calculate the weight. Formally, pmi(i, j) = log p(i,j) p(i)·p(j) , p(i, j) = #W (i,j) #W and p(i) = #W (i) #W , where #W (i, j) is the number of sliding windows that contain both c i and c j , #W (i) is the number of sliding windows that only contain c i , and #W is the 3. an edge between two learning objects, and the weight is the normalized distance between these two learning objects in the educational data. Formally, dis(i, j) = |j−i| M . E.g., e oo in Figure 2.
Thus, the adjacency matrix A ∈ R (M +N )×(M +N ) of the graph G is defined as: i and j are LOs 1 i = j 0 otherwise

Concept Representation via R-GCN
Since there are different types of relations among the nodes in the heterogeneous concept-learning object graph, we employ R-GCN to learn the representations of concepts and LOs. We first use pretrained word embeddings GLoVE (Pennington et al., 2014) to represent each concept node in G. To represent the learning object, we calculate the average word embeddings of concepts in that learning object. Then, we update the node representation with R-GCN by aggregating messages from its direct neighbors as follows: where N r i is the neighbors of node i of relation r ∈ R, W l r ∈ R d×d is a relation-specific weight matrix, W l 0 ∈ R d×d is a general weight matrix, h l i is the hidden state of node i at l-th layer, σ is the ReLU function, and c i,r = j∈N r i A ij is a normalization constant.
We stack the networks for L layers, and the concepts and learning objects can be represented by the hidden state of nodes in the L-th layer.

Prerequisite Relation Classification
After representing concepts via R-GCN, a Siamese network is used to predict whether the concept c i is prerequisite of c j .
We firstly take the concept representation of c i and c j as the input of a Siamese network, as shown in Figure 3, to calculate the likelihood of c i being a prerequisite concept of c j . Formally, where σ is the sigmoid function, ⊗ and − are the element-wise multiplication and subtraction operators, and [·; ·] means the concatenation of vectors.
Finally, we use the cross-entropy as the loss function: where T is the training dataset, and y ij ∈ {0, 1} is the ground truth of (c i , c j ).

Optimized with LO Dependencies
Intuitively, the dependencies among learning objects can reflect the prerequisite relations among concepts, but how can we utilize the learning object dependencies to enhance our model?
In the heterogeneous concept learning object graph, concepts and learning objects are both rep-resented in the same space, so they can be fed to the same Siamese network.
Formally, we feed the representations of learning object o i and o j to the same Siamese network mentioned in previous section, and obtain the likelihood of the learning object dependency as Similarly, we define the loss function as: Predicting the dependencies among learning objects can be considered as an auxiliary task for concept prerequisite relation learning, so the loss function could be: L = L c + µL o .

Fusing Handcrafted Pairwise Features
In order to fully utilize the information of LOs, we also extract concept pairwise features from their textual and structural information. Liang et al. (2015a) pointed out that when learning concept A, if one needs to refer to concept B a lot but not vice versa, then B is more likely to be a prerequisite of A than A of B. Inspired by this idea, we propose a new generic metric, namely learning object reference distance (LOrd), in a learning object sequence D = {o 1 , o 2 , ..., o M } to measure prerequisite relations among concepts.
For a concept pair (c i , c j ), we propose the reference weight (rw) to qualify how c j is referred by LOs which mention concept c i , defined as: indicates the frequency of concept c i appears in the learning object o, and r(o, c j ) ∈ {0, 1} denotes whether concept c j appears in o. Then, the LOrd is defined as: Obviously, LOrd can be easily calculated for textbooks, MOOC courses and university courses. In addition, for MOOCs, we use features as in (Pan et al., 2017). While for textbooks, we extract several pairwise features as in (Pan et al., 2017), including Semantic Relatedness, Wikipedia reference distance and complexity level distance. The details can be referred in the Appendix.
Moreover, we also extract head matching feature and ToC distance  for concept pairs for textbooks. Head matching feature represents whether two concepts have a common head or not, which is obtained by suffix matching. Usually, it implies the existence of prerequisite relation, e.g., tree and binary tree. ToC distance measures the distance of concepts in the table of contents in D.
All the pairwise features are concatenated and fed into a forward neural network, which will generate the prediction result p F (c i , c j ) for the concept pair (c i , c j ). The loss function for the pairwise features is: Therefore, the overall loss function is: L = L c + µL o + λL f , where µ and λ are two hyperparameters.

The CPRL with Weak Supervision
In practice, it is expensive to collect massive handlabeled data for model training. One intuitive way to alleviate the labeling cost is that we can train the model in one domain (e.g. Calculus), and then use it to predict the concept prerequisite relations in other domains (e.g. Data Structure and Physics). However, the idea fails and we will explain it in our experiments.
Therefore, we extend our model under the weak supervision settings in two ways.
We call the first way as learning prerequisite relations from LO dependencies. Since concepts and LOs are embedded into the same space through R-GCN in the heterogeneous graph, our model can implicitly infer the prerequisite relationships between concepts by explicitly learning the dependencies between LOs. This procedure is called CPRL lo .
Another way is use the data programming (Ratner et al., 2016) paradigm to create probabilistic training data. Data programming expresses weak supervision strategies or domain heuristics as labeling functions (LFs), and then estimates the label accuracies by fitting a generative model. The process is shown as Figure 4. We apply m such LFs to the unlabeled concept pairs {(c t i , c t j ) n t=1 } to generate a label matrix Λ ∈ {−1, 0, 1} n×m . Then, we use the weak supervision framework Snorkel (Ratner et al., 2019a) to train a probabilistic model. The probabilistic model takes the label matrix Λ as input, and generates the probabilistic training labelsỸ = p(Y |Λ) for each concept pair. The generated labels could be used to train our model.
With the probabilistic training data, L c and L f are changed to the noise-aware variants: . This procedure is called CPRL dp .

Datasets
In order to validate the efficiency of our model, we conducted experiments on four datasets with different domains.
• Textbook: we selected six Chinese textbooks in each of the three domains: Calculus, Data Structure, and Physics, and then extracted 89, 84 and 139 concepts, and labeled 449, 439 and 623 prerequisite relations for each domain respectively. The datasets will be publicly available later.
• MOOC: we used MOOC data 1 mentioned in (Pan et al., 2017), which involves two domains: Data Structure and Algorithms (DSA) and Machine Learning (ML).
• LectureBank: This dataset 2 (Li et al., 2019a) contains 1,352 English lecture files collected from university courses, and the annotations of prerequisite relations on 208 concepts.
• University Course: This dataset 3 (Liang et al., 2017) has 654 courses with 861 course prerequisite edges from various universities in USA, and 1008 pairs of concepts with prerequisite relations are manually annotated.
The set of concepts and prerequisite relations among them was annotated by experts, and released with the datasets. The statistics of the datasets are listed in the appendix.

Baselines
We used the following state-of-the-art approaches as baselines.
Binary classifiers: We compared our model with the binary classifiers as in (Pan et al., 2017), including Naïve Bayes classifier (NB), Support vector machine (SVM), Logistic Regression (LR) and Random Forest classifier (RF).
RefD: RefD (Liang et al., 2015b) is a simple link-based metric for measuring the prerequisite relations among concepts.
GAE: GAE denotes graph autoencoder, which encodes a graph with GCN, and predicts links through the adjacency matrix reconstruction. Li et al. (2019a) used GAE for concept prerequisite relation learning.
VGAE: VGAE is an extension to GAE, which was also used in (Li et al., 2019a) for concept prerequisite relation learning.
PREREQ: PREREQ (Roy et al., 2019) obtains latent representations of concepts through the pairwise-link LDA model, and identifies concept prerequisite relations through a Siamese network.
We also compared our weakly-supervised variants with CPR-Recover (Liang et al., 2017), which is an unsupervised approach, and can recover concept prerequisite relations from course dependencies.
Consistent with many methods, we mainly used F-score(F 1 ) to evaluate the performance of CPRL with all the baselines. We also compared precision(P) and recall(R) against other methods.

Implementation Details
In all datasets, only concept prerequisite pairs are manually annotated, and we split the positive samples into train and test sets. In order to fairly compare with the previous researches, 90% samples of LectureBank were used for training while the rest 10% for testing. For other datasets, the proportions changed to 70% and 30%. Then, we generated negative samples by sampling random unrelated pairs of concepts from the vocabulary in addition to the reverse pair of original positive samples. In order to address the imbalance problem, we oversampled 3.5 and 1.5 times the number of the positive examples in the training and testing sets for Textbook dataset and other datasets respectively. The results are averaged over 5 train-test splits.
The parameters were initialized randomly from a Gaussian distribution with zero mean and standard deviation σ = 0.3. The initial learning rate γ is 0.5 for Textbook and 0.1 for other datasets. Besides, the learning rate annealed every 50 epochs by 0.99γ. We trained CPRL using the Stochastic Gradient Descent method and stopped training if the train loss did not decrease for 30 consecutive epochs. For baseline models, we used default parameter settings as in their original implementations, and also used 300-dimensional GloVE as the pre-trained word embeddings.
For R-GCN, we set the number of R-GCN layers L = 2 and set the embedding size of the first convolution layer as 256 and the second convolution layer as the number of concepts in each dataset. We experimented with other settings and found that small changes did not influence the result much. In addition, we set λ = 0.2 and µ = 0.1, since they made the best performance. The influence of parameters L, λ and µ can be referred to the Appendix. Table 1 shows the precision, recall and F-score on four datasets with different domains.

Performance Comparison
From the table, we find that (1) CPRL achieves the best performance with F-score against all baselines on all datasets, except for DSA domain of the MOOC dataset. (2) CPRL performs best in Lecture-Bank and University even without pairwise features and dependencies among learning objects. It tells that HCLoG can effectively model the multiple and complex relations among concepts and learning resources to learn better concept representation. (3) RefD can indeed measure the prerequisite relations among concepts, and obtains a higher precision, but a lower recall. (4) GAE and VGAE utilize GCN for adjacency matrix reconstruction, but they perform worse than CPRL. The reason is that CPRL utilizes the heterogeneous concept learning object graph to learn the concept representation, which can fully utilize the complex relationships among concepts and learning objects, while GAE and VGAE only use the graph among concepts.

Ablation Study
In order to prove the effects of pairwise features and LO dependencies, we conducted ablation experiments on Textbook and MOOC datasets. The results are shown in  Table 2: Ablation Study on CPRL. Row-wise best results are in bold. CPRL f and CPRL c are the models which minimize L c + λL f and L c respectively.
As shown in Table 2, CPRL performs better than CPRL f and CPRL c on most of the datasets, so pairwise features and learning object dependencies can both contribute to the performance. Besides, even CPRL c obtains a better performance than the baselines in Table 1, which proves the effectiveness of the heterogeneous graph.

Effectiveness of Weak Supervision
In order to evaluate our weakly supervised prerequisite relation learning approaches, we compared our two variants CPRL lo and CPRL dp with CPR-Recover (Liang et al., 2017) in Textbook dataset, and the results are shown in Table 3.
From the table, we find that CPRL lo and CPRL dp outperform CPR-Recover in all metrics, and CPRL dp achieves the best performance. It proves that the knowledge of learning object dependencies can be transferred to learn the concept prerequisite relations through the concept learning object graph. In addition, the data programming with our designed label functions can generate help-   ful training data, and achieve comparable performance with the supervised CPRL.

Verification of Domain Transfer Ability
In order to explore the transfer ability of our model between different domains, we conducted an experiment on Textbook dataset. Specifically, for CPRL, we firstly trained the model in one domain, and then used the model to predict prerequisite relations between concepts in another domain. While for CPRL dp , we obtained the best thresholds such as θ max LOrd and θ min LOrd in LFs in one domain and then used them to other domains. The results are shown in  Table 4: Domain transfer ability verification experiments for CPRL and CPRL dp , where each row and column represent the source and target domain respectively, and the values in the cells are F 1 -scores.
We observe that (1) F-scores drop severely in CPRL, so we cannot simply transfer the model across domains due to the difference among concepts and LOs.
(2) CPRL dp is more stable and can be used in practice since we only need to label a small amount of training data in one domain.

Effectiveness of Ensemble
Our approach can learn the concept prerequisite relations from one learning object sequence, such as a textbook. While the concepts in textbooks in the same domain are basically the same, so the prerequisite relations among them can be aggregated.
Here, we used a simple majority voting strategy for aggregation, and the results are shown in Figure 6. From the table, we see a significant improvement for the ensemble results. 6 Related Work

Prerequisite Relation Learning
Learning prerequisite relations between concepts has attracted much recent work, and can be classified into three categories: local statistical information based approaches, recovery based approaches and learning based approaches.
As local statistical information, reference distance (Liang et al., 2015a) and cross-entropy (Gordon et al., 2016) were proposed to measure the concept prerequisite relations. CPR-Recover (Liang et al., 2017) is a recovery based approach, which recovers prerequisite relations from course dependencies. The learning based approaches are the most popular. For example, Pan et al. (2017) proposed contextual, structural and semantic features for concept prerequisite relation classification. Roy et al. (2019) applied the pairwise-link LDA model to represent concept, and trained a Siamese network to identify prerequisite relations. Li et al. (2019a) trained variational graph autoencoders to predict concept prerequisite relations. However, these approaches didn't model the mutiple and complex relations among concepts and learning resources. Meanwhile, they also need a large set of training data, which is costly to obtain.In order to reduce the amount of training data required, active learning was investigated in (Liang et al., 2018) and (Liang et al., 2019) for concept prerequisite learning.

Weakly Supervised Learning
One of the most significant bottlenecks for machine learning is the need for a big training data set. Nowadays, it is very promising to use weakly supervised learning techniques to reduce the amount of human intervention needed. For example, distant supervision can produce noisy training data by aligning unlabeled data with an external knowledge base, e.g. relation extraction in (Smirnova and Cudré-Mauroux, 2018). Crowdsourcing (Yuen et al., 2011) and heuristic rules  can also generate noisy training data.
However, these weakly supervised data is incomplete, inexact and inaccurate, so it is important to integrate multiple noisy labeling data to produce more accuracy data. Data programming (Ratner et al., 2016) provides a simple and unifying framework for the creation of training sets, which expresses weak supervision strategies as labeling functions, and then uses a generative model to denoise the labeling data. Snorkel 4 (Ratner et al., 2019a) is a system built around the data programming paradigm for rapidly creating, modeling, and managing training data. Several works have been explored to use data programming for training data creation. For example, SwellShark (Fries et al., 2017) was proposed for quickly building biomed-ical named entity recognition taggers using lexicons, heuristics, and other forms of weak supervision instead of hand-labeled data. GWASkb with thousands of genotype-phenotype associations was created by using Snorkel in (Kuleshov et al., 2019). Snorkel was also used for chemical reaction relationship extraction (Mallory et al., 2020), discourse structure learning (Badene et al., 2019) and medical entity classification (Fries et al., 2020).
In addition, data programming was further improved under different situations. For example, MeTaL (Ratner et al., 2019b) was proposed for modeling and integrating weak supervision sources with different unknown accuracies, correlations, and granularities. Cross-modal data programming was proposed in (Dunnmon et al., 2020). Fly-ingSquid (Fu et al., 2020) speeded up weak supervision with triplet methods.

Conclusion
In this paper, we propose a novel concept prerequisite relation learning approach, named CPRL, which combines both concept representation learned from a heterogeneous graph and concept pairwise features. Furthermore, we extend CPRL under weakly supervised settings to make our method more practical. The experiments on four datasets show that our method achieves stateof-the-art performance. In addition, we also prove the effectiveness of our weakly supervised prerequisite relation learning variants.
In future, we plan to design more effective label functions or employ more reliable weakly supervised learning approaches (Li et al., 2019b;Guo et al., 2019) to further improve the performance. Moreover, we will also introduce concept prerequisite relations into curriculum planning and intelligent tutoring applications, e.g. organizing learning resources into a reasonable order and incorporating prerequisite relations into knowledge tracing technologies. The optimal thresholds of the labeling functions can be obtained by grid search with a small amount of training data. Some empirical values are given in Table 6

A.4 Influence of Parameters
In order to determine the parameters λ and µ in the loss function, we conducted the experiments on Textbook dataset in Physics domain with different λs and µs, and Figure 8 shows the results. Therefore, we chose λ = 0.2 and µ = 0.1 in our experiments, which made the best performance. In addition, we also evaluated our approach with different number of GCN layers (L), and the result Table 5: Statistics of the Datasets. In University Course, each course is described only using its brief introduction, so the average number of tokens in the learning objects is limited. is shown in Figure 9. From the figure, we can see that the F-score increases gradually and then drops finally. Thus, we chose L = 2 in our experiments.

A.5 Impact of Training Set Size
To compare with the previous research, we used 90% positive samples of LectureBank and 70% positive samples of other datasets to train the model. In order to further explore the ability of our model, we train our model with with different number of training data, and show the result in Physics domain in the Textbook dataset in Figure 10.
It is shown that, when we use more positive samples to train the model, it can reach a higher F1-Score. Besides, it could outperform the baselines with only about 30% positive samples, which implies our model's ability to fully utilize the training samples.