Prerequisite Relation Learning for Concepts in MOOCs

What prerequisite knowledge should students achieve a level of mastery before moving forward to learn subsequent coursewares? We study the extent to which the prerequisite relation between knowledge concepts in Massive Open Online Courses (MOOCs) can be inferred automatically. In particular, what kinds of information can be leverage to uncover the potential prerequisite relation between knowledge concepts. We first propose a representation learning-based method for learning latent representations of course concepts, and then investigate how different features capture the prerequisite relations between concepts. Our experiments on three datasets form Coursera show that the proposed method achieves significant improvements (+5.9-48.0% by F1-score) comparing with existing methods.


Introduction
Mastery learning was first formally proposed by Benjamin Bloom in 1968(Bloom, 1981, suggesting that students must achieve a level of mastery (e.g., 90% on a knowledge test) in prerequisite knowledge before moving forward to learn subsequent knowledge concepts. From then on, prerequisite relations between knowledge concepts become a cornerstone for designing curriculum in schools and universities. Prerequisite relations essentially can be considered as the dependency among knowledge concepts. It is crucial for people to learn, organize, apply, and generate knowledge (Laurence and Margolis, 1999). Figure 1 shows a real example from Coursera. The student wants to learn "Conditional Random Field" (in video18 of CS229). The prerequisite knowledge might be "Hidden Markov Model" (in video25 of Organizing the knowledge structure with prerequisite relations in education improves tasks such as curriculum planning , automatic reading list generation (Jardine, 2014), and improving education quality (Rouly et al., 2015). For example, as shown in Figure 1, with explicit prerequisite relations among concepts (in red), a coherent and reasonable learning sequence can be recommended to the student (in blue). Before, the prerequisite relationships were provided by teachers or teaching assistants (Novak, 1990); however in the era of MOOCs, it is becoming infeasible as the teachers would find that they are facing with hundreds of thousands of students with various background. Meanwhile, the rapid growth of Massive Open Online Courses has offered thousands of courses, and students are free to choose any course from the thousands of candidates. Therefore, there is a clear need for methods to automatically dig out the prerequisite relationships among knowledge concepts from the large course space, so that the students from different background can easily explore the knowledge space and better design their personalized learning schedule.
There are a few efforts aiming to automatically detect prerequisite relations for knowledge base. For example, Talukdar and Cohen (2012) proposed a method for inferring prerequisite relationships between entities in Wikipedia and Liang et al. (2015) presented a more general approach to predict prerequisite relationships. A few other works intend to extract prerequisite relationships from textbooks (Yosef et al., 2011;Wang et al., 2016). However, it is far from sufficient to directly apply these methods to the MOOC environments due to the following reasons. First, the focus of most previous attempts has been on prerequisite inference of Wikipedia concepts (either Wikipedia articles or Wikipedia concepts in textbooks). Many course concepts are not included in Wikipedia (Schweitzer, 2008;Okoli et al., 2014). We can leverage Wikipedia, in particular the existing entity relationships in Wikipedia, but cannot only rely on Wikipedia for detecting prerequisite relations in MOOCs. Second, with the thousands of courses from different universities and also very different disciplinaries, the MOOC scenario is much more complicated -there are not only inter-course concept relationships, but also intracourse and even intra-disciplinary relationships. Moreover, user interactions with the MOOC system might be also helpful to identify the prerequisite relations. How to fully leverage the different information to obtain a better performance for inferring prerequisite relations in MOOCs is a challenging issue.
In this paper, we attempt to figure out what kinds of information in MOOCs can be used to uncover the prerequisite relations among concepts. Specifically, we consider it from three aspects, including course concept semantics, course video context and course structure. First, semantic relatedness plays an important role in prerequisite relations between concepts. If two concepts have very different semantic meanings (e.g., "matrix" and "anthropology"), it is unlikely that they have prerequisite relations. However, statistical features in MOOCs do not provide sufficient information for capturing the concept semantics because of the short length of course videos in MOOCs, we propose an embedding-based method to incorporate external knowledge from Wikipedia to learn semantic representations of concepts in MOOCs. Based on it, we propose one semantic feature to calculate the semantic relatedness between concepts. Second, motivated by the reference distance (RefD) (Liang et al., 2015), we propose three new contextual features, i.e., Video Reference Distance, Sentence Reference Distance and Wikipedia Reference Distance, to infer prerequisite relations in MOOCs based on context information from different aspects, which are more general and informative than RefD and overcome its sparsity problem. Third, we examine different distributional patterns for concepts in MOOCs, including appearing position, distributional asymmetry, video coverage and survival time. We further propose three structural features to utilize these patterns to help prerequisite inference in MOOCs.
To evaluate the proposed method, we construct three datasets, each of which consists of multiple real courses in a specific domain from Coursera 1 , the largest MOOC platform in the world. We also compare our method with the representative works of prerequisite learning and make a deep analysis of the feature contribution proposed in the paper. The experimental results show that our method achieves the state-of-the-art results in the prerequisite relation discovery in MOOCs. In summary, our contributions include: a) the first attempt, to the best of our knowledge, to detect prerequisite relations among concepts in MOOCs; b) proposal of a set of novel features that utilize contextual, structural and semantic information in MOOCs to identify prerequisite relations; c) design of three useful datasets based on real courses of Coursera to evaluate our method.

Problem Formulation
In this section, we first give some necessary definitions and then formulate the problem of prerequisite relation learning in MOOCs.
A MOOC corpus is composed by n courses in the same subject area, denoted as D = {C 1 , · · · , C i , · · · , C n }, where C i is one course. Each course C can be further represented as a video sequence C = (V 1 , · · · , V i , · · · , V |C| ), where V i denotes the i-th teaching video of course C. Finally, we view each video V as a document of its video texts (video subtitles or speech script), i.e., V = (s 1 · · · s i · · · s |V| ), where s i is the i-th sentence of the video texts.
Course concepts are subjects taught in the course, i.e., the concepts not only mentioned but also discussed and taught in the course. Let us denote the course concept set of D as K = K 1 ∪ · · · ∪ K n , where K i is the set of course concepts in C i .
Prerequisite relation learning in MOOCs is formally defined as follows. Given a MOOC corpus D and its corresponding course concepts K, the objective is to learn a function P : K 2 → {0, 1} that maps a concept pair a, b , where a, b ∈ K, to a binary class that predicts whether a is a prerequisite concept of b.
In order to learn this mapping, we need to answer two crucial questions. How could we represent a course concept? What information regarding a concept pair is helpful to capture their prerequisite relation? We first propose an embedding-based method to learn appropriate semantic representations for each course concept in K. Based on the learned representations, we propose 7 novel features to capture whether a concept pair has prerequisite relation. These features utilize different aspects of information and can be classified into 1 semantic feature, 3 contextual features and 3 structural features. In the following section, we first describe the semantic representations in detail, and then formally introduce our proposed features.

Concept Representation & Semantic Relatedness
We first learn appropriate representations for course concepts. Given the course concepts K as input, we utilize a Wikipedia corpus to learn semantic representations for concepts in K. A Wikipedia corpus W is a set of Wikipedia articles and can be represented as a sequence of words W = w 1 · · · w i · · · w m , where w i denotes a word and m is the length of the word sequence. Our method consists of two steps: (1) entity annotation, and (2) representation learning. Entity Annotation. We first automatically annotate the entities in W to obtain an entity set E and an entity-annotated Wikipedia corpus W = x 1 · · · x i · · · x m , where x i corresponds to a word w ∈ W or an entity e ∈ E. Note that m < m because multiple adjacent words could be labeled as one entity. Many entity linking tools are available for entity annotation, e.g. TAGME (Ferragina and Scaiella, 2010), AIDA (Yosef et al., 2011) and TremenRank (Cao et al., 2015). However, the rich hyperlinks created by Wiki editors provide a more natural way. In our experiments, we simply use the hyperlinks in Wikipedia articles as annotated entities.
Representation Learning. We then learn word embeddings (Mikolov et al., 2013b,a) on W to obtain low-dimensional, real-valued vector repre-sentations for entities in E and words in W. Let us denote v e and v w as the vector of e ∈ E and w ∈ W, respectively. For a course concept a ∈ K, suppose a is a N -gram term g 1 · · · g N and g 1 , · · · , g N ∈ W, we obtain its semantic representations v a as follows.
It means that if a is a Wikipedia entity, we can directly obtain its semantic representations; otherwise, we obtain its vector via the vector addition of its individual word vectors. In this way, a has no corresponding vector only if any of its constituent word is absence in the whole Wikipedia corpus. This case is unusual because a large online encyclopedia corpus can easily cover almost all individual words of the vocabulary. Our experimental results verify that over 98% of the course concepts have vector representations.

Feature 1: Semantic Relatedness
For a given concept pair a, b , the semantic relatedness between a and b, denoted as ω(a, b), is our first feature (the only semantic feature). With learned semantic representations, semantic relatedness of two concepts can be easily reflected by their distance in the vector space. We define ω(a, b) ∈ [0, 1] as the normalized cosine distance between v a and v b , as follows.

Contextual Features
Context information in course videos provides important clues to infer prerequisite relations. In videos where concept A is taught, if the teacher also mentions concept B for a lot but not vice versa, then B is more likely to be a prerequisite of A than A of B. For example, "gradient descent" is a prerequisite concept of "back propagation". In teaching videos of "back propagation", the concept "gradient descent" is frequently mentioned when illustrating the optimization detail of back propagation. On the contrary, however, "back propagation" is unlikely to be mentioned when teaching "gradient descent". A similar observation also exists in Wikipedia, based on which Liang et al. (2015) proposed an indicator, namely reference distance (RefD), to infer prerequisite relations among Wikipedia articles. However, RefD is computed based on the link structure of Wikipedia, thus is only feasible for Wikipedia concepts and is not applicable in plain text. We overcome the above shortcomings of RefD to propose three novel features, which utilize different aspects of context information-course videos, video sentences and Wikipedia articles-to infer prerequisite relations in MOOCs.

Feature 2: Video Reference Distance
Given a concept pair a, b where a, b ∈ K, we propose the video reference weight (V rw) to quantify how b is referred by videos of a, defined as follows.
where f (a, V) indicates the term frequency of concept a in video V, which reflects how important is concept a to this video. r (V, b) ∈ {0, 1} denotes whether concept b appears in video V. Intuitively, if b appears in more important videos of a, V rw (a, b) tends to be larger, and the range of V rw (a, b) is between 0 and 1. Then, the video reference distance (V rd) is defined as the difference of V rw between two concepts, as follows.
In practice, this feature may be too sparse if the MOOC corpus is small. For an arbitrary concept pair, they may have no co-occurrence in all course videos. We expend the video reference distance to a more general version by considering the semantic relatedness among concepts. Besides the conditions in which A refers to B, we also consider the cases in which A-related concepts refer to B. We first define the generalized video reference weight (GV rw) as follows.
where a 1 , · · · , a M ∈ K are the top-M most similar concepts of a, measured by the semantic relatedness function ω(·, ·) in feature 1. GV rw is the weighted average of V rw (a i , b), indicating how b is referred by a-related concepts in their corresponding videos. Note that a 1 = a, thus GV rw (a, b) ≡ V rw (a, b) when M = 1. Similarly, we define the generalized video reference distance (GV rd) as follows.
Intuitively, if most of b-related concepts refer to a but not vice versa, then a is likely to be a prerequisite of b. For example, it is plausible for the related concepts of "gradient descent", e.g., "steepest descent" and "Newton's method", to mention "matrix" but clearly not vice versa.

Feature 3: Sentence Reference Distance
Sentence reference distance is similar to feature 2, but stands on the sentence level. Following the same design pattern of feature 2, we define the sentence reference weight (Srw) and sentence reference distance (Srd) as follows.
where r (s, a) ∈ {0, 1} is an indicator of whether concept a appears in sentence s. Srw(a, b) calculates the ratio of B appearing in the sentences of a. We also define generalized sentence reference weight (GSrw) and generalized sentence reference distance (GSrd) as follows.

Feature 4: Wikipedia Reference Distance
Contextual information of Wikipedia is also useful for detecting prerequisite relations. As mention before, RefD is not general enough to be applied in our settings, because it is limited to Wikipedia concepts. Therefore, we improve this indicator to a more general one, which is also suitable for non-wiki concepts.
Specifically, for a concept a ∈ K, let us denote the top-M most related wiki entities of a as R a = e 1 , · · · , e M , where e 1 , · · · , e M ∈ E. Because concepts in K and entities in E are jointly embedded in the same vector space in Section 3.1, we can easily obtain R a with the semantic relatedness metric ω(·, ·) in Feature 1. We then define the wikipedia reference weight (W rw) as follows.
where Erw(e, a) is a binary indicator, in which Erw(e, a) = 1 if the Wikipedia article of e refers to any entity in R a , and Erw(e, a) = 0 otherwise. W rw (a, b) measures how frequently that arelated wiki entities refer to b-related wiki entities. Finally, wikipedia reference distance (W rd) is defined as the difference of W rw between a and b, i.e., W rd (a, b) = W rw (b, a) − W rw (a, b).

Structural Features
Since course concepts are usually introduced based on their learning dependencies, the structure of MOOC courses also significantly contribute to prerequisite relation inference in MOOCs. However, structure-based features for prerequisite detection have not been well-studied in previous works. In this section, we investigate different structural information, including appearing positions of concepts, learning dependencies of videos and complexity levels of concepts, to propose three novel features to infer prerequisite relations in MOOCs. Before introducing these features, let us define two useful notations as follows. C(a) are the courses in which a is a course concept, i.e., C(a) = {C i |C i ∈ D, a ∈ K i }. I(C, a) are the video indexes that contain concept a in course C.
For example, if a appears in the first and the 4-th video of C, then I(C, a) = {1, 4}.

Feature 5: Average Position Distance
In a course, for a specific concept, its prerequisite concepts tend to be introduced before this concept and its subsequent concepts tend to be introduced after this concept. Based on this observation, for a concept pair a, b , we calculate the distance of the average appearing position of a and b as one feature, namely average position distance (Apd). If C(a) ∩ C(b) = ∅, Apd (a, b) is formally defined as follows.

Feature 6: Distributional Asymmetry Distance
We also use the learning dependency of course videos to help infer learning dependency of course concepts. Based on our observation, the chance that a prerequisite concept is frequently mentioned in its subsequent videos is larger than that a subsequent concept is talked about in its prerequisite videos. Specifically, if video V a is a precursor video of V b , and a is a prerequisite concept of b, then it is likely that f (b, V a ) < f (a, V b ), where f (a, V) denotes the term frequency of a in video V. We thus define another feature, namely distributional asymmetry distance (Dad), to calculate the extent that a given concept pair satisfies this distributional asymmetry pattern. Formally, in course C, for a given concept pair a, b , we first define S(C) = {(i, j)|i ∈ I(C, a), j ∈ I(C, b), i < j}, i.e., all possible video pairs of a, b that have sequential relation. Then, the distributional asymmetry distance of a, b is formally defined as follows.
where V C i denotes the i-th video of course C. If C(a) ∩ C(b) = ∅, we set Dad (a, b) = 0.
Feature 7: Complexity Level Distance Two related concepts with prerequisite relationship tend to have a difference in their complexity level, meaning that one concept is basic while another one is advanced. For example, "data set" and "training set" have learning dependencies and the latter concept is more advanced than the former one. However, "test set" and "training set" have no such relation when their complexity levels are similar. Complexity level of a course concept is implicit in its distribution in courses. Specifically, we observe that, for a concept in MOOCs, if it covers more videos in a course or it survives longer time in a course, then it is more likely to be a basic concept rather than an advanced one. We then formally define the average video coverage (avc) and the average survival time (ast) of a concept a as follows.
where max/min(I(C, a)) obtains the video index where a appears the last/first time in course C. Based on the above equations, we define the complexity level distance (Cld) between concept a and b as follows.

Data Sets
In order to validate the efficiency of our features, we conducted experiments on three MOOC corpus with different domains: "Machine Learning" (ML), "Data Structure and Algorithms" (DSA), and "Calculus" (CAL). To the best of our knowledge, there is no public data set for mining First, for each chosen domain, we select its relevant courses from Coursera, one of the leading MOOC platforms, and download all course materials using coursera-dl 2 , a widely-used tool for automatically downloading Coursera.org videos. For example, for ML, we select 5 related courses 3 from 5 different universities and obtain a total of 548 course videos. Then, we manually label course concepts for each course: (1) Extract candidate concepts from documents of video subtitles following the method of Parameswaran et al. (2010). (2) Label the candidates as "course concept" or "not course concept" and obtain a set of course concepts for this course.
Finally, we manually annotate the prerequisite relations among the labeled course concepts. If the number of course concepts is n, the number of all possible pairs to be checked could reach n × (n − 1)/2, which requires arduous human labeling work. Therefore, for each dataset, we randomly select 25 percent of all possible pairs for evaluation. For each course concept pair a, b , three human annotators majoring in the corresponding domain were asked to label them as "a is b's prerequisite", "b is a's prerequisite" or "no prerequisite relationship" using their own knowledge background and additional textbook resources. We take a majority vote of the annotators to create final labels and access the interannotator agreement using the average of pairwise κ statistics (Landis and Koch, 1981) between all pairs of the three annotators.
The statistics of the three datasets are listed in Table 1, where #courses and #videos are the total number of courses and videos in each dataset and #concepts is the number of labeled course concepts. The #pairs denotes the number of labeled concept pairs for evaluation, in which '+' 2 https://github.com/coursera-dl/coursera-dl 3 These courses are: "Machine Learning (Stanford)", "Machine Learning (Washington)", "Practical Machine Learning (JHU)", "Machine Learning With Big Data (UCSD)" and "Neural Networks for Machine Learning (UofT)" denotes the number of positive instances, i.e. pairs who have prerequisite relations, and '−' denotes the number of negative instances.

Evaluation Results
For each dataset, we apply 5-fold cross validation to evaluate the performance of the proposed method, i.e., testing our method on one fold while training the classifier using the other 4 folds. Usually, there are much fewer positive instances than negative instances, so we balance the training set by oversampling the positive instances (Yosef et al., 2011;Talukdar and Cohen, 2012). In our experiments, we employ 4 different binary classifiers, including NaïveBayes (NB), Logistic Regression (LR), SVM with linear kernel (SVM) and Random Forest (RF). We use precision (P ), recall (R), and F1-score (F 1 ) to evaluate the prerequisite classification results. The experimental results are presented in Table 2.
Contextual features are shaped by the parameter M , i.e., the number of related concepts being considered. In our experiments, we tried different settings of M and report the results when M =1 and M =10 in Table 2. As for the semantic representation, we use the latest publicly available Wikipedia dump 4 and apply the skip-gram model (Mikolov et al., 2013b) to train word embeddings using the Python library gensim 5 with default parameters.
As shown in Table 2, the evaluation results varies by different classifiers. It turns out that NaïveBayes performs the worst. This seems to be caused by the fact that the independence assumption is not satisfied for our features; for example, Feature 2 and Feature 3 both utilize the local context information, only with different granularity, thus are quite co-related. Random Forest beats others, with best F 1 across all three datasets. Its average F 1 outperforms SVM, NB and LR by 7.0%, 11.1% and 8.3%, respectively (M =10). The reason is as follows. Instead of a simple descriptive feature, each of our proposed feature determines whether a concept pair has prerequisite relation from a specific aspect; its function is similar to an independent weak classifier. Therefore, rather than using a linear combination of features for classification (e.g., SVM and LR), a boosting model (e.g., Random Forest) is more suitable for this task. The performance is slightly better when M =10 for all classifiers, with +0.20% for SVM, +0.53% for NB, +0.73% for LR and +3.63% for RF, with respect to the average F 1 . The results verify the effectiveness of considering related concepts in contextual features. We use RF and set M =10 in the following experiments.

Comparison with Baselines
We further compare our approach with three representative methods for prerequisite inference.

Baseline Approaches
Hyponym Pattern Method (HPM). Prerequisite relationships often exists between hyponymhypernym concept pairs (e.g., "Machine Learning" and "Supervised Learning"). As a baseline, we adopt the 10 lexico-syntactic patterns used by Wang et al. (2016) to extract hyponym relationships between concepts. If a concept pair matches at least one of these patterns in the MOOC corpus, we judge them to have prerequisite relations. Reference Distance (RD) We also employ the RefD proposed by Liang et al. (2015) as one of our baselines. However, this method is only appliable to Wikipedia concepts. To make it comparable with our method, for each of our datasets, we construct a subset of it by picking out the concept pairs a, b in which a and b are both Wikipedia concepts. For example, we find 49% of course concepts in ML have their corresponding Wikipedia articles and 28% percent of concept pairs in ML meet the above condition. We use the new datasets constructed from ML, DSA and CAL, namely W-ML, W-DSA, and W-CAL, to compare our method with RefD. Supervised Relationship Identification (SRI) Wang et al. (2016) has employed several fea-

Performance Comparison
In Table 3 we summarize the comparing results of different methods across different datasets ("MOOC" refers to our method). We find that our method outperforms baseline methods across all six datasets 6 . For example, the F 1 of our method on ML outperforms T-SRI and HPM by 10.5% and 43.6%, respectively. Specifically, we have the following observations. First, HPM achieves relatively high precision but low recall. This is because when A "is a" B, a prerequisite relation often exists from B to A, but clearly not vise versa. Second, T-SRI has certain effectiveness for learning prerequisite relations, with F 1 ranging from 62.1 to 65.2%. However, T-SRI only considers relatively simple features, such as the sequential and co-occurrence among concepts. With more comprehensive feature engineering, the F 1 of our method significantly outperforms T-SRI (+10.5% on ML, +9.1% on DSA and +7.1% on CAL). Third, incorporating Wikipedia-based features (F-SRI) achieves certain promotion in performance (+0.93% comparing with T-SRI in average F 1 ).

Feature Contribution Analysis
In order to get an insight into the importance of each feature in our method, we perform a contribution analysis with different features. Here, we run our approach 10 times on the ML dataset.
In each of the first 7 times, one feature is removed; in each of the rest 3 times, one group of features are removed, e.g., removing contextual features means removing Gvrd, Gsrd and W rd at the same time. We record the decrease of F1-score for each setting. Table 4 lists the evaluation results after ignoring different features.
According to the decrement of F1-scores, we find that all the proposed features are useful in predicting prerequisite relations. Especially, we observe that Cld (Feature 7), decreasing our best F1score by 7.4%, plays the most important role. This suggests that most concepts do exist difference in complexity level. For two concepts, the difference of their coverage and survival times in courses are important for prerequisite relation detection. On the contrary, with 1.9% decrease, Sr (Feature 1) is relatively less important. We may easily find two concepts which have related semantic meanings (e.g., "test set" and "training set") but have no prerequisite relationship. However, semantic relatedness is critical for the contextual features because it overcomes the problem of the sparsity of context in calculation. We experience a decrease of 5.4% when we further do not consider related concepts in contextual features, i.e., set M =1. As for the feature group contribution, we observe that Structural Features, with a decrease of 9.2%, has a greater impact than the other two groups. This is as expected because it includes Cld. Among the three structural features, Apd makes relatively less contribution. The reason is that sometimes the professor may frequently mention a prerequisite concept after introducing a subsequent concept orally, for helping students better understand the concept.

Related Works
To the best of our knowledge, there has been no previous work on mining prerequisite relations    proposed to induce prerequisite relations among courses to support curriculum planning. Liu et al. (2011) studied learning-dependency between knowledge units, a special text fragment containing concepts, using a classification-based method. In the area of education, researchers have tried to find general prerequisite structures from students' test performance (Vuong et al., 2011;Scheines et al., 2014;. Different from them, we focus on more finegrained prerequisite relations, i.e., the prerequisite relations among course concepts. Among the few related works of mining prerequisite relations among concepts, Liang et al. (2015) and Talukdar and Cohen (Talukdar and Cohen, 2012) studied prerequisite relationships between Wikipedia articles. They assumed that hyperlinks between Wikipedia pages indicate a prerequisite relationship and design several useful features. Based on these Wikipedia features plus some textbook features, Wang et al. (Wang et al., 2016) proposed a method to construct a concept map from textbooks, which jointly learns the key concepts and their prerequisite relations. However, the investigation of only Wikipedia concepts is also the bottleneck of their studies. In our work, we propose more general features to infer prerequisite relations among concepts, regardless of whether the concept is in Wikipedia or not. Liang et al. (2017) propose an optimization based framework to discover concept prerequisite relations from course dependencies. Gordon et al. (2016) utilize cross-entropy to learn concept dependencies in scientific corpus. Besides local statistical information, our method also utilize external knowledge to enrich concept semantics, which is more informativeness.
Our work is also related to the study of automatic relation extraction. Different research lines have been proposed around this topic, including hypernym-hyponym relation extraction (Ritter et al., 2009;Wei et al., 2012), entity relation extraction (Zhou et al., 2006;Fan et al., 2014;Lin et al., 2015) and open relation extraction (Fader et al., 2011). However, previous works mainly focus on factual relations, the extraction of cognitive relations (e.g. prerequisite relations) has not been well studied yet.

Conclusions and Future Work
We conducted a new investigation on automatically inferring prerequisite relations among concepts in MOOCs. We precisely define the problem and propose several useful features from different aspects, i.e., contextual, structural and semantic features. Moreover, we apply an embeddingbased method that jointly learns the semantic representations of Wikipedia concepts and MOOC concepts to help implement the features. Experimental results on online courses with different domains validate the effectiveness of the proposed method. Promising future directions would be to investigate how to utilize user interaction in MOOCs for better prerequisite learning, as well as how deep learning models can be used to automatically learn useful features to help infer prerequisite relations.