A Novel Cascade Model for Learning Latent Similarity from Heterogeneous Sequential Data of MOOC

Recent years have witnessed the proliferation of Massive Open Online Courses (MOOCs). With massive learners being offered MOOCs, there is a demand that the forum contents within MOOCs need to be classified in order to facilitate both learners and instructors. Therefore we investigate a significant application, which is to associate forum threads to subtitles of video clips. This task can be regarded as a document ranking problem, and the key is how to learn a distinguishable text representation from word sequences and learners’ behavior sequences. In this paper, we propose a novel cascade model, which can capture both the latent semantics and latent similarity by modeling MOOC data. Experimental results on two real-world datasets demonstrate that our textual representation outperforms state-of-the-art unsupervised counterparts for the application.


Introduction
With the rapid development of Massive Open Online Courses (MOOCs), more and more learners participate in MOOCs (Anderson et al., 2014). Due to the lack of effective management, most of the discussion forums within MOOCs are overloaded and in chaos (Huang et al., 2014). Therefore, a key problem is how to manage the forum contents.
To manage the forum contents, threads of forums can be regarded as documents and be classified to groups. There are several straightforward methods, such as defining sub-forums according to weeks and asking learners to tag threads. However their effectiveness is limited (Rossi and Gnawali, 2014), because learners have few incentives to tag threads. Recently, machine learning solutions have been proposed, e.g., content-related thread identification (Wise et al., 2016), confusion classification (Agrawal et al., 2015) and sentiment classification (Ramesh et al., 2015). However they are developed for specific research problems and cannot be applied to our problem. Moreover, they require labeling data which needs domain experts to label data for different courses.
We observe that the video clips of a MOOC would have many well-formed subtitles composed by instructors. Moreover, within MOOC settings, the course contents can be broken down to knowledge points, and each video clip just corresponds to a knowledge point. Consequently, we propose to fulfill the application, which is to associate threads to subtitles of video clips, i.e., threadsubtitle matching. By this way, the relevant videos to the threads can be recommended to learners, and the chaotic threads in discussion forums can also be well grouped.
However, it is challenging to identify the relevant video clips for threads without labeling data. To address this issue, we regard it as a document ranking problem based on the calculation of similarity between documents. The key problem of this task is to learn a textual representation, which can cluster similar documents and meanwhile distinguish irrelevant ones.
Intuitively, Bag-of-words model (BOW) can be utilized to calculate the similarity between threads and subtitles (Salton and Buckley, 1988). However, BOW cannot effectively capture semantics of words and documents. In addition, recently-studied semantic word embeddings, e.g., Word2Vec (Mikolov et al., 2013), can capture the semantics. Para2Vec (Le and Mikolov, 2014) can capture the similarity to some degree, but not explicitly model the latent similarity of documents.
Since the latent similarity is crucial to determine whether a document can be associated to the right target, in our task, the document representation is expected to preserve both the latent semantics and similarity.
In this paper, we leverage two kinds of sequential information: 1) word sequence of subtitles and forum contents, and 2) clickstream log of learning behaviors. Specifically, different from conventional representation learning tasks, e.g. Word2Vec and Para2Vec, we consider the clickstream data, which reflects the relationship between thread and video's subtitle. For instance, if a user watches a video and then clicks a thread in forums, the video would be relevant to the thread. In order to learn representations from the two types of data, we propose a novel cascade model.
Our basic idea is to jointly model three components: 1) word-word coherence, 2) documentdocument coherence, and 3) word-document coherence. The three components are cascaded for learning the low-dimensional word embeddings. Then the learned embeddings are used to calculate similarities between threads and subtitles.
To summarize, our contributions include: • We study an application-oriented research problem, which is how to capture the latent similarity when learning text representation.
• We propose a novel cascade model to learn the document representation from heterogeneous sequential data: 1) word sequence and 2) learners' clickstream.
• We collect two real-world MOOC datasets and conduct thorough experiments. The results demonstrate that our proposed model outperforms the state-of-the-art unsupervised counterparts on the application.

Related Work
MOOC data has attracted extensive research attention and many interesting research problems have been studied. For example, dropout predicting (Qiu et al., 2016), sentiment analysis of learning gains (Ramesh et al., 2015), instructor intervention (Chaturvedi et al., 2014) and answer recommendation (Jenders et al., 2016), etc. Particularly, (Agrawal et al., 2015) considers a similar task as ours, which is to recommend video clips to threads. But its solution is designed for the specific task and needs labeling data. Our solution is an unsupervised learning method and the learned embeddings have other applications, e.g. thread retrieval. How to represent text is a fundamental research problem in the field of information retrieval. Existing approaches can be generally classified into unsupervised methods and supervised methods (Tang et al., 2015). Although supervised embeddings can obtain good performance in specific tasks, such as using deep neural network (Mikolov et al., 2010;Kim, 2014), they need human efforts to get labels. Unsupervised word embeddings usually leverage various levels of textual information. For example, Word2Vec learns word embeddings based on word coherence. Para2Vec utilizes word and document coherence to learn their embeddings. Particularly, Hierarchical Document Vector (HDV) (Djuric et al., 2015) leverages both streaming documents and their contents to achieve better representation, which is similar to our proposed model. However, HDV regards the documents as the context of words, which cannot learn the latent similarity, since it fails to explicitly reflect the relationship between document and word. In order to model the heterogeneous MOOC data, we develop a cascade representation model. To our knowledge, (Jiang et al., 2017) also proposes an unsupervised learning model (called NOSE) for the task of thread-subtitle matching within MOOC settings. However, NOSE needs to build a heterogeneous textual network beforehand and may suffer from heterogeneous issue, which our model can avoid.
Recently, representation learning has been applied to many tasks, such as network embedding (Grover and Leskovec, 2016) and location embedding . In this paper, we focus on learning representation of words and documents in MOOCs.
jumps from videos to threads may look for further relevant information from forums when s/he is watching a video, or s/he wants to review the relevant videos when s/he reads a thread.
However, learning from the log of clickstream data merely guarantees that similar documents are close enough in the embedding space, while different documents cannot be scattered. To address this issue, we attempt to strengthen the relationship between words and their affiliated documents. Thus, words within the same documents would be gathered and otherwise scattered in the embedding space. Consequently, the latent similarity can be embodied by word embeddings.
Based on the aforementioned idea, we can model the data by three components: 1) latent semantics at word level, 2) latent similarity at document level, and 3) latent similarity between words and documents. To integrate all the three kinds of information into a uniform learning framework, we propose a novel cascade model, as shown in Fig. 1. L 1 , L 2 and L 3 correspond to the loglikelihood of three components respectively. Formally, we aim at minimizing the log-likelihood function: Note that L 3 not only learns the latent similarity, but also builds a connection between words and documents. In this way, our learned word embeddings can be adopted to our task without learning classifiers by labeling data.

Word-level Latent Semantics
As to the part of L 1 , corresponding to the red/bottom part of Fig. 1, we leverage the Word2Vec model to learn the semantics of words.
In this paper, we take the Continuous bag-ofwords (CBOW) architecture. The objective function is to minimize the log-likelihood: where c w is the context window length used in word sequence, and w t−cw : w t+cw is the subsequence (w t−cw , . . . , w t+cw ) excluding w t itself. The probability P(w t |w t−cw : w t+cw ) is defined by the softmax function , where v wt is the vector representation of word w t , and v is averaged vector representation of the subsequence. Two methods can be employed to calcu- lating L 1 : hierarchical softmax and negative sampling (Mikolov et al., 2013).

Document-level Latent Similarity
Similar to L 1 , we adopt the CBOW architecture for calculating L 2 , as shown by the green/top part of Fig. 1. The objective function is to minimize the log-likelihood: where M is the number of documents, c d is the context window length used in clickstreams, and d m−c d : is also the softmax function. Methods of hierarchical softmax and negative sampling can be employed to approximate the log-likelihood function.

Document-Word Latent Similarity
To learn the latent similarity, we make use of the relationship between words and documents, and then similar documents can be clustered, while different documents are scattered. Therefore, we propose the third component, L 3 , shown in the middle part of Fig. 1. Different from L 1 and L 2 , we employ negative sampling of documents to calculate its approximation, because there are numerous threads in MOOC forums. Given a pair (w t , d m ), representing that word w t appears in document d m , L 3 is denoted as: (4) where σ(x) is the sigmoid function and C is the number of sampled negative documents.

Model Training
We adopt stochastic gradient descent (SGD) to minimize L. As to the components of L 1 and L 2 , we exploit the training methods proposed in (Mikolov et al., 2013) to the two kinds of sequences, i.e., words and documents, respectively. For training L 3 , given the pair (w t , d m ), we calculate the gradients: where d j represents both the positive and negative samples, as d j ∈ {d i } ∪ {d c ∼ P n (w)|c = 1, . . . , C}. P n (w) is the noise distribution and we set it as unigram distribution raised to 3/4th power, which is the same as Word2Vec. (x) is an indicator function defined as: The time complexity of updating L is O(T log T + M log M + T C) when using hierarchical softmax method for L 1 and L 2 , or O((2T + M )C) when using negative sampling method. Based on the complexity analysis, our cascade model is efficient enough and can be applied to MOOC datasets.

Experiment
Data Sets We collect the sequential data of two MOOCs from Coursera 1 and China University MOOC 2 respectively. The former is an interdiscipline course called People and Network, and the second is called Introduction to MOOC. From both courses, we collect subtitles of video clips, forum contents and learners' log of clickstream. Table 1 shows the statistical information of the two MOOCs.
For evaluation, we invite the teaching assistants (TAs) of respective courses to label test samples in advance. Note that our model is unsupervised. Therefore, labeled data (thread-subtitle matching pairs) are only used for evaluation, and we do not utilize dev dataset.
Experimental Setting We compare our embeddings with unsupervised rivals and the labels are only used for evaluation. To ensure fair comparison, we represent documents with their averaged word embeddings. Note that in the training phase, we represent each thread/subtitle with a vector, in order to make the words within a document clustered and close to each other. We evaluate the following methods.
• Word2Vec: word embeddings which leverages word-level coherence and we adopt the CBOW architecture.
• Para2Vec: paragraph embeddings which considers document-level context information.
We also adopt CBOW framework.
• Hierarchical Document Vector(HDV): the latest word embeddings with a hierarchical architecture for modeling streaming documents and their contents.
• Cascade Document Representation (CDR): our proposed model which captures both the latent semantics and latent similarity.
We use the hype-parameters recommended by previous literatures. For all the evaluated baselines, we use the same parameter setting. Thus it is fair to make comparison. The window size set in all baselines is 5 by default. The number of negative samples is empirically set as 5. The size of hidden layer is set as 100 for all the methods. We utilize the Precision@K (denoted by P@K) as metric. If the retrieved top-K subtitles hit at least one ground-truth label, we regard it as true; otherwise, it is false. In our experiments, we run 10 times and report the average result for each case.  Result Firstly we use all the data to learn word embeddings by models. Then the learned word vectors are utilized to calculate the similarity between threads and subtitles, and rank the subtitles. Table 2 reports the results of thread-subtitle matching. We can notice that there are some anomalies in P@3 and P@5 results. It may be for the reason of dataset. In the first MOOC (people and network), video subtitles contain relatively less words, and therefore it is hard to get effective representations. Overall, the proposed models can achieve better performance than baselines, and we highlight the Precision@1 results. Compared to HDV which also considers the streaming documents, our model is better at every task. This indicates our model can effectively capture the latent similarity.
We investigate the effect of number of dimensions, i.e., the size of the neural network's hidden layer. From Fig.2, we find that CDR can achieve better performance than baselines with various numbers of dimensions. In addition, the optimal results can be obtained when the dimension is set as 100 or 200 in both datasets.

Conclusion
In this paper, we propose an approach to solve a significant problem: how to learn distinguishable representations from word sequences in documents and clickstreams of learners. To model the heterogeneous data, we develop a cascade model which can jointly learn the latent semantics and latent similarity without labeling data. We conduct experiments on two real datasets, and the results demonstrate the effectiveness of our model. Moreover, our model is not limited to MOOC data. For instance, we can adopt the proposed algorithm to streaming documents, e.g. webpage click streams, since our method can model the document-document sequences. We leave this as the future work.