Towards Automatic Description of Knowledge Components

A key aspect of cognitive diagnostic models is the speciﬁcation of the Q-matrix associating the items and some underlying student attributes. In many data-driven approaches, test items are mapped to the underlying, latent knowledge components (KC) based on observed student performance, and with little or no input from human experts. As a result, these latent skills typically focus on modeling the data accurately, but may be hard to describe and interpret. In this paper, we focus on the problem of describing these knowledge components. Using a simple probabilistic model, we extract, from the text of the test items, some keywords that are most relevant to each KC. On a small dataset from the PSLC datashop, we show that this is surprisingly effective, retrieving unknown skill labels in close to 50% of cases. We also show that our method clearly outperforms typical base-lines in speciﬁcity and diversity.


Introduction
Recent years have seen significant advances in automatically identifying latent attributes useful for cognitive diagnostic assessment. For example, the Qmatrix (Tatsuoka, 1983) associates test items with skills of students taking the test. Data-driven methods were introduced to automatically identify latent knowledge components (KCs) and map them to test items, based on observed student performance, cf. Barnes (2005) and Section 2 below.
A crucial issue with these automatic methods is that latent skills optimize some well defined objec-tive function, but may be hard to describe and interpret. Even for manually-designed Q-matrices, knowledge components may not be described in detail by the designer. In that situation, a datagenerated description can provide useful information. In this short paper, we show how to extract keywords relevant to each KC, from the textual content corresponding to each item. We build a simple probabilistic model, with which we score possible keywords. This proves surprisingly effective on a small dataset obtained from the PSLC datashop.
After a quick overview of the automatic extraction of latent attributes in Section 2, we describe our keyword extraction procedure in Section 3. The data is introduced in Section 4, and we present our experimental results and analysis in Section 5.

Extraction of Knowledge Component Models
The Rule Space model (Tatsuoka, 1983;Tatsuoka, 1995) was introduced to statistically classify student's item responses into a set of ideal response patterns associated with different cognitive skills. A major assumption of Rule Space is that students only need to master specific skills in order to successfully complete items. Using the Rule Space model for cognitive diagnostics assessment requires experts to build and reduce an incidence or Q matrix encoding the combination of skills, a.k.a. attributes, needed for completing items (Birenbaum et al., 1992) and generating ideal item responses based on the reduced Q matrix (Gierl et al., 2000). The ideal response patterns can then be used to analyze student response patterns.
The requirement for extensive expert effort in the traditional Q matrix design has motivated attempts to discover the Q matrix from observed response patterns, in effect reverse engineering the design process. Barnes (2005) proposed a multi-start hillclimbing method to create the Q-matrix, but experimented only on limited number of skills. Desmarais et al. (2011; refined expert Q matrices using matrix factorization, Although this proved useful to automatically improve expert designed Q-matrices, non-negative matrix factorization is sensitive to initialization and prone to local minima. Sun et al. (2014) generated binary Q-matrices using an alternate recursive method that automatically estimates the number of latent attributes, yielding high matrix coverage rates. Others (Liu et al., 2012;Chen et al., 2014) estimate the Q-matrix under the setting of well known psychometric models that integrate guess and slip parameters to model the variation between ideal and observed response patterns. They formulate Q-matrix extraction as a latent variable selection problem solved by regularized maximum likelihood, but require to know the number of latent attributes. Finally, Sparse Factor Analysis (Lan et al., 2014) was recently introduced to address data sparsity in a flexible probabilistic model. They require setting the number of attributes and rely on user-generated tags to facilitate the interpretability of estimated factors.
These approaches to the automatic extraction of a Q-matrix address the problem from various angles and an extensive comparison of their respective performance is still required. However, none of these techniques address the problem of providing a textual description of the discovered attributes. This makes them hard to interpret and understand, and may limit their practical usability.

Probabilistic Keyword Extraction
We focus on the textual content associated with each item in order to identify the salient terms as keywords. Textual content associated with an item may be for example the body of the question, optional hints or the text contained in the answers (Figure 1).
For each item i, we denote by d i its textual content (e.g. body text in Figure 1). We also assume a binary mapping of items to K skills c k , k = 1 . . . K.
Skills are typically latent skills obtained automatically (unsupervised) from data. They may also be defined by a manually designed Q-matrix for which skill descriptions are unknown. In analogy with text categorization, textual content is a document d i and each skill is a class (or cluster) c k . Our goal is to identify keywords from the documents that describe the classes.
For each KC c k , we estimate a unigram language model based on all text d i associated with that KC. This is essentially building a Naive Bayes classifier (McCallum and Nigam, 1998), estimating relative word frequencies in each KC: where n wi is the number of occurrences of word w in document d i , and |d i | is the length (in words) of document |d i |. In some models such as Naive Bayes, it is essential to smooth the probability estimates (1) appropriately. However more advanced multinomial mixture models (Gaussier et al., 2002), or for the purpose of this paper, smoothing has little impact. Conditional probability estimates (1) may be seen as the profile of c k . Important words to describe a KC c ∈ {c 1 , . . . c K } have significantly higher probability in c than in other KCs. One metric to evaluate how two distributions differ is the (symmetrized) Kullback-Leibler divergence: (2) where / c means all KCs except c, and P (w|/ c) is estimated similarly to Eq. 1, Note that Eq. (2) is an additive sum of positive, word-specific contributions k(w). Large contributions come from significant differences either way between the profile of a KC, P (w|c), and the average profile of all other KCs, P (w|/ c). As we want to focus on keywords that have significantly higher probability for that KC, and diregard words that have higher probability outside, we will use a signed score:  where the log ensures that the score is positive if and only if P (w|c) > P (w|/ c). Figure 2 illustrates this graphically. Some words (blue horizontal shading) have high probability in c (top) but also outside (middle), hence s(w) close to zero (bottom): they are not specific enough. The most important keywords (green upward shading, right) are more frequent in c than outside, hence a large score. Some words (red downward shading, left) are less frequent in c than outside: they do contribute to the KL divergence, but are atypical in c. They receive a negative score.

Data
In order to test and illustrate our method, we focus on a dataset from the PSLC datashop (Koedinger et al., 2010). We used the OLI C@CM v2.5 -Fall 2013, Mini 1. 1 This OLI dataset tests proficiency with the CMU computing infrastructure. It is especially well suited for our study because the full text of the items (cf. Fig. 1) is available in HTML format and can be easily extracted. Other datasets only include screenshots of the item, making text extraction more challenging.
There are 912 unique steps in that dataset, and less than 84K tokens of text (Table 1) small by NLP standards. We picked two KC models included in PSLC for that dataset. The noSA model has 108 distinct KCs with minimally descriptive labels (e.g. "vpn"), assigning between 1 and 52 items to each KC. The C75 model is fully unsupervised and has the best BIC reported in PSLC. It contains 44 unique KCs simply labelled Cxx, with xx between -1 and 91. It assigns 5 to 78 items per KC. In both models there are 823 items with at least 1 KC assigned.
We use a standard text preprocessing chain. All text (body, hint and responses) in the dataset is tokenized and lowercased, and we remove all tokens appearing in an in-house stoplist, as well as tokens not containing at least one alphabetical character.

Experimental Results
From the preprocessed data, we estimate all KC profiles using Eq. (1), on different data sources: 1. Only the body of the question ("body"), 2. Body plus hints ("b+h"),

Body, hints and responses ("all").
For each KC, we extract the top 10 keywords according to s c (w) (Eq. 3). KC label #items Top 10 keywords identify-sr 52 phishing email scam social learned indicate legitimate engineering anti-phishing p2p 27 risks mitigate applications p2p protected law file-sharing copyright illegal print quota03 12 quota printing andrew print semester consumed printouts longer unused cost vpn 11 vpn connect restricted libraries circumstances accessing need using university dmca 9 copyright dmca party notice student digital played regard brad policies penalties dmca 2 penalties illegal possible file-sharing fines 80,000 $ imprisonment high years penalties bandwidth 1 maximum limitations exceed times long bandwidth suspended network access We first illustrate this on the noSA KC model, for which we can use the minimally descriptive KC labels as partial reference. Table 2 shows the top keywords extracted from the body text for a sample of knowledge components. Even for knowledge components with very few items, the extracted keywords are clearly related to the topic suggested by the label.
Although the label itself is not available when estimating the model, words from the label often appear in the keywords (sometimes with slight morphological differences). Our first metric evaluates the quality of the extraction by the number of times words from the (unknown) label appear in the keywords. For the model in Table 2, this occurs in 44 KCs out of the 108 in the model (41%). These KCs are associated with 280 items (34%), suggesting that labels are more commonly found within keywords for small KCs. This may also be due to vague labels for large KCs (e.g. identify, sr in Table 2), although the overall keyword description is quite clear (phishing, email, scam).
We now focus on two ways to evaluate keyword quality: diversity (number of distinct keywords) and specificity (how many KC a keyword describes). Desirable keywords are specific to one or few KCs. A side effect is that there should be many different keywords. We therefore compute 1) how many distinct keywords there are overall, 2) how many keywords appear in a single KC, and 3) the maximum number of KCs sharing the same keyword. As a baseline, we compare against the simple strategy that consists in simply picking as keywords the tokens with maximum probability in the KC profile (1). This baseline is common practice when describing probabilistic topic models (Blei et al., 2003). Table 3 compares KL score ("KL-*" rows) and maximum probability baseline ("MP-*" rows) for the two KC models. The total number of keywords is fairly stable as we extract up to 10 keywords per KC in all cases (some KCs have a single item and not enough text). The KL rows clearly show that our KL-based method generates many more different keywords than MP, implying that MP extracts the same keywords for many more KCs.
• With KL, we have up to 727 distinct keywords (out of 995) for noSA and 372 out of 440 for C75, i.e. an average 1.18 to 1.37 (median 1) KC per keyword. With MP the keywords describe on average 3.1 KC of noSA, and 2.97 of C75.
• With KL, as many as 577 (i.e. more than half) keywords appear in a single noSA KC. By contrast, only as few as 221 MP keywords have a unique KC. For C75, the numbers are 316 (72%) vs, 88 to 131.
• With KL, no keyword is used to describe more than 9 to 19 noSA KCs and 6 to 12 C75 KCs. With MP, some keywords appear in as many as 87 noSA KCs and all 44 C75 KCs. This shows that they are much less specific at describing the content of a KC.
These results all point to the fact that the KL-based method provides better diversity as well as specificity in naming the different KCs.
Source of textual content: Somewhat surprisingly, using less textual content, i.e. body only, consistently produces better diversity (more distinct keywords) and better specificity (fewer KC per keyword). The hint text yields little change and the response text seriously degrades both diversity and specificity, despite nearly doubling the amount of textual data available. This is because responses are very similar across items. They add textual information but tend to smooth out profiles. This is shown in the comparison between "KL-body" and "MP-all" in Table 4. The latter extracts "correct" and "incorrect" as keywords for most KCs in both models, because these words frequently appear in the response feedback ( Fig. 1). KL-based naming discards these words because they are almost equally frequent in all KCs and are not specific enough.  extracted for all 44 KCs. Results on noSA are similar and not included for brievity.

Discussion
We described a simple probabilistic method for knowledge component naming using keywords. This simple method is effective at generating descriptive keywords that are both diverse and specific. We show that our method clearly outperforms the simple baseline that focuses on most probable words, with no impact on computational cost.
Although we only extract key words from the textual data, one straightforward improvement would be to identify and extract either multiword terms, which may be more explanatory, or relevant snippets from the data. A related perspective would be to combine our relevance scores with, for example, the output of a parser in order to extract more complicated linguistic structure such as subject-verb-object triples (Atapattu et al., 2014).
Our data-generated descriptions could also be useful in the generation or the refinement of Q-Matrices. In addition to describing knowledge components, naming KCs could offer significant information on the consistency of the KC mapping. This may offer a new and complementary approach to the existing refinement methods based on functional models optimization (Desmarais et al., 2014). It could also complement or replace human input in student model discovery and improvement (Stamper and Koedinger, 2011).