LDTM: A Latent Document Type Model for Cumulative Citation Recommendation

This paper studies Cumulative Citation Recommendation (CCR) - given an entity in Knowledge Bases, how to effectively detect its potential citations from volume text streams. Most previous approaches treated all kinds of features indifferently to build a global relevance model, in which the prior knowledge embedded in documents cannot be exploited adequately. To address this problem, we propose a latent document type discriminative model by introducing a latent layer to capture the correlations between documents and their underlying types. The model can better adjust to different types of documents and yield ﬂexible performance when dealing with a broad range of document types. An extensive set of experiments has been conducted on TREC-KBA-2013 dataset, and the results demonstrate that this model can yield a signiﬁcant performance gain in recommendation quality as compared to the state-of-the-art.


Introduction
Knowledge Bases (KBs), like Wikipedia, are playing increasingly important roles in numerous entity-based information retrieval tasks. Nevertheless, most KBs are hard to be up-to-date due to their manual maintenances by human editors. As reported in (Frank et al., 2012), there exists a median time lag of 356 days between the day a news article is published and the time that the news is cited in a Wikipedia article dedicated to the entity concerned by the news. The time lag would be reduced if relevant documents could be automatically detected as soon as they are published online * This work was partially performed when the first author was visiting Purdue University and Microsoft Research Asia.
† Corresponding Author and then recommended to the editors. This task is studied as Cumulative Citation Recommendation (CCR). Formally, given a set of KB entities, CCR is to filter relevant documents from a stream corpus and evaluate their citation-worthiness to the target entities. A variety of supervised approaches (e.g., classification, learning to rank) have been employed and achieved promising results (Wang et al., 2013;. Nevertheless, most of them leverage all features indiscriminately to build a global relevance model, which leads to unsatisfactory performance. The documents can offer some prior knowledge, which is named as type in this paper. The type is the prior knowledge embedded in the document that impacts on the probability of its being recommended to KBs. For instance, when dealing with a document on "music" topic, we would like to have less weights put on a politician entity because this document is not likely to related to it, but more often related to musicians or musical bands. Besides, the source of a document impacts on the recommendation strategies too. A document from news agencies is more reliable and citable than the one from social websites even if they state an identical story about the target KB entity. Hence we consider two kinds of document features to model the prior type knowledge: (1) topic-based features, and (2) source-based features.
This paper proposes a latent document type discriminative mixture model for CCR. We introduce an intermediate latent layer to model latent document types and define a joint distribution over the document-entity pairs and latent document-types on the observation data. The aim is to achieve a discriminative mixture model that is expected to outperform the global relevance model.
To the best of our knowledge, this is the first research work that leverages prior knowledge embedded in documents to improve CCR perfor-mance. An extensive set of experiments conducted on TREC-KBA-2013 dataset has demonstrated the effectiveness of the proposed mixture model.

Discriminative Models for CCR
Given a set of KB entities E = {e u |u = 1, · · · , M } and a document collection D = {d v |v = 1, · · · , N }, our objective is to estimate the conditional probability of relevance P (r|e, d) with respect to an entity-document pair (e, d). Each (e, d) is represented as a feature vector f (e, d) = (f 1 (e, d), · · · , f K (e, d)), where K is the dimension of the entity-document feature vector. Moreover, to model the hidden document type, each document is represented as an document-type feature vector g(d) = (g 1 (d), · · · , g L (d)), where L indicates the dimension of the document-type feature vector.

Global Model
This paper utilizes logistic regression to estimate the conditional probability P (r|e, d), where r(r ∈ {1, −1}) is a binary label indicating the relevance of an entity-document pair (e, d). The value of r is 1 if the document d is related to the entity e, otherwise r = −1. Formally, the parametric form is the standard logistic function, ω i is the combination parameter for the ith feature. It is easy to derive that for different values of r, the only difference in P (r|e, d) is the sign within the logistic function. Therefore, we adopt the general representation of . This model is denoted as GM in this paper. Several previous approaches can be deemed as global models adopting different classification functions such as decision trees (Wang et al., 2013) and Support Vector Machine (SVM) (Bonnefoy et al., 2013).

Latent Document Type Model
In GM, a fixed set of combination weights (i.e., ω) are learned to optimize the overall performance for all entity-document pairs. However, the best combination strategy for a given pair is not always the best for the others since both the documents and entities are heterogeneous. Therefore, we may benefit from developing a document type dependent model in which we choose the combination strategy individually for each document type to optimize the performance for specific document types. Since it is not feasible to determine the proper combination strategy for each document type, we need to classify documents into one of several types. The combination strategy is then tuned to optimize average performance for documents within the same type.
We propose a latent document type model (LDTM) by introducing an intermediate layer to capture the underlying type information in documents. A latent variable z is utilized to indicate which type the combination weights ω z are drawn from. The choice of z is determined by the document d. The joint probability of relevance r and the latent variable z is represented as P (r, z|e, d; α, ω)=P (z|d; α)P (r|e, d, z; ω), where P (z|d; α) is the mixing coefficient, denoting the probability of choosing the hidden type z given document d, and α is the corresponding parameter. P (r|e, d, z; ω) denotes the discriminative component which takes a logistic function. By marginalizing out z, we obtain where ω zi is the weight for the ith entry in the feature vector under the hidden variable z. We adopt a soft-max function 1 Z d exp( L j=1 α z j g j (d)) to model P (z|d; α), and Z d is the normalization factor that scaled the exponential function to be a probability distribution. In this representation, each document d is denoted by a bag of document type features (g 1 (d), · · · , g L (d)). By plugging the soft-max function into Equation (1), we have Suppose entity-document pairs in training set are represented as T ={(d u , e v )}, and R={r uv } denotes the corresponding relevance judgment of (d u , e v ), where u = 1, · · · , M and v = 1, · · · , N . Assume training instances in T are independently generated, the conditional likelihood of training data is written as

Parameter Estimation
The parameters (i.e., ω and α) can be estimated by maximizing the data log-likelihood L(ω, α), which is the form of logarithm of Equation (3). A typical parameter estimation method is to use Expectation-Maximization (EM) algorithm by iterating E-step and M-step continuously until convergence. The E-step is derived by computing the posterior probability of z given d u and e v , which is denoted as P (z|d u , e v ).
In M-step, we can obtain the following parameter update rules.
To optimize Equation (5), we employ the minFunc toolkit 1 using Quasi-Newton strategy. We adopt Akaike Information Criteria (AIC) to determine the number of latent variables (Fang et al., 2010), which is calculated as 2m − 2L(ω, α), where m is the number of parameters in the model. LDTM holds two advantages over GM.
(1) The combination parameters vary across various document types and hence lead to a gain of flexibility; (2) It offers probabilistic semantics for the latent document types and thus documents can be associated with multiple types.

Features
This section presents the two types of features used in the discriminative models. Entitydocument features (i.e., f (e, d)) are used in the discriminative components of GM and LDTM. In addition, LDTM requires document-type features (i.e., g(e)) to learn the mixing coefficients in the mixture component.
Since our goal is not to develop new entitydocument features, we adopt the identical entitydocument feature set proposed in our previous work (Wang et al., 2013;Wang et al., 2015a;Wang et al., 2015b), which has been proved effective.
In terms of document-type features, we consider two kinds of prior knowledge embedded in documents to model the correlations between documents and their latent types.
Topic-based features One prior knowledge to model a document's latent type is its intrinsic topics. As we have claimed, documents with one or more obvious topics are more likely to be recommended to KB than those without any explicit topic. We capture the underlying topics in documents with word co-occurrences. After removing stop words, we represent each document as a feature vector with the bag-of-words model, where word weights are determined by TF-IDF scheme.
Source-based features The source of a document is another prior knowledge to evaluate the probability of the document's being recommended to KBs. We leverage a "bag-of-sources" model to represent each document as source-based feature vector, and term weights are determined by binary occurrence scheme. Please note that the sources are organized hierarchically. For example, mainstream news is a sub-source of news.

Dataset
We utilize TREC-KBA-2013 dataset 2 as our experimental dataset. The dataset is composed of a temporally stream corpus and a target KB entity set. The stream corpus contains nearly 1 billion documents crawled from 10 sources: news, mainstream news, social, weblog, linking, arxiv, classified, reviews, forum and memetracker 3 . The corpus has been split with documents from October 2011 to February 2012 as training instances and the remainder for evaluation. We adopt the same training/test range setting in our experiments. The entity set is composed of 121 Wikipedia entities and 20 Twitter entities.
Each entity-document pair is labeled as one of the 4 relevance levels: (i) Vital, timely information about the entity's current state, actions, or situation. This would motivate a change to an already up-to-date KB article. (ii) Useful, possibly citable but not timely, e.g., background biography, secondary source information. (iii) Neural, informative but not citable, e.g., tertiary source like Wikipedia article itself. and (iv) Garbage, no information about the target entity could be learned from the document, e.g., spam. Annotation details of the dataset are presented in Table 1

Evaluation Scenarios
According to different granularity settings, we evaluate the proposed models in two scenarios: (i) Vital Only. Only vital entity-document pairs are treated as positive instances. (ii) Vital + Useful. Both vital and useful entity-document pairs are treated as positive instances.

Comparison Methods
We conduct extensive comparisons with the following methods.
• Global Model (GM). The global discriminative model introduced in section 2.1.
• Source-based Latent Document Type Model (src LDTM). A variant of LDTM that utilizes source-based features as document-type features.
• Topic-based Latent Document Type Model (topic LDTM). A variant of LDTM that utilizes topic-based features as document-type features.
• Combination Latent Document Type Model (combine LDTM). This approach utilizes source-based and topic-based features together as document-type features. In our experimental setting, we simply union the two feature vectors together into an integral feature vector.
For reference, we also include three top-ranked approaches in TREC-KBA-2013 track.
• Official Baseline (Frank et al., 2013). A strong baseline in which human annotators go through target entities and came up with a list of keywords for filtering vital documents.

Results and Discussion
Improving precision is harder than improving recall for CCR (Frank et al., 2013). Therefore, we care more about recommendation quality of CCR. Precision and overall accuracy are adopted as metrics to evaluate different approaches. All the metrics are computed in the test pool of all entity-document pairs. The results are reported in  in the 2nd block of Table 2, our mixture models achieve higher or competitive precision and accuracy in both scenarios considerably. Compared with the official baseline, our best mixture model improves precision about 28%. In both scenarios, the variants of LDTM outperform GM on precision and accuracy, which validates our motivations that (i) introducing document latent types in mixture model can enhance the recommendation quality, and (ii) source-based and topic-based features can capture the hidden type information of documents. Moreover, topic LDTM generally performs better than src LDTM in both scenarios, which meets our expectation because topic-based features have far more dimensions than source-based features. However, even if source-based feature vector holds a few dimensions (10 in our experiments), src LDT improves the precision on the basis of GM. Thus, the precision can be enhanced further if we can develop more valuable features to represent the underlying document types. The combination variant of LDTM achieve the best precision in Vital Only scenario and the best accuracy in Vital + Useful scenario. The naïve combination strategy of two types of features can improve the performance but not stable, so we need find better combination strategies.
For all variant of the LDTM, the number of latent types determined by AIC are reported in Table 3. The optimal number of latent types in Vital + Useful is more than that in Vital Only. This reveals that the types of Vital documents for entities have more restrictions than Useful documents, either by topics or by sources. In addition, the optimal number of latent topics is more than that of latent sources, which also follows our intuition that topic-based features holding more dimensions than source-based features. Since we employ a naïve combination strategy for the two types of features, the number of latent types of combine LDTM is more close to topic LDTM, which possesses more features than src LDTM.

Related Work
There are three kinds of approaches developed for CCR in previous work: query expansion (Liu et al., 2013;Wang et al., 2013), classification such as SVM (Kjersten and McNamee, 2012) and Random Forest classifier (Bonnefoy et al., 2013;, and learning to rank approaches (Wang et al., 2013;. Transfer learning is utilized to transfer the keyword importance learned from training pairs to query pairs (Zhou and Chang, 2013). However, some highly supervised methods require training instances for each entity to build a relevance model, limiting their scalabilities. A compromised solution is to build a global discriminative model with all features indifferently.
We spotlight document-type features and study the impacts of them in discriminative mixture models. Mixture model has been applied and proved effective in multiple information retrieval tasks, such as expert search (Fang et al., 2010) and federated search (Hong and Si, 2012). By learning flexible combination weights for different types of training instances, mixture model can outperform global models with fixed weights for all instances.

Conclusion
Cumulative Citation Recommendation (CCR) is an important task to automatically detect citationworthy documents from volume text streams for knowledge base entities. We study CCR as a classification problem and propose a latent document type model (LDTM) through introducing a latent layer in a discriminative model to capture the correlations between documents and their intrinsic types. Two variants of LDTM are implemented by modeling the latent types with document source-based and topic-based features respectively. Experimental results on TREC-KBA-2013 dataset demonstrate that our mixture model can improve CCR performance significantly, especially on precision and accuracy, revealing the advantage of LDTM in enhancing recommendation quality of citation-worthy documents.
For future work, we wish to explore more useful document-type features and apply more proper combination strategies to improve the latent document type model.