Which Matters Most? Comparing the Impact of Concept and Document Relationships in Topic Models

Topic models have been widely used to discover hidden topics in a collection of documents. In this paper, we propose to investigate the role of two different types of relational information, i.e. document relationships and concept relationships. While exploiting the document network significantly improves topic coherence, the introduction of concepts and their relationships does not influence the results both quantitatively and qualitatively.


Introduction
Topic models are a suite of generative probabilistic models aimed at discovering thematic information (or topics) of an unstructured collection of documents. These models, including the well-known Latent Dirichlet Allocation (LDA) (Blei et al., 2003), usually consider texts as the unique source of information and are based on the assumption that texts are independent and identically distributed (i.i.d. assumption). However, in several real-world cases, documents are often characterized by an underlying relational structure: scientific papers can be related through citations, web pages can present hyperlinks between each other, and users in social networks can be friends. One of the first approaches that explicitly models the relationships between documents is Relational Topic Model (RTM) (Chang and Blei, 2009), based on the intuition that connected documents likely discuss the same topics.
Traditional topic models also assume that the topic assignment of a word is independent of other hidden topics, given the document's topic distribution. However, previous work proved that the introduction of additional knowledge about the relationships between words improves the coherence of the discovered topics (Yang et al., 2015b;Chen et al., 2013b,c). This type of relationship is commonly viewed as related to the concept of synonym, but this is not always the case in a real-world scenario because of word ambiguity. Following this intuition, it is thus important to take into consideration the concept behind the word alongside the word itself for understanding its relationship with other words, because it would permit to associate the same topic to words that are actually related and not only synonyms. For example, it would be possible to grasp that the word "engine", when associated with the concept of "search engine", is distant from "motor", but similar to "information retrieval". Few works investigate the use of named entities in topic models (Kim et al., 2012;Allahyari and Kochut, 2016), but none of them addresses the problem in relational settings.
Contribution In this paper, we investigate the role of two different types of relational information: (1) concept relationships between words and named entities obtained by Word Embeddings and (2) document-level relationships extracted by a document network. The impact of these two types of relational information is evaluated by considering traditional topic models and by introducing two novel Entity Constrained Topic Models. The source code has been made available at the following link: https://github.com/MIND-Lab/EC-RTM.

Related Work
Latent Dirichlet Allocation (LDA) (Blei et al., 2003) is a generative probabilistic model that describes a document corpus through a set of topics K, seen as distributions of words over a fixed vocabulary. A document is assumed as composed of a mixture of the topics, following a Dirichlet distribution. Words are generated according to the topics drawn from this mixture. LDA can be extended by considering different types of relational information.
Word-level Relational Topic Models relax the independence assumption of words in a document or in a topic. They can be roughly divided into models that encode word-order (Wang et al., 2007;Gruber et al., 2007;Lindsey et al., 2012;Fei et al., 2014;Wallach, 2006) and syntactic dependencies Boyd-Graber and Blei, 2008), and models that incorporate semantic or domain knowledge relationships (Andrzejewski et al., 2009(Andrzejewski et al., , 2011Chen et al., 2013b;Yang et al., 2015b). Lately, the growing interest in word embeddings has led to the incorporation of the relationships deriving from word embeddings (Petterson et al., 2010;Zhao et al., 2017;Das et al., 2015;Nguyen et al., 2015;Li et al., 2016;Batmanghelich et al., 2016;Nozza et al., 2016). Document-level Relational Topic Models assume that two linked documents are more likely to have similar topic distributions. Relational Topic Model (RTM) and its extensions (Chen et al., 2013a;Terragni et al., 2020;Yang et al., 2015aYang et al., , 2016, ground on LDA and model each link as a binary variable considering the existence of a link between pairs of documents. Other approaches include the regularized topic models Mei et al., 2008), which augment the model's objective function with a network regularization penalty, and the Dirichlet Multinomial Regression (Mimno and McCallum, 2008) and its extensions (Hefny et al., 2013;Wahabzada et al., 2010), incorporating links by viewing them as per-document attributes. A promising paradigm uses neural variational inference to infer topics (Miao et al., 2016;Bianchi et al., 2020a,b). Neural Relational Topic Model (NRTM) (Bai et al., 2018), is based on Stacked Variational AutoEncoder (SVAE) to infer topics and predict links using a multilayer perceptron.

Entity Constrained Topic Models
We propose Entity Constrained Latent Dirichlet Allocation (EC-LDA) and Entity Constrained Relational Topic Models (EC-RTM), two classes of models aimed at incorporating entity-entity and entity-word relationships in traditional topic models. Following (Yang et al., 2015b;Terragni et al., 2020), we constrain the joint distribution of LDA and RTM through the use of potential functions that model entity-entity and/or entity-word relationships. The potential can be factored out of the joint distribution and the posterior can be derived using a collapsed Gibbs sampling for inference. In addition to EC-LDA, EC-RTM also assumes that two linked documents are likely to discuss the same topics. We report the joint distributions of the proposed models in the Appendix A. For further details on Constrained Topic Models, we refer the reader to (Yang et al., 2015b;Terragni et al., 2020).
We define the vocabulary E containing the unique named entities of the corpus, and the vocabulary W containing the unique words. We derive the vocabulary Γ as the union of the word and named entity vocabularies. Relationships are denoted by the set of knowledge L and each piece of knowledge l ∈ L is incorporated by a potential function f l (z, u), which represents a real-valued score for the hidden topic assignment z of the word or named entity token u.
We derive the knowledge L using Skip-Gram (Mikolov et al., 2013). Given a word embeddings training set composed of a large but finite set Λ, the word embeddings model can be expressed as a mapping function C : Γ → R t . For each token u ∈ Γ, we define a must-constraint set L m u , containing words and named entities that are likely to share the same themes of u. L m u is defined as: where sim is the cosine similarity between two vectors, and m is a given threshold. We also define a cannot-constraint set L c u , that contains the words and named entities that are not likely to share the same themes of u. L c u is defined as: where c is a given threshold. An example of a must-constraint set for the named entity "Artificial neural network" may be { Artificial neuron, ANN, perceptron} which contains named entities that are likely to be assigned to the same topic. Analogously, an example of cannotconstraint set for the named entity "Artificial neural network" may be {Olympic games, Athlete} which denotes named entities related to sports and not to Machine Learning.

Entity-Entity Potential Function
We specify an entity-entity potential function that models the relationships between named entities. Let N ze be the maximum between 1 and the topicentities counts, i.e. the number of occurrences of e assigned to topic z. The function f l (z, u) is as follows: The function increases the probability that the entity u will be assigned to the same topics as those of the entities belonging to L m u . Similarly, the potential function decreases the probability that a named entity u will be drawn from the same topics as those of entities contained in the L c u . The models that can encode the Entity-Entity (EE) potential function will be referred to EC-LDA-EE and EC-RTM-EE.

Entity-Word Potential Function
Let N zw be the maximum between 1 and the topicword counts, i.e. the counts of word w assigned to topic z. The following potential function deals with relationships between entities and word tokens: The potential function models the following cases: • if u is a named entity, then we consider only the words that are contained in u's must-and cannot-constraint sets, i.e. L m u and L c u ; • if u is a word, then we consider only the named entities that are contained in u's mustand cannot-constraint sets, i.e. L m u and L c u . The models encoding Entity-Word (EW) relationships are named EC-LDA-EW and EC-RTM-EW.

Experimental setting
Datasets The experimental investigation has been performed on two relational benchmark datasets: (1) Cora-ML (McCallum et al., 2005), a citation network on the set of Machine Learning papers (Sen et al., 2008) and (2)   Preprocessing The identification of named entities in text is typically performed through a series of techniques that refer to the task of Named Entity Recognition (NER) (Fersini et al., 2014;Ritter et al., 2011;Li et al., 2020). Once the named entities are recognized, the next step is to associate them to unambigous concepts, as for example resources in a Knowledge Base. This process is known as the task of Named Entity Linking (NEL) (Cucerzan, 2007;Dredze et al., 2010;Basile et al., 2015;Cecchini et al., 2016;Nozza et al., 2019).
In this paper, we used the DBPedia Spotlight tool (Mendes et al., 2011) (confidence = 0.5 and support = 0.0) to identify named entities in the text and associate them to DBPedia units. We added the prefix "NE/" to each identified entity to discriminate it from words. We applied a common preprocessing technique on the text. We considered only must-constraints, that have been extracted from Wikipedia2Vec (Yamada et al., 2018). For details on the hyperparameters and preprocessing, see the Appendix A.

Experimental Results
Quantitative Results Tables 2 and 3 show the performance of the models in terms of all the considered scores over an increasing number of topics on the datasets. 4 Results show that models that consider relational information generally obtain higher performance than their non-relational counterpart. Differently, the introduction of the concept constraints in EC-RTM-EE and EC-RTM-EW models does not seem to provide significant improvements with respect to RTM. This can be motivated by the fact that the constraint sets additionally included in the EC-RTM models are already captured in the word-topic distribution obtained by RTM. Different behaviors can be observed for the C V scores, for which NRTM and SVAE obtain significantly higher performance. This opposite trend with respect to the other topic scores can be explained by the fact that C V rewards the presence of rare words even if they are contained in junk topics as stated by the author of (Röder et al., 2015) 5 .

Qualitative Results
In Table 4, we show the top-10 words for Cora-ML concerning an example topic "Genetic Programming" for EC-RTM-EE, EC-RTM-EW, LDA, RTM, SVAE, and NRTM. To analyze if the named entity annotation can contribute to topic interpretability, we report the words 4 Computing the KL-metrics is impractical for SVAE and NRTM since they do not model word-and document-topic distributions. 5 https://bit.ly/3jApSAC Models Top-10 words LDA* problem genetic algorithms problems programming search optimization fitness population space RTM* genetic control programming fitness reinforcement population algorithms paper environment behavior EC-RTM-EE NE/Genetic programming programs NE/Genetic algorithm population fitness genetic evolutionary program NE/Evolution strategies EC-RTM-EW NE/Genetic programming NE/Genetic algorithm population fitness genetic evolutionary NE/Evolution encoding operator operators SVAE koza NE/Multidisciplinary design optimization splice bitsback NE/Genetic programming fitness orientation NE/Ploidy NE/Exon coded NRTM genetic reactive NE/Genetic programming NE/Case casebased neuroevolution ssa NE/Genetic algorithm coevolutionary problemsolving Table 4: "Genetic Programming" topic in Cora-ML.
of LDA and RTM (referred as LDA* and RTM*) run on Cora-ML composed of words only. As expected from the quantitative results, the topics extracted by the proposed models do not significantly differ from RTM*, further demonstrating the hypothesis that the imposed constraints were already captured by the original model. Qualitative considerations can be made regarding the exploitation of the novel entity-level modeling of the documents. While this representation leads to topics containing explicit concepts (e.g., "NE/Genetic programming"), topics obtained by RTM* seem to be equally interpretable because they can identify named entities in the form of distinct words (e.g., "genetic, programming, algorithm"). Moreover, the difference in representation is only evident when named entities are composed of two or more words (e.g., "NE/Evolution" and "evolution" are equivalent). The benefit of applying NEEL techniques for recognizing named entities in topics may come in handy for automatically providing links to KB (such as Wikipedia), at the computational cost of discovering named entities. Moreover, the proposed novel potential function would allow users to artificially manipulate the model to derive explanations for the topic assignments or force entities in the same topic based on human domain knowledge. Regarding SVAE and NRTM, their topics seem hard to interpret from a qualitative perspective, confirming the results of the quantitative evaluation.

Conclusion
We propose two classes of Entity Constrained Topic Models for incorporating different types of relational information. Results demonstrated that models exploiting document-level relationships achieve improvements with respect to their non-relational counterparts. Differently, concept relationships do not significantly improve either topic coherence or interpretability. As future work, we plan to investigate multi-relational topic models extracting other relationships from the data and to exploit contextual encoding method for entity representation also in multilingual settings (Devlin et al., 2019;Nozza et al., 2020).  function models each per-pair binary variable related to links as a logistic regression (with hidden covariates), parameterized by coefficients η and intercept ν.

A.4 Computing Infrastructure
Experiments were run on three common computers using CPUs. Models can be run with basic infrastructure. Two computers have 8GB of RAM and the other has 16GB of RAM.