Salience Rank: Efficient Keyphrase Extraction with Topic Modeling

Topical PageRank (TPR) uses latent topic distribution inferred by Latent Dirichlet Allocation (LDA) to perform ranking of noun phrases extracted from documents. The ranking procedure consists of running PageRank K times, where K is the number of topics used in the LDA model. In this paper, we propose a modification of TPR, called Salience Rank. Salience Rank only needs to run PageRank once and extracts comparable or better keyphrases on benchmark datasets. In addition to quality and efficiency benefit, our method has the flexibility to extract keyphrases with varying tradeoffs between topic specificity and corpus specificity.


Introduction
Automatic keyphrase extraction consists of finding a set of terms in a document that provides a concise summary of the text content (Hasan and Ng, 2014). In this paper we consider unsupervised keyphrase extraction, where no human labeled corpus of documents is used for training a classifier (Grineva et al., 2009;Pasquier, 2010;Liu et al., 2009b;Zhao et al., 2011;Liu et al., 2009a). This is a scenario often arising in practical applications as human annotation and tagging is both time and resource consuming. Unsupervised keyphrase extraction is typically casted as a ranking problem -first, candidate phrases are extracted from documents, typically noun phrases identified by part-of-speech tagging; then these candidates are ranked. The performance of unsupervised keyphrase extraction algorithms is evaluated by comparing the most highly ranked keyphrases with keyphrases assigned by annotators. * Work done as an intern at Amazon. This paper proposes Salience Rank, a modification of Topical PageRank algorithm by Liu et al. (2010). Our method is close in spirit to Single Topical PageRank by Sterckx et al. (2015) and includes it as a special case. The advantages of Salience Rank are twofold: Performance: The algorithm extracts highquality keyphrases that are comparable to, and sometimes better than, the ones extracted by Topical PageRank. Salience Rank is more efficient than Topical PageRank as it runs PageRank once, rather than multiple times. Configurability: The algorithm is based on the concept of "word salience" (hence its name), which is described in Section 3 and can be used to balance topic specificity and corpus specificity of the extracted keyphrases. Depending on the use case, the output of the Salience Rank algorithm can be tuned accordingly.

Review of Related Models
Below we introduce some notation and discuss approaches that are most related to ours.
Let W = {w 1 , w 2 , . . . , w N } be the set of all the words present in a corpus of documents. Let G = (W, E) denote a word graph, whose vertices represent words and an edge e(w i , w j ) ∈ E indicates the relatedness between words w i and w j in a document (measured, e.g., by co-occurrence or number of co-occurrences between the two words). The outdegree of vertex w i is given by Out(w i ) = i:w i →w j e(w i , w j ).

Topical PageRank
The main idea behind Topical PageRank (TPR) (Liu et al., 2010) is to incorporate topical information by performing Latent Dirichlet Allocation (LDA) (Blei et al., 2003) on a corpus of documents. TPR constructs a word graph G = (W, E) based on the word co-occurrences within documents. It uses LDA to find the latent topics of the document, reweighs the word graph according to each latent topic, and runs PageRank (Page et al., 1998) once per topic.
In LDA each word w of a document d is assumed to be generated by first sampling a topic t ∈ T (where T is a set of K topics) from d's topic distribution θ d and then sampling a word from the distribution over words φ t of topic t. Both θ d and φ t are drawn from conjugate Dirichlet priors α and β, respectively. Thus, the probability of word w, given document d and the priors α and β, is After running LDA, TPR ranks each word w i ∈ W of G by for t ∈ T , where p(t | w) is estimated via LDA. TPR assigns a topic specific preference value p(t | w) to each w ∈ W as the jump probability at each vertex depending on the underlying topic. Intuitively, p(t | w) indicates how much the word w focuses on topic t. 1 At the next step of TPR, the word scores (2) are accumulated into keyphrase scores. In particular, for each topic t, a candidate keyphrase is ranked by the sum of the word scores By combining the topic specific keyphrase scores R t (phrase) with the probability p(t | d) derived from the LDA we can compute the final keyphrase scores across all K topics:

Single Topical PageRank
Single Topical PageRank (STPR) was recently proposed by Sterckx et al. (2015). It aims to reduce the runtime complexity of TPR and at the same time maintain its predictive performance. Similar to Salience Rank, it runs PageRank once. STPR is based on the idea of "topical word importance" TWI (w), which is defined as the cosine similarity between the vector of word-topic probabilities [p(w | t 1 ), . . . , p(w | t K )] and the vector of document-topic probabilities , for each word w given the document d. STPR then uses PageRank to rank each word w i ∈ W by replacing p(t | w i ) in (2) with . STPR can be seen as a special case of Salience Rank, where topic specificity of a word is considered when constructing the random walk, but corpus specificity is neglected. In practice, however, balancing these two concepts is important. It may explain why Salience Rank outperforms STPR in our experiments.

Salience Rank
In order to achieve performance and configurability, the Salience Rank (SR) algorithm combines the K latent topics estimated by LDA into a word metric, called word salience, and uses it as a preference value for each w i ∈ W . Thus, SR needs to perform only a single run of PageRank on the word graph G in order to obtain a ranking of the words in each document.

Word Salience
In the following we provide quantitative measures for topic specificity and corpus specificity, and define word salience.
Definition 3.1 The topic specificity of a word w is The definition of topic specificity of a word w is equivalent to Chuang et al. (2012)'s proposal of the distinctiveness of a word w, which is in turn equivalent to the Kullback-Leibler (KL) divergence from the marginal probability p(t), i.e., the likelihood that any randomly selected word is generated by topic t, to the conditional probability p(t | w), i.e., the likelihood that an observed word w is generated by a latent topic t. Intuitively, topic specificity measures how much a word is shared across topics: The less w is shared across topics, the higher its topic specificity TS (w).
As TS (w) is non-negative and unbounded, we can empirically normalize it to [0, 1] by with the minimum and maximum topic specificity values in the corpus. In what follows, we always use normalized topic specificity values, unless explicitly stated otherwise.
We apply a straightforward definition for corpus specificity.
Definition 3.2 The corpus specificity of a word w is The corpus specificity CS (w) of a word w can be estimated by counting word frequencies in the corpus of interest. Finally, a word's salience is defined as a linear combination of its topic specificity and corpus specificity.
Definition 3.3 The salience of a word w is where α ∈ [0, 1] is a parameter controlling the tradeoff between the corpus specificity and the topic specificity of w.
On one hand, we aim to extract keyphrases that are relevant to one or more topics while, on the other hand, the extracted keyphrases as a whole should have a good coverage of the topics in the document. Depending on the downstream applications, it is often useful to be able to control the balance between these two competing principles. In other words, sometimes keyphrases with high topic specificity (i.e., phrases that are representative exclusively for certain topics) are more appropriate, while other times keyphrases with high corpus specificity (i.e., phrases that are representative of the corpus as a whole) are more appropriate. Intuitively, it is advantageous for a keyphrase extraction algorithm to have an internal "switch" tuning the extent to which extracted keyphrases are skewed towards particular topics and, conversely, the extent to which keyphrases generalize across different topics.
It needs to be emphasized that the choice of quantitative measures for topic specificity and corpus specificity used above is just one among many possibilities. For example, for topic specificity, one can make use of the topical word importance by Sterckx et al. (2015), or the several other alternatives mentioned in Section 2.1 proposed by Liu et al. (2010). For corpus specificity, alternatives besides vanilla term frequencies, such as augmented frequency (to discount longer documents) and logarithmically scaled frequency, quickly come into mind.
Taking word salience into account, we modify (2) as follow: The substantial efficiency boost of SR comparing to TPR lies in the fact that in (2) K PageRanks are required to calculate R t (w i ), t = 1 . . . K before obtaining R(w i ), while in (8) R(w i ) is obtained with a single PageRank.

Algorithm Description
First, SR performs LDA to estimate the latent topics p(t) presented in the corpus and the probability p(t | w), which are used to calculate the topic specificity and the salience of each word w.
Similarly to TPR, SR is performed on the word co-occurrence graph G = (W, E). We use undirected graphs: When sliding a window of size s through the document, a link between two vertices is added if these two words appear within the window. It was our observation that the edge direction does not affect the keyphrase extraction performance much. The same observation was noted by Mihalcea and Tarau (2004) and Liu et al. (2010).
We then run the updated version of PageRank derived in (8) and compute the scores of the candidate keyphrases similarly to the way TPR does using (4). For a fair comparison, noun phrases with the pattern (adjective) * (noun)+ are chosen as candidate keyphrases, which represents zero or more adjectives followed by one or more nouns. It is the same pattern suggested by Liu et al. (2010) in the original TPR paper. SR combines the K PageRank runs in TPR into a single one using salience as a preference value in the word graph.

Results
Our experiments are conducted on two widely used datasets in the keyphrase extraction literature, 500N-KPCrowd (Marujo et al., 2013) and Inspec (Hulth, 2003 In all experiments we used a damping factor λ = 0.85 in PageRank, as in the original PageRank algorithm, and a window size s = 2 to construct the word graphs. Changing the window size s from 2 to 20 does not influence the results much, as also observed in Liu et al. (2010). The convergence of PageRank is achieved when the l 2 norm of the vector containing R(w i ) changes smaller than 10 −6 . The tradeoff parameter α in SR is fixed at 0.4. The 95% confidence interval for the F measure is shown in the last column.  keyphrases by the authors. Following the evaluation process described in Mihalcea and Tarau (2004), we use only the uncontrolled set of annotated keyphrases for our analysis. Since our approach is completely unsupervised, we combine the training, testing, and validation datasets. Top 50 and 10 keyphrases were used for evaluation on 500N-KPCrowd and Inspec, respectively. 2 We compare the performance of Salience Rank (SR), Topical PageRank (TPR), and Single Topical PageRank (STPR) in terms of precision, recall and F measure on 500N-KPCrowd and Inspec. The results are summarized in Table 1. Details on parametrization are given in the caption. In terms of the F measure, SR achieves the best results on both datasets. It ties TPR and outperforms STPR on 500N-KPCrowd, and outperforms both TPR and STPR on Inspec. The source code is available at https://github.com/methanet/ saliencerank.git.
We further experiment with varying the num-2 There are two common ways to set the number of output keyphrases: using a fixed value a priori as we do (Turney, 1999) or deciding a value with heuristics at runtime (Mihalcea and Tarau, 2004).  Table 3: Effect of the α parameter in SR on 500N-KPCrowd. SR was run with 50 LDA topics and the top 50 keyphrases were used for the evaluation. The 95% confidence interval for the F measure is shown in the last column.
ber of topics K used for fitting the LDA model in SR. Table 2 shows how the F measures change on 500N-KPCrowd as the number of topics varies. Overall, the impact of topic size is mild, with K = 50 being the optimal value. The impact of K on TPR can be found in Liu et al. (2010). In our approach, the random walk derived in (8) depends on the word salience, which in turn depends on K; In TPR, not only the individual random walk (2) depends on K, but the final aggregation of rankings of keyphrases also depends on K.
We also experiment with varying the tradeoff parameter α of SR. With 500N-KPCrowd, Table 3 illustrates that different α can have a considerable impact on various performance measures. To complement the quantitative results in Table 3, Table 4 presents a concrete example, showing that varying α can lead to qualitative changes in the top ranked keyphrases. In particular, when α = 0 the corpus specificity of the keyphrases SR extracts is high. This is demonstrated by the fact that words such as "theory" and "function" are among Input: Individual rationality, or doing what is best for oneself, is a standard model used to explain and predict human behavior, and von Neumann-Morgenstern game theory is the classical mathematical formalization of this theory in multiple-agent settings. Individual rationality, however, is an inadequate model for the synthesis of artificial social systems where cooperation is essential, since it does not permit the accommodation of group interests other than as aggregations of individual interests. Satisficing game theory is based upon a well-defined notion of being good enough, and does accommodate group as well as individual interests through the use of conditional preference relationships, whereby a decision maker is able to adjust its preferences as a function of the preferences, and not just the options, of others. This new theory is offered as an alternative paradigm to construct artificial societies that are capable of complex behavior that goes beyond exclusive self interest.
Unique top keyphrases with α = 0 : α = 0 : α = 0 : Unique top keyphrases with α = 1 : α = 1 : α = 1 : classical mathematical formalization individual interests preferences group interests theory artificial social systems options individual rationality function conditional preference relationships multiple agent settings standard model the top keyphrases SR selects, which are highly common words in scientific papers. On the other hand, when α = 1 these keyphrases are not presented among the top. This toy example illustrates the relevance of balancing topic and corpus specificity in practice: When presenting the keyphrases to a layman, high corpus specificity is suitable as it conveys more high-level information; when presenting to an expert in the area, high topic specificity is suitable as it dives deeper into topic specific details.

Conclusions & Remarks
In this paper, we propose a new keyphrase extraction method, called Salience Rank. It improves upon the Topical PageRank algorithm by Liu et al. (2010) and the Single Topical PageRank algorithm by Sterckx et al. (2015). The key advantages of this new method are twofold: (i) While maintaining and sometimes improving the quality of extracted keyphrases, it only runs PageRank once instead of K times as in Topical PageRank, therefore leads to lower runtime; (ii) By constructing the underlying word graph with newly proposed word salience, it allows the user to balance topic and corpus specificity of the extracted keyphrases. These three methods rely only on the input cor-pus. They can be benefited by external resources like Wikipedia and WordNet, as indicated by, e.g., Medelyan et al. (2009), Grineva et al. (2009), Martinez-Romo et al. (2016. In the keyphrase extraction literature, LDA is the most commonly used topic modeling method. Other methods, such as probabilistic latent semantic indexing (Hofmann, 1999), nonnegative matrix factorization (Sra and Inderjit, 2006), are viable alternatives. However, it is hard to tell in general if the keyphrase quality improves with these alternatives. We suspect that strongly depends on the domain of the dataset and a choice may be made depending on other practical considerations.
We have fixed the tradeoff parameter α throughout the experiments for a straightforward comparison to other methods. In practice, one should search the optimal value of α for the task at hand. An open question is how to theoretically quantify the relationship between α and various performance measures, such as the F measure.