Summarizing Topical Contents from PubMed Documents Using a Thematic Analysis

Improving the search and browsing experience in PubMed r is a key component in helping users detect information of interest. In particular, when exploring a novel ﬁeld, it is important to provide a comprehensive view for a speciﬁc subject. One solution for providing this panoramic picture is to ﬁnd sub-topics from a set of documents. We propose a method that ﬁnds sub-topics that we refer to as themes and computes representative titles based on a set of documents in each theme. The method combines a thematic clustering algorithm and the Pool Adja-cent Violators algorithm to induce signiﬁ-cant themes. Then, for each theme, a title is computed using PubMed document titles and theme-dependent term scores. We tested our system on ﬁve disease sets from OMIM r and evaluated the results based on normalized point-wise mutual information and MeSH r terms. For both performance measures, the proposed approach outperformed LDA. The quality of theme titles were also evaluated by comparing them with manually created titles.


Introduction
PubMed 1 , currently a collection of about 25 million bibliographic records, has grown exponentially in size. With the abundance and diversity of information in PubMed many queries retrieve thousands of documents making it difficult for users to browse the results and identify the information most relevant to their topic of interest. The query 'cystic fibrosis', for example, retrieves papers that discuss different aspects of the disease, including its clinical features, treatment options, 1 http://pubmed.gov diagnosis, etc. A possible solution to this problem is to automatically group the retrieved documents into meaningful thematic clusters or themes (these terms are used interchangeably). However, clustering alone does not solve the problem entirely, as a significant amount of human post-processing is required to infer the topic of the cluster.
There exists a vast collection of probabilistic clustering methods. One common problem among most of them is that different results are obtained depending on the cluster initialization, suggesting that some clusters are unstable or weak. However, there is no obvious way to effectively and efficiently evaluate the quality of clusters. In this paper, we combine EM-based thematic clustering (Kim and Wilbur, 2012) with the Pool Adjacent Violators (PAV) algorithm (Ayer et al., 1955;Wilbur et al., 2005). PAV is an isotonic regression algorithm which we use as a method for converting a score into a probability. Here, we show how PAV can be applied to evaluate the quality of clusters.
Another issue that motivated this research is that most existing algorithms produce clusters that are not self-descriptive. Presenting meaningful titles can significantly improve the user perception of clustering results. To that end, we utilize PubMed document titles and cluster-related term scores to automatically obtain a title for each theme. The method results in thematic clusters of documents with cluster titles.
Studies similar to our approach are ASI (Adaptive Subspace Iteration) (Li et al., 2004) and SKWIC (Simultaneous Keyword Identification and Clustering of text documents) (Frigui and Nasraoui, 2004). Both perform document clustering and cluster-dependent keyword identification simultaneously. SKWIC can only produce hard clustering, while ASI is computationally very expensive as it heavily depends on matrix operations. A study by Hammouda et al. (2005) sug-gests automatic keyphrase extraction from a cluster of documents as a surrogate to providing a cluster title, but they treat document clustering and cluster-dependent keyword extraction as separate problems.
Topic modeling (Hofmann, 1999;Blei et al., 2003;Blei and Lafferty, 2005) is the most popular and an alternative approach that has a similar underlying goal of discovering hidden thematic structure of a document collection and organizing the collection according to the discovered topics. Topic models are based upon the idea that documents are mixtures of topics, where a topic is a probability distribution over words (Steyvers and Griffiths, 2007). However, topic modeling is not a document clustering scheme in nature. Although a list of keywords that represent a topic is available, the title of the cluster may not be evident.

Methods
We here describe the EM-based clustering algorithm, and show how PAV is incorporated with it to yield the PAV-EM thematic clustering technique. We further present a cluster summarization method to induce theme titles.

Theme definition
Let D be a document set and let T be the set of terms in D. Let R denote the relation between elements of T and D. tRd means t ∈ d. We define a theme as a subject that is described by non-empty sets U ⊆ T and V ⊆ D, where all the elements of U have a high probability of occurring in all the element of V . An EM framework is used to extract subject terms for a theme (Wilbur, 2002). In addition to the observed data R, a theme is defined by the latent indicator variables z d , {z d } d∈D . The parameters are where n U is the size of the set U . For any t ∈ U , p t is the probability that for any d ∈ V , tRd. q t is the probability that for any d ∈ D − V , tRd. For any t ∈ T , r t is the probability that for any d ∈ D, tRd. Assuming all relations tRd are independent of each other, the goal is to obtain the highest probabilities (2) E-step (expectation step) evaluates the expectation of the logarithm of Eqn. 2. M-step (maximization Give a value for the parameter q.
Set X = ∅. for i ← 1, n do Create q random clusters. Run the theme clustering algorithm.
For each cluster C and d with pz C d , Obtain the PAV function, P AV (pz C d ), over X.
repeat Create q random clusters for {d|d / ∈ ∪S}. Run the theme clustering algorithm. Select any cluster C, where until no more changes in S.
step) maximizes this expectation over the parameters Θ. For each term, t, we define a quantity α t which is the difference between the contribution coming from t depending on whether u t = 1 or u t = 0. The maximization is completed by choosing the n U largest α t 's and setting u t = 1 for each of them and u t = 0 for all others. Details of this theme extraction scheme can be found in Wilbur (2002).

PAV-EM thematic clustering
In thematic clustering, a document is assigned to a theme that has the highest probability to the document (Kim and Wilbur, 2012). Although this approach shows a reasonable performance for theme-based document clustering, the dynamic nature of random initialization and multiple subjects described in a document may create many weak themes. Moreover, there is no clear guideline to distinguish strong and weak themes. Thus, we here propose a method that extracts strong themes more effectively. In the EM-based theme extraction scheme, the log odds score pz C d indicates the extent to which a document d is coupled with a specific theme C. If a cluster in-cludes a reasonable number of documents that have high pz C d s, it indicates that the cluster represents a strong theme. Therefore, we can obtain strong themes by collecting these clusters.
Let the probability p(score) be a monotonically non-decreasing function of score. The PAV algorithm (Ayer et al., 1955;Wilbur et al., 2005) is a regression method to derive from the data that monotonically non-decreasing estimate of p(score) which assigns maximal likelihood to the data. For our approach, score = pz C d . Algorithm 1 shows the theme clustering process using the PAV algorithm. For the given dateset D and the initial number of clusters q, theme clustering is performed n times, and an isotonic regression function is learned by applying the PAV algorithm. Note that q is an initial guess for the number of clusters and it is not guaranteed to remain the same in the output set. For our experiments, we set q = 50 and n = 100. After the PAV algorithm is applied, theme clustering is performed. At each iteration, we select any cluster in which there are more than 10 documents with PAV scores higher than 0.9. Unselected documents are re-used for clustering in the next iteration. This procedure is repeated until there are no more changes in the selected cluster set S.

Theme summarization
After obtaining themes (document clusters and their subject terms), we summarize each theme by choosing a text segment from PubMed document titles. A title should cover as many subject terms as possible, but also it should be well-formed, i.e. be descriptive enough and humanly understandable. To achieve this goal, we first extract all possible candidates from document titles as follows: (i) Extract all possible candidates as n-grams, where n = 1, ..., 20. Noun phrases are treated as units and must be totally inside or outside a candidate.
(ii) Check POS tags for starting and ending words in a candidate. Starting with a conjunction, verb, preposition and symbol is not allowed. Ending with a conjunction, verb, preposition, symbol, determiner, adjective or certain pronouns is not allowed.
(iv) Check grammatical dependency relations. We discard candidates for which the head word of a preposition does not appear in the same candidate as the proposition. Also, we validate the case, 'between A and B', so that A and B are not separated.
Next, for each candidate, a score is calculated by where tf t is the term frequency of the term t.
However, an ideal title should have enough words to be descriptive, hence we subtract (len(cand i )− 5) 2 from score(cand i ), where len(cand i ) is the number of words in cand i , and choose the top score as a title.

Experimental Results
We applied our method to the five disease sets, "cystic fibrosis", "deafness", "DiGeorge syndrome", "autism" and "hypertrophic cardiomyopathy" from OMIM 3 . These sets consist of 3000, 3000, 956, 2917 and 1997 PubMed documents, respectively, and are available at http://www.ncbi.nlm.nih.gov/ CBBresearch/Wilbur/IRET/PAVEM. For evaluating PAV-EM and comparing with the topic modeling method, latent Dirichlet allocation (LDA) (Blei et al., 2003), both approaches were performed 10 times for each disease set and scores were averaged over all runs. Mallet 4 was used to run LDA. The same tokenization was applied to LDA and PAV-EM. The number of topics given for LDA was 50 and the recommended optimization parameter was used for producing LDA topics. Table 1 presents average runtimes 5 for LDA and PAV-EM. LDA and PAV-EM spent 15.2 and 13.3 seconds on average for processing the smallest set, "DiGeorge syndrome". However, in larger sets, e.g. "autism", it took 46.9 and 31.3 seconds for LDA and PAV-EM, respectively. We also ran another implementation 6 of LDA, which was 30 times slower than Mallet. While PAV-EM and  Table 1: Average runtimes for LDA and PAV-EM in seconds. Sets 1, 2, 3, 4 and 5 are "cystic fibrosis", "deafness", "DiGeorge syndrome", "autism" and "hypertrophic cardiomyopathy", respectively.  LDA can be implemented in parallel computation 7 , this indicates that PAV-EM may be more efficient to obtain themes for a larger set of PubMed documents.

Method
The PAV-EM algorithm automatically learns themes from unlabeled PubMed documents, hence the performance measures that are used in supervised learning cannot be applied to our setup. Recent studies have shown more interest in topic coherence measures Newman et al., 2010;Mimno et al., 2011), which capture the semantic interpretability of topics based on subject terms. Table 2 shows the topic coherence scores measured by normalized point-wise mutual information (NPMI). For both top 5 and top 10 subject terms, PAV-EM achieves better NMPI scores than LDA. NPMI is known to be strongly correlated with human ratings (Aletras and Stevenson, 2013;Röder et al., 2015) and is defined by where p(t i , t j ) is the fraction of documents containing both terms t i and t j , and N indicates the number of top subject terms. = 1 D is the smoothing factor, where D is the size of the dataset.
MeSH (Medical Subject Headings) is a controlled vocabulary for indexing and searching biomedical literature (Lowe and Barnett, 1994). 7 A parallel implementation of LDA appears in    MeSH terms assigned to an article are often used to indicate the topics of the article, thus these terms can be used to identify how well documents are grouped by topics. In each cluster, p-values of MeSH terms are calculated using the hypergeometric distribution (Kim and Wilbur, 2001), and the top N significant MeSH terms are used to calculate precision, recall and F1. Table 3 compares PAV-EM with LDA 8 for the MeSH termbased performance. In the table, PAV-EM provides higher recall and F1 for top 1 and top 3 MeSH terms. Higher recall has an advantage in our task because the theme summarization process uses a consensus among PubMed documents to reach a theme title.
The next experiment was performed to compare machine generated titles with manually labeled titles. Although human judgements are subjective, it is not uncommon to collect human judgements for evaluating topic modeling methods (Mei et al., 2007;Xie and Xing, 2013). To validate the performance of the theme summarization approach, we first chose 500 documents from each disease set, and produced themes and titles. For each topic, five strongest themes were chosen, and they were shown to three human annotators with extracted subject terms. Table 4 shows an example of the proposed approach and the manual annotation for the "hypertrophic cardiomyopathy" set. Among 25 themes, our approach correctly identified 21 theme titles. We assumed that a machine-generated title was correct if it included any of manually annotated titles.

Conclusion
This study was inspired by an EM-based thematic clustering approach. In this probabilistic framework, theme terms are iteratively selected and documents are assigned to a most likely theme. The number of themes is dynamically adjusted  Table 4: Comparison of the titles generated from the proposed approach and manual annotation for the "hypertrophic cardiomyopathy" set.
by probabilistic evidence from documents. The PAV algorithm is utilized to measure the quality of themes. After themes are identified, subject term weights and PubMed document titles are used to form humanly understandable titles. The experimental results show that our approach provides a useful overview of a set of documents. In addition, the method may allow for a new way of browsing by semantically clustered documents as well as searching with context-based query suggestions.