Distributed Document and Phrase Co-embeddings for Descriptive Clustering

Descriptive document clustering aims to automatically discover groups of semantically related documents and to assign a meaningful label to characterise the content of each cluster. In this paper, we present a descriptive clustering approach that employs a distributed representation model, namely the paragraph vector model, to capture semantic similarities between documents and phrases. The proposed method uses a joint representation of phrases and documents (i.e., a co-embedding) to automatically select a descriptive phrase that best represents each document cluster. We evaluate our method by comparing its performance to an existing state-of-the-art descriptive clustering method that also uses co-embedding but relies on a bag-of-words representation. Results obtained on benchmark datasets demonstrate that the paragraph vector-based method obtains superior performance over the existing approach in both identifying clusters and assigning appropriate descriptive labels to them.


Introduction
Document clustering is a well-established technique whose goal is to automatically organise a collection of documents into a number of semantically coherent groups. Descriptive document clustering goes a step further, in that each identified document cluster is automatically assigned a human-readable label (either a word or phrase) that characterises the semantic content of the documents within the cluster. Descriptive clustering methods have been shown to be useful in a variety of scenarios, including information retrieval (Bharambe and Kale, 2011), analysis of social networks (Zhao and Zhang, 2011), and large-scale exploration (Nassif and Hruschka, 2013) and visualisation of text collections (Kandel et al., 2012).
A number of previously proposed descriptive clustering techniques work by extending a standard document clustering approach. Documents are typically clustered based on a bag-of-words (BoW) representation (i.e., the occurrence counts of the words that appear in each document). Then, each cluster is labelled using the most commonly occurring word or phrase within the cluster (Weiss, 2006). In contrast to this approach, the recently proposed descriptive clustering approach (CEDL) (Mu et al., 2016) maps documents and candidate cluster labels into a common semantic vector space (i.e., co-embedding). The co-embedding space facilitates the straightforward assignment of descriptive labels to document clusters. The CEDL method has been shown to generate accurate cluster labels and achieved improved clustering performance when compared to standard descriptive clustering methods. Nonetheless, the co-embedding is based solely on a BoW representation of the documents and is thus limited in its ability to accurately represent the semantic similarity between documents.
In this paper, we investigate a specific case of descriptive clustering that selects a single multiword phrase to characterise each cluster of documents (Li et al., 2008). Firstly, we assume descriptive phrases are to be selected from a candidate phrase set extracted from the corpus during preprocessing. The proposed method then follows the co-embedding descriptive clustering paradigm of the CEDL algorithm. However, in-stead of using a BoW representation, we employ the paragraph vector (PV) (Le and Mikolov, 2014) model to learn a distributed vector representations of phrases and documents. These distributed representations move beyond unstructured BoW representations by considering the local contexts in which words and phrases appear within documents, which provides a more precise estimate of semantic similarity.
In particular, we present two extensions to the initial PV-based method that enable models that learn a common co-embedding space of documents and phrases. The first extension jointly learns co-embeddings of documents and phrases. The second extension constructs 'pseudodocuments' consisting of the lexical context surrounding each occurrence of a particular phrase. Each of these contexts are treated as separate document instance that are associated with a single embedded vector. In both cases, after clustering the document embedding vectors, each embedded phrase is a candidate cluster label. To select the most appropriate descriptive label amongst these candidates, we first rank the documents according to their proximity to each candidate label's embedding vector and then select the phrase whose ranking maximises the average precision for a given cluster.
We compare the results obtained by our PVbased descriptive clustering method against two methods: spectral clustering (Shi and Malik, 2000), which only identifies clusters (but does not assign labels to them), and the previously introduced CEDL method (Mu et al., 2016), which carries out both clustering and labelling. Experimental results based on publicly available benchmark text collections demonstrate the effectiveness and superiority of our methods in both clustering performance and labelling quality.

Descriptive Clustering
Descriptive clustering methods typically use an unsupervised approach to firstly group documents into flat or hierarchical clusters (Steinbach et al., 2000). Document clusters are then characterised using a set of informative and discriminative words (Zhu et al., 2006), phrases (Mu et al., 2016Li et al., 2008) or sentences (Kim et al., 2015).
Early approaches to descriptive clustering followed the description-comes-first (DCF) paradigm (Osiński et al., 2004;Weiss, 2006;Zhang, 2009). DCF-based methods work by firstly identifying a set of cluster labels, and subsequently forming document clusters by measuring the relevance of each document to a potential cluster label. DCF-based approaches have several shortcomings, which include poor clustering performance and low readability of cluster descriptors (Lee et al., 2008;Carpineto et al., 2009). More recent developments in descriptive clustering have proposed alternative techniques, which approach the problems of improving clustering performance and descriptive label quality from various different angles. For instance, Scaiella et al. (2012) identifies Wikipedia concepts in documents and then computes relatedness between documents according to the linked structure of Wikipedia. Navigli and Crisafulli (2010) propose a method that takes into account synonymy and polysemy. Their method utilises the Google Web1T corpus to identify word senses based on word co-occurrences and computes the similarity between documents using the extracted sense information.
More recently, Mu et al. (2016) presented their co-embedding based descriptive clustering approach that learns a common co-embedding vector space of documents and candidate descriptive phrases. The co-embedded space simplifies the clustering and cluster labelling task into a more straightforward process of computing similarity between pairs of documents and between documents and candidate cluster labels.

Distributed Representation
Distributed representation techniques are becoming increasingly important in a number of supervised learning tasks, e.g., sentiment analysis (Dai et al., 2015), text classification (Dai et al., 2015;Ma et al., 2015) and named entity recognition (Turian et al., 2010). A number of models have been proposed to learn distributed word or phrase representations in order to predict word occurrences given a local context (Mnih and Hinton, 2009;Mikolov et al., 2013b;Mikolov et al., 2013a;Pennington et al., 2014). Subsequently, the PV model was proposed to learn representations of both words and documents (Le and Mikolov, 2014;Dai et al., 2015). The PV model has been shown to be capable of learning a semantically richer representation of documents compared to unstructured BoW models. To our knowledge, our work constitutes the first attempt to use distributed representation models to co-embed documents and phrases for unsupervised descriptive clustering.

Proposed Descriptive Clustering Method
As outlined above, the descriptive clustering task (i.e., grouping documents according to semantic relatedness and characterising the cluster content using a representative descriptive phrase) relies heavily on learning a representation of documents and phrases that can accurately capture relevant semantic information. A particularly effective strategy for descriptive clustering is to jointly map documents and descriptive phrases together into a common embedding space (Mu et al., 2016). The clustering of documents and selection of descriptive phrases for each cluster is then carried out by calculating the cosine similarities between documents (to form clusters), and between documents and descriptive phrases (to determine descriptive labels) in the learned space. Instead of relying on the commonly used BoW model, we propose a novel descriptive clustering approach. Our method uses similarities computed from distributed joint embeddings of documents and phrases, which are learned by considering both the global context provided by the document and the local context of the descriptive phrases. We propose two different strategies to learn these embeddings, as described below.

Joint Learning of Document and Phrase Embeddings
The first strategy jointly learns the distributed representations for documents and phrases by representing phrases, words, and documents as vectors that are used both to predict the occurrence of words in given documents (reflecting global document content information), and to predict the cooccurrences of words and phrases within a sliding window, to reflect the local context information.
We extend the PV model described in Dai et al. (2015) to simultaneously generate word, phrase and document embeddings. The objective function is to maximise the log probability of words and phrases conditioned on either their global or local context: where T P is the set of training phrase instances; p t ∈ P is the t-th phrase instance; d t denotes the document corresponding to the t-th training instance; c denotes a member of the local context C t = [q t−L , . . . , q t−1 , q t+1 , . . . , q t+L ], which occurs within a window size of L of the training instance (|C t | = 2L) and consists of both words and phrases q t ∈ P ∪ W; likewise, T W is the set of training word instances; w s ∈ W is the s-th target word instance; d s denotes the document corresponding to the s-th training instance; and C s is its local context with |C s | = 2L. To summarise, the probability terms p(p t |d t ) and p(w s |d t ) model the document content information from a global level, while p(p t |c) and p(w s |c) model the local context. There are 2L + 1 conditional probabilities estimated for each training instance. The probability of a given lexical unit q t ∈ P ∪ W (either a word or a phrase) is modelled using the vector embeddings of the |P|+|W| words and phrases and the softmax function as follows: where u qt is a weight vector specific to the target word or phrase, z dt is the embedding vector of the document corresponding to instance t, and z c is the embedding vector of a word or phrase in the context of q t . Since the document, phrase and word vectors all use the same weight vector u qt to predict the target phrase, they are necessarily in the same vector space.

Phrase Embeddings via Local Context Pseudo-Documents
The previous model considers learning an embedding as a multi-objective problem by trying to predict phrases and words based on the global and local context. Besides indexing, Equation (1) treats words and descriptive phrases interchangeably. An alternative approach is to treat phrases as 'pseudo-documents' by using the sets of words appearing in the local context of each phrase occurence. Specifically, training instances for a phrase's embedding vector are constructed by extracting the local context around each occurrence of a phrase in the document collection. Using the augmented training set, consisting of both the original documents and the additional pseudodocuments, we can then employ any existing PV model (Le and Mikolov, 2014;Dai et al., 2015) to learn the document-phrase co-embeddings.
However, due to the significant differences in the sizes and numbers of documents and pseudodocuments, there is a danger that the addition of the pseudo-documents can have a detrimental effect on the performance of the model. Thus, we adopt a two-stage training procedure. Firstly, an embedding model is trained using only the documents. Then, we fix the weights of the model and optimise the phrase embeddings by providing the pseudo-documents as the input to the model.
We have integrated the above-mentioned process with two PV approaches, namely the distributed memory model (PV-DM) Le and Mikolov (2014), and the extension of the distributed BoW model (PV-DBOW) in Dai et al. (2015).
In the PV-DM model, the probability that a target word will appear in a given lexical context is conditioned on the surrounding co-occurring words and also the document: where w t is the target word for instance t, T W is the set of training word instances, . . , w t+L ] are context words that occur within a window size of L words around w t , and d t denotes the document corresponding to the t-th training instance. The probability is modelled using a softmax function.
For phrase p, the objective is to maximise the sum of the log probabilities t∈Tp log p(w t |C t , p) where w t ∈ T p are the word instances that appear in local context around the phrase, i.e., T p is the set of word instances across all pseudo-documents, and C t is the set of words that occur around the t-th word instance which also occur within the pseudodocuments for the phrase. Explicitly, the optimal embedding vector for the phrase is determined by solving the following optimisation problem: where x t is the concatenation of all word vectors in the context of word w and {u w } w and {v w } w for w. To find an approximate solution, the parameters of the embedding vector are randomly initialised and optimised using stochastic gradient descent; the gradient is calculated via backpropagation (Rumelhart et al., 1986).
The PV-DBOW model simplifies the PV-DM model by ignoring the local context of words in the log probability function. The probability that a target word will appear in a given lexical context is conditioned solely by the document. Dai et al. (2015) introduced a modified version of the PV-DBOW model that treats words and documents as interchangeable inputs to the neural network. This enables the model to jointly learn word and document embeddings in the same space; we denote the model as PV-DBOW-W. Essentially, the objective of the PV-DBOW-W model is a combination of both the skip-gram model (Mikolov et al., 2013b) that generates word embeddings and the PV-DBOW method which is used for learning document embeddings: To optimise the embedding of a specific phrase, denoted p, the existing word embeddings remain fixed, and the objective function is simplified as t∈Tp log p(w t |p) where w t ∈ T p are word instances that appear in the local contexts around the phrase. The optimal embedding vector for this phrase is determined by solving the following optimisation problem: where the weight vectors {u w } w are fixed. As in the previous model, the parameters of the embedding vector are randomly initialised and optimised using stochastic gradient descent.

Descriptive Phrase Selection
Given co-embeddings of documents and phrases, any clustering algorithm can be applied. We use k-means, with the cosine similarity-based distance metric, to cluster the documents. Given the set of documents within each identified cluster G 1 , . . . , G K , the document embedding vectors {z d } N d=1 and the descriptive phrase embedding vectors {z p } P p=1 , we then select a descriptive phrase that best represents the documents assigned to a cluster.
A baseline approach for descriptive phrase selection is to select the phrase whose embedding vector is nearest to the cluster centroid; however, proximity to the cluster centroid is not always a good indicator of cluster membership, as it ignores the location of documents belonging to other clusters. An ideal phrase vector should lie closer to documents within the cluster than documents outside of the cluster. Accordingly, we rank documents based on their proximity to a candidate phrase and calculate the average precision of this ranking (where documents belonging to the given cluster are the true positives).
For cluster G, we define the cluster membership indicator for each document as: For a given phrase p, let π p (1) be the index of the nearest document to the phrase, and π p (i) be the index of the i-th nearest neighbour. The precision after the k-nearest documents are retrieved is P πp (k) = 1 k k i=1 y πp(i) . The phrase which maximises the average precision P p is selected as the cluster descriptor where |G| is the number of documents in the cluster.

Results
We evaluate the proposed PV-based descriptive clustering methods in terms of cluster quality and descriptive phrase selection. Additionally, we show a visualisation of the co-embedding space in the supplementary material.

Datasets
We use two well-known, publicly available datasets: "Reuters-21578 Text Categorization Test Collection" from the Reuters newswire (Lewis, 1997), and the "20 Newsgroups" email dataset 1 . We pre-process the 20 Newsgroups corpus to remove email header information while for both datasets we extract candidate phrases using Termine (Frantzi et al., 2000), an automatic term extraction tool.
For the Reuters corpus, we use the complete document collection for training the PV models. For evaluation, we use both the training and testing sets of the modApte split, and select the 10 categories with the largest number of documents. Moreover, we remove documents that belong to multiple categories, this process results in an evaluation set of 8, 009 documents. For the 20 Newsgroups dataset, we use the complete set of 18, 846 documents for training the PV models. We remove words and phrases that only appear in a single document and then remove any empty documents. This process results in an evaluation set of 18, 813 documents with 20 categories, organised into 4 higher level parent categories. Table 1 summarises various characteristics of the employed datasets, including: a) number of documents, b) number of candidate phrases and c) category labels.

Paragraph Vector Models
In this section, we provide implementation details for the three PV models (PV-DBOW-WP, PV-DBOW-W, and PV-DM), introduce a fourth model (PV-CAT) and explain the different settings that we use throughout the experiments. The PV-DBOW-WP model is used to jointly train phrase, word and document co-embeddings. For the PV-DBOW-W and PV-DM models, we use the twostage training approach, in which the document embeddings and softmax weights are trained first, and then the phrase co-embeddings are trained using pseudo-documents. A window size of 10 words around the target phrase is used as the local context to create the pseudo-documents.
Each PV model has a number of parameters, including the dimension of the embedded spaces and the size of the context window. We set all embedding dimensions to 100. For the PV-DBOW-W and PV-DBOW-WP model, we use a context window of 10 words/phrases while for the PV-DM model a window size of 2 words (we tuned the size of the context window by applying the two PV models to a small development set of the Reuters corpus). This disparity in window size is not surprising since the PV-DM model considers the order of words within the local context and uses different parameters for the vectors at each location in the context window, Equation (5), whereas an increased window size does not add additional parameters to the PD-DBOW model.
We create an additional model, namely PV-CAT, by concatenating the vector representations induced by the PV-DBOW-W and the PV-DM models. This is performed after training the document and the phrase vectors. Intuitively, the concatenation of the PV-DBOW-W and PV-DM feature vectors can provide complimentary information given that the two models are trained using a different size of context window (i.e., 10 and 2 words, respectively).
Given that the size of the vocabulary is very large, computing the softmax function during stochastic gradient descent is computationally expensive. For faster training, different optimisation algorithms can be used to approximate the log probability function. We use a combination of negative sampling and hierachical softmax via backpropagation (Mnih and Hinton, 2009;Mikolov et al., 2013b). Specifically, we use negative sampling and then further optimise the embeddings using hierarchical softmax. Although, these are different optimisation approaches, both methods can be applied in this ad-hoc manner.
Moreover, we follow the process described in Le and Mikolov (2014) to tune the learning rate. For this, we set the initial learning rate to 0.025 and decrease it linearly during 10 training epochs such that the learning rate is 0.001 during the last training epoch.

Baseline Methods
As our first baseline, we perform spectral clustering based on the affinity matrix produced according to the cosine similarity between the standard term-frequency inverse document (tf-idf) representation of the documents. We used the normalised cut (NC) spectral clustering algorithm proposed by Shi and Malik (2000).
We also compare our proposed method to the CEDL algorithm (Mu et al., 2016), which uses a measure of second-order similarity between phrases and documents, based on their cooccurrences at the document level, to obtain a spectral co-embedding. We use the same parameters suggested in the original publication, but carried out minor changes to the algorithm to allow the method to be scaled up to larger datasets. To compare clustering performance, we also run the CEDL algorithm without the phrase co-embeddings.

Evaluation of Cluster Quality
In this experiment, we evaluate the clustering performance of the methods by comparing automatically generated document clusters against the gold standard categories. For all methods, we use kmeans clustering with cosine similarity as the distance metric. Following previous approaches (Xie and Xing, 2013), we set the number of clusters equal to the number of gold standard categories. As evaluation metrics, we use the macro-averaged F1 score 2 , and normalised mutual information 3 . Table 2 compares the clustering performance achieved by four PV models (PV-DBOW-WP, PV-DBOW-W, PV-DM, and PV-CAT) against the performance of the baselines (i.e., the two versions of the CEDL algorithm and spectral clustering via normalised cut).  The PV-DBOW-W and PV-CAT methods yield the best clustering performance on the Reuters dataset. Performance gains over the three baseline methods range between 6% − 12% (F1 score) and 8% − 9% (normalised mutual information). On the 20 Newsgroups dataset, the PV-DBOW-WP and PV-CAT models outperformed the baseline methods 4 by approximately 2% − 21% (F1 score) and 3% − 18% (normalised mutual information). Moreover, we note that the performance achieved by the PV-CAT model exceeds the best results reported in Xie and Xing (2013) (normalised mutual information of 0.6159 using a multi-grain clustering topic model).
Finally, we note that co-embedding is not designed as a mechanism for improving cluster quality. For CEDL, the co-embedding of phrases reduced the clustering performances in most datasets. This reduction in performance is equally observed for the paragraph vector models when applied to the Reuters dataset: the jointly trained co-embedding model (i.e., PV-DBOW-WP) achieved a lower performance than the twostage approach PV-DBOW-W (F1 score of 0.56 and 0.66, respectively). 4 Our implementation of the CEDL algorithm did not scale up to the entire dataset ('all'), but average results on random subsets were consistent, as shown in the table.

Evaluation of Cluster Labelling
In this section, we evaluate the cluster labels (i.e., multi-word phrases) selected by the proposed PVbased descriptive clustering methods. As a baseline approach, we use the CEDL algorithm that produces a co-embedded space of documents and phrases 5 .
For each document cluster, we apply the phrase selection criterion, Equation (9), to identify the phrase that best describes the underlying cluster. Then for each gold standard category, the cluster having the highest proportion of documents belonging to the category is determined. This process means some clusters are assigned to multiple categories while other clusters are left unassigned. For each assigned cluster, we rank all documents according to their similarity to the automatically selected phrase (in the co-embedding space), where documents within the cluster have precedence over documents outside the cluster.
We evaluate the quality of the cluster label by computing the average precision of this ranking in recalling the gold standard category. The average precision is maximised when documents closest to the selected phrase belong to the gold standard category. Table 3 shows the selected cluster descriptors aligned to the gold standard categories, the average precision and mean average precision scores achieved by the CEDL method and PV-based descriptive clustering models when applied to the Reuters and the 20 Newsgroups dataset.
The Reuters dataset presents a challenging case for descriptive clustering methods given that the distribution of gold standard categories is highly skewed, i.e., the majority categories (e.g., 'earn' and 'acq') correspond to more than one clusters while the remaining clusters cover multiple smaller categories. Nonetheless, we observe that the automatically selected cluster descriptors are related to the corresponding gold standard categories (e.g., 'import coffee' and 'oil export' for gold standard category 'ship'). In practice, the skewed distribution of gold standard categories can be addressed by using a larger number of clusters in k-means, or by using a cluster algorithm more amenable to heterogeneously sized clusters.
The 20 Newsgroups dataset shows a more balanced distribution of categories than the Reuters Table 3: Cluster descriptions and average precision (as percentages) achieved by descriptive clustering methods. CE1: CEDL, PV1: PV-DBOW-WP, PV2: PV-DBOW-W, PV3: PV-DM, PV4: PV-CAT. The average precision metric depends on not only the phrase but also the location of documents relative to the selected phrase; consequently, the average precision of a phrase may vary among the embeddings. corpus, and we note that all descriptive clustering methods were able to identify meaningful cluster descriptors that have a clear correspondence to the gold standard categories (e.g., 'window version' and 'dos app' for the category 'comp.os.mswindows.misc'). With regard to the mean average precision, we observe that the PV-DBOW-W and PV-CAT models obtained the best performance. Moreover, the PV-CAT model achieved statistically significant improvements over the CEDL baseline in terms of the average precision across the 28 categories while no significant 6 improvement was observed for the remaining three PV-based models.
The results that we obtained demonstrate that the PV-based co-embedding space can effectively capture semantic similarities between documents and phrases. An illustrative example of this is shown in Table 4. In this example, we selected two documents that neighbour the phrase "user interface" in the PV-CAT co-embedded feature space 6 For significance testing, we used a paired sign-test, with a significance threshold of 0.05 and Bonferroni multiple test correction for the 4 tests; the uncorrected p-value for the PV-CAT model is 0.0009. for the "20 Newsgroups" dataset. It can be noted that although neither of the two documents explicitly contain the input phrase, the first discusses a semantically similar topic, and the second uses the acronym GUI, i.e., graphical user interface.
As another example, we generate a twodimensional visualisation of the document-phrase co-embeddings using t-SNE (van der Maaten and Hinton, 2008) that demonstrates how coembedded phrases can be used as 'landmarks' for exploring a corpus. For this example, we use the 'sci' categories from the 20 Newsgroup corpus and select the 200 most frequent phrases in this subset. As input to t-SNE, we use the chordal distance defined by the cosine similarity in the co-embedding space and set the perplexity level to 40. Figure 1 in the supplementary material shows the visualisation with the cluster boundaries, location of the documents and co-embedded phrases.

Conclusion
Descriptive document clustering helps information retrieval tasks by automatically organising document collections into semantically coherent groups and assigning descriptive labels to each Table 4: Two documents whose vector embeddings were the 5th and 6th nearest neighbours (according to the cosine of the angle of the corresponding vectors) to the phrase "user interface" in the PV-CAT based co-embedded space.
train/comp.windows.x 67337 Does anyone know the difference between MOOLIT and OLIT? Does Sun support MOOLIT? Is MOOLIT available on Sparcstations? MoOLIT (Motif/Open Look Intrinsic Toolkit allows developers to build applications that can switch between Motif and Open Look at run-time, while OLIT only gives you Open Look. Internet: chunhong@vnet.ibm.com test/comp.windows.x 68238 Hi there, I'm looking for tools that can make X programming easy. I would like to have a tool that will enable to create X motif GUI Interactivly. Currently I'm Working on a SGI with forms. A package that enables to create GUI with no coding at all (but the callbacks). Any help will be appreciated. Thanks Gabi.
group. In this paper, we have presented a descriptive clustering method that uses paragraph vector models to support accurate clustering of documents and selection of meaningful and precise cluster descriptors. Our PV-based approach maps phrases and documents to a common feature space to enable the straightforward assignment of descriptive phrases to clusters. We have compared our approach to another state-of-the-art algorithm employing a co-embedding based on bag-of-word representations. The PV-based descriptive clustering method achieved superior clustering performance on both the Reuters and the 20 Newsgroups datasets. An evaluation of the selected cluster descriptors showed that our method selects informative phrases that accurately characterise the content of each cluster.