Fusing Document, Collection and Label Graph-based Representations with Word Embeddings for Text Classification

Contrary to the traditional Bag-of-Words approach, we consider the Graph-of-Words(GoW) model in which each document is represented by a graph that encodes relationships between the different terms. Based on this formulation, the importance of a term is determined by weighting the corresponding node in the document, collection and label graphs, using node centrality criteria. We also introduce novel graph-based weighting schemes by enriching graphs with word-embedding similarities, in order to reward or penalize semantic relationships. Our methods produce more discriminative feature weights for text categorization, outperforming existing frequency-based criteria.


Introduction
With the rapid growth of the social media and networking platforms, the available textual resources have been increased. Text categorization or classification (TC) refers to the supervised learning task of assigning a document to a set of two or more predefined categories (or classes) (Sebastiani, 2002). Well-known applications of TC include sentiment analysis, spam detection and news classification.
In the TC pipeline, each document is modeled using the so-called Vector Space Model (Baeza-Yates and Ribeiro-Neto, 1999). The main issue here is how to find appropriate weights regarding the importance of each term in a document. Typically, the Bag-of-Words (BoW) model is applied and a document is represented as a multiset of its terms, disregarding co-occurence between 1 Code and data: github.com/y3nk0/Graph-Based-TC the terms; using this model, the importance of a term in a document is mainly determined by the frequency of the term. Although several variants and extensions of this modeling approach have been proposed (e.g., the n-gram model (Baeza-Yates and Ribeiro-Neto, 1999)), the main weakness comes from the underlying term independence assumption, where the order of the terms is also completely disregarded. After the introduction of deep learning models for TC (Blunsom et al., 2014;Kim, 2014), recent work by Johnson and Zhang (2015) shows how we could effectively use the order of words with CNNs (LeCun et al., 1995). In many cases though, space and time limitations may arise due to complex neural network architectures. As stated in work by Joulin et al. (2017), computation can still be expensive and prohibitive.
In this paper, we explore fast term weighting criteria for TC that go beyond the term independence assumption. The notion of dependencies between terms is introduced via a Graph-of-Words (GoW) representation model. Under this model, each term is represented as a node in the graph and the edges capture co-occurrence relationships of terms with a specified distance in the document. We implicitly consider information about n-grams in the document as well as the collection of documents -expressed by paths in the graph -without increasing the dimensionality of the problem. Furthermore, we introduce word-embedding similarities as weights in the GoW approach, in order to further boost the performance of our methods. Finally, we successfully mix document, collection and label GoWs along with word vector similarities into a single powerful graph-based framework. An overview of our approach is shown in Fig. 1.

Related Work
Term weighting schemes. A core aspect in the Vector Space Model for document representation, is how to determine the importance of a term within a document. Many criteria have been introduced with the most prominent ones being TF, TF-IDF (Salton and Buckley, 1988;Singhal et al., 1996;Baeza-Yates and Ribeiro-Neto, 1999;Robertson, 2004) and Okapi BM25 (Robertson et al., 1995), while some recent ones include N-gram IDF (Shirakawa et al., 2015). Lan et al. (2005) conducted a comparative study of frequency-based term weighting criteria for text categorization; one of their outcomes was that, in many cases, the IDF factor is not significant for the categorization task, leading to no improvement of the performance. It is interesting to point out here that, more specialized approaches have been proposed for specific classification tasks, such as the Delta TF-IDF method that constitutes an extension of TF-IDF for sentiment analysis (Martineau and Finin, 2009). However, most of the previously proposed frequency-based weights consider the document as a Bag-of-Words; that way, any structural information about the ordering or in general, syntactic relationship of the terms, is ignored by the weighting process.
Text categorization. A number of diverse approaches have been proposed for TC (Joachims, 1998;McCallum and Nigam, 1998;Nigam et al., 2000;Sebastiani, 2002;Kim et al., 2006). The first step of TC concerns the feature extraction task, i.e., which features will be used to represent the textual content. Typically, the straightforward Bag-of-Words approach is adopted, where every document is represented by a feature vector that contains boolean or weighted representation of unigrams or n-grams in general. In the case of weighted feature vectors, various term weighting schemes have been used, with the most wellknown ones being TF (Term Frequency), TF-IDF (Term Frequency -Inverse Document Frequency). Although these weighting schemes were initially introduced in the NLP and IR fields, they have also been applied in the TC task. Paltoglou and Thelwall (2010) reported that, in the case of sentiment analysis, extensions of the TF-IDF weighting schemes introduced in the IR field, can further improve the classification accuracy. A comprehensive review of this area is offered in the article by Sebastiani (2002).
Deep Learning for TC. With the rise of deep learning models, CNNs were applied for text classification (Blunsom et al., 2014;Kim, 2014;Johnson and Zhang, 2015). Next,  presented Character-level CNNs for the task of TC. Finally, Joulin et al. (2017) proposed a novel text classifier which achieves equivalent performance to state-of-the-art TC models, with faster learning times. Our work does not focus on the classifier part, as the aforementioned methods, but on the extraction of better features.
Graph-based text categorization. In the related literature, most of the graph-based method for TC, rely on graph mining algorithms that are applied to extract frequent subgraphs, which are then used to produce feature vectors for classification (Deshpande et al., 2005;Jiang et al., 2010;Nikolentzos et al., 2017). The basic shortcoming of those methods stems from the computational complexity of the frequent subgraph mining algorithm. Furthermore, most of these methods require from the user to set the support parameter, which concerns the frequency of appearance of a subgraph. Close to our work are the approaches followed by Hassan et al. (2007) and Malliaros and Skianis (2015); they explored how random walks and other graph centrality criteria can be applied to determine the importance of a term.
Graph-based text mining, NLP and IR. Representing documents as graphs is a well-known approach in NLP and IR. TextRank algorithm, proposed by Mihalcea and Tarau (2004), was among the first works that considered a random walk model similar to PageRank, over a graph representation of the document, in order to extract representative keywords and sentences. Later, sev-    eral methods for those tasks were followed (Erkan and Radev, 2004;Litvak and Last, 2008;Boudin, 2013;Lahiri et al., 2014;. Another domain where graph-based term weighting schemes have been applied is the one of ad hoc Information Retrieval (Rousseau and Vazirgiannis, 2013). An interesting survey can be found in the work of Blanco and Lioma (2012) for a detailed description of graph-based methods in the text domain.

Preliminaries and Background
Let D = {d 1 , d 2 , . . . , d m } be a collection of documents and let C = {c 1 , c 2 , . . . , c |C| } be the set of predefined categories. Text categorization is considered the task of assigning a boolean value to each pair (d i , c i ) ∈ D × C, i.e., assigning each document to one or more categories (Sebastiani, 2002). The main point here is how to find appropriate weights for the terms within a document. As we will present below shortly, our approach utilizes network centrality criteria.
Node Centrality Criteria. Centrality 2 represents a central notion in graph theory and network analysis in general; it constitutes of measures that capture the relative importance of the node in the graph based on specific criteria (Newman, 2010). One important characteristic of the centrality measures is that they consider either local information 2 en.wikipedia.org/wiki/Centrality. of the graph (e.g., degree centrality, in-degree/outdegree centrality in directed networks, weighted degree in weighted graphs, clustering coefficient) (Newman, 2010), or more global information -in the sense that the importance of a node is determined by the properties of the node globally in the graph (e.g., PageRank, closeness). Let G = (V, E) be a graph (directed or undirected), and let |V |, |E| be the number of nodes and edges respectively. Next, we define basic centrality criteria that are used in the proposed methodology.
Degree centrality. The degree centrality is one of the simplest local node importance criteria, which captures the number of neighbors that each node has. Let N (i) be the set of nodes connected to node i. Then, the degree centrality can be derived based on the following formula: degree centrality(i) = |N (i)| |V |−1 . Closeness centrality. Let dist(i, j) be the shortest path distance between nodes i and j. The closeness centrality of a node i is defined as the inverse of the average shortest path distance from the node to any other node in the graph: closeness(i) = |V |−1 j∈V dist(i,j) . PageRank centrality. PageRank counts the number and quality of edges to a node to determine a rough estimate of how important the node is: out-deg(j) , where α is the teleportation probability and out-deg(i) denotes the out degree on node i.

51
In this section, we present the components of the proposed graph-based framework for TC.

Graph Construction
We model documents as graphs that capture dependencies between terms. More precisely, each document d ∈ D is represented by a graph where the nodes correspond to the terms t of the document and the edges capture co-occurrence relationships between terms within a fixed-size sliding window of size w. That is, for all the terms that co-occur within the window, we add edges between the corresponding nodes of the graph. Note that, the windows are overlapping starting from the first term of the document; at each step, we simply remove the first term of the window and add the new one from the document. As graphs constitute rich modeling structures, several parameters about the construction phase need to be specified, including the directionality of the edges, the addition of edge weights, well as the size w of the sliding window. Fig. 2 gives a toy example of the construction of GoW for a collection composed by two documents.
To summarize, the key point of the graph-based representation for TC is the fact that it deals with the term independence assumption. Even if we consider the n-gram model, still information about the relationship between two different n-grams is not fully captured -as happens in the case of graphs. This has also been noted in other application domains (e.g., IR (Rousseau and Vazirgiannis, 2013)).

Term Weighting
Having the graph, the importance of a term in a document can be inferred by the importance of the corresponding node in the graph. In the previous section, we presented local and global centrality criteria that have been widely used for graph mining and network analysis purposes; here, we propose that those criteria can also be used for weighting terms in the TC task. That way, similar to TF, we can define the Term Weight (TW) weighting scheme as TW(t, d) = centrality(t, d), where centrality(t, d) corresponds to the score of term (node) t in the graph representation G d of document d. The interesting point here is that TW can be used along with any centrality criterion in the graph, local or global. Furthermore, we can extend this weighting scheme by considering information about the inverse document frequency (IDF factor) of the term t in the collection D. That way, we can derive the TW-IDF model as follows: In fact, TW and TW-IDF constitute suites for graph-based term-weighting schemes and thus, can be applied in any text analytics task. Some of them have already been explored in graphbased IR (Rousseau and Vazirgiannis, 2013) and keyword extraction (Mihalcea and Tarau, 2004;. The proposed weights are inferred from the interconnection of features (i.e., terms) -as suggested by the graph -and therefore information about n-grams is implicitly captured. That way, the feature space of the learning problem is kept to the one defined by the unique unigrams of our collection (instead of using simultaneously as features all the possible unigrams, bigrams, 3-grams, etc.), but the produced term weights incorporate n-gram information through the graph-based representation.

Inverse Collection Weight (ICW)
In this paragraph, we introduce the concept of Inverse Collection Weight (ICW) -a graph-based criterion to penalize the weight of terms that are "important" across the whole collection of documents. The main concept behind ICW is the collection level graph G -an extension of the Graphof-Words in the collection of documents D.
Definition 1 (Collection Level Graph G) Let {G 1 , G 2 , . . . , G d } |D| be the set of graphs that correspond to all documents d ∈ D. The collection level graph G is defined as the union of graphs G 1 ∪ G 2 ∪ . . . ∪ G d over all documents in the collection.
The union of two graphs G = (V G , E G ) and H = (V H , E H ) is defined as the union of their node and edge sets, i.e., G∪H = (V G ∪V H , E G ∪E H ). The number of nodes in graph G is equal to the number of unique terms in the collection, while the number of edges is equal to the number of unique edges over all document-level graphs (see also Fig. 2).
This graph captures the overall dependencies between the terms of the collection; the relative overall importance of a term in the collection will be proportional to the importance of the corresponding node in G. Following similar methodological arguments as used for IDF (Robertson, 2004), we define a probability distribution over the nodes of G (or equivalently, the unique terms of D), with respect to a centrality (term-weighting in our case) criterion; then, the probability of node (term) t will be: . (2) Note that, in Eq.
(2), we use D instead of G; we consider that the space defined by the document collection D is equivalent to the one defined by graph G with respect to the unique terms of the collection. This way, the notion of TW(t, D) used here is consistent with what was described earlier.
Based on this, we define the ICW measure as: Instead of selecting the maximum centrality in the collection level (Eq. (3)), the sum of all centralities also yields good results. ICW shares common intuition with the inverse total term frequency described in Robertson (2004). In fact, it can be considered as an extension of the total collection frequency of a term, to the graph-based document representation. Furthermore, similar to TW, it can be used along with any node centrality criterion.
Using ICW as a graph-based collection-level term penalization factor, we derive a new class of term-to-document weighting mechanism, namely TW-ICW. This weighting scheme is derived combining different local (i.e., document-level) and global (i.e., collection-level) criteria as follows: (ICW(t, D)).
In the case of TW and ICW, any centrality criterion can be applied. However, the computational complexity is a crucial factor that should be taken into account. Nevertheless, as we have noticed from the experimental evaluation, even using simple and easy-to-compute local criteria (e.g., degree), we achieve good classification performance.

Label Graphs
Shanavas et al. (2016) introduced supervised term weighting (TW-CRC) as a method to integrate class information with graphs. Similarly, we create a graph for each class (label), where we add all words of documents belonging to the respective class as nodes and their co-occurrence as edges. Our weighting scheme is a variant of TW-CRC; we define LW for a term t as: where the maximum degree of term t in all label graphs (L) is divided by the max of two values: the average degree of the term in all label graphs (except the one having the max degree) and the min degree of all the terms in all the label graphs. Then, we obtain ICW-LW as follows: ICW-LW(t, d) = log(ICW(t, D) × LW(t)), and multiply it with TW(t, d) to get TW-ICW-LW. Notice that, supervised frequency-based methods have also been proposed in previous work (Debole and Sebastiani, 2004;Huynh et al., 2011).

Edge Weighting using Word Embeddings
With our proposed framework, we can now use word embeddings (Bengio et al., 2003) in order to extract similarities between terms. Our goal is to integrate these similarities in the graph representation as weights on the edge between two words.
The key idea behind our approach is that we want to reward semantically close words in the graphdocument level (TW) and penalize them in the collection level (ICW). The most commonly used similarity between two words t 1 and t 2 in the word-embedding space is cosine similarity, which ranges between -1 and 1. In order to have a valid distance metric, we need to bound this between 0 and 1. We use the angular similarity to represent the weight of an edge between two words, and since the vector elements may be positive or negative, the formula becomes: The best performance was given by using Google's pre-trained word embeddings (Mikolov et al., 2013) and not by learning them by the datasets. Since the words included in the pretrained version of word2vec are case sensitive and not stemmed, we did not apply any of these transformations. For words that do not appear in word2vec, we add a small value as similarity. Other distances (e.g. inverse euclidean, fractional) did not yield any further improvement. A similar approach for generic keyphrase extraction can be tracked in work by Wang et al. (2015).
Providing more information in the weights, like number of co-occurrences between words, did not yield better results.

Classification Algorithms
Since the goal of this paper is to introduce new term weighting schemes, we rely on widely used classification algorithms. Specifically, we have used linear SVMs, due to their superior performance in TC (Joachims, 1998). Furthermore, as discussed in Leopold and Kindermann (2002), the choice of the kernel function of SVM is not very crucial, compared to the significance of the term weighting schemes.

Experiments
We have evaluated our method on six freely available standard TC datasets, covering multi-class document categorization, sentiment analysis and subjectivity. Specifically: (1) 20NG 3 : newsgroup documents belonging to 20 categories, (2) REUTERS 3 : 8 categories of Reuters-21578, (3) WEBKB 3 : 4 most frequent categories of webpages from Computer Science departments, (4) IMDB (Pang and Lee, 2004): positive and negative movie reviews; (5) AMAZON (Blitzer et al., 2007): product reviews acquired from Amazon over four different sub-collections; (6) SUBJEC-TIVITY (Pang and Lee, 2004): contains subjective sentences gathered from Rotten Tomatoes and objective sentences gathered from IMDB. A summary of the datasets can be found in Table 1.
In the experiments, linear SVMs were used with grid search cross-validation for tuning the C parameter. We also examined logistic regression, and observed similar performance. In the text preprocessing step, we have removed stopwords. No stemming or lowercase transformation was applied in order to match the words in word2vec.
For evaluation we use macro-average F1 score and classification accuracy on the test sets; that way, we deal with the skewed class size distribution of some datasets (Sebastiani, 2002). For the notation of the proposed schemes, we use TW (centrality measure) (e.g., TW (degree)) to indicate the centrality and TW-ICW (centrality at G, centrality at G) (e.g., TW-ICW (degree, degree)) for the local and collection-level graphs respectively. In TW-IDF (w2v), we compute the weighted degree centrality on the document level, with word-embedding similarities as weights. Similarly, in TW-ICW (w2v) we compute both weighted centralities for document and collection graphs. Finally, we denote as TW-ICW-LW the blending of TW, ICW and label graphs (LW). In label graphs we only make use of the degree centrality, since it is fast and performs best. Table 2 presents the results concerning the categorization performance of the proposed schemes for the six datasets. As discussed previously, the size of the window considered to create the graphs is one of the model's parameters. From the extensive experimental evaluation that we have performed, we have concluded that small window sizes give the most persistent results across various datasets and weighting schemes. For completeness in the presentation, we report results for two window sizes. In order to capture more information, we need larger window sizes for small datasets (e.g. SUBJECTIVITY). Also, since for the baseline methods (TF, TF binary, TF-IDF, w2v, TF-IDF-w2v) there is no notion of window size, the results for w = {2, 3} are the same. We have also examined several centrality criteria (using both undirected and directed graphs); undirected giving better results.

Results
Comparing TF to the graph-based ones, namely TW (degree), in almost all cases TW gives higher F1 and accuracy results. Similar observations can be made in the case where the IDF penalization is applied. In most of the datasets, the TW-IDF (degree) scheme performs quite well. The interesting point here, which is confirmed by the related literature (Lan et al., 2005), is that TF-IDF is in general inferior to TF in TC. However, when Table 2: Macro-F1 and accuracy for window size w. Bold shows the best performance on each window size and blue the best overall on each dataset. * indicates statistical significance of improvement over TF at p < 0.05 using micro sign test. MAX and SUM state the best nominator for ICW in Eq. (3).

20NG (MAX) IMDB (SUM) SUBJECTIVITY (MAX)
Methods w = 3 w = 4 w = 2 w = 3 w = 6 w = 7 the IDF penalization factor is applied on the TW term-to-document weighting, a powerful mechanism is derived. In the case of purely graph-based schemes, we can observe that some of them produce very good classification results. In almost all cases, TW-ICW-LW (degree or closeness) achieve the best performance.
Significant improvement is observed by adding the w2v similarities as weights in the document, collection level and label graphs in almost all datasets. In fact, we have obtained better results in 20NG (TW-ICW (w2v)), WebKB (TW-ICW (w2v)) and Reuters (TW-IDF(w2v)), by boosting semantically close words in the document level and penalizing them in the collection level.
TF n-gram binary scheme (TF binary) has also been examined, i.e., all the possible n-grams of the collection with binary weights (up to 6-grams in our experiments). For comparison reasons, the size of the unigram feature space considered by our framework is equal to the unique terms in the collections and much smaller compared to the n−grams ones. Moreover, graph-based weighting is able to outperform TF (binary) in all datasets.
We clearly see that by fusing document, collection and label graphs we obtain the best results in almost in 5 out of 6 datasets. Label graphs information consist a powerful weighting method, when combined with our proposed collection level graph approach. Adding word2vec similarities as weights, when label graphs are used, does not improve the accuracy. This implies that important   terms concerning different labels can be close in the word vector space. Choosing closeness in the document level GoW yields the best performance in 3 datasets. Closeness can only have an affect in larger document lengths and when used along with label graphs. To further investigate the effectiveness of our approach, we have compared our results with current state-of-the-art graph-based and non graphbased methods. In Table 3 we compare against CNN for text classification ,without pre-trained word vectors (Kim, 2014), FastText (Joulin et al., 2017), TextRank (Mihalcea and Tarau, 2004), Word Attraction weights based on word2vec similarities (Wang et al., 2015) and Supervised Term Weighting (TW-CRC) by Shanavas et al. (2016). Our work produces comparable to state-of-the-art results. Since the implementation of most models is our own, their performance is not optimal.
Selecting the window size w is also important. As we observed, the maximum accuracy is achieved while using small window sizes. In any case, even if larger values of w were able to get slightly better results, a smaller window size would be preferable, due to the overall overhead that could be introduced (increase of the density of the graph). Figure 3 depicts the F1 score and accuracy on the WEBKB, REUTERS and SUBJEC-TIVITY datasets, using the TW, TW-ICW and TW-ICW-LW(deg) schemes for various window sizes. We notice also that larger sliding windows are only improving accuracy in datasets with small document length (e.g. SUBJECTIVITY).

Conclusion & Future Work
In this paper, we proposed a graph-based framework for TC. By treating the term weighting task as a node ranking problem of interconnected features defined by a graph, we were able to determine the importance of a term using node centrality criteria. Building on this formulation, we introduced simple-yet-effective weighting schemes at the collection and label level, in order to penalize globally important terms (as analogous to "globally frequent terms") and reward locally important terms respectively. We also incorporate additional word-embedding information as weights in the graph-based representations.
Our proposed methods could also be applied in IR. In fact, document-level graph-based term weighting has already been applied there, so it would be interesting to examine the performance of the proposed collection-level (ICW) penalization mechanism. In the unsupervised scenario, where label information is not available, community detection algorithms may be applied to identify clusters of words or documents in collection graphs. Graph-based representations of text could also be fitted into deep learning architectures following the idea of Lei et al. (2015). Lastly, one could examine a Graph-of-documents approach, in which we create a graph, where nodes represent documents and edges correspond to similarity between them. In this case, graph kernels could be utilized for graph comparison and/or Word Mover's distance (Kusner et al., 2015) between two documents as weights.