Heterogeneous Graph Attention Networks for Semi-supervised Short Text Classification

Short text classification has found rich and critical applications in news and tweet tagging to help users find relevant information. Due to lack of labeled training data in many practical use cases, there is a pressing need for studying semi-supervised short text classification. Most existing studies focus on long texts and achieve unsatisfactory performance on short texts due to the sparsity and limited labeled data. In this paper, we propose a novel heterogeneous graph neural network based method for semi-supervised short text classification, leveraging full advantage of few labeled data and large unlabeled data through information propagation along the graph. In particular, we first present a flexible HIN (heterogeneous information network) framework for modeling the short texts, which can integrate any type of additional information as well as capture their relations to address the semantic sparsity. Then, we propose Heterogeneous Graph ATtention networks (HGAT) to embed the HIN for short text classification based on a dual-level attention mechanism, including node-level and type-level attentions. The attention mechanism can learn the importance of different neighboring nodes as well as the importance of different node (information) types to a current node. Extensive experimental results have demonstrated that our proposed model outperforms state-of-the-art methods across six benchmark datasets significantly.


Introduction
With the rapid development of online social media and e-commerce, short texts, such as online news, queries, reviews, tweets, are increasingly widespread on the Internet (Song et al., 2014). Short text classification can be widely applied in many domains, ranging from sentiment analysis to news tagging/categorization and query intent classification (Aggarwal and Zhai, 2012;Meng et al., 2018). In many practical scenarios, the labeled data is scarce, while human labeling is time-consuming and may require expert knowledge (Aggarwal and Zhai, 2012). As a consequence, there is a pressing need for studying semisupervised short text classification with a relatively small number of labeled training data.
Nevertheless, semi-supervised short text classification is nontrivial due to the following challenges. Firstly, short texts are usually semantically sparse and ambiguous, lacking contexts (Phan et al., 2008). While some methods have been proposed to incorporate additional information such as entities (Wang et al., 2013(Wang et al., , 2017, they are unable to consider the relational data such as the semantic relations among entities. Secondly, the labeled training data is limited, which leads to traditional and neural supervised methods (Wang and Manning, 2012;Kim, 2014; ineffective. As such, how to make full use of the limited labeled data and large number of unlabeled data has become a key problem for short text classification (Aggarwal and Zhai, 2012). Finally, we need to capture the importance of different information that is incorporated to address sparsity at multiple granularity levels and reduce the weights of noisy information to achieve more accurate classification results.
In this work, we propose a novel heterogeneous graph neural network based method for semisupervised short text classification, which makes full use of both limited labeled data and large unlabeled data by allowing information propagation through our automatically constructed graph. Particularly, we first present a flexible HIN framework for modeling the short texts, which is able to incorporate any additional information (e.g., entities and topics) as well as capture the rich relations among the texts and the additional information. Then, we propose Heterogeneous Graph Attention networks (HGAT) to embed the HIN for short text classification based on a new dual-level attention mechanism including node-level and type-level attentions. Our HGAT method considers the heterogeneity of different node types. Additionally, the dual-level attention mechanism captures both the importance of different neighboring nodes (reducing the weights of noisy information) and the importance of different node (information) types to a current node. The main contributions of this paper can be summarized as follows: 1) To the best of our knowledge, this is the first attempt to model short texts as well as additional information with an HIN and adapt graph neural networks on the HIN for semi-supervised classification.
2) We propose novel heterogeneous graph attention networks (HGAT) for the HIN embedding based on a new dual-level attention mechanism which can learn the importance of different neighboring nodes and the importance of different node (information) types to a current node.
3) Extensive experimental results have demonstrated that our proposed HGAT model significantly outperforms seven state-of-the-art methods across six benchmark datasets.

Traditional Text Classification
Traditional text classification methods such as SVM (Drucker et al., 1999) need a feature engineering step for text representation. The most commonly used features are BoW and TF-IDF (Blei et al., 2003). Some recent studies (Rousseau et al., 2015;Wang et al., 2016) model texts as graphs and extract path based features for classification. Despite its initial success on formal and well-edited texts, all these methods fail to achieve satisfactory performance on short text classification, due to the insufficient features incurred by short texts. To address the problem, efforts have been made to enrich the semantics of short texts. For example, Phan et al. (2008) extracted the latent topics of the short texts with the help of an external corpus. Wang et al. (2013) introduced external entity information from Knowledge Bases, etc. However, these methods are not able to achieve good performance as the feature engineering step relies on domain knowledge.

Deep neural networks for Text Classification
Deep neural networks which automatically represent texts as embeddings, have been widely used for text classification. Two representative deep neural models, such as RNNs (Liu et al., 2016;Sinha et al., 2018) and CNNs, (Kim, 2014;Shimura et al., 2018) have shown their power in many NLP tasks, including text classification. To adapt it to short text classification, several methods have been proposed. For example,  designs a character-level CNN which alleviates the sparsity by mining different levels of information within the texts. Wang et al. (2017) incorporates the entities and concepts from KBs to enrich the semantics of short texts. However, these methods cannot capture the semantic relations (e.g., entity relations) and rely heavily on the number of training data. Clearly, lacking of training data is still a key bottleneck that prohibits them from successful practical applications.

Semi-supervised Text Classification
Considering the cost of human labeling and the fact that unlabeled texts also provide valuable information, semi-supervised methods have peen proposed. They can be categorized into two classes: (1) latent variable models (Lu and Zhai, 2008;Chen et al., 2015); and (2) embedding-based models (Meng et al., 2018). The former mainly extend topic model by user-provided seed information and then infer the documents' labels based on posterior category-topic assignment. The latter use seed information to derive embeddings for documents and label names for text classification. For example, PTE (Tang et al., 2015) models the documents, words and labels with graphs and learns text (node) embeddings for classification. Meng et al. (2018) leveraged seed information to generate pseudo-labeled documents for pre-training. Yin et al. (2015) used a semisupervised learning method based on SVM to label the unlabeled documents in an iterative way.
Recently, graph convolutional networks (GCN) have received wide attention for semi-supervised classification (Kipf and Welling, 2017). TextGCN (Yao et al., 2019) models the whole text corpus as a document-word graph and applies GCN for classification. However, all these methods focus on long texts. In addition, they fail to use attention mechanisms to capture important information.

Our Proposed Method
In this paper, we propose a novel heterogeneous graph neural network based method for semi-supervised short text classification, which takes full advantage of both limited labeled data and large unlabeled data by allowing information propagation along the graph. Our method includes two steps. Particularly, to alleviate the sparsity of short texts, we first present a flexible HIN framework for modeling the short texts, which can incorporate any additional information as well as capture the rich relations among the short texts and the added information. Then, we propose a novel model HGAT to embed the HIN for short text classification based on a new dual-level attention mechanism. HGAT considers the heterogeneity of different types of information. In addition, the attention mechanism can learn the importance of different nodes (reducing the weights of noisy information) as well as the importance of different node (information) types.

HIN for Short Texts
We first present the HIN framework for modeling the short texts, which enables integration of any additional information and captures the rich relations among the texts and the added information.
In this way, the sparsity of the short texts is alleviated.
Previous studies have exploited latent topics (Zeng et al., 2018) and external knowledge (e.g., entities) from Knowledge Bases to enrich the semantics of the short texts (Wang et al., 2013(Wang et al., , 2017. However, they fail to consider the semantic relation information, such as entity relations. Our HIN framework for short texts is flexible for integrating any additional information and modeling their rich relations. Here, we consider two types of additional information i.e., topics and entities. As shown in Figure 1, we construct the HIN G = (V, E) containing the short texts D = {d 1 , ..., d m }, topics T = {t 1 , ..., t K }, and entities E = {e 1 , ..., e n } as nodes, i.e., V = D∪T ∪E. The set of edges E represent their relations. The details of constructing the network are described as follows.
First, we mine the latent topics T to enrich the semantics of short texts using LDA (Blei et al., 2003). Each topic t i = (θ 1 , ..., θ w ) (w denotes the vocabulary size) is represented by a probability distribution over the words. We assign each document to the top P topics with the largest probabilities. Thus, the edge between a document and a topic is built if the document is assigned to the topic.
Second, we recognize the entities E in the documents D and map them to Wikipedia with the entity linking tool TAGME 1 . The edge between a document and an entity is built if the document contains the entity. We take an entity as a whole word and learn the entity embeddings using word2vec 2 based on the Wikipedia corpus. To further enrich the semantics of short texts and advance the information propagation, we consider the relations between entities. Particularly, if the similarity score (cosine similarity) between two entities, computed based on their embeddings, is above a predefined threshold δ, we build an edge between them.
By incorporating the topics, entities and the relations, we enrich the semantics of the short texts and thus greatly benefit the following classification task. For example, as shown in Figure 1, the short text "the seed of Apple's Innovation: In an era when most technology..." is semantically enriched by the relations with the entities "Apple Inc." and "company", as well as the topic "technology". Thus, it can be correctly classified into the category of "business" with high confidence.

HGAT
We then propose HGAT model (shown in Figure  2 ) to embed the HIN for short text classification based on a new dual-level attention mechanism including node level and type level. HGAT considers the heterogeneity of different types of information with heterogeneous graph convolution. In addition, the dual-level attention mechanism captures the importance of different neighboring nodes (reducing the weights of noisy information) and the importance of different node (information) types to a specific node. Finally, it predicts the labels of documents through a softmax layer.

Heterogeneous Graph Convolution
We first describe the heterogeneous graph convolution in HGAT, considering the heterogeneous types of nodes (information).
As known, GCN (Kipf and Welling, 2017) is a multi-layer neural network that operates directly on a homogeneous graph and induces the embedding vectors of nodes based on the properties of their neighborhoods. Formally, consider a graph G = (V, E) where V and E represent the set of nodes and edges respectively. Let X ∈ R |V|×q be a matrix containing the nodes with their features x v ∈ R q (each row x v is a feature vector for a node v). For the graph G, we introduce its adjacency matrix A = A + I with added self-connections and degree matrix M , where M ii = j A ij . Then the layer-wise propagation rule is as follows: (1) Here,Ã = M − 1 2 A M − 1 2 represents the symmetric normalized adjacency matrix. W (l) is a layer-specific trainable transformation matrix. σ(·) denotes an activation function such as ReLU. H (l) ∈ R |V|×q denotes the hidden representations of nodes in the l th layer. Initially, H (0) = X.
Unfortunately, GCN cannot be directly applied to the HIN for short texts due to the node heterogeneity issue. Specifically, in the HIN, we have three types of nodes: documents, topics and entities with different feature spaces. For a document d ∈ D, we use the TF-IDF vector as its feature vector x d . For the topic t ∈ T , the word distribution is used to represent the topic x t = {θ i } i= [1,w] .
For each entity, to make full use of relevant information, we represent the entity x v by concatenating its embedding and TF-IDF vector of its Wikipedia description text.
A straightforward way to adapt GCN for the HIN containing different types of nodes T = {τ 1 , τ 2 , τ 3 } is to construct a new large feature space by concatenating together the feature spaces of different types of nodes. For example, each node is denoted as a feature vector with 0 values for the irrelevant dimensions for other types. We name this basic method for adapting GCN to HIN as GCN-HIN. However, it suffers from reduced performance since it ignores the heterogeneity of different information types.
To address the issue, we propose the heterogeneous graph convolution, which considers the difference of various types of information and projects them into an implicit common space with their respective transformation matrices.
whereÃ τ ∈ R |V|×|Vτ | is the submatrix ofÃ, whose rows represent all the nodes and columns represent their neighboring nodes with the type τ . The representation of the nodes H (l+1) is obtained by aggregating information from the features of their neighboring nodes H (l) τ with different types τ using different transformation matrix W (l) τ ∈ R q (l) ×q (l+1) . The transformation matrix W (l) τ considers the difference of different feature spaces and projects them into an implicit common space R q (l+1) . Initially, H

Dual-level Attention Mechanism
Typically, given a specific node, different types of neighboring nodes may have different impacts on it. For example, the neighboring nodes of the same type may carry more useful information. Additionally, different neighboring nodes of the same type could also have different importance. To capture both the different importance at both node level and type level, we design a new dual-level attention mechanism.
Type-level Attention. Given a specific node v, the type-level attention learns the weights of different types of neighboring nodes. Specifically, we first represent the embedding of the type τ as where μ τ is the attention vector for the type τ , || means "concatenate", and σ(·) denotes the activation function, such as Leaky ReLU.
Then we obtain the type-level attention weights by normalizing the attention scores across all the types with the softmax function: .
Node-level Attention. We design the node-level attention to capture the importance of different neighboring nodes and reduce the weights of noisy nodes. Formally, given a specific node v with the type τ and its neighboring node v ∈ N v with the type τ , we compute the node-level attention scores based on the node embeddings h v and h v with the type-level attention weight α τ for the node v : where ν is the attention vector. Then we normalize the node-level attention scores with the softmax function: Finally, we incorporate the dual-level attention mechanism including type-level and node-level attentions into the heterogeneous graph convolution by replacing Eq. 2 with the following layer-wise propagation rule: Here, B τ represents the attention matrix, whose element in the v th row v th column is β vv in Eq. 6.

Model Training
After going through an L-layer HGAT, we can get the embeddings of nodes (including short texts) in the HIN. The short text embeddings H (L) are then fed to a softmax layer for classification. Formlly, During model training, we exploit the crossentropy loss over training data with the L2-norm. Formally, where C is the number of classes, D train is the set of short text indices for training, Y is the corresponding label indicator matrix, Θ is model parameters, and η is regularization factor. For model optimization, we adopt the gradient descent algorithm.

Experiments
In this section, we evaluate the empirical performance of different methods for semi-supervised short text classification.
AGNews: This dataset is adopted from . We randomly select 6,000 pieces of news from AGNews, evenly distributed into 4 classes.
Snippets: This dataset is released by Phan et al. (2008). It is composed of the snippets returned by a web-search engine.
Ohsumed 3 : We use the benchmark bibliographic classification dataset released by Yao et al. (2019), where the documents with multiple labels are removed. We use the titles for short text classification.
TagMyNews: We use the news titles as instances from the benchmark classification dataset released by Vitale et al. (2012), which contains English news from really simple syndication (RSS) feeds. MR: It is a movie review dataset, in which each review only contains one sentence (Pang and Lee, 2005). Each sentence is annotated with positive or negative for binary sentiment classification.
Twitter: This dataset is provided by NLTK 4 , a library of Python, which is also a binary sentiment classification dataset.
For each dataset, we randomly select 40 labeled documents per class, half of which for training and the other half for validation. Following Kipf and Welling (2017), all the left documents are for testing, which are also used as unlabeled documents during training.
We preprocess all the datasets as follows. We remove non-English characters, the stop words, and low-frequency words appearing less than 5 times. Table 1 shows the statistics of the datasets, including the number of documents, the number of average tokens and entities, the number of classes, and the proportion of texts containing entities in parentheses. In our datasets, most of the texts (around 80%) contain entities.

Baselines
To comprehensively evaluate our proposed method for semi-supervised short text classification, we compare it with the following nine state-of-the-art methods: SVM: SVM classifiers using TF-IDF features and LDA features (Blei et al., 2003), are denoted as SVM+TFIDF and SVM+LDA, respectively.
LSTM: LSTM (Liu et al., 2016) with and without pre-trained word embeddings, named LSTMrand and LSTM-pretrain, respectively. PTE: A semi-supervised representation learning method for text data (Tang et al., 2015). It firstly learns word embedding based on the heterogeneous text networks containing three bipartite networks of words, documents and labels, then averages word embeddings as document embeddings for text classification.
TextGCN: Text GCN (Yao et al., 2019) models the text corpus as a graph containing documents and words as nodes, and applies GCN for text classification.
HAN: HAN (Wang et al., 2019) embeds HINs by first converting an HIN to several homogeneous sub-networks through pre-defined meta-paths and then applying graph attention networks.
For fair comparison, all of the above baselines, such as SVMs, CNN and LSTM, have used entity information.

Parameter Settings
We choose the parameter values of K, T and δ that achieve the best results on the validation set. To construct HIN for short texts, we set the number of topics K = 15 in LDA for the datasets AGNews, TagMyNews, MR and Twitter. We set K = 20 for Snippets and K = 40 for Ohsumed. For all the datasets, each document is assigned to top P = 2 topics with the largest probabilities. The similarity threshold δ between entities is set δ = 0.5.
Following previous studies (Vaswani et al., 2017), we set the hidden dimension of our model HGAT and other neural models to d = 512 and the dimension of pre-trained word embeddings to 100. We set the layer number L of HGAT, GCN-HIN and TextGCN as 2. For model training, we set the learning rate as 0.005, dropout rate as 0.8 and the regularization factor η = 5e-6. Early stopping is applied to avoid overfitting. Table 2 shows the classification accuracy of different methods on 6 benchmark datasets. We can see that our methods significantly outperform all the baselines by a large margin, which shows the effectiveness of our proposed method on semisupervised short text classification.

Experimental Results
The traditional method SVMs based on the human-designed features, achieve better performance than the deep models with random initialization, i.e., CNN-rand and LSTM-rand in most cases. While CNN-pretrain and LSTM-pretrain using the pre-trained vectors achieve significant  improvements and outperform SVMs. The graph based model PTE achieves inferior performance compared to CNN-pretrain and LSTM-pretrain. The reason may be that PTE learns text embeddings based on word co-occurrences, which, however, are sparse in short text classification. Graph neural network based models TextGCN and HAN achieve comparable results with the deep models CNN-pretrain and LSTM-pretrain. Our model HGAT consistently outperforms all the state-ofthe-art models by a large margin, which shows the effectiveness of our proposed method. The reasons include that 1) we construct a flexible HIN framework for modeling the short texts, enabling integration of additional information to enrich the semantics and 2) we propose a novel model HGAT to embed the HIN for short text classification based on a new dual-level attention mechanism. The attention mechanism not only captures the importance of different neighboring nodes (reducing the weights of noisy information) but also the importance of different types of nodes.

Comparison of Variants of HGAT
We also compare our model HGAT with some variants to validate the effectiveness of our model. As shown in Table 3, we compare our HGAT with four variant models. The basic model GCN-HIN directly applies GCN on our constructed HIN for short texts by concatenating the feature spaces of different types of information. It does not consider the heterogeneity of various information types. HGAT w/o ATT considers the heterogeneity through our proposed heterogeneous graph convolution, which projects different types of information to an implicit common space with respective transformation matrices. HGAT-Type and HGAT-Node respectively consider only typelevel attention and node-level attention. We can see from Table 2, HGAT w/o ATT consistently outperforms GCN-HIN on all datasets, demonstrating the effectiveness of our proposed heterogeneous graph convolution which considers the heterogeneity of various information types. HGAT-Type and HGAT-Node further improve HGAT w/o ATT by capturing the importance of different information (reducing the weights of noisy information). HGAT-Node achieves better performance than HGAT-Type, indicating that node-level attention is more important. Finally, HGAT significantly outperforms all the variants by considering the heterogeneity and applying duallevel attention mechanism including node-level and type-level attentions.

Impact of Number of Labeled Docs
We choose 6 representative methods with the best performance: SVM+LDA, CNN-pretrain, LSM-  pretrain, GCN-HIN, TextGCN and HGAT, to study the impact of the number of labeled documents. Particularly, we vary the number of labeled documents per class and compare their performance on the AGNews dataset. We run each method 10 times and report the average performance. As shown in Figure 3, with the increase of labeled documents, all the methods achieve better results in terms of accuracy. Generally, the graph based methods GCN-HIN, TextGCN and HGAT achieve better performance, indicating that graphbased methods can make better use of limited labeled data through information propagation. Our method outperforms all the other methods consistently. When fewer labeled documents are provided, the baselines exhibit obvious performance drop, while our model still achieves relatively high performance. It demonstrates that our method can more effectively take advantage of the limited labeled data for short text classification. We believe our method benefits from the flexible HIN and the proposed heterogeneous graph attention networks with dual-level attention. Figure 4 (a) and (b) show the test accuracy of our HGAT model on the AGNews dataset with different number of topics K and Top P relevant topics assigned to a document. As we can see clearly, for the number of topics, the test accuracy first increases with the increase of the number of topics, reaching the highest value at 15; it falls when its number is larger than 15. We also tried the different numbers of topics for baselines, and have observed that the best K is the same as in our model. This is consistent with the intuition that the number of topics should fit the dataset, i.e., it should be model free. For the number of top relevant topics P assigned to the documents, the test accuracy first increases with the increase of P and  Figure 5: Visualization of the dual-level attention including node-level attention (shown in red) and typelevel attention (shown in blue). Each topic t is represented by top 10 words with highest probabilities.

Parameter Analysis
then decreases when P is larger than 2. In our experiments, the two parameters are set based on the validation set of each dataset.

Case Study
As Figure 5 shows, we take a short text from AG-News as an example (which is classified to the class of sports correctly) to illustrate the duallevel attention of HGAT. The type-level attention assigns high weight (0.7) to the short text itself, while lower weights (0.2 and 0.1) to entities and topics. It means that the text itself contributes more for classification, than the entities and topics. The node-level attention assigns different weights to neighboring nodes. The node-level weights of nodes belonging to a same type sum to 1. As we see, the entities e 3 (Atlanta Braves, a baseball team), e 4 (Dodger Stadium, a baseball gym), e 1 (Shawn Green, a baseball player) have higher weights than e 2 (Los Angeles, referring to a city at most time). The topics t 1 (game) and t 2 (win) have almost the same importance for classifying the text to the class of sports. The case study shows that our proposed dual-level attention can capture key information at multiple granularities for classification and reduce the weights of noisy information.

Conclusion
In this paper, we propose a novel heterogeneous graph neural network based method for semisupervised short text classification, which takes full advantage of both limited labeled and large unlabeled data by information propagation. Particularly, we first present a flexible HIN framework for modeling the short texts, which can integrate any additional information and capture their rich relations to address the semantic sparsity of short texts. Then, we propose a novel model HGAT to embed the HIN based on a dual-level attention mechanism including node-level and type-level at-tentions. HGAT considers the heterogeneity of various information types by projecting them into an implicit common space. Additionally, the duallevel attention captures the key information at multiple granularity levels and reduces the weights of noisy information. Extensive experimental results demonstrated that our proposed model significantly outperforms the state-of-the-art methods across six benchmark datasets consistently. As our model HGAT is a general HIN embedding approach, it would be interesting to apply it to other tasks, e.g., HIN based recommendation.