Be More with Less: Hypergraph Attention Networks for Inductive Text Classification

Text classification is a critical research topic with broad applications in natural language processing. Recently, graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task. Despite the success, their performance could be largely jeopardized in practice since they are: (1) unable to capture high-order interaction between words; (2) inefficient to handle large datasets and new documents. To address those issues, in this paper, we propose a principled model -- hypergraph attention networks (HyperGAT), which can obtain more expressive power with less computational consumption for text representation learning. Extensive experiments on various benchmark datasets demonstrate the efficacy of the proposed approach on the text classification task.


Introduction
Text classification, as one of the most fundamental tasks in the field of natural language processing, has received continuous endeavors from researchers due to its wide spectrum of applications, including sentiment analysis (Wang et al., 2016), topic labeling (Wang and Manning, 2012), and disease diagnosis (Miotto et al., 2016). Inspired by the success of deep learning techniques, methods based on representation learning such as convolutional neural networks (CNNs) (Kim, 2014) and recurrent neural networks (RNNs) (Liu et al., 2016) have been extensively explored in the past few years. In essence, the groundbreaking achievements of those methods can be attributed to their strong capability of capturing sequential context information from local consecutive word sequences.
More recently, graph neural networks (GNNs) (Kipf and Welling, 2017;Veličković et al., 201b;Hamilton et al., 2017) have drawn much attention and demonstrated their superior performance in the task of text classification (Yao et al., 2019;Wu et al., 2019a;. This line of work leverages the knowledge from both training and test documents to construct a corpus-level graph with global word co-occurrence and document-word relations, and consider text classification as a semi-supervised node classification problem. Then with GNNs, long-distance interactions between words could be captured to improve the final text classification performance. Despite their promising early results, the usability of existing efforts could be largely jeopardized in real-world scenarios, mainly owing to their limitations in the following two aspects: (i) Expressive Power. Existing GNN-based methods predominately focus on pairwise interactions (i.e., dyadic relations) between words. However, word interactions are not necessarily dyadic in natural language, but rather could be triadic, tetradic, or of a higher-order. For instance, consider the idiom "eat humble pie", whose definition is "admit that one was wrong" in common usage. If we adopt a simple graph to model the word interactions, GNNs may misinterpret the word pie as "a baked dish" based on its pairwise connections to other two words (humblepie and eatpie), then further misunderstand the actual meaning of the whole idiom. Hence, how to go beyond pairwise relations and further capture the high-order word interactions is vital for high-quality text representation learning, but still remains to be explored. (ii) Computational Consumption. On the one hand, most of the endeavors with GNN backbone tend to be memory-inefficient when the scale of data increases, due to the fact that constructing and learning on a global document-word graph consumes immense memory (Huang et al., 2019). On the other hand, the mandatory access to test documents during training renders those methods inherently trans-ductive. It means that when new data arrives, we have to retrain the model from scratch for handling newly added documents. Therefore, it is necessary to design a computationally efficient approach for solving graph-based text classification.
Upon the discussions above, one critical research question to ask is "Is it feasible to acquire more expressive power with less computational consumption?". To achieve this goal, we propose to adopt document-level hypergraph (hypergraph is a generalization of simple graph, in which a hyperedge can connect arbitrary number of nodes) for modeling each text document. The use of document-level hypergraphs potentially enables a learning model not only to alleviate the computational inefficiency issue, but more remarkably, to capture heterogeneous (e.g., sequential and semantic) high-order contextual information of each word. Therefore, more expressive power could be obtained with less computational consumption during the text representation learning process. As conventional GNN models are infeasible to be used on hypergraphs, to bridge this gap, we propose a new model named HyperGAT, which is able to capture the encoded high-order word interactions within each hypergraph. In the meantime, its internal dual attention mechanism highlights key contextual information for learning highly expressive text representations. To summarize, our contributions are in three-fold: • We propose to model text documents with document-level hypergraphs, which improves the model expressive power and reduces computational consumption.
• A principled model HyperGAT based on a dual attention mechanism is proposed to support representation learning on text hypergraphs.
• We conduct extensive experiments on multiple benchmark datasets to illustrate the superiority of HyperGAT over other state-of-the-art methods on the text classification task.
2 Related Work

Graph Neural Networks
Graph neural networks (GNNs) -a family of neural models for learning latent node representations in a graph, have achieved remarkable success in different graph learning tasks (Defferrard et al., 2016;Kipf and Welling, 2017;Veličković et al., 201b;Ding et al., 2019a. Most of the prevailing GNN models follow the paradigm of neighborhood aggregation, aiming to learn latent node representations via message passing among local neighbors in the graph. With deep roots in graph spectral theory, the learning process of graph convolutional networks (GCNs) (Kipf and Welling, 2017) can be considered as a mean-pooling neighborhood aggregation. Later on, GraphSAGE (Hamilton et al., 2017) was developed to concatenate the node's feature with mean/max/LSTM pooled neighborhood information, which enables inductive representation learning on large graphs. Graph attention networks (GATs) (Veličković et al.,201b) incorporate trainable attention weights to specify fine-grained weights on neighbors when aggregating neighborhood information of a node. Recent research further extend GNN models to consider global graph information (Battaglia et al., 2018) and edge information (Gilmer et al., 2017) during aggregation. More recently, hypergraph neural networks (Feng et al., 2019;Bai et al., 2020; are proposed to capture high-order dependency between nodes. Our model HyperGAT is the first attempt to shift the power of hypergraph to the canonical text classification task.

Deep Text Classification
Grounded on the fast development of deep learning techniques, various neural models that automatically represent texts as embeddings have been developed for text classification. Two representative deep neural models, CNNs (Kim, 2014;Zhang et al., 2015) and RNNs (Tai et al., 2015;Liu et al., 2016) have shown their superior power in the text classification task. To further improve the model expressiveness, a series of attentional models have been developed, including hierarchical attention networks (Yang et al., 2016), attention over attention (Cui et al., 2017), etc. More recently, graph neural networks have shown to be a powerful tool for solving the problem of text classification by considering the long-distance dependency between words. Specifically, TextGCN (Yao et al., 2019) applies the graph convolutional networks (GCNs) (Kipf and Welling, 2017) on a single large graph built from the whole corpus, which achieves state-of-the-art performance on text classification. Later on, SGC (Wu et al., 2019a) is proposed to reduce the unnecessary complexity and redundant computation of GCNs, and shows competitive results with superior time efficiency. TensorGCN  proposes a text graph tensor to learn word and document embeddings by incorporating more context information. (Huang et al., 2019) propose to learn text representations on document-level graphs. However, those transductive methods are computationally inefficient and cannot capture the high-order interactions between words for improving model expressive power.

Methodology
In this section, we introduce a new family of GNN models developed for inductive text classification. By reviewing the existing GNN-based endeavors, we first summarize their main limitations that need to be addressed. Then we illustrate how we use hypergraphs to model text documents for achieving the goals. Finally, we propose the model Hyper-GAT based on a new dual attention mechanism and model training for inductive text classification.

GNNs for Text Classification
With the booming development of deep learning techniques, graph neural networks (GNNs) have achieved great success in representation learning on graph-structured data (Zhou et al., 2018;Ding et al., 2019b). In general, most of the prevailing GNN models follow the neighborhood aggregation strategy, and a GNN layer can be defined as: where h l i is the node representation of node i at layer l (we use x i as h 0 i ) and N i is the local neighbor set of node i. AGGR is the aggregation function of GNNs and has a series of possible implementations (Kipf and Welling, 2017;Hamilton et al., 2017;Veličković et al., 201b).
Given the capability of capturing long-distance interactions between entities, GNNs also have demonstrated promising performance on text classification (Yao et al., 2019;Wu et al., 2019b;. The prevailing approach is to build a corpus-level document-word graph and try to classify documents through semi-supervised node classification. Despite their success, most of the existing efforts suffer from the computational inefficiency issue, not only because of the mandatory access of test documents, but also the construction of corpus-level document-word graphs. In the meantime, those methods are largely limited by the expressibility of using simple graphs to model word interactions. Therefore, how to improve model ex-pressive power with less computational consumption is a challenging and imperative task to solve.

Documents as Text Hypergraphs
To address the aforementioned challenges, in this study, we alternatively propose to model text documents with document-level hypergraphs. Formally, hypergraphs can be defined as follows: Definition 3.1 Hypergraphs: A hypergraph is defined as a graph G = (V, E), where V = {v 1 , . . . , v n } represents the set of nodes in the graph, and E = {e 1 , . . . , e m } represents the set of hyperedges. Note that for any hyperedge e, it can connect two or more nodes (i.e., σ(e) ≥ 2).
Notably, the topological structure of a hypergraph G can also be represented by an incidence matrix A ∈ R n×m , with entries defined as: In the general case, each node in hypergraphs could come with a d-dimensional attribute vector. Therefore, all the node attributes can be denoted as X = [x 1 , x 2 , . . . , x n ] T ∈ R n×d , and we can further use G = (A, X) to represent the whole hypergraph for simplicity.
For a text hypergraph, nodes represent words in the document and node attributes could be either one-hot vector or the pre-trained word embeddings (e.g., word2vec, GloVe). In order to model heterogeneous high-order context information within each document, we include multi-relational hyperedges as follows: Sequential Hyperedges. Sequential context depicts the language property of local co-occurrence between words, which has demonstrated its effectiveness for text representation learning (Yao et al., 2019). To leverage the sequential context information of each word, we first construct sequential hyperedges for each document in the corpus. One natural way is to adopt a fixed-size sliding window to obtain global word co-occurrence as the sequential context. Inspired by the success of hierarchical attention networks (Yang et al., 2016), here we consider each sentence as a hyperedge and it connects all the words in this sentence. As another benefit, using sentences as sequential hyperedges enables our model to capture the document structural information at the same time. Semantic Hyperedges. Furthermore, in order to enrich the semantic context for each word, we build semantic hyperedges to capture topic-related highorder correlations between words (Linmei et al., 2019). Specifically, we first mine the latent topics T from the text documents using LDA (Blei et al., 2003) and each topic t i = (θ 1 , ..., θ w ) (w denotes the vocabulary size) can be represented by a probability distribution over the words. Then for each topic, we consider it as a semantic hyperedge that connects the top K words with the largest probabilities in the document. With those topic-related hyperedges, we are able to enrich the high-order semantic context of words in each document. It is worth mentioning that though we only discuss sequential and semantic hyperedges in this study, other meaningful hyperedges (e.g., syntacticrelated) could also be integrated into the proposed model for further improving the model expressiveness and we leave this for future work.

Hypergraph Attention Networks
To support text representation learning on the constructed text hypergraphs, we then propose a new model called HyperGAT (as shown in Figure 1) in this section. Apart from conventional GNN models, HyperGAT learns node representations with two different aggregation functions, allowing to capture heterogeneous high-order context information of words on text hypergraphs. In general, a HyperGAT layer can be defined as: where E i denotes the set of hyperedges connected to node v i and f l j is the representation of hyperedge e j in layer l. AGGR edge is an aggregation function that aggregates features of hyperedges to nodes and AGGR node is another aggregation function that aggregates features of nodes to hyperedges. In this work, we propose to implement those two functions based on a dual attention mechanism. We will start by describing a single layer l for building arbitrary HyperGAT architectures as follows: Node-level Attention. Given a specific node v i , our HyperGAT layer first learns the representations of all its connected hyperedges E i . As not all the nodes in a hyperedge e j ∈ E i contribute equally to the hyperedge meaning, we introduce attention mechanism (i.e., node-level attention) to highlight those nodes that are important to the meaning of the hyperedge and then aggregate them to compute the hyperedge representation f l j . Formally: where σ is the nonlinearity such as ReLU and W 1 is a trainable weight matrix. α jk denotes the attention coefficient of node v k in the hyperedge e j , which can be computed by: where a T 1 is a weight vector (a.k.a, context vector).
Edge-level Attention. With all the hyperedges representations {f l j |∀e j ∈ E i }, we again apply an edge-level attention mechanism to highlight the informative hyperedges for learning the next-layer representation of node v i . This process can be formally expressed as: where h l i is the output representation of node v i and W 2 is a weight matrix. β ij denotes the attention coefficient of hyperedge e j on node v i , which can be computed by: where a T 2 is another weight (context) vector for measuring the importance of the hyperedges and || is the concatenation operation.
The proposed dual attention mechanism enables a HyperGAT layer not only to capture the highorder word interactions, but also to highlight the key information at different granularities during the node representation learning process.

Inductive Text Classification
For each document, after going through L Hyper-GAT layers, we are able to compute all the node representations on the constructed text hypergraph. Then we apply the mean-pooling operation on the learned node representations H L to obtain the document representation z, and feed it to a softmax layer for text classification. Formally: where W c is a parameter matrix mapping the document representation into an output space and b c is the bias.ŷ denotes the predicted label scores. Specifically, the loss function of text classification is defined as the cross-entropy loss: where j is the ground truth label of document d.
Thus HyperGAT can be learned by minimizing the above loss function over all the labeled documents. Note that HyperGAT eliminates the mandatory access of test documents during training, making  (Joulin et al., 2016), and more advanced methods SWEM  and LEAM ; (ii) sequence-based methods which capture text fea-  tures from local consecutive word sequences, including CNNs (Kim, 2014), LSTMs (Liu et al., 2016), and Bi-LSTM (Huang et al., 2015); (iii) graph-based methods that aim to capture interactions between words, including Graph-CNN (Defferrard et al., 2016), two versions of TextGCN (Yao et al., 2019) and Text-level GNN (Huang et al., 2019). Note that TextGCN (transductive) is the model proposed in the original paper and TextGCN (inductive) is the inductive version implemented by the same authors. Text-level GNN is a state-ofthe-art baseline which performs text representation learning on document-level graphs. More details of baselines can be found in (Yao et al., 2019).

Implementation Details.
HyperGAT is implemented by PyTorch and optimized with the Adam optimizer. We train and test the model on a 12 GB Titan Xp GPU. Specifically, our HyperGAT model consists of two layers with 300 and 100 embedding dimensions, respectively. We use one-hot vectors as the node attributes and the batch size is set to 8 for all the datasets. The optimal values of hyperparameters are selected when the model achieves the highest accuracy for the validation samples. The optimized learning rate α is set to 0.0005 for MR and 0.001 for the other datasets. L2 regularization is 10 −6 and dropout rate is 0.3 for the best performance. For learning HyperGAT, we train the model for 100 epochs with early-stopping strategy. To construct the semantic hyperedges, we train an LDA model for each dataset using the training documents and select the Top-10 words from each topic. The topic number is set to the same number of classes. For baseline models, we either show the results reported in previous research (Yao et al., 2019) or run the codes provided by the authors using the parameters described in the original papers. More details can be found in the Appendix A.2. Our data and source code is available at https://github.com/kaize0409/HyperGAT.

Experimental Results
Classification Performance. We first conduct comprehensive experiments to evaluate model performance on text classification and present the results in Table 2. Overall, our model HyperGAT outperforms all the baselines on the five evaluation datasets, which demonstrates its superior capability in text classification. In addition, we can make the following in-depth observations and analysis: • Graph-based methods, especially GNN-based models are able to achieve superior performance over the other two categories of baselines on the first four datasets. This observation indicates that text classification performance can be directly improved by capturing long-distance word interactions. While for the MR dataset, sequence-based methods (CNNs and LSTMs) show stronger classification capability than most of the graph-based baselines. One potential reason is that sequential context information plays a critical role in sentiment classification, which cannot be explicitly captured by the majority of existing graph-based methods.
• Not surprisingly, without the additional knowledge on test documents, the performance of  TextGCN (inductive) largely falls behind its original transductive version. Though Text-level GNN is able to achieve performance improvements by adding trainable edge weights between word, its performance is still limited by the information loss of using pairwise simple graph. In particular, our model HyperGAT achieves considerable improvements over other GNN-based models, demonstrating the importance of high-order context information for learning word representations.
Computational Efficiency. Table 3 presents the computational cost comparison between the most representative transductive baseline TextGCN and our approach. Form the reported results, we can clearly find that HyperGAT has a significant computational advantage in terms of memory consumption. The main reason is that HyperGAT conducts text representation learning at the documentlevel and it only needs to store a batch of small text hypergraphs during training. On the contrary, TextGCN requires constructing a large documentword graph using both training and test documents, which inevitably consumes a great amount of memory. Another computational advantage of our model is that HyperGAT is an inductive model that can generalize to unseen documents. Thus we do not have to retrain the whole model for newly added documents like transductive methods.
Model Sensitivity. The model performance on 20NG and Ohsumed with different first-layer embedding dimensions is reported in Figure 2, and we omit the results on other datasets since similar results can be observed. Notably, the best performance of HyperGAT is achieved when the firstlayer embedding size is set to 300. It indicates that small embedding size may render the model less expressive, while the model may encounter overfitting if the embedding size is too large. In the meantime, to evaluate the effect of the size of labeled training data, we compare several best performing models with different proportions of the training data and report the results on Ohsumed and MR in Figure 3. In general, with the growth of labeled training data, all the evaluated methods can achieve performance improvements. More remarkably, HyperGAT can significantly outperform other baselines with limited labeled data, showing its effectiveness in real-world scenarios.

Ablation Analysis
To investigate the contribution of each module in HyperGAT, we conduct an ablation analysis and report the results in Table 4. Specifically, w/o attention is a variant of HyperGAT that replaces the dual attention with convolution. w/o sequential and w/o semantic are another two variants by excluding sequential, semantic hyperedges, respectively. From the reported results we can learn that HyperGAT can achieve better performance by stacking more layers. This observation can verify the usefulness of long-distance word interactions for text representation learning. Moreover, the performance gap between w/o attention and HyperGAT shows the effectiveness of the dual attention mechanism for learning more expressive word representations. By comparing the results of w/o sequential and w/o semantic, we can learn that the context informa-  tion encoded by the sequential hyperedges is more important, but adding semantic hyperedges can enhance the model expressiveness. It also indicates that heterogeneous high-order context information can complement each other and we could investigate more meaningful hyperedges to further improve the performance of our approach.

Case Study
Embedding Visualization. In order to show the superior embedding quality of HyperGAT over other methods, we use t-SNE (Maaten and Hinton, 2008) to visualize the learned representations of documents for comparison. Specifically, Figure  4 shows the visualization results of the best performing baseline Text-level GNN and HyperGAT on the test documents of Ohsumed. Note that the node's color corresponds to its label, which is used to verify the model's expressive power on 23 document classes. From the embedding visualization, we are able to observe that HyperGAT can learn more expressive document representations over the state-of-the-art method Text-level GNN.
Attention Visualization. To better illustrate the learning process of the proposed dual attention mechanism, we take a text document from 20NG (labeled as sport.baseball correctly) and visualize the attention weights computed for the word player. As shown in Figure 5, player is con- nected to four hyperedges within the constructed document-level hypergraph. The first three lines ended with periods represent sequential hyperedges, while the last one without a period is a semantic hyperedge. Note that we use orange to denote the node-level attention weight and blue to denote the edge-level attention weight. Darker color represents larger attention weight. On the one hand, node-level attention is able to select those nodes (words) carrying informative context on the same hyperedge. For example, win and team in the third hyperedge gain larger attention weights since they are more expressive compared to other words in the same sentence. On the other hand, edge-level attention can also assign fine-grained weights to highlight meaningful hyperedges. As we can see, the last hyperedge that connects player with baseball and win receives higher attention weight since it can better characterize the meaning of player in the document. To summarize, this case study shows that our proposed dual attention can capture key information at different granularities for learning expressive text representations.

Conclusion
In this study, we propose a new graph-based method for solving the problem of inductive text classification. Apart from the existing efforts, we propose to model text documents with document-level hypergraphs and further develop a new family of GNN model named HyperGAT for learning discriminative text representations. Specifically, our method is able to acquire more expressive power with less computational consumption for text representation learning. By conducting extensive experiments, the results demonstrate the superiority of the proposed model over the state-of-the-art methods.