Every Document Owns Its Structure: Inductive Text Classification via Graph Neural Networks

Text classification is fundamental in natural language processing (NLP) and Graph Neural Networks (GNN) are recently applied in this task. However, the existing graph-based works can neither capture the contextual word relationships within each document nor fulfil the inductive learning of new words. Therefore in this work, to overcome such problems, we propose TextING for inductive text classification via GNN. We first build individual graphs for each document and then use GNN to learn the fine-grained word representations based on their local structure, which can also effectively produce embeddings for unseen words in the new document. Finally, the word nodes are aggregated as the document embedding. Extensive experiments on four benchmark datasets show that our method outperforms state-of-the-art text classification methods.


Introduction
Text classification is one of the primary tasks in the NLP field, as it provides fundamental methodologies for other NLP tasks, such as spam filtering, sentiment analysis, intent detection, and so forth.Traditional methods for text classification include Naive Bayes (Androutsopoulos et al., 2000), k-Nearest Neighbor (Tan, 2006) and Support Vector Machine (Forman, 2008).They are, however, primarily dependent on the hand-crafted features at the cost of labour and efficiency.
There are several deep learning methods proposed to address the problem, among which Recurrent Neural Network (RNN) (Mikolov et al., 2010) and Convolutional Neural Network (CNN) (Kim, 2014) are essential ones.Based on them, extended models follow to leverage the classification performance, for instance, TextCNN (Kim, 2014), TextRNN (Liu et al., 2016) and TextRCNN (Lai et al., 2015).Yet they all focus on the locality of words and thus lack of long-distance and non-consecutive word interactions.Graph-based methods are recently applied to solve such issue, which do not treat the text as a sequence but as a set of co-occurrent words instead.For example, Yao et al. (2019) employ Graph Convolutional Networks (Kipf and Welling, 2017) and turns the text classification problem into a node classification one (TextGCN).Moreover, Huang et al. (2019) improve TextGCN by introducing the message passing mechanism and reducing the memory consumption.
However, there are two major drawbacks in these graph-based methods.First, the contextual-aware word relations within each document are neglected.To be specific, TextGCN (Yao et al., 2019) constructs a single graph with global relations between documents and words, where fine-grained text level word interactions are not considered (Wu et al., 2019;Hu et al., 2019a,b).In Huang et al. (2019), the edges of the graph are globally fixed between each pair of words, but the fact is that they may affect each other differently in a different text.Second, due to the global structure, the test documents are mandatory in training.Thus they are inherently transductive and have difficulty with inductive learning, in which one can easily obtain word embeddings for new documents with new structures and words using the trained model.
Therefore, in this work, we propose a novel Text classification method for INductive word representations via Graph neural networks, termed TextING.In contrast to previous graph-based approaches with global structure, we train a GNN that can depict the detailed word-word relations using only training documents, and generalise to new documents in test.We build individual graphs by applying the sliding window inside each doc-arXiv:2004.13826v2[cs.CL] 12 May 2020 ument (Rousseau et al., 2015).The information of word nodes is propagated to their neighbours via the Gated Graph Neural Networks (Li et al., 2015(Li et al., , 2019)), which is then aggregated into the document embedding.We also conduct extensive experiments to examine the advantages of our approach against baselines, even when words in test are mostly unseen (21.06% average gain in such inductive condition).Noticing a concurrent work (Nikolentzos et al., 2020) also reinforces the approach with a similar graph network structure, we describe the similarities and differences in the method section.To sum up, our contributions are threefold: • We propose a new graph neural network for text classification, where each document is an individual graph and text level word interactions can be learned in it.
• Our approach can generalise to new words that absent in training, and it is therefore applicable for inductive circumstances.
• We demonstrate that our approach outperforms state-of-the-art text classification methods experimentally.

Method
TextING comprises three key components: the graph construction, the graph-based word interaction, and the readout function.The architecture is illustrated in Figure 1.In this section, we detail how to implement the three and how they work.

Graph Construction
We construct the graph for a textual document by representing unique words as vertices and co-occurrences between words as edges, denoted as G = (V, E) where V is the set of vertices and E the edges.The co-occurrences describe the relationship of words that occur within a fixed-size sliding window (length 3 at default) and they are undirected in the graph.Nikolentzos et al. (2020) also use a sliding window of size 2.However, they include a particular master node connecting to every other node, which means the graph is densely connected and the structure information is vague during message passing.
The text is preprocessed in a standard way, including tokenisation and stopword removal (Blanco and Lioma, 2012;Rousseau et al., 2015).Embeddings of the vertices are initialised with word features, denoted as h ∈ R |V|×d where d is the embedding dimension.Since we build individual graphs for each document, the word feature information is propagated and incorporated contextually during the word interaction phase.

Graph-based Word Interaction
Upon each graph, we then employ the Gated Graph Neural Networks (Li et al., 2015) to learn the embeddings of the word nodes.A node could receive the information a from its adjacent neighbours and then merge with its own representation to update.As the graph layer operates on the first-order neighbours, we can stack such layer t times to achieve high-order feature interactions, where a node can reach another node t hops away.The formulas of the interaction are: where A ∈ R |V|×|V| is the adjacency matrix, σ is the sigmoid function, and all W, U and b are trainable weights and biases.z and r function as the update gate and reset gate respectively to determine to what degree the neighbour information contributes to the current node embedding.

Readout Function
After the word nodes are sufficiently updated, they are aggregated to a graph-level representation for the document, based on which the final prediction is produced.We define the readout function as: where f 1 and f 2 are two multilayer perceptrons (MLP).The former performs as a soft attention weight while the latter as a non-linear feature transformation.In addition to averaging the weighted word features, we also apply a max-pooling function for the graph representation h G .The idea behind is that every word plays a role in the text and the keywords should contribute more explicitly.
Finally, the label is predicted by feeding the graph-level vector into a softmax layer.We minimise the loss through the cross-entropy function: where W and b are weights and bias, and y Gi is the i-th element of the one-hot label.

Model Variant
We also extend our model with a multichannel branch TextING-M, where graphs with local structure (original TextING) and graphs with global structure (subgraphs from TextGCN) work in parallel.The nodes remain the same whereas the edges of latter are extracted from the large graph (built on the whole corpus) for each document.We train them separately and make them vote 1:1 for the final prediction.Although it is not the inductive case, our point is to investigate whether and how the two could complement each other from micro and macro perspectives.

Experiments
In this section, we aim at testing and evaluating the overall performance of TextING.During the experimental tests, we principally concentrate on three concerns: (i) the performance and advantages of our approach against other comparable models, (ii) the adaptability of our approach for words that are never seen in training, and (iii) the interpretability of our approach on how words impact a document.
Datasets.For the sake of consistency, we adopt four benchmark tasks the same as in (Yao et al., 2019): (i) classifying movie reviews into positive or negative sentiment polarities (MR)2 , (ii) & (iii) classifying documents that appear on Reuters newswire into 8 and 52 categories (R8 and R52 respectively)3 , (iv) classifying medical abstracts into 23 cardiovascular diseases categories (Ohsumed) 4 .Table 1 demonstrates the statistics of the datasets as well as their supplemental information.
Experimental Set-up.For all the datasets, the training set and the test set are given, and we randomly split the training set into the ratio 9:1 for actual training and validation respectively.The hyperparameters were tuned according to the performance on the validation set.Empirically, we set the learning rate as 0.01 with Adam (Kingma and Ba, 2015) optimiser and the dropout rate as 0.5.Some depended on the intrinsic attributes of the dataset, for example, the word interaction step and the sliding window size.We refer to them in the parameter sensitivity subsection.
Regarding the word embeddings, we used the pre-trained GloVe (Pennington et al., 2014) 5 with Table 2: Test accuracy (%) of various models on four datasets.The mean ± standard deviation of our model is reported according to 10 times run.Note that some baseline results are from (Yao et al., 2019)   neighbours and learn its representation more accurately.Nevertheless, the situation reverses with a continuous increment, where a node receives from every node in the graph and becomes over-smooth.
Figure 5 illustrates the performance as well as the graph density of TextING with a varying window size on MR and Ohsumed.It presents a similar trend as the interaction step's when the number of neighbours of a node grows.

Conclusion
We proposed a novel graph-based method for inductive text classification, where each text owns its structural graph and text level word interactions can be learned.Experiments proved the effectiveness of our approach in modelling local word-word relations and word significances in the text.

Figure 1 :
Figure 1: The architecture of TextING.As an example, upon a graph of document, every word node updates itself from its neighbours and they aggregate to the ultimate graph representation.

Figure 3 :
Figure 3: Attention visualisation of positive and negative movie reviews in MR.

Table 1 :
The statistics of the datasets including both short (sentence) and long (paragraph) documents.The vocab means the number of unique words in a document.The Prop.NW denotes the proportion of new words in test.
.Results.Table2presents the performance of our model as well as the baselines.We observe that graph-based methods generally outperform other types of models, suggesting that the graph model benefits to the text processing.Further, TextING ranks top on all tasks, suggesting that the individual graph exceeds the global one.Particularly, the result of TextING on MR is remarkably higher.Under Inductive Condition.To examine the adaptability of TextING under inductive condition, we reduce the amount of training data to 20 labelled documents per class and compare it with TextGCN.Word nodes absent in the training set are masked for TextGCN to simulate the inductive condition.In this scenario, most of the words in the test set are unseen during training, which behaves like a rigorous cold-start problem.The result of both models on MR and Ohsumed are listed in Table3.An average gain of 21.06% shows that TextING is much less impacted by the reduction of exposed

Table 3 :
Accuracy (%) of TextGCN and TextING on MR and Ohsumed, where MR uses 40 labelled documents (0.5% of full training data) and Ohsumed uses 460 labelled documents (13.7% of full training data).
Parameter Sensitivity.Figure4exhibits the performance of TextING with a varying number of the graph layer on MR and Ohsumed.The result reveals that with the increment of the layer, a node could receive more information from high-order