A Hierarchical Neural Attention-based Text Classifier

Deep neural networks have been displaying superior performance over traditional supervised classifiers in text classification. They learn to extract useful features automatically when sufficient amount of data is presented. However, along with the growth in the number of documents comes the increase in the number of categories, which often results in poor performance of the multiclass classifiers. In this work, we use external knowledge in the form of topic category taxonomies to aide the classification by introducing a deep hierarchical neural attention-based classifier. Our model performs better than or comparable to state-of-the-art hierarchical models at significantly lower computational cost while maintaining high interpretability.


Introduction
A large number of documents are being generated all over the world everyday, and as a result automatic text classification has become an essential tool for searching, retrieving, and managing the text (Allahyari et al., 2017). There has been an increasing trend in developing data-driven neural text classifiers (Collobert et al., 2011;Lai et al., 2015;Zhang et al., 2015;Yogatama et al., 2017;Conneau et al., 2017), due to their ability to handle large-scale corpora and their robustness in automatic feature extraction.
However, text classification has become increasingly challenging as the number of categories grows with continually expanding corpus. To alleviate this problem, one form of the external knowledge -class taxonomy -has been introduced to aid the classification in a hierarchical fashion (Koller and Sahami, 1997). In general, hierarchical classifiers can be categorized into two broad approaches: local (top-down and bottom-up) and global (or big-bang) (Silla and Freitas, 2011). The local approaches create a unique classifier for each parent node in the taxonomy (Liu et al., 2001;Quinn and Laier, 2006;Vens et al., 2008;Kowsari et al., 2017), while global approaches create a single classifier for the entire taxonomy (Silla Jr and Freitas, 2009). Kowsari et al. (2017) recently proposed a hierarchical neural-based model called HDLTex, which displayed superior performance over traditional non-neural-based models with a top-down structure. However, HDLTex suffers the inherited disadvantage of the top-down approach: the number of sub-models grows exponentially with respect to the number of sub-trees. This is especially problematic in HDLTex, as it uses deep networks with a large number of parameters for the sub-models, and the combined model itself grows exponentially with the depth of taxonomy.
In contrast, we propose a unified global deep neural-based classifier that overcomes the problem of exploding models. The backbone of our approach is one encoder-decoder structure that sequentially predicts the class label of the next level, conditioned on a dynamic document representation obtained based on a variant of an attention mechanism (Bahdanau et al., 2015). The contribution of our paper is as follows: 1. We propose an end-to-end global neural attention-based model for hierarchical classification, which performs better than the state-of-the-art hierarchical classifier at lower computation cost.
2. We empirically show that the use of hierarchical taxonomy provides a robust classifier, by comparing with state-of-the-art flat classifiers.

Literature Review
Traditional text classification methods focus on selecting a good set of features (for example,TF-IDF (Salton and Buckley, 1987)) to represent the documents and employing non-linear classifiers such as SVM (Dumais et al., 1998;Joachims, 1999;Tong and Koller, 2001), decision trees (Apté et al., 1994), or Naive Bayes (McCallum et al., 1998Kim et al., 2006) methods for text classification.
More recent work has employed deep neural networks to merge feature extraction and classification into one joint process, where the model parameters can be learned through back-propagation (Xue et al., 2008;Lai et al., 2015;Zhang et al., 2015). A common theme in these convolutional neural networks (CNN)-based or recurrent neural network (RNN)-based approaches is to create a document representation from either the last hidden state of the RNN or via some pooling operations on all hidden states. Furthermore, the attention mechanism (Bahdanau et al., 2015; has been adapted for these CNN/RNN structures for text classification (Lin et al., 2017), providing high interpretability and allowing us to inspect which parts of the text are discriminative for a particular sample.
In addition, external knowledge has been examined as a way to boost the performance of text classifiers (Collobert and Weston, 2008a;Ngiam et al., 2011;Howard and Ruder, 2018). One form of external knowledge is built on top of the hierarchical relations of the classes (Koller and Sahami, 1997), where a class taxonomy is used to improve the performance of the end-level classification 1 . Most of the hierarchical classifiers 2 perform classification by navigating through the hierarchy in top-down approaches (Liu et al., 2001;Quinn and Laier, 2006;Vens et al., 2008), where a local classifier is constructed at each parent node. The state-ofthe-art hierarchical classifier HDLTex is proposed by Kowsari et al. (2017). It combines deep neural networks in the top-down fashion where a sepa-1 Classifiers that do not take into account the hierarchy and are only concerned with predicting the leaf nodes are termed flat classifiers in this work. 2 We use the term "hierarchical classifiers" to refer the models that follow the external taxonomy of class labels, which is substantially different from hierarchical attention networks (Yang et al., 2016). In Yang et al. (2016), hierarchical attention networks refer to the hierarchical nature of their attention mechanism; the model attends to the sentences first and then attends to the words. rate neural network (either CNN or RNN) is built at each parent node to classify its children.

Model
T h e 2 0 1 2 C h ib a e a r th q u a k e o c c u r r e d a lo n g w8 n o r th e a s te r n Figure 1: Proposed model architecture Our proposed model ( Figure 1) consists of three parts: 1) a bidirectional LSTM encoder (Hochreiter and Schmidhuber, 1997) that transforms each word into vector representations based on their context. 2) an attention module that helps to generate dynamic document representations across different level of classification, 3) multi-layer perceptron (MLP) classifiers at each level that makes the prediction of classes at that level based on the dynamically generated document representation and the level masking.
Our hierarchical classification model can be viewed as a sequence-to-sequence model, where a sequence of word embeddings is used to generate a sequence of hierarchical class labels. In addition, we employ a modified attention module from the traditional attention mechanism used in sequential generation tasks (Bahdanau et al., 2015;. Instead of computing attention weights conditioned on the hidden state of the decoder at time step i, we condition on the parent category embedding c k−1 . This is intuitive in our setting as the document representation should depend on the parent class predicted by the model. Formally, suppose we are given a document with n tokens D = (w 1 , w 2 , ..., w n ) and its category labels of m levels C = (c 1 , . . . , c m ), c k ∈ {c l k 1 , . . . , c l k s k } where l k indicates the k-th level of the class taxonomy and s k represents the number of classes in level k 3 . A bidirectional LSTM is first used to extract features of the document: (1) The encoder's hidden states H = (h 1 , . . . , h n ) are constructed by the concatenation of ( When classifying the class label at level k, we first form contextual word featuresH k by concatenating the previously predicted category embedding c k−1 (parent) with each of the encoder's outputs H = (h 1 , . . . , h n ): Then, we transform these n vectors inH k into n attention scores (scalars) through a series of linear and non-linear transformations: The document representation for level k is obtained by multiplying the multi-head attention matrix and the contextual word features: Finally, a two layered multi-layer perceptron (MLP) is employed to classify the category at level k: Normally, the softmax in Equation 5 is computed over all class labels across the entire taxonomy levels. This is not desirable when the taxonomy is deep and the number of classes is large. We solve this by employing a level masking technique where we mask out all the classes that are not in the current classification level k. The loss is then calculated as the joint cross entropy loss among all levels of the taxonomy: l = m i=1 l i .  Kowsari et al. (2017). Despite its small size, WOS is used as a benchmark dataset for hierarchical classification as it provides the raw text for deep neural models to train on 4 . As deep learning models usually contain a large number of parameters that need to be learned, to prevent over-fitting (Lawrence et al., 1997;Srivastava et al., 2014) we usually need a large dataset to train upon. Thus, we curated a bigger dataset with hierarchical labels from Wikipedia meta information provider DBpedia 5 . Compared to WOS, our DBpedia dataset is larger in two aspects: the number of data instances and the number of hierarchical levels ( Table 1). The DBpedia ontology was first used in Zhang et al. (2015) for flat text classification. We instead use the DBpedia ontology to construct a dataset with a three-level taxonomy of classes. In order to ensure enough documents per-class, we only extract leaf-classes with more than 200 documents. We also randomly subsample 3,000 documents per category to balance the number of leaf-level categories. This results in 381,025 documents in total, which we split into 90% for training (from which 10% were kept aside for validation) and 10% on testing, on which we report our classification metrics 6 .
Baselines State-of-the-art flat classifiers such as FastText (Joulin et al., 2017), Bi-directional 4 The LSHTC dataset (Partalas et al., 2015) has been widely used as a benchmark for hierarchical text classification. However, the raw texts are not available which makes it difficult to extract features for modern neural approaches. Instead, only the tf-idf vectors are provided as inputs with no option to retrieve the original text (even after consulting with the original authors we were unable to procure it). 5 http://wiki.dbpedia.org/ 6 Our code and data will be released at https:// github.com/koustuvsinha/hier-class  LSTM with max/mean pooling (Collobert and Weston, 2008b;Lee and Dernoncourt, 2016) and the Structured Self-attentive classifier (Lin et al., 2017) are used for the comparison. We noticed that using the default hyperparameters of the Structured Self-attentive classifier with high attention hops (m >= 8) performed poorly compared to use just one attention hop (m = 1). Therefore, we reported the results of using one attention hop (m = 1) as our baselines for fair comparison.
We also compare our classifier to the state-of-theart hierarchical classifier HDLTex (Kowsari et al., 2017).
Hyperparameters We use 300-dimensional word embeddings which are randomly initialized and fine-tuned during training. Two-layer Bidirectional LSTM with 300 hidden units in each layer are employed. In the multi-head attention mechanism, we use 4 heads (hops) with 0.1 Frobenius norm penalty because it gives the best validation performance. The final fully-connected MLP layer W D has 1200 hidden units. In addition, we add 0.4 dropout on BiLSTM layers and MLP layers to prevent over-fitting. For optimization, we use the standard Adam optimizer (Kingma and Ba, 2014) with the learning rate of 0.001, weight decay of 10 −4 and 10 −6 for WOS and DBpedia, respectively. The gradients are clipped to 0.5 in order to prevent exploding gradients. All the results are obtained after 25 epochs of training. After every 10 epochs, we reduce the learning rate by half if the validation accuracy is not improving. We employ earlystopping to select the best model. In addition, a weighted loss function is utilized to balance the performance on under-represented classes.
Hierarchical Evaluation For evaluating hierarchical models, we present the teacher-forcing re-sult on each level, such as l 1 , l 2 and l 3 . This indicates the per-level classification performance when we provide the true parent class to the classifier while predicting the next class. However, this is not desirable as during inference we should not have access to the correct parent class. Hence we also present the Overall score in Table 2, where the classifier uses its own prediction as the parent class.

Results
Our model is significantly better than the existing state-of-the-art hierarchical baseline (Table 2). Although, we also see that both hierarchical classifiers (ours and HDLTex) perform comparably with or slightly worse than the state-of-the-art flat classifiers in terms of accuracy. However, the robustness analysis we performed in Table 3 indicates that hierarchical models are more robust in their errors since most of the errors generated by hierarchical classifiers remain within the correct tree of the parent class, while flat classifiers do worse. For example, on WOS, 88.57% of all classified data by our hierarchical model is within the correct subtree compared to 85.56% for the flat classifier.  Table 3: Robustness analysis of taxonomy on the WOS dataset. We compare the success rate of our model and the BiLSTM flat classifier. The success rate is defined as the number of times the predicted class is within the same subtree as the correct parent. We calculate this in two scenarios: 1. when the true parent class is manually provided, or teacher-forced (Correct parent), and 2. when the true parent class is predicted by our model (Predicted parent) Interestingly, the class taxonomy seems to be more beneficial in boosting the performance of hierarchical classifiers on WOS than DBpedia. The hierarchical classifiers perform better on the higher focus of attention. We note that the attention spread becomes much more focused in Level 2 compared to its parent Level 1.
leaf-node level classification of WOS than that on DBpedia. We observe this behaviour due to the dataset of DBpedia being shorter in average length making it easier to classify for flat classifiers, hence hierarchical classifiers overfit on the training data.
In addition to the performance improvement on both datasets over HDLTex, our model takes significantly less time and resources to train, especially when the dataset is large in terms of the intermediate non-leaf nodes in the output taxonomy. As HDLTex needs to build one sub-classifier for each parent nodes, the number of sub-classifiers grows quickly. For example, there are 80 parent nodes in the taxonomy of the DBpedia dataset and HDLTex needs to build 80 RNNs, where each subclassifier contains around 67 million parameters. As a consequence, we can barely fit the whole model of HDLTex on our CPU 7 because it requires 60 GB RAM to build these 80 deep neural networks.

Discussion
Analysis of Attention The intuition behind building dynamic document representations, using multiple attentions across different hierarchical levels, is to have a re-reading effect over the taxonomy. When we first encounter an article as humans, we tend to read it carefully, but on subsequent reads we can easily identify the key aspects of the article. We find in our exploratory experiments the attention vectors behave exactly the same. For the

Conclusion
In this work, we propose a light-weight neuralbased hierarchical classifier that performs better than or comparable to the state-of-the-art hierarchical model at lower computation cost. Our model employs an adapted version of attention to represent documents dynamically through the hierarchy, which provides additional interpretability of the dynamic document representations. In addition, we demonstrate that the robustness of flat text classification can be improved by using external knowledge such as a hierarchical taxonomy. As a future direction, we will advance our model to automatically construct the hierarchical taxonomy in order to improve text classification with a large number of classes.