Hierarchy-Aware Global Model for Hierarchical Text Classification

Hierarchical text classification is an essential yet challenging subtask of multi-label text classification with a taxonomic hierarchy. Existing methods have difficulties in modeling the hierarchical label structure in a global view. Furthermore, they cannot make full use of the mutual interactions between the text feature space and the label space. In this paper, we formulate the hierarchy as a directed graph and introduce hierarchy-aware structure encoders for modeling label dependencies. Based on the hierarchy encoder, we propose a novel end-to-end hierarchy-aware global model (HiAGM) with two variants. A multi-label attention variant (HiAGM-LA) learns hierarchy-aware label embeddings through the hierarchy encoder and conducts inductive fusion of label-aware text features. A text feature propagation model (HiAGM-TP) is proposed as the deductive variant that directly feeds text features into hierarchy encoders. Compared with previous works, both HiAGM-LA and HiAGM-TP achieve significant and consistent improvements on three benchmark datasets.


Introduction
Text classification is widely used in Natural Language Processing (NLP) applications, such as sentimental analysis (Pang and Lee, 2007), information retrieval (Liu et al., 2015), and document categorization (Yang et al., 2016). Hierarchical text classification (HTC) is a particular multi-label text classification (MLC) problem, where the classification result corresponds to one or more nodes of a taxonomic hierarchy. The taxonomic hierarchy is commonly modeled as a tree or a directed acyclic graph, as depicted in Figure 1.
Existing approaches for HTC could be categorized into two groups: local approach and global * This work was done during intern at Alibaba Group. † Corresponding author. Figure 1: This short sample is tagged with news, sports, football, features and books. Note that HTC could be either a single-path or a multi-path problem.
approach. The first group tends to constructs multiple classification models and then traverse the hierarchy in a top-down manner. Previous local studies (Wehrmann et al., 2018;Shimura et al., 2018;Banerjee et al., 2019) propose to overcome the data imbalance on child nodes by learning from parent one. However, these models contain a large number of parameters and easily lead to exposure bias for the lack of holistic structural information. The global approach treats HTC problem as a flat MLC problem, and uses one single classifier for all classes. Recent global methods introduce various strategies to utilize structural information of top-down paths, such as recursive regularization (Gopal and Yang, 2013), reinforcement learning  and meta-learning (Wu et al., 2019). There is so far no global method that encodes the holistic label structure for label correlation features. Moreover, these methods still exploit the hierarchy in a shallow manner, thus ignoring the fine-grained label correlation information that has proved to be more fruitful in our work.
In this paper, we formulate the hierarchy as a directed graph and utilize prior probabilities of label dependencies to aggregate node information. A hierarchy-aware global model (HiAGM) is pro-posed to enhance textual information with the label structural features. It comprises a traditional text encoder for extracting textual information and a hierarchy-aware structure encoder for modeling hierarchical label relations. The hierarchy-aware structure encoder could be either a TreeLSTM or a hierarchy-GCN where hierarchical prior knowledge is integrated. Moreover, these two structure encoders are bidirectionally calculated, allowing them to capture label correlation information in both top-down and bottom-up manners. As a result, HiAGM is more robust than previous top-down models and is able to alleviate the problems caused by exposure bias and imbalanced data.
To aggregate text features and label structural features, we present two variants of HiAGM, a multi-label attention model HiAGM-LA and a text feature propagation model HiAGM-TP. Both variants extract hierarchy-aware text features based on the structure encoders. HiAGM-LA extracts the inductive label-wise text features while HiAGM-TP generates hybrid information in a deductive manner. Specifically, HiAGM-LA updates the label embedding across the holistic hierarchy and then employs node outputs as the hierarchy-aware label representations. Finally, it conducts multi-label attention for label-aware text features. On the other hand, HiAGM-TP directly utilizes text features as the input of the structure encoder in a serial dataflow. Hence it propagates textual information throughout the overall hierarchy. The hidden state of each node in the entire hierarchy represents the class-specific textual information.
The major contributions of this paper are: • With the prior hierarchy knowledge, we adopt typical structure encoders for modeling label dependencies in both top-down and bottomup manners, which has not been investigated for hierarchical text classification. • We propose a novel end-to-end hierarchyaware global model (HiAGM). We further present two variants for label-wise text features, a hierarchy-aware multi-label attention model (HiAGM-LA) and a hierarchy-aware text feature propagation model (HiAGM-TP). • We empirically demonstrate that both variants of HiAGM achieve consistent improvements on various datasets when using different structure encoders. Our best model outperforms the state-of-the-art model by 3.25% of Macro-F1 and 0.66% of Micro-F1 on RCV1-V2.
• We release our code and experimental splits of Web-of-Science and NYTimes for reproducibility. 1

Related Work
Existing works for HTC could be categorized into local and global approaches. Local approaches could be subdivided into local classifier per node (LCN) (Banerjee et al., 2019), local classifier per parent node (LCPN) (Dumais and Chen, 2000), and local classifier per level (LCL) (Shimura et al., 2018;Wehrmann et al., 2018;Kowsari et al., 2017). Banerjee et al. (2019) transfers parameters of the parent model for child models as LCN. Wehrmann et al. (2018) alleviates exposure bias problem by the hybrid of LCL and global optimizations. Peng et al. (2018) decomposes the hierarchy into subgraphs and conducts Text-GCN on n-gram tokens. The global approach improves flat MLC models with the hierarchy information. Cai and Hofmann (2004) modifies SVM to Hierarchical-SVM by decomposition. Gopal and Yang (2013) proposes a simple recursive regularization of parameters among adjacent classes. Deep learning architectures are also employed in global models, such as sequence-to-sequence (Yang et al., 2018), metalearning (Wu et al., 2019), reinforcement learning , and capsule network (Peng et al., 2019). Those models mainly focus on improving decoders based on the constraint of hierarchical paths. In contrast, we propose an effective hierarchy-aware global model, HiAGM, that extracts label-wise text features with hierarchy encoders based on prior hierarchy information.
Moreover, the attention mechanism is introduced in MLC by Mullenbach et al. (2018) for ICD coding. Rios and Kavuluru (2018) trains label representation through basic GraphCNN and conducts mutli-label attention with residual shortcuts. At-tentionXML (You et al., 2019) converts MLC to a multi-label attention LCL model by label clusters. Huang et al. (2019) improves HMCN (Wehrmann et al., 2018) with label attention per level. Our HiAGM-LA, however, employs multi-label attention in a single model with a simplified structure encoder, reducing the computational complexity.
Recent works, in semantic analysis (Chen et al., 2017b), semantic role labeling  and machine translation (Chen et al., 2017a), shows the improvement on sentence representation of syntax encoder, such as Tree-Based RNN (Tai et al., 2015;Chen et al., 2017a) and GraphCNN (Marcheggiani and Titov, 2017). We modify those structure encoders for HTC with fine-grained prior knowledge in both top-down and bottom-up manners.

Problem Definition
Hierarchical text classification (HTC), a subtask of text classification, organizes the label space with a predefined taxonomic hierarchy. The hierarchy is predefined based on holistic corpus. The hierarchy groups label subsets according to class relations. The taxonomic hierarchy mainly contains the treelike structure and the directed acyclic graph (DAG) structure. Note that DAG can be converted into a tree-like structure by distinguishing each label node as a single-path node. Thus, the taxonomic hierarchy can be simplified as a tree-like structure.
As illustrated in Figure 2, we formulate a taxonomic hierarchy as a directed graph G = (V, where V refers to the set of label nodes V = {v 1 , v 2 , . . . , v C } and C denotes the number of label nodes. Formally, we define HTC as H = (X, L) with a sequence of text objects X = (x 1 , x 2 , . . . , x N ) and an aligned sequence of supervised label sets L = (l 1 , l 2 , . . . , l N ).
As depicted in Figure 1, each sample x i corresponds to a label set l i that includes multiple classes. Those corresponding classes belong to either one or more sub-paths in the hierarchy. Note that the sample belongs to the parent node v i in the condition pertaining to the child node v j ∈ child(i).

Hierarchy-Aware Global Model
As depicted in Figure 3, we propose a Hierarchy-Aware Global Model (HiAGM) that leverages the fine-grained hierarchy information and then aggregates label-wise text features. HiAGM consists of a traditional text encoder for textual information and a hierarchy-aware structure encoder for hierarchical label correlation features.
We present two variants of HiAGM for hybrid information aggregation, a multi-label attention model (HiAGM-LA) and a text feature propagation model (HiAGM-TP). HiAGM-LA updates label representations with the structure encoder and generates label-aware text features with multi-label attention mechanism. HiAGM-TP propagates text representations throughout the holistic hierarchy, thus obtaining label-wise text features with the fusion of label correlations.

Prior Hierarchy Information
The taxonomic hierarchy describes the hierarchical relations among labels. The major bottleneck of HTC is how to make full use of this established structure. Previous studies directly utilize this hierarchy path in a static method based on a pipeline framework, hierarchical model or label assignment model. In contrast, based on Bayesian statistical inference, HiAGM leverages the prior knowledge of label correlations regarding the predefined hierarchy and corpus. We exploit the prior probability of label dependencies as prior hierarchy knowledge.
Suppose that there is a hierarchy path e i,j between the parent node v i and child node v j . This edge feature f (e i,j ) is represented by the prior probability P (U j |U i ) and P (U i |U j ) as: where U k means the occurrence of v k and P (U j |U i ) is the conditional probability of v j given that v i occurs. P (U j ∩ U i ) is the probability of {v j , v i } occurring simultaneously. N k refers to the number of U k in the training subset. Note that the hierarchy ensures U k given that v child(k) occurs. We rescale and normalize the prior probabilities of child nodes v child(k) to sum total to 1.

Hierarchy-Aware Structure Encoder
Tree-LSTM and graph convolutional neural networks (GCN) are widely used as structure encoders for aggregating node information in NLP (Tai et al., 2015;Chen et al., 2017a;Rios and Kavuluru, 2018). As depicted in Figure 3, HiAGM models fine-grained hierarchy information based on the hierarchy-aware structure encoder. Based on the prior hierarchy information, we improve typical structure encoders for the directed hierarchy graph. Specifically, the top-down dataflow employs the prior hierarchy information as f c (e i,j ) = N j N i while the bottom-up one adopts f p (e i,j ) = 1.0.
Bidirectional Tree-LSTM Tree-LSTM could be utilized as our structure encoder. The implementation of Tree-LSTM is similar to syntax encoders (Tai et al., 2015;Zhang et al., 2016;. The predefined hierarchy is identical to all samples, which allows the mini-batch training method for this recursive computational module. The node transformation is as follows: where h k and c k represent the hidden state and memory cell state of node k respectively.
To induce label correlations, HiAGM employs a bidirectional Tree-LSTM by the fusion of a childsum and a top-down module: where h ↑ k and h ↓ k are separately calculated in the bottom-up and top-down manner as h k = TreeLSTM( h k ). ⊕ indicates the concatenation of hidden states. The final hidden state of node k is the hierarchical node representation h bi k . Hierarchy-GCN GCN (Kipf and Welling, 2017) is proposed to enhance node representations based on the local graph structural information. Some NLP studies have improved Text-GCNs for rich word representations upon the syntactic structure and word correlation (Marcheggiani and Titov, 2017;Vashishth et al., 2019;Yao et al., 2019;Peng et al., 2018). We introduce a simple hierarchy-GCN for the hierarchy structure, thus gaining our aforementioned fine-grained hierarchy information.
Hierarchy-GCN aggregates dataflows within the top-down, bottom-up, and self-loop edges. In the hierarchy graph, each directed edge represents a pair-wise label correlation feature. Thus, those dataflows should conduct node transformations with edge-wise linear transformations. However, edge-wise transformations shall lead to overparameterized edge-wise weight matrixes. Our Hierarchy-GCN simplifies this transformation with a weighted adjacent matrix. This weighted adjacent matrix represents the hierarchical prior probability. Formally, Hierarchy-GCN encodes the hidden state of node k based on its associated neighbourhood N (k) = {n k , child(k), parent(k)} as: where W d(k,j) g ∈ R dim , b l ∈ R N ×dim , and b g ∈ R N . d(j, k) indicates the hierarchical direction from node j to node k, including top-down, bottomup, and self-loop edges. Note that a k,j ∈ R denotes the hierarchy probability f d(k,j) (e kj ), where the self-loop edge employs a k,k = 1, top-down edges use f c (e j,k ) = N k N j , and bottom-up edges use f p (e j,k ) = 1. The holistic edge feature matrix F = {a 0,0 , a 0,1 , . . . , a C−1,C−1 } indicates the weighted adjacent matrix of the directed hierarchy graph. Finally, the output hidden state h k of node k denotes its label representation corresponding to the hierarchy structural information.

Hybrid Information Aggregation
Previous global models classify labels upon the original textual information and improve the decoder with predefined hierarchy paths. In contrast, we construct a novel end-to-end hierarchy-aware global model (HiAGM) for the mutual interaction of text features and label correlations. It combines a traditional text classification model with a hierarchy encoder, thus obtaining label-wise text features. HiAGM is extended to two variants, a parallel model for an inductive fusion (HiAGM-LA) and a serial model for a deductive fusion (HiAGM-TP).
Given a document x = (w 1 , w 2 , . . . , w s ), the sequence of token embedding is firstly fed into a bidirectional GRU layer to extract text contextual feature. Then, multiple CNNs are used for generating n-gram features. The concatenation of n-gram features is filtered by a top-k max-pooling layer to extract key information. Finally, by reshaping, we can obtain the continuous text representation S = (s 1 , . . . , s n ) where s i ∈ R dc and d c indicates the output dimension of the CNN layer. n = n k × n c refers to the multiplication of top-k number and the number of CNNs.
Hierarchy-Aware Multi-Label Attention The first variant of HiAGM is proposed based on multilabel attention, called as HiAGM-LA. Attention mechanism is usually utilized as the memory unit in text classification (Yang et al., 2016;Du et al., 2019). Recent LCL studies (Huang et al., 2019;You et al., 2019) construct one multi-label attentionbased model per level so as to avoid optimizing label embedding among different levels.
Our HiAGM-LA is similar to those baselines but simplifies multi-label attention LCL models to a global model. Based on our hierarchy encoders, HiAGM-LA could overcome the problem of convergence for label embedding across various levels. Label representations are enhanced with bidirectional hierarchical information. This local structural information makes it feasible to learn label features across different levels in a single model. Formally, suppose that the trainable label embedding of node k is randomly initialized as L k ∈ R d l . The initial label embedding L k is directly fed into structure encoders as the input vector of aligned label node x k . Then, the output hidden state h ∈ R C×dc represents as the hierarchy-aware label features. Given text representation S ∈ R n×dc , HiAGM-LA calculates the label-wise attention value α ki as: Note that α ki indicates how informative the ith text feature vector is for the k-th label. We can get the inductive label-aligned text features V ∈ R C×dc based on multi-label attention. Then it would be fed into the classifier for prediction. Furthermore, we could directly use the hidden state of hierarchy encoders as the pretrained label representations so that HiAGM-LA could be even lighter in the inference process.
Hierarchical text feature propagation Graph neural networks are capable of message passing (Gilmer et al., 2017;Duvenaud et al., 2015), learning both local node correlations and overall graph structure. To avoid the noise from heterogeneous fusion, the second variant obtains label-wise text features based on a deductive method. It directly takes text features S as the node inputs and updates textual information through the hierarchy-aware structure encoder. This variant mainly conducts the propagation of text features, called as HiAGM-TP. Formally, node inputs V are reshaped from text features by a single linear transformation: where the trainable weight matrix M ∈ R (n×dc)×(C×dv) transforms text features S ∈ R n×dc to node inputs V ∈ R C×dv . Given the predefined structure, each sample would update its textual information throughout the same holistic taxonomic hierarchy. In a mini-batch learning manner, the initial node representation V is fed into the hierarchy encoder. The output hidden state h denotes deductive hierarchy-aware text features as the input of the final classifier. Compared with HiAGM-LA, the transformation of HiAGM-TP is conducted on textual information without the fusion of label embedding. Thus, the structure encoder would be activated in both training and inference procedures for passing textual messages across the hierarchy. It could converge much easier but has slightly higher computational complexity than HiAGM-LA.

Classification
We flatten the hierarchy by taking all nodes as leaf nodes for multi-label classification, no matter it is a leaf node or an internal node. The final hierarchy-aware features are fed into a fully connected layer for prediction. HiAGM is complementary with recursive regularization (Gopal and Yang, 2013) as L r = i∈C j∈child(i) 1 2 ||w i − w j || 2 for the parameters of the final fully connected layer. For multi-label classification, HiAGM uses a binary cross-entropy loss function: where y ij and y ij are the ground truth and sigmoid score for the j-th label of the i-th sample. Thus, the final loss function is L m = L c + λ · L r .

Experiment
In this section, we introduce our experiments with datasets, evaluation metrics, implementation details, comparison, ablation study, and analysis of experimental results.

Experiment Setup
We experiment our proposed architecture on RCV1-V2, Web-of-Science (WOS) and NYTimes (NYT) datasets for comparison and ablation study.
Datasets RCV1-V2 (Lewis et al., 2004) and NYT (Sandhaus, 2008) are both news categorization corpora while WOS (Kowsari et al., 2017) includes abstracts of published papers from Web of Science. Those typical text classification datasets  are all annotated with the ground truth of hierarchical taxonomic labels. We use the benchmark split of RCV1-V2 and select a small partial training subset for validation. WOS dataset is randomly splitted into training, validation and test subsets. In NYT, we randomly select and split subsets from original raw data. We also remove samples with no label or only a single one-level label. Note that WOS is for single-path HTC while NYT and RCV1-V2 include multi-path taxonomic tags. The statistics of datasets is shown in Table 1.

Evaluation Metrics
We measure the experimental results with standard evaluation metrics (Gopal and Yang, 2013), including Micro-F1 and Macro-F1. Micro-F1 takes the overall precision and recall of all the instances into account while Macro-F1 equals to the average F1-score of labels. So Micro-F1 gives more weight to frequent labels, while Macro-F1 equally weights all labels.

Implementation Details
We use a one-layer bi-GRU with 64 hidden units and 3 parallel CNN layers with filter region size of {2, 3, 4}. The vocabulary is created by the most frequent words with the maximum size of 60,000. We use 300-dimensional pretrained word embedding from GloVe 2 (Pennington et al., 2014) and randomly initialize the out-ofvocabulary words above the minimum count of 2.
The key information pertaining to text classification could be extracted from the beginning statements. Thus, we set the maximum length of token inputs as 256. The fixed threshold for tagging is chosen as 0.5. Dropout is employed in the embedding layer and MLP layer with the rate of 0.5 while in the bi-GRU layer and node transformation with the rate of 0.1 and 0.05 respectively. Additionally, for HiAGM-LA, the label embedding is initialized by Kaiming uniform  while the other model parameters are initialized by Xavier uniform (Glorot and Bengio, 2010). We use the Adam optimizer in a mini-batch size of 64 with learning rate Model Micro Macro Local Models HR-DGCNN-3 (Peng et al., 2018) 76.18 43.34 HMCN  80.80 54.60 HFT(M) (Shimura et al., 2018) 80.29 51.40 Htrans (Banerjee et al., 2019) 80.51 58.49 Global Models SGM 4 (Yang et al., 2018) 77.30 47.49 HE-AGCRCNN (Peng et al., 2019)   Note that the prior probability matrix in HiAGM-TP is fine-tuned during training while the one in HiAGM-LA is fixed. w/o Rec denotes training without recursive regularization. " †" and " ‡" indicate statistically significant difference (p<0.01) from TextRCNN and TextR-CNN+LabelAttention respectively. α = 1 × 10 −4 , momentum parameters β 1 = 0.9, β 2 = 0.999 and = 1 × 10 −6 . The penalty coefficient of recursive regularization is set as 1 × 10 −6 . Our model evaluates the test subset with the best model on the validation subset.

Comparison
In Table 2, we compare the performance of Hi-AGM to traditional MLC models and the state-ofthe-art HTC studies on RCV1-V2. With the recursive regularization for the last MLP layer, those conventional text classification models also obtain competitive performance. As for our proposed architecture, both HiAGM-LA and HiAGM-TP outperform most state-of-the-art results of global and local studies, esspecially in Macro-F1. It shows the strong advancement of our hierarchy encoders on HTC. HiAGM-LA achieves the performance of 61.90% Macro-F1 score and 82.54% Micro-F1 score while HiAGM-TP obtains the best performance of 63.35% Macro-F1 score and 83.96% Micro-F1 score.
To clarify the improvement of our proposed 4 The result is reproduced with benchmark split upon the released project of SGM.  HiAGM, we also experiment without recursive regularization. Compared with the state-of-theart recent work (HiLAP) , our HiAGM-LA and HiAGM-TP without recursive regularization also achieve competitive improvement by 1.75% and 3.13% in terms of Macro-F1. It demonstrates that the recursive regularization is complementary but not necessary with our proposed architecture. According to Table 4, HiAGM achieves consistent improvement on the performance of HTC among RCV1-V2, WOS and NYT datasets. It indicates the strong improvement of the label-wise text feature on HTC task. The results present that our proposed global model HiAGM has the advanced capability of enhancing text features for HTC.
All in all, HiAGM strongly improves the performance on the benchmark dataset RCV1-V2 and the other two classical text classification datasets. Especially, it obtains better results on Macro-F1 score. It indicates that HiAGM has a strong ability to tackle data-sparse classes deep in the hierarchy.

Analysis
Hybrid Information Aggregation According to Table 2, both variants outperform the baseline models and previous studies. It denotes that the enhanced text feature is beneficial for HTC. We clarify the ablation study of two variants and structure encoders in Table 3. Both HiAGM-LA and HiAGM-TP are trained with fixed prior probability. With the help of the recursive computation process, bidirectional Tree-LSTM achieves better performance on learning hierarchy-aware label embedding. However, it additionally leads to lower computational efficiency when compared to Hierarchy-GCN. Regarding HiAGM-TP, hierarchy-GCN shows its better performance and efficiency than bidirectional Tree-LSTM.
These two variants have various advantages, respectively. To be specific, HiAGM-TP has better performance than HiAGM-LA in both Bi-  TreeLSTM and Hierarchy-GCN encoders. The multi-label attention variant, HiAGM-LA, would somehow induce noises from the randomly initialized label embedding. Otherwise, HiAGM-TP aggregates the fusion of local structural information and text feature maps, without the negative impact of label embedding. As for efficiency, HiAGM-LA is more computationally efficient than HiAGM-TP, especially in the inference process. The label representation from hierarchy encoders could be utilized as pretrained label embedding for multi-label attention during inference. Thus, HiAGM-LA omits the hierarchyaware structure encoder module after training.
We recommend HiAGM-TP for high performance while we also suggest HiAGM-LA for empirically good performance and faster inference.

GCN Layers
The impact of GCN layers is also an important issue for HiAGM. As illustrated in Figure 4, the one-layer structure encoder consistently performs best in both HiAGM-LA and HiAGM-TP. It indicates that the correlation between non-adjacent nodes is not essential for HTC but somehow noisy for hierarchical information aggregation. This empirical conclusion is consistent with the implementation of recursive regularization (Peng et al., 2018;Gopal and Yang, 2013)and transfer learning (Banerjee et al., 2019;Shimura et al., 2018) between adjacent labels or levels.
Prior Probability According to the aforementioned comparisons, our simplified structure encoders with prior probabilities is undoubtedly beneficial for HTC. We also investigate different choices of prior probabilities with hierarchy-GCN encoder on the HiAGM-TP variant, clarified as Table 5. Note that the weighted adjacent matrix is initialized by prior probabilities.
The simple weighted adjacent matrix performs better than the complex edge-wise weight matrix for node transformation. The fixed weighted adjacent matrix also achieves better results than the original unweighted adjacent matrix and the trainable randomly initialized one. It demonstrates that the prior probability of the hierarchy is capable of representing hierarchical label dependencies. Furthermore, the best result is obtained by the setting that obeys the calculating direction of prior probability. When comparing the results of the fixed adjacent matrix and trainable one, we can find that the weighted adjacent matrix could be finetuned for higher flexibility and better performance.
In Table 5, the settings that allows all interac-   Table 5: Ablation study of the fine-grained hierarchy information on RCV1-V2 based on GCN-based HiAGM-TP. Edge-Wise Matrix denotes that each directional edge has a distinct trainable weight matrix for transformation while the others use the weighted adjacent matrix. P is f c (e i,j ) =

Nj
Ni and 1 is f p (e i,j ) = 1.0. "*" allows the information propagation between all nodes while the others obey the constraint of hierarchy. tions perform worse than the others that allow propagation throughout the hierarchy paths. As analyzed on GCN layers, the interaction between non-adjacent nodes would lead to negative impact on the HTC. We also validate this conclusion based on the ablation study of prior probability.

Performance Study
We analyze the improvement on performance by dividing labels based on their levels. We compute level-based Micro-F1 scores of NYT on baseline, HiAGM-LA, and HiAGM-TP. Figure 5 shows that our models retain a better performance than the baseline on all levels, especially among deep levels.

Conclusion
In this paper, we propose a novel end-to-end hierarchy-aware global model that extracts the label structural information for aggregating label-wise text features. We present a bidirectional TreeL-STM and a hierarchy-GCN as the hierarchy-aware structure encoder. Furthermore, our framework is extended into a parallel variant based on multilabel attention and a serial variant of text feature propagation. Our approaches empirically achieve significant and consistent improvement on three distinct datasets, especially on the low-frequency labels. Specifically, both variants outperform the state-of-the-art model on the RCV1-V2 benchmark dataset. And our best model obtains a Macro-F1 score of 63.35% and a Micro-F1 score of 83.96%.