Hierarchical Text Classification with Reinforced Label Assignment

While existing hierarchical text classification (HTC) methods attempt to capture label hierarchies for model training, they either make local decisions regarding each label or completely ignore the hierarchy information during inference. To solve the mismatch between training and inference as well as modeling label dependencies in a more principled way, we formulate HTC as a Markov decision process and propose to learn a Label Assignment Policy via deep reinforcement learning to determine where to place an object and when to stop the assignment process. The proposed method, HiLAP, explores the hierarchy during both training and inference time in a consistent manner and makes inter-dependent decisions. As a general framework, HiLAP can incorporate different neural encoders as base models for end-to-end training. Experiments on five public datasets and four base models show that HiLAP yields an average improvement of 33.4% in Macro-F1 over flat classifiers and outperforms state-of-the-art HTC methods by a large margin. Data and code can be found at https://github.com/morningmoni/HiLAP.


Introduction
In recent years there has been a surge of interest in leveraging hierarchies (taxonomies) to organize objects (e.g., documents), leading to the development of hierarchical text classification (HTC)-a task that aims to predict for an object multiple appropriate labels in a given label hierarchy, which together constitute a sub-tree. HTC methods have found a wide range of applications such as question answering (Qu et al., 2012), online advertising (Agrawal et al., 2013), and scientific literature organization (Peng et al., 2016). In contrast to "flat" classification, the key challenges of HTC Figure 1: We aim at consistent, multi-path, and nonmandatory leaf node prediction. For a Caribbean restaurant with a beer bar, inconsistent prediction may place it to node "Beer Bars" but not "Bars", which contradicts with each other; Single-path prediction may only recognize that it is a beer bar; Mandatory leaf node prediction would have to assign a leaf node "Dominican" even if the nation of the cuisine is uncertain. lie in modeling the large-scale, imbalanced, and in particular, structured label space.
Based on how the hierarchy is explored, HTC methods can be summarized into flat, local, and global approaches (Silla and Freitas, 2011). Flat approaches (Hayete and Bienkowska, 2005;Johnson and Zhang, 2014) assume all the labels in the given hierarchy are independent. Some predict labels at the leaf nodes and heuristically add their ancestor labels, which is problematic as the labels of some objects may not be at the leaf nodes (nonmandatory leaf node prediction, see Fig. 1) and all the non-leaf nodes are completely neglected. Some simply ignore the hierarchy and perform standard multi-label classification, in which label inconsistencies (one label is predicted positive but its ancestors are not) may occur and post-processing is needed to correct such contradictions. Local approaches (Koller and Sahami, 1997;Cesa-Bianchi et al., 2006)

Beer Bars Wine Bars Dominican Haitian Puerto Rican
Figure 2: An illustrative example of the label assignment policy. At t = 0, x i is placed at the root label and the policy would decide if x i should be placed to its two children (red). At t = 1, x i is placed at label "Restaurants", which adds its three children as the candidates. At t = 6, the stop action is taken and the label assignment is thus terminated. We then take all the labels where x i has been placed (blue) as x i 's labels. been predicted positive. One critical issue is that the number of local classifiers depends on the size of the label hierarchy, making local approaches infeasible to scale.
Global approaches use one single classifier and model the label hierarchy more explicitly. Traditional global approaches (Wang et al., 2001;Silla Jr and Freitas, 2009) are largely based on specific flat models and often make unrealistic assumptions (Cai and Hofmann, 2004) as in flat approaches. Recent neural approaches (Kim, 2014;Yang et al., 2016) mainly focus on flat classification while their performance in HTC is relatively less studied. Even if the classification is supposed to be hierarchical, prior work (Gopal and Yang, 2013;Johnson and Zhang, 2014;Peng et al., 2018) still make flat and independent predictions or utilize simple constraints without considering the holistic quality of label assignment. One recent framework (Wehrmann et al., 2018) attempts to leverage both local and global information but it uses static features as input and its inference process is still flat.
In this paper, we formulate HTC as a Markov decision process to better capture label dependencies and measure the holistic quality of label assignment. We present HiLAP, a global framework that learns a label assignment policy to determine where to place the objects and when to stop the assignment process. HiLAP explores the label hierarchy during both training and inference in a consistent manner, which alleviates the exposure bias often found in prior local and global approaches. By learning when to stop, HiLAP is more flexible than approaches that only support mandatory leaf node prediction or require thresholding. In addition, HiLAP supports multi-path prediction and its predictions of one object on different paths are inter-dependent, which not only guarantees label consistency but matches the nature of HTC. Furthermore, HiLAP estimates the holistic quality of all the labels assigned to one object via reinforcement learning instead of evaluating each label independently via maximum likelihood as in prior studies. To summarize, HiLAP achieves better effectiveness compared to flat and local approaches as it examines the label hierarchy during both training and inference. HiLAP has more flexibility and generalization capacity than previous global approaches in that it has no constraints on the structure of the hierarchy or the labels of the objects (Cai and Hofmann, 2004), generalizes to neural representation learning models (Gopal and Yang, 2013), and makes inter-dependent predictions while ensuring label consistency (Wehrmann et al., 2018;Peng et al., 2018).
HiLAP can be combined with various neural encoding models and trained in an end-to-end fashion. In our experiments, we select four representative encoding models as the base models to evaluate the effectiveness of HiLAP. Experimental results on five public datasets from different domains show that combining the base models with HiLAP yields an average performance improvement of 33.4% in Macro-F1 over corresponding flat classifiers and outperforms state-ofthe-art HTC methods by a large margin. In particular, ablation study shows that HiLAP is especially beneficial to those unpopular labels at the bottom levels.
2 Hierarchical Label Assignment 2.1 Overview Problem Formulation. We define a label hierarchy H = (L, E) as a tree or DAG (directed acyclic graph)-structured hierarchy with a set of nodes (labels) L and a set of edges E indicating the parent-child relation between the labels. Taking a set of objects X = {x 1 , x 2 , ..., x N } and their labels L = {L 1 , L 2 , ..., L N } as input, we aim to  Figure 3: The architecture of the proposed framework HiLAP. One CNN model (Kim, 2014) is used as the base model for illustration. The object embedding e d generated by the base model is combined with the embedding of currently assigned label l t and used as the state representation s t , based on which actions are taken by the policy network. The time corresponds to t = 1 in Fig. 2. learn a label assignment policy P to place each object x i to its labels L i on the label hierarchy H. The label assignment is supposed to be consistent, multi-path, and non-mandatory leaf node prediction (refer to Figs. 1 and 2). We define one base model B as a mapping f that converts raw object x i to a finite dimensional vector, i.e., the object embedding e d ∈ R D . B can be any neural representation learning model and its output e d is used as the input of P for policy learning. The major challenge, compared to standard classification setup, is that we need to model E, i.e., the relation between labels.
Our Framework. Prior studies either have a mismatch between training and inference as different routines are followed in the two phases, or compute losses with respect to each individual label and make flat predictions during inference time. In contrast, we learn a policy that (1) makes consistent, inter-dependent predictions by traversing the label hierarchy and maintaining state representation; (2) measures the holistic quality of label assignment via reinforcement learning. Specifically, the policy P puts x i at the root label in the beginning. At each time step, P decides which label x i should be further placed to, among all the children labels of where x i has been placed, until a special stop action is taken. An illustration of how HiLAP labels one object is shown in Fig. 2 and the overall architecture of HiLAP is shown in Fig. 3.

Reinforcement Learning for Hierarchical Label Assignment
We describe the details of policy learning including its actions, rewards, states, and the policy network in this section. We formulate HTC as a Markov decision process (MDP): at each time step, the agent observes current state, takes an action, and receives a reward. The end goal is to train a policy network to determine where to place the objects and when to stop.
Actions. Specifically, we regard the process of placing an object x i to the right positions on the label hierarchy as making a sequence of actions, where an action a t at time step t is to select one label l t+1 from the action space A t and place x i to that label l t+1 . We denote the children of label l t as C(l t ). At the beginning of each episode, x i is placed at the root label l 0 and the action space A 0 = C(l 0 ), i.e., all the labels at level 1. When x i is placed at another label l 1 , its children C(l 1 ) are then added to the action space A 1 while l 1 itself is removed. In addition, one stop action with embedding e stop ∈ R C is included in the action space so that the model can automatically learn when to stop placing object x i to new labels. Intuitively, when the confidence of placing x i to another label is lower than the stop action, the label assignment process would be terminated. In short, the action space A t consists of all the unvisited children labels of where the object x i has been placed and the stop action. One distinction of HiLAP is that it takes the inter-dependencies of labels across different paths and levels into consideration while previous approaches make independent predictions on different paths. For example, HiLAP can first place x i to a label at level 3 if the probability of that label is high and then place it to another label at level 1 on another path.
Rewards. The agent receives scalar rewards as feedback for its actions. Different from exist-ing work where each label of one example 2 is treated independently, HiLAP measures the quality of all the labels assigned to each example x i by rewarding the agent with the Example-based F1 (see Sec. 4.1 for details of this metric). Intuitively, the agent would realize how similar the assigned and the ground-truth labels of one example are. Instead of waiting until the end of the label assignment process and comparing the predicted labels with the gold labels, we use reward shaping (Mao et al., 2018), i.e., giving intermediate rewards at each time step, to accelerate the learning process. Specifically, we set the reward r of x i at time step t to be the difference of Example-based F1 scores between current and the last time step: If current F1 is better than that at the last time step, the reward would be positive, and vice versa. The cumulative reward from current time step to the end of an episode would cancel the intermediate rewards and thus reflect whether the current action improves the holistic label assignment or not. As a result, the learned policy would not focus on the current placement but have a long-term view that takes following actions into account.
States and Policy Network. We parameterize action a t by a policy network π(a | s; W). For each object, its representation e d is generated by the base model B. For each label, a label embedding l ∈ R C is randomly initialized and updated during training. The embeddings of the object e d and currently assigned label l t are concatenated and projected to a vector s t ∈ R C via a two-layer feedforward network. s t has the same size as the label embedding l and is used as the state representation at time step t. By stacking the action embeddings (i.e., the embeddings of candidate labels and stop action), we obtain an action matrix A t with size |A| × C. A t is multiplied with the state embedding s t , which outputs the probability distribution of actions. Finally, an action a t is sampled based on the probability distribution of the action space.
We use policy gradient (Williams, 1992) as the optimization algorithm. In addition, we adopt a selfcritical training approach (Rennie et al., 2017).
For each object x i , two label assignments are generated:L x i is sampled from the probability distribution, andL x i , the baseline label assignment, is greedily obtained by choosing the action with the highest probability at each time step. We usẽ as the actual reward, which ensures that the policy network learns to place the object to positions with higher F1 score than the greedy baseline. Formally, we measure the global loss O g as follows.
where v x i j = T t=j γ t−jr x i t is the cumulative future reward at time j and γ ∈ [0, 1] is the discount factor. At the time of inference, we greedily select labels with the highest probability asL x i .

Top-Down Supervised Pre-Training
Instead of learning from scratch, we use supervised learning to pre-train HiLAP. We denote the supervised variant as HiLAP-SL. While most parameters of HiLAP-SL are shared and used to initialize HiLAP (except that e stop is randomly initialized), its way of exploring the label hierarchy H is dissimilar.
The major difference is that HiLAP-SL explores the label hierarchy H in a top-down manner independently. At each time step t, the object goes down one level on the hierarchy and the labels under the same parent are discriminated locally. Specifically, the local per-parent label probability distribution p Local t is estimated as p Local t = σ(C t s t ), where σ denotes the sigmoid function, and C t ∈ R |C(lt)|×C denotes the candidate embeddings of HiLAP-SL, i.e., an embedding matrix consisting of the children of current label l t , rather than all the labels where x i has been placed.
Another difference is that in HiLAP the actions are sampled and thus might place the objects to incorrect labels, while in HiLAP-SL only the ground-truth positions are traversed during training. Specifically, if there are K(≥ 1) groundtruth labels at the same level, the object embedding e d would be copied K times and losses on the K different paths would be measured independently (see Fig. 6 in Appendix for illustration). The local loss of HiLAP-SL is defined as where T is the lowest label's level of one example and O t estimates the binary cross entropy over the candidate labels C(l t ): where L i (l) and p Local t,i (l) evaluate label l of x i . Intuitively, HiLAP-SL works as if there were a set of local classifiers, although most of its parameters (except for the label embedding l) are shared by all the labels so that there is no need to train multiple classifiers.

Experiment Setup
Datasets. We conduct extensive experiments on five public datasets from various domains (summarized in Table 1 and detailed in Appendix A). The first two datasets are related to news categorization, including RCV1 (Lewis et al., 2004) and the NYT annotated corpus (Sandhaus, 2008). The We hypothesize that one business can be represented by its reviews and use the reviews to predict business categories. The last two datasets are related to protein functional catalogue (FunCat) and gene ontology (GO) prediction (Vens et al., 2008), which are used to test the generalization ability of HiLAP to non-textual data. For all the datasets, the lowest labels of one example may not be at the leaf nodes and there could be multiple labels at each level, making them harder and more realistic than mandatory-leaf or single-path datasets such as IPC (WIPO, 2014) and LSHTC (Partalas et al., 2015).
Recall that F1 x i is used as the reward in HiLAP.
Base Models for Feature Encoding. Different from most of existing global HTC methods that rely on pre-specified features (Gopal and Yang, 2013) as input or build on specific models (Cai and Hofmann, 2004;Vens et al., 2008;Silla Jr and Freitas, 2009), our framework is trained in an end-to-end manner by leveraging a differentiable feature representation learning model as the base model. Specifically, we use TextCNN (Kim, 2014), HAN (Yang et al., 2016), bow-CNN (Johnson and Zhang, 2014) on the three textual datasets, and a feed-forward network on the two nontextual datasets. The details of the base models are provided in Appendix C due to limited space.
To incorporate one base model into our framework, we remove its final feed-forward layer that projects the object representation e d to a flat probability distribution of all labels (p Flat ), and use e d directly as the input of HiLAP. As one will see in the later experiments, HiLAP consistently improves the base model by modeling the label hierarchy in an effective manner.

Compared Methods
1. Traditional HTC Methods. A major line of work for HTC is Support Vector Machines (SVM) and its hierarchical variants. Specifically, SVM performs standard multi-label classification using one-vs-the-rest (OvR) strategy. Leaf-SVM treats each leaf node as a label and adds the ancestors of predicted leaf nodes. Variants such as HSVM (Tsochantaridis et al., 2005), Top-Down SVM (TD-SVM) (Liu et al., 2005), and Hierarchically Regularized SVM (HR-SVM) (Gopal and Yang, 2013) are also tested. Other state-of-theart HTC methods that we compare with include Clus-HMC (Vens et al., 2008) and CSSA (Bi and Kwok, 2011).

Neural HTC Methods.
There are not many neural methods that specifically target HTC. We mainly compare with two latest neural models: HR-DGCNN (Peng et al., 2018), which extends hierarchical regularization (Gopal and Yang, 2013) to Graph-CNN and compares favorably to flat models like RCNN (Lai et al.,  2015) and XML-CNN (Liu et al., 2017), and HMCN (Wehrmann et al., 2018), which outperforms state-of-the-art HTC methods such as HMC-LMLP (Cerri et al., 2016). We also compare with the base models that we use for feature encoding. The main aim is to see how much gain they could obtain by combining each one of them with HiLAP.

Implementation Details
For datasets without held-out set, we randomly sample 10% from the training set as the validation set following Johnson and Zhang (2014); Peng et al. (2018). We only use the first 256 tokens of each document for representation learning. All the models are trained using an Adam optimizer with initial learning rate 1e-3 and weight decay 1e-6. We use GloVe (Pennington et al., 2014) with size 50 as word embeddings for TextCNN (Kim, 2014) and HAN (Yang et al., 2016). We create a vocabulary of the most frequent 30,000 words in the training data and generate multi-hot vectors as the input of bow-CNN (Johnson and Zhang, 2014). For our framework, since the parameter updates are performed after T steps, we cache the object representation e d and reuse it at each step for better efficiency. More details are provided in Appendix D for reproducibility.

Performance Comparison
1. Comparison with State-of-the-art Methods. We compare the performance of HiLAP to state-of-the-art HTC methods and show the results in Tables 2 and 3. On RCV1, Hi-LAP (HAN) achieves similar performance to HR-DGCNN even though the corresponding base model HAN is originally worse than HR-DGCNN. HiLAP (TextCNN) outperforms most baselines in Macro-F1 and perform similarly to TD-SVM despite that it uses one global classifier while TD-  2. Comparison using Same Base Models. We compare the performance of different frameworks that support the use of exactly the same base models and summarize the results in Fig. 4. 5 Due to the extreme imbalance of the data, directly applying a flat model may suffer from low Macro-F1, i.e., the predictions of flat models are inevitably biased to the most popular labels. HMCN also has the same issue, resulting in Macro-F1 lower than 10 when combining with some base models. In contrast, HiLAP outperforms the baselines significantly in Macro-F1, which implies that our method is bet- 4 The results are not comparable with Johnson and Zhang (2014) due to implementation details and the fact that they tune the threshold for each label using k-fold crossvalidation. See Appendix B for more discussions. 5 For HMCN, we replace its static features with the same base model for fair comparison. Table 4: Performance comparison on Functional Catalogue and Gene Ontology. We compare with state-of-the-art hierarchical classification methods that take exactly the same raw features as input (i.e., we exclude models designed specifically for text objects).

Method
FunCat GO  Kwok, 2011), CLUS-HMC (Vens et al., 2008), and HMCN (Wehrmann et al., 2018) on the Fun-Cat and GO datasets, as they represent the stateof-the-art on these datasets. An SVM classifier is also evaluated to better understand the difficulties of the task. We use the same raw features as the input of all the methods for apples-to-apples comparison and list the results in Table 4. Note that the metric area under the average precision-recall curve (AUPRC) (Wehrmann et al., 2018) is not applicable because HiLAP does not use a flat probability distribution of all the labels. As one can see, HiLAP outperforms all the baselines on both datasets by a large margin. In particular, we observe significant improvement on Macro-F1 over the best baseline (47.9% and 53.9%, respectively), which shows that our method is especially better at classifying sparse labels than previous approaches.

Ablation Study on Different Framework
Components. We show the ablation analysis of HiLAP in Table 5. Using Flat-Only degenerates HiLAP to the flat baseline. By comparing the results of Flat-Only and HiLAP-SL-NoFlat (a variant of HiLAP-SL without flat loss), we further confirm that flat approaches are likely to neglect sparse labels, which results in low Macro-F1. Local approaches (HiLAP-SL-NoFlat), on the other hand, are slightly worse in terms of Micro-F1 and EBF but significantly better on Macro-F1.
By combining flat and local information, HiLAP-SL achieves performance close to Flat-Only on Micro-F1 and EBF, and even higher Macro-F1 than HiLAP-SL-NoFlat. HiLAP-NoSL is initialized by the pre-trained HiLAP-SL model without mixing the supervised loss during its training. We can see that using the reinforced loss alone still improves the performance on all the three metrics. After removing the flat loss during the training of HiLAP, HiLAP-NoFlat shows slightly lower performance than the full HiLAP model, indicating that the flat component serves as a regularization of the base model and is beneficial to the overall performance.
2. Performance Study on Label Granularity and Popularity. We analyze the sources of performance gains by dividing the labels based on their levels and number of supporting examples. Fig. 5 shows the absolute Macro-F1 differences between several methods and the base model. We observe similar results for other setups and omit them for a clearer view. As depicted in Fig. 5, HiLAP and HiLAP-SL are especially beneficial to unpopular labels (P3) at the bottom levels (L3). 3. Analysis of Label Inconsistency. Label inconsistencies often happen in approaches that perform flat inference, but they are not measured by standard evaluation metrics like F1 scores. To provide a picture of how severe the issue is, we further conduct experiments to check the percentage of objects that are predicted with inconsistent labels (Table 6). We found, for example, 29,186/781,265 (3.74%) predictions of TextCNN have inconsistent on RCV1. In contrast, HiLAP ensures 0% label inconsistency without the need of post-processing, because its predictions are always valid sub-trees of the label hierarchy (refer to Fig. 2).

Related Work
Hierarchical classification approaches have been developed for many applications. For text classification, both traditional methods (Lewis et al., 2004;Gopal and Yang, 2013) and neural methods (Johnson and Zhang, 2014;Peng et al., 2018) have been proposed to classify, e.g., the topics of newswire and web content (Sun and Lim, 2001) or categories of laws and patents (Bi and Kwok, 2015;Cai and Hofmann, 2004;Rousu et al., 2005). Many previous studies (Liu et al., 2005;Sun and Lim, 2001) train a set of local classifiers and make predictions in a top-down manner. In particular, Bi and Kwok (2015) develop Bayesoptimal predictions that minimize the global risks but their model is still locally trained. Such local approaches are not popularly used among recent neural-based HTC models (Johnson and Zhang, 2014;Peng et al., 2018) since it is usually infeasible to train many neural classifiers locally. Global methods, on the other hand, train only one classifier. Although global methods are desirable, they are relatively less studied due to the complexity of the problem. Existing global models are generally modified based on specific flat models. Hierarchical-SVM (Cai and Hofmann, 2004;Qiu et al., 2009) generalizes Support Vector Machine (SVM) learning based on discriminant functions that are structured in a way that mirrors the label hierarchy. One limitation is that Hierarchical-SVM only supports balanced tree (all possible labels are presumed to be at the same level in their experiments). Hierarchical naive Bayes (Silla Jr and Freitas, 2009) modifies naive Bayes by updating weights of one's ancestors as well whenever one label's weights are updated. There are other global methods that are based on association rules (Wang et al., 2001), C4.5 (Clare and King, 2003), kernel machines (Rousu et al., 2005), and decision tree (Vens et al., 2008). Constraints such as the regularization that enforces the parameters of one node and its parent to be similar (Gopal and Yang, 2013) are also proposed to leverage the label hierarchy while maintaining scalability. However, their use of the label hierarchies is somewhat limited compared with HiLAP.

Conclusions
We proposed an end-to-end reinforcement learning approach to hierarchical text classification (HTC) where objects are labeled by placing them at the proper positions in the label hierarchy. The proposed framework makes consistent and inter-dependent predictions, in which any neuralbased representation learning model can be used as a base model and a label assignment policy is learned to determine where to place the objects and when to stop the assignment process. Experiments on five public datasets and four base models showed that our approach outperforms stateof-the-art HTC methods significantly. For future work, we will explore the effectiveness of the proposed framework on other base models and forms of data (e.g., images). We will introduce more losses covering other aspects in the objective function to further improve the performance of our framework.