Efficient Strategies for Hierarchical Text Classification: External Knowledge and Auxiliary Tasks

In hierarchical text classification, we perform a sequence of inference steps to predict the category of a document from top to bottom of a given class taxonomy. Most of the studies have focused on developing novels neural network architectures to deal with the hierarchical structure, but we prefer to look for efficient ways to strengthen a baseline model. We first define the task as a sequence-to-sequence problem. Afterwards, we propose an auxiliary synthetic task of bottom-up-classification. Then, from external dictionaries, we retrieve textual definitions for the classes of all the hierarchy’s layers, and map them into the word vector space. We use the class-definition embeddings as an additional input to condition the prediction of the next layer and in an adapted beam search. Whereas the modified search did not provide large gains, the combination of the auxiliary task and the additional input of class-definitions significantly enhance the classification accuracy. With our efficient approaches, we outperform previous studies, using a drastically reduced number of parameters, in two well-known English datasets.


Introduction
Hierarchical text classification (HTC) aims to categorise a textual description within a set of labels that are organized in a structured class hierarchy (Silla and Freitas, 2011). The task is perceived as a more challenging problem than flat text classification, since we need to consider the relationships of the nodes from different levels in the class taxonomy (Liu et al., 2019).
Both flat text classification and HTC have been tackled using traditional machine learning classifiers (Liu et al., 2005;Kim et al., 2006) or deep neural networks (Peng et al., 2018;Conneau et al., 2017). Nevertheless, the majority of the latest approaches consider models with a large number of parameters that require extended training time. In the flat-classification scenario, some studies have addressed the problem of efficiency by proposing methods that do not focus on the model architecture, but in external ways of improving the results (Joulin et al., 2017;Howard and Ruder, 2018). However, the listed strategies are still underdeveloped for HTC, and the most recent and effective methods are still computationally expensive (Yang et al., 2019;Banerjee et al., 2019).
The described context opens our research question: How can we improve HTC at a lower computational cost? Therefore, our focus and main contributions are: • A robust model for HTC, with few parameters and short training time, that follows the paradigm of sequence-to-sequence learning.
• The practical application of an auxiliary (and not expensive) task that strengthens the model capacity for prediction in a bottom-up scheme.
• An exploration of strategies that take advantage of external information about textual definition of the classes. We encode the definitions in the word vector space and use them in: (1) each prediction step and (2) an adapted beam search.
2 Efficient strategies for hierarchical text classification 2.1 Sequence-to-sequence approach Hierarchical classification resembles a multi-label classification where there are hierarchical relationships between labels, i. e., labels at lower levels are conditioned by labels at higher levels in the hierarchy. For that reason, we differ from previous work and address the task as a sequence-to-sequence problem, where the encoder receives a textual description and the decoder generates a class at each step (from the highest to the lowest layer in the hierarchy). Our baseline model thereafter is a sequenceto-sequence neural network (Sutskever et al., 2014) composed of: Embedding layer: To transform a word into a vector w i , where i ∈ {1,...,N} and N is the number of tokens in the input document. We use pre-trained word embeddings from Common Crawl (Grave et al., 2018) for the weights of this layer, and we do not fine-tune them during training time.
Encoder: It is a bidirectional GRU (Cho et al., 2014) unit that takes as input a sequence of word vectors and computes a hidden vector h i per each i time step of the sequence.
Attention layer: We employ the attention variant of Bahdanau et al. (2015), and generate a context vector a i for each encoder output h i .
Decoder: To use the context a i and hidden h i vectors to predict the c l j l jk class of the hierarchy, where j ∈ {1,...,M}. M is the number of levels in the class taxonomy, l j represents the j-th layer of the hierarchy, and l jk is the k-th class in level l j . Similar to the encoder, we use a bidirectional GRU.

Auxiliary task
For an input sequence of words, the model predicts a sequence of classes. Given the nature of recurrent neural networks, iterating over a sequence stores historical information. Therefore, for the last output computation we could take the previous inputs into consideration.
Previous work in HTC (Kowsari et al., 2017;Sinha et al., 2018) usually starts by predicting the most general category (Parent node) and continues to a more specific class (Child nodes) each time. However, by following the common approach, the prediction of the most specific classes will have a smaller impact than the more general ones when the error propagates. In this way, it could be harder to learn the relationship of the last target class with the upper ones.
Inspired by reversing the order of words in the input sequence (Sutskever et al., 2014), we propose an auxiliary synthetic task that changes the order of the target class levels in the output sequence. In other words, we go upward from the child nodes to the parent. With the proposed task, the parent and child nodes will have a similar impact on the error propagation, and the network could learn more robust representations.

Class-definition embeddings for external knowledge integration
We analyze the potential of using textual definitions of classes for external knowledge integration. For each class c l j l jk in any level l j of the hierarchy, we could obtain a raw text definition from an external dictionary to compute a vector representation cv, that from now on we call the class definition vector (CDV). We thereafter use the CDV representations with the two following strategies.

Parent node conditioning (PNC)
For a given document D, we classify it among the target classes C = (c l 1 l 1k ,...,c l M l M k ), where M is the number of layers in the taxonomy. In our approach, we predict the highest-level class c l 1 l 1k and then use its CDV representation cv l 1 l 1k as an additional input (alongside the encoder outputs) to the attention layer for the prediction of the next level class c l 2 l 2k . We continue the process for all the layers of the class hierarchy.

Adapted beam search
Beam search is a search strategy commonly used in neural machine translation (Freitag and Al-Onaizan, 2017), but the algorithm can be used in any problem that involves word-by-word decoding. We assess the impact of applying beam search in HTC, and introduce an adapted version that takes advantage of the computed CDV representations: In each step of the decoding phase, we predict a class that belongs to the corresponding level of the class hierarchy. Given a time step i, the beam search expands all the k (beam size) possible class candidates and sort them by their logarithmic probability. In addition to the original calculation, we compute the cosine distance between the CDV of a class candidate and the average vector of the word embeddings from the textual description z that we want to classify (CD component in Equation 1). We add the new term to the logarithmic probability of each class candidate, re-order them based on the new score, and preserve the top-k candidates.
Our intuition behind the added component is similar to the shallow fusion in the decoder of a  (Gulcehre et al., 2017). Thus, the class-definition representation might introduce a bias in the decoding, and help to identify classes with similar scores in the classification model.

Experimental setup
Datasets. We test our model and proposed strategies in two well-known hierarchical text classification datasets previously used in the evaluation of state-of-the-art methods for English: Web of Science (WOS; Kowsari et al., 2017) and DBpedia (Sinha et al., 2018). The former includes parent classes of scientific areas such as Biochemistry or Psychology, whereas the latter considers more general topics like Sports Season, Event or Work. General information for both datasets is presented in Table 1.
Model, hyper-parameters and training. We use the AllenNLP framework (Gardner et al., 2018) to implement our methods. Our baseline consists of the model specified in §2.1. For all experiments, we use 300 units in the hidden layer, 300 for embedding size, and a batch size of 100. During training time, we employ Adam optimiser (Kingma and Ba, 2014) with default parameters (β 1 = 0.9, β 2 = 0.98, ε = 10 −9 ). We also use a learning rate of 0.001, that is divided by ten after four consecutive epochs without improvements in the validation split. Furthermore, we apply a dropout of 0.3 in the bidirectional GRU encoderdecoder, clip the gradient with 0.5, and train the model for 30 epochs. For evaluation, we select the best model in the validation set of the 30 epochs concerning the accuracy metric.
Settings for the proposed strategies.
• For learning with the auxiliary task, we interleave the loss function between the main prediction task and the auxiliary task ( §2.2) every two epochs with the same learning rate. We aim for both tasks to have equivalent relevance in the network training.
• To compute the class-definition vectors, we extract the textual definitions using the Oxford Dictionaries API 1 . We vectorize each token of the descriptions using pre-trained Common Crawl embeddings (the same as in the embedding layer) and average them.
• For the beam search experiments, we employ a beam size (k) of five, and assess both the original and adapted strategies. We note that the sequence-to-sequence baseline model use a beam size of one 2 . Table 2 presents the average accuracy results of our experiments with each proposed method over the test set. For all cases, we maintain the same architecture and hyper-parameters in order to estimate the impact of the auxiliary task, parent node conditioning, and the beam search variants independently. Moreover, we examine the performance of the combination of our approaches 3 .

Results and discussion
In the individual analysis, we observe that the parent node conditioning and the auxiliary task provides significant gains over the seq2seq baseline, which support our initial hypothesis about the relevance of the auxiliary loss and the information of the parent class. Conversely, we note that the modified beam search strategy has the lowest gain of all the experiments in WOS, although it provides one of the best scores for DBpedia. One potential reason is the new added term for the k-top candidates selection (see Eq. 1), as it strongly depends on the quality of the sentence representation. The classes of WOS includes scientific areas that are usually more complex to define than the categories of the DBpedia database 4 .
We also notice that the accuracy increment is relatively higher for all experiments on the WOS corpus than on DBpedia. A primary reason might be the number of documents in each dataset, as DBpedia contains almost seven times the number  of documents of WOS. If we have a large number of training samples, the architecture is capable of learning how to discriminate correctly between classes only with the original training data. However, in less-resourced scenarios, our proposed approaches with external knowledge integration could achieve a high positive impact.
As our strategies are orthogonal and focus on different parts of the model architecture, we proceed to combine them and assess their joint performance. In the case of WOS, we observe that every combination of strategies improves the single counterparts, and the best accuracy is achieved by the merge of the auxiliary task and PNC, but with an original beam search of size five. Concerning DBpedia, most of the results are very close to each other, given the high accuracy provided since the seq2seq baseline. However, we note the relevance of combining the PNC strategy with the original or modified beam search to increase the performance.
Finally, we compare our strategies to the best HTC models reported in previous studies (Kowsari et al., 2017;Sinha et al., 2018). We then observe that the results of our methods are outstanding in terms of accuracy and number of parameters. Moreover, the training time of each model takes around one hour (for the 30 epochs), and the proposed auxiliary task do not add any significant delay.

Related work
Most of the studies for flat text classification primarily focus on proposing a variety of novel neural architectures (Conneau et al., 2017;Zhang et al., 2015). Other approaches involve a transfer learning step to take advantage of unlabelled data. McCann et al. (2017) used the encoder unit of a neural machine translation model to provide context for other natural language processing models, while Howard and Ruder (2018) pre-trained a language model on a general-domain monolingual corpus and then fine-tuned it for text classification tasks.
In HTC, there are local or global strategies (Silla and Freitas, 2011). The former exploits local information per layer of the taxonomy, whereas the latter addresses the task with a single model for all the classes and levels. Neural models show excellent performance for both approaches (Kowsari et al., 2017;Sinha et al., 2018). Furthermore, other studies focus on using transfer learning for introducing dependencies between parent and child categories (Banerjee et al., 2019) and deep reinforcement learning to consider hierarchy information during inference (Mao et al., 2019).
The incorporation of external information in neural models has offered potential in different tasks, such as in flat text classification. By using categorical metadata of the target classes (Kim et al., 2019) and linguistic features at word-level (Margatina et al., 2019), previous studies have notably improved flat-text classification at a moderate com-putational cost. Besides, Liu et al. (2016) outperform several state-of-the-art classification baselines by employing multitask learning.
To our knowledge, the latter strategies are not explicitly exploited for HTC. For this reason, our study focuses on the exploration and evaluation of methods that enable hierarchical classifiers to achieve an overall accuracy improvement with the least increasing complexity as possible.

Conclusion
We presented a bag of tricks to efficiently improve hierarchical text classification by adding an auxiliary task of reverse hierarchy prediction and integrating external knowledge (vectorized textual definitions of classes in a parent node conditioning scheme and in the beam search). Our proposed methods established new state-of-the-art results with class hierarchies on the WOS and DBpedia datasets in English. Finally, we also open a path to study integration of knowledge into the decoding phase, which can benefit other tasks such as neural machine translation.