NeuralClassifier: An Open-source Neural Hierarchical Multi-label Text Classification Toolkit

In this paper, we introduce NeuralClassifier, a toolkit for neural hierarchical multi-label text classification. NeuralClassifier is designed for quick implementation of neural models for hierarchical multi-label classification task, which is more challenging and common in real-world scenarios. A salient feature is that NeuralClassifier currently provides a variety of text encoders, such as FastText, TextCNN, TextRNN, RCNN, VDCNN, DPCNN, DRNN, AttentiveConvNet and Transformer encoder, etc. It also supports other text classification scenarios, including binary-class and multi-class classification. Built on PyTorch, the core operations are calculated in batch, making the toolkit efficient with the acceleration of GPU. Experiments show that models built in our toolkit achieve comparable performance with reported results in the literature.


Introduction
Text classification is an important task in Natural Language Processing with many applications, such as web search, information retrieval, ranking and document classification (Deerwester et al., 1990;Pang et al., 2008). As a result of the great success of deep neural networks, a series of classification models based on neural networks that achieve very good performance in practice have been proposed (Kim, 2014;Lai et al., 2015;Joulin et al., 2016;Conneau et al., 2016;Liu et al., 2016;Johnson and Zhang, 2017;Vaswani et al., 2017;Yin and Schütze, 2017;Wang, 2018;Qiao et al., 2018;Guo et al., 2019).
The problem of Hierarchical Multi-label Classification (HMC) is a branch of classification problem. It is a more challenging classification problem in real-world scenarios. Unlike traditional flat and single-label text classification, it aims at considering the interrelationships among labels and classifying the text document into multiple labels, which are organized into a hierarchical structure of tree or DAG (Directed Acyclic Graph). Regularizing the deep architecture with the dependency among labels adopted by the existing solutions (Gopal and Yang, 2013;Peng et al., 2018) is more naturally for solving hierarchical multi-label text classification problem, especially for large scale datasets. It has a wide variety of real-world applications such as question answering (Qu et al., 2012), online advertising (Agrawal et al., 2013), and scientific literature organization (Peng et al., 2016).
There exist several open-source statistical hierarchical or multi-label text classification toolkits, such as scikit-multilearn 2 , sklearn-hierarchicalclassification 3 , which provide users with various hierarchical or multi-label classification modules based on scikit-learn's interfaces and conventions. On the other hand, there is limited choice for neural hierarchical multi-label text classification toolkits. Although many researchers have released their codes along with their hierarchical or multi-  label text classification papers (Kowsari et al., 2017;Peng et al., 2018), but the implementations are mostly focused on specific model structures and specific tasks, which greatly limit their extensions for other similar tasks.
In this paper, we introduce an open-source toolkit, NeuralClassifier 4 , a neural hierarchical multi-label text classification toolkit based on PyTorch. It is designed for solving the hierarchical multi-label text classification problem with effective and efficient neural models. It provides a variety of models and features, users can utilize a comfortable configuration file with neural feature design and utilization. We take the layerwise implementation, which includes input layer, embedding layer, encoder layer and output layer. To our best knowledge, our work is the first neural hierarchical multi-label text classification toolkit with rich models. For the details, we give a summary comparison with existing toolkits in Table 1. NeuralClassifier is: • Rich in models and features: An important feature of our work is that, compared with existing toolkits, NeuralClassifier reimplements a very large number of the state-of-the-art text encoders, including FastText (Joulin et al., 2016), TextCNN (Kim, 2014), TextRNN (Liu et al., 2016), RCNN (Lai et al., 2015) , VDCNN (Conneau et al., 2016), DPCNN (Johnson and Zhang, 2017) , AttentiveConvNet (Yin and Schütze, 2017), DRNN (Wang, 2018), Transformer encoder (Vaswani et al., 2017), Star-Transformer encoder (Guo et al., 2019). Meanwhile, NeuralClassifier involves a variety of useful features or widgets, i.e., word-based and char-based input, optimizers, loss functions, embedding methods and attention mechanisms, etc. All those above can be configured through a configuration file. Figure 1 shows a segment of configuration file. Note that users can configure different text encoders and features through the configuration file, and can easily modify the source code to achieve more advanced developments.
• Suitable for almost all text classification tasks: NeuralClassifier is designed for hierarchical and multi-label classification, which naturally also supports binary-class and multi-class classification, so it can be considered a universal toolkit for text classification tasks. Especially in hierarchical multi-label classification task, the taxonomy can be organized in the form of a tree or DAG, and instances are multi-labeled during training and testing. It also provides a complete evaluation mechanism. An illustration with results is shown in Figure 2. Users can choose their task types only through a comfortable configuration file without any code work.
• Effective and efficient: We conduct the experiments based on a variety of models and features provided by NeuralClassifier. Experiments show models built in our toolkit output comparable performance with reported results in the literature. Furthermore, NeuralClassifier is implemented using batch calculation that can be accelerated using GPU. Our experiments demonstrate that NeuralClassifier is an effective and efficient toolkit.
The rest of this paper is organized as follows: Section 2 describes the detail of architecture of NeuralClassifier. The experimental evaluations and results are discussed in Section 3. Finally, Section 4 concludes this paper.

NeuralClassifier Architecture
The framework of NeuralClassifier is shown in Figure 3. It is composed of four layers: input layer, embedding layer, encoder layer and output layer. At the first layer (input layer), the input word sequence will be organized and processed as words, characters, or corresponding n-grams. For FastText, custom features such as keywords and topics are also supported. The embedding of input data will be generated at the embedding layer, subsequently be encoded at encoder layer. On top of the system, the different loss functions are constructed in the output layer to serve the different real-world tasks, i.e., binary-class, multi-class, multi-label and hierarchical multi-label classification. The user can deploy it through a comfortable configuration file without any code work. Note that a salient feature is that users can easily utilize and integrate any widgets in the NeuralClassifier to construct their own structure to satisfy any requirements.
The following will describe the pertinent details of the four layers and the user interface.

Layer Units
• Input Layer. The input text sequence will be processed at input layer. Input text sequence in the form of token (word) can be processed into words and characters, along with their n-grams. Custom feature inputs such as keywords and topics are also supported when the text encoder is FastText. All the above can be flexibly configured by the users. Besides, reading input data can be accelerated with multiple processes. See Figure 4 for an example of input data.
• Embedding Layer. Various embeddings are processed at this layer. There are four embedding types can be chosen, which are random embedding, pre-trained embedding, region embedding and position embedding. Embedding can be initialized randomly or from a pre-trained embedding input. Region embedding (Qiao et al., 2018) is a supervised enhanced word embedding method that the representation of a word or char has two parts, the embedding of the word itself, and a weighting matrix to interact with the local context, referred to as local context unit. Position embedding (Vaswani et al., 2017) is an embedding method that considers position information in the input sequence.
• Encoder Layer. We reimplement a very large number of state-of-the-art text encoders at encoder layer, including FastText, TextCNN, Tex-tRNN, RCNN, VDCNN, DPCNN, DRNN, Transformer encoder, Star-Transformer encoder and At-tentiveConvNet. Each encoder has its own hyperparameters that can be configured by users.
• Output Layer. This layer determines the specific classification tasks, including binary-class, multi-class, multi-label and hierarchical-class. For the single-label (binary-class and multi-class) classification task, we provide three candidate loss functions, which are SoftmaxCrossEntopy, BCLoss and SoftmaxFocalLoss (Lin et al., 2017).
For the hierarchical multilabel classification task, we use BCELoss or SigmodFocalLoss as the loss function for multi-label classification and add a recursive regularization (Gopal and Yang, 2013;Peng et al., 2018) for hierarchical classification. Using this regularization framework, we can incorporate the hierarchical dependencies between the classlabels into the regularization structure of the parameters thereby encouraging classes nearby in the hierarchy to share similar model parameters.
In addition, such a regularization approach is more suitable for large-scale hierarchical multi-label classification task. Users can easily use above functions through the configuration file.

User Interface
NeuralClassifier provides abundant configuration interfaces, including the common settings, input settings, training settings and network structure settings. Through the configuration file, users can construct most state-of-the-art neural hierarchical multi-label text classification models. JSON is used as the configuration file format.
The configuration file has four major parts: • Common settings include the type of classification tasks, which are single-labeled and multilabeled, whether it is hierarchical (task info), which running device to use (device), the specified model (model name), directories of input and output data (data), how many subprocesses to use for data loading (num worker), etc.
• Input settings include various configurations about input data, such as maximum input sequence length (max token len), minimum input token count (min token count), dictionary size (max token dict size), pre-trained embeddings of input data (token pretrained file), etc.
• Training settings include the batch size (batch size), type of loss function (loss type), optimizer (optimizer type), learning rate (learning rate), number of epochs (num epochs), which GPUs to use (visible device list), etc.
• Network structure settings specify which text encoders to use, such as TextCNN, TextRNN, RCNN, Transformer, etc. For each text encoder, there are corresponding hyperparameters that can be configured. Take TextCNN as an example, users can configure the size and number of convolution kernels and the number of tops in the pooling (kernel sizes, num kernels, top k max pooling).

Extension
Users can write their own custom modules on all those layers, and self-defined modules can be integrated into the toolkit easily. For example, if a user wants to implement a new classifier model, he/she only needs to implement the part at encoder layer. All the other network structures can be used and controlled through the configuration file.

Evaluation
In this section, we conduct several experiments to evaluate the performance of NeuralClassifier using datasets from two public benchmarks, namely, RCV1 (Lewis et al., 2004) and Yelp 5 . The experiments consist of three parts: (1) Results of using rich models and features in Section 3.1; (2) influence of hierarchical information in Section 3.2; (3) speed with batch size in Section 3.3.

Results of using rich models and features
We use the ability of various models and features provided by our toolkit to illustrate the performance of NeuralClassifier on hierarchical multilabel text classification problem. Concretely, we select a best model through coarse-grained experiments on each of the two benchmarks and fix it, and then fine-tune the features and hyperparameters, such as model structures, input representations, activation functions, optimizers, learning rate, etc. The best performance models 6 are as Models RCV1 Yelp Micro-F1 Micro-F1 HR-DGCNN (Peng et al., 2018) 0.7610 -HMCN (Wehrmann et al., 2018) 0.8080 0.6640 Our best models 0.8099 0.6704  follows: (1) RCNN with two-layers Bi-GRU and one-layer CNN for RCV1 dataset (input = word, optimizer = Adam, learning rate = 0.008); (2) Tex-tRNN with one-layer Bi-GRU for Yelp dataset (input = word, optimizer = Adam, learning rate = 0.008). Table 2 shows the results of the best models on the two benchmarks. The best models can achieve comparable results with the state-of-theart HMC models. The results shows the effectiveness of our implementation, and usability of a variety of models and features. Table 3 shows performances of different text encoders. In particular, we can use different combinations of strategies to guide the setup of model for better performance in the real applications.

Influence of hierarchical information
Hierarchical classification problems can also be solving by flat methods, which regard the hierarchical classification as a flat classification, regardless of the hierarchical relationship between labels. As mentioned before, our toolkit is configurable, we can easily set different loss functions by configuration. In this section, we discuss the influence of hierarchical information. Table 4 shows the results of setting the HMC loss function (Hierarchical) and the traditional multi-label loss function (Flat). As can been seen from the results, hierarchical information can greatly improve performance. It also demonstrates the effectiveness of   our implementation.

Speed with Batch Size
As NeuralClassifier is implemented on bathed calculation, it can be greatly accelerated through parallel computing through GPU. We test the system speed on training process on RCV1 using a Nvidia Tesla P40 GPU. As shown in Figure 5, the training speed can be significantly accelerated through a large batch size, demonstrating the efficiency of our implementation.

Conclusion
This paper presents NeuralClassifier, an opensource neural hierarchical multi-label text classification toolkit. NeuralClassifier provides a large variety of text encoders and features. Users can design their models for different text classification tasks easily through the configuration file. We conduct a series of experiments and the results show that models built on NeuralClassifier can achieve state-of-the-art results with an efficient running speed.