HFT-CNN: Learning Hierarchical Category Structure for Multi-label Short Text Categorization

We focus on the multi-label categorization task for short texts and explore the use of a hierarchical structure (HS) of categories. In contrast to the existing work using non-hierarchical flat model, the method leverages the hierarchical relations between the pre-defined categories to tackle the data sparsity problem. The lower the HS level, the less the categorization performance. Because the number of training data per category in a lower level is much smaller than that in an upper level. We propose an approach which can effectively utilize the data in the upper levels to contribute the categorization in the lower levels by applying the Convolutional Neural Network (CNN) with a fine-tuning technique. The results using two benchmark datasets show that proposed method, Hierarchical Fine-Tuning based CNN (HFT-CNN) is competitive with the state-of-the-art CNN based methods.


Introduction
Short text categorization is widely studied since the recent explosive growth of online social networking applications (Song et al., 2014).
In contrast with documents, short texts are less topic-focused in texts.
Major attempts to tackle the problem is to expand short texts with knowledge extracted from the textual corpus, machine-readable dictionaries, and thesauri (Phan et al., 2008;Wang et al., 2008;Chen et al., 2011;Wu et al., 2012). However, because of domain-independent nature of dictionaries and thesauri, it is often the case that the data distribution of the external knowledge is different from the test data collected from some specific domain, which deteriorates the overall performance of categorization. A methodology which maximizes the impact of pre-defined domains/categories is needed to improve categorization performance.
More recently, many authors have attempted to apply deep learning techniques including CNN (Wang et al., 2015;Zhang and Wallace, 2015;Wang et al., 2017), the attention based CNN (Yang et al., 2016), bag-of-words based CNN (Johnson and Zhang, 2015a), and the combination of CNN and recurrent neural network (Lee and Dernoncourt, 2016;Zhang et al., 2016) to text categorization. Most of them demonstrated that neural network models are powerful for learning features from texts, while they focused on single-label or a few labels problem. Several efforts have been made to multi-labels (Johnson and Zhang, 2015b;Liu et al., 2017). Liu et al. explored a family of new CNN models which are tailored for extreme multi-label classification (Liu et al., 2017). They used a dynamic max pooling scheme, a binary cross-entropy loss, and a hidden bottleneck layer to improve the overall performance. The results by using six benchmark datasets where the label-set sizes are up to 670K showed that their method attained at the best or second best in comparison with seven state-of-the-art methods including FastText (Joulin et al., 2017) and bag-of-words based CNN (Johnson and Zhang, 2015a). However, all of these attempts aimed at utilizing a large volume of data.
We address the problem of multi-label short text categorization and explore the use of a HS of categories. The lower level of categories are finegrained compared to the upper level of categories. Moreover, it is often the case that the amount of training data in a lower level is much smaller than that in an upper level which deteriorates the overall performance of categorization. We propose an approach which can effectively utilize the data in the upper levels to contribute categorization in lower levels by applying fine-tuning to the CNN which can learn a HS of categories and incorporate Figure 1: HFT-CNN model granularity of categories into categorization. We transferred the parameters of CNN trained from upper to lower levels according to the HS, and finely tuned parameters. The main contributions of our work can be summarized: (1) We propose a method that maximizes the impact of pre-defined categories to alleviate data sparsity in multi-label short texts. (2) We empirically examined a finetuning with CNN that fits to learn a HS of categories defined by lexicographers, and (3) The results show that our method is competitive to the state-of-the-art CNN based methods by using two benchmark datasets, especially it is effective for categorization of short texts consisting of a few words with a large number of labels.
2 Hierarchical Fine-Tuning based CNN 2.1 CNN architecture Similar to other CNN (Johnson and Zhang, 2015a;Liu et al., 2017), our HFT-CNN model shown in Figure 1 is based on (Kim, 2014). Let x i ∈ R k be the k-dimensional word vector with the i-th word in a sentence obtained by applying skip-gram model provided in fastText 1 . A sentence with length n is represented as x 1:n = [x 1 , x 2 , · · · , x n ] ∈ R nk . A convolution filter w ∈ R hk is applied to a window size of h words to produce a new feature, c i = f (w·x i:i+h−1 +b) where b ∈ R indicates a bias term and f refers to a non-linear activation function. We applied this convolution filter to each possible window size in the sentence and obtained a feature map, m ∈ R n−h+1 . As shown in Figure 1, we then apply a max pooling operation over the feature map and obtain the maximum valuê m as a feature of this filter. We obtained multiple filters by varying window sizes and multiple 1 https://github.com/facebookresearch/fastText features. These features form a pooling layer and are passed to a fully connected layer. In the fully connected layer, we applied dropout (Hinton et al., 2012). The dropout randomly sets values in the layer to 0. Finally, we obtained the probability distribution over categories. The network is trained with the objective that minimizes the binary crossentropy (BCE) of the predicted distributions and the actual distributions by performing stochastic gradient descent.

Hierarchical structure learning
Our key idea is to use a fine-tuning technique in CNN to tackle the data sparsity problem, especially a lower level of a HS. Following a HS, we transferred the parameters of CNN trained in the upper levels to the lower levels which are worse trained because of the lack of data, and then finely tuned parameters of CNN for lower levels ( Figure  1). This approach can effectively utilize the data in the upper levels to contribute categorization in the lower levels.
Fine-tuning is motivated by the observation that the earlier features of CNN contain more generic features that should be effective for many tasks, but later layers of the CNN becomes progressively more specific to the details of the classes contained in the original dataset. The motivation is identical to a HS of categories as we first learn to distinguish among generic categories at the upper level of a hierarchy, then learns lower level distinctions by using only within the appropriate top level of the HS. We note that fine-tuning the last few layers are usually sufficient for transfer learning as the last few layers become more specific features. However, the HS consisting of deep level needs to finetune the early layers as well because the distance between the upper and lower level of categories is significant. For this reason, we transferred two layers shown in Figure 1, i.e., a layer obtained by word embedding and the convolutional layer. We used them as an initial parameter to learn the second level of a hierarchy. We repeated this procedure from the top level to the bottom level of a hierarchy. We note that a HS consists of many levels. We fine-tune between adjacent layers only because they are more correlated with each other compared to distant layers.

Multi-label categorization
Each test instance is classified into categories with probabilities/scores by applying HFT-CNN. We  then utilize a constraint of a HS to obtain final results which differs from the existing work on non-hierarchical flat model (Johnson and Zhang, 2015a;Liu et al., 2017). This is done by using two scoring functions: One is a Boolean Scoring Function (BSF). Another is a Multiplicative Scoring Function (MSF). Both functions set a threshold value and categories whose scores exceed the threshold value are considered for selection. The difference is that BSF has a constraint that a category can only be selected if its ancestor categories are selected. MSF does not have such a constraint, i.e., we extracted all the categories whose scores exceeded the threshold value and sorted them in descending order as the system's assignments.

Data and HFT-CNN model setting
We selected two benchmark datasets having a HS from the extreme classification repository 2 : RCV1 (Lewis et al., 2004) and Amazon670K (Leskovec and Krevl, 2015). All the documents in RCV1 and item descriptions in Amazon670K are tagged by using Tree Tagger (Schmid, 1995). We used nouns, verbs, and adjectives. We then applied fastText. Each dataset has an official training and test sets. We used each fold in the experiments. We choose titles from the training and test set on RCV1. The maximum number of words in the title was 13 words. Each text of Amazon670K consists of a product name and its item description. We extracted the first 13 words from each item description and used them in the experiments. Table  1 presents the statistics on the datasets. We divided the training data into two folds; we used 5% to tuning the parameters, and the remains to train the models. Our model setting is shown in Table  2

Evaluation Metrics
We used the standard F1 measure. Furthermore, we evaluated our method by two rank-based evaluation metrics: the precision at top k, P@k and the Normalized Discounted Cumulated Gains, NDCG@k which are commonly used for comparing extreme multi-label classification methods (Liu et al., 2017). We calculated P@k and NDCG@k for each test data and then obtained an average over all the test data.

Basic results
We compared HFT-CNN with a method which has hierarchical-based categorization but without fine-tuning (WoFT-CNN) and Flat model to examine the effect of the fine-tuning. WoFT-CNN shows that we independently trained parameters of CNN for each level and trained parameters are not transferred. Flat means that we simply applied our CNN model. The results are shown in Table 3. The HFT-CNN is better than WoFT-CNN and Flat model except for Micro-F1 obtained by WoFT-CNN(M) in Amazon670K. We also found that the overall results obtained by MSF were better to those obtained by BSF.

Comparison with state-of-the-art method
We chose XML-CNN as a comparative method because their method attained at the best or second best compared to the seven existing methods in six benchmark datasets (Liu et al., 2017). Original XML-CNN is implemented by using Theano 4 , while we implemented HFT-CNN by Chainer 5 . In order to avoid the influence of differences in libraries, we implemented XML-CNN by Chainer and compared it with HFT-CNN. We used the author-provided implementation in Chainer's version of XML-CNN. We recall that we set convolutional filters with the window sizes to (2,3,4) and the stride size to 1 because of short text. To make a fair comparison, we also evaluated XML-CNN with the same window sizes and stride size as HFT-CNN.
Liu et al. evaluated their method by using P@k and NDCG@k. We used their metrics as well as F1 measure. We did not set a threshold value on BSF and MSF when we evaluated by using these metrics, but instead, we used a ranked list of cate-   Table 3: Basic results: (B) and (M) refer to a BSF and MSF, respectively. Bold font shows the best result within each line. The method marked with " * " indicates the score is not statistically significant compared to the best one. We used a t-test, p-value < 0.05.
gories assigned to the test instance. The results are shown in Table 4. HFT-CNN with BSF/MSF has the best scores with statistical significance compared to both of the XML-CNNs. On RCV1, HFT-CNN(B) in P@1 and NDCG@1 were worse than XML-CNN(1), while HFT-CNN(M) with the same metrics were statistically significant compared to XML-CNN(1). This is not surprising because hierarchical fine-tuning does not contribute to the accuracy at the top level as the trained parameters on the top level have not changed in the level.
We also examined the affection on each system performance by the depth of a hierarchical structure. Figure 2 shows Micro-F1 at each hierarchical level. The deeper the hierarchical level, the worse the system's performance. However, HFT-CNN is still better than XML-CNNs. The improvement by MSF was 1.00 ∼ 1.34% by Micro-F1 and 3.77 ∼ 10.07% by Macro-F1 on RCV1. On Amazon670K, the improvement was 1.10 ∼ 9.26% by Micro-F1 and 1.10 ∼ 3.60% by Macro-F1. This shows that hierarchical fine-tuning fits to learn the hierarchical category structure.
We recall that we focused on the multi-label problem. Figures 3 illustrates Micro-F1 and Macro-F1 against the number of categories per short text. We can see from RCV1 in Figure 3 Metric RCV1 HFT(B) HFT(M) XML (1)   that Micro-F1 obtained by HFT-CNN and XML-CNNs were not statistically significant difference in the number of categories, while Macro-F1 by HFT-CNN except for the number of 13 categories was constantly better to XML-CNNs. On Ama-zon670K data, when the number of categories assigned to the short text is less than 38, HFT-CNN was better than XML-CNNs or HFT-CNN was not statistically significant compared to XML-CNNs by both F1-scores. However, when it exceeds 39, HFT-CNN was worse than XML-CNNs. One possible reason is the use of BSF: a category can only be selected if its ancestor categories are selected. Therefore, once the test data could not be classified into categories correctly, their child categories also cannot be correctly assigned to the test data.
In contrast, as shown in Figure 5, HFT-CNN by MSF was better than XML-CNNs in both Micro and Macro F1 even in the deep level of a hierarchy. From the observations, a robust scoring function is needed for further improvement.
It is important to note that how the ratio of training data affects the overall performance as we focused on the data sparsity problem. Figure shows Micro and Macro-F1 against a ratio of the training data. Overall, the curves show that more training helps the performance, while the curves obtained by HFT-CNN drop slowly compared to other methods in both datasets and evaluation metrics. From the observations mentioned in the above, we can conclude that fine-tuning works well, especially in the cases that the number of the training data per category is small.

Conclusion
We have presented an approach to multi-label categorization for short text. The comparative results with XML-CNN showed that HFT-CNN is competitive, especially for the cases that there exists only a small amount of training data. Future work will include: (i) incorporating lexical semantics such as named entities and domainspecific senses for further improvement, (ii) extending the method to utilize label dependency constraints (Bi and Kwok, 2011), and (iii) improving the accuracy of the top ranking categories to deal with P@1 and NDCG@1 metrics.