Hyperbolic Capsule Networks for Multi-Label Classification

Although deep neural networks are effective at extracting high-level features, classification methods usually encode an input into a vector representation via simple feature aggregation operations (e.g. pooling). Such operations limit the performance. For instance, a multi-label document may contain several concepts. In this case, one vector can not sufficiently capture its salient and discriminative content. Thus, we propose Hyperbolic Capsule Networks (HyperCaps) for Multi-Label Classification (MLC), which have two merits. First, hyperbolic capsules are designed to capture fine-grained document information for each label, which has the ability to characterize complicated structures among labels and documents. Second, Hyperbolic Dynamic Routing (HDR) is introduced to aggregate hyperbolic capsules in a label-aware manner, so that the label-level discriminative information can be preserved along the depth of neural networks. To efficiently handle large-scale MLC datasets, we additionally present a new routing method to adaptively adjust the capsule number during routing. Extensive experiments are conducted on four benchmark datasets. Compared with the state-of-the-art methods, HyperCaps significantly improves the performance of MLC especially on tail labels.


Introduction
The main difference between Multi-Class Classification (MCC) and Multi-Label Classification (MLC) is that datasets in MCC have only serval mutually exclusive classes, while datasets in MLC contain much more correlated labels. MLC allows label co-occurrence in one document, which indicates that the labels are not disjointed. In addition, a large fraction of the labels are the infrequently occurring tail labels (Bhatia et al., 2015), which is also referred as the power-law label distribution.  . A multi-label document usually has serval head and tail labels, and hence contain several concepts about both its head and tail labels simultaneously.
Recent works for text classification, such as CNN-KIM (Kim, 2014) and FASTTEXT (Joulin et al., 2017), focus on encoding a document into a fixed-length vector as the distributed document representation (Le and Mikolov, 2014). These encoding based deep learning methods use simple operations (e.g. pooling) to aggregate features extracted by neural networks and construct the document vector representation. A Fully-Connected (FC) layer is usually applied upon the document vector to predict the probability of each label. And each row in its weight matrix can be interpreted as a label vector representation (Du et al., 2019b). In this way, the label probability can be predicted by computing the dot product between label and document vectors, which is proportional to the scalar projection of the label vector onto the document vector as shown in Figure 2. For example, label "movie" should have the largest scalar projection onto a document about "movie". However, even the learned label representation of "music" can be distinguished from "movie", it may also have a large scalar projection onto the document.
Moreover, multi-label documents always contain several concepts about multiple labels, such as a document about "sport movie". Whereas the document vector representation is identical to all the labels, and training instances for tail labels are inadequate compared to head labels. The imbalance between head and tail labels makes it hard for the FC layer to make prediction, especially on tail labels. In this case, one vector can not sufficiently capture its salient and discriminative content. Therefore, the performance of constructing the document vector representation via simple aggregation operations is limited for MLC.
Capsule networks (Sabour et al., 2017;Yang et al., 2018a) has recently proposed to use dynamic routing in place of pooling and achieved better performance for classification tasks. In fact, capsules are fine-grained features compared to the distributed document representation, and dynamic routing is a label-aware feature aggregation procedure. (Zhao et al., 2019) improves the scalability of capsule networks for MLC. However, they only use CNN to construct capsules, which capture local contextual information (Wang et al., 2016). Effectively learning the document information about multiple labels is crucial for MLC. Thus we propose to connect CNN and RNN in parallel to capture both local and global contextual information, which would be complementary to each other. Nevertheless, Euclidean capsules necessitate designing a non-linear squashing function.
Inspired by the hyperbolic representation learning methods which demonstrate that the hyper-bolic space has more representation capacity than the Euclidean space (Nickel and Kiela, 2017;Ganea et al., 2018a), Hyperbolic Capsule Networks (HYPERCAPS) is proposed. Capsules are constrained in the hyperbolic space which does not require the squashing function. Hyperbolic Dynamic Routing (HDR) is introduced to aggregate hyperbolic capsules in a label-aware manner. Moreover, in order to fit the large label set of MLC and improve the scalability of HYPERCAPS, adaptive routing is presented to adjust the number of capsules participated in the routing procedure.
The main contributions of our work are therefore summarized as follows: • We propose to connect CNN and RNN in parallel to simultaneously extract local and global contextual information, which would be complementary to each other.
• HYPERCAPS with HDR are formulated to aggregate features in a label-aware manner, and hyperbolic capsules benefits from the representation capacity of the hyperbolic space.
• Adaptive routing is furthermore presented to improve the scalability of HYPERCAPS and fit the large label set of MLC.
• Extensive experiments on four benchmark MLC datasets demonstrate the effectiveness of HYPER-CAPS, especially on tail labels.

Preliminaries
In order to make neural networks work in the hyperbolic space, formalism of the Möbius gyrovector space is adopted (Ganea et al., 2018b). An n-dimensional Poincaré ball B n is a Riemannian manifold defined as B n = {x ∈ R n | x < 1}, with its tangent space around p ∈ B n denoted as T p B n and the conformal factor as λ p := 2 1− p 2 . The exponential map exp p : T p B n → B n for w ∈ T p B n \ {0} is consequently defined as To work with hyperbolic capsules, Möbius operations in the Poincaré ball also need to be formulated.
Möbius addition for u, v ∈ B n is defined as where ·, · denotes the Euclidean inner product.
Thus Möbius summation can be formulated as Möbius scalar multiplication for k ∈ R and p ∈ B n \ {0} is defined as And k ⊗ p = 0 when p = 0 ∈ B n . The definition of Möbius matrix-vector multiplication for M ∈ R m×n and p ∈ B n when M p = 0 is as follows HDR is developed based on these operations.

Local and Global Hyperbolic Capsules
Neural networks are generally used as effective feature extractors for text classification. Kernels of CNN can be used to capture local n-gram contextual information at different positions of a text sequence, while hidden states of RNN can represent global long-term dependencies of the text (Wang et al., 2016). Hence, we propose to obtain the combination of local and global hyperbolic capsules by connecting CNN and RNN in parallel, which would be complementary to each other. Given a text sequence of a document with T word tokens x = [x 1 , . . . , x T ], pre-trained wdimensional word embeddings (e.g. GLOVE (Pennington et al., 2014)) are used to compose word vector representations E = [e 1 , . . . , e T ] ∈ R T ×w , upon which CNN and RNN connected in parallel are used to construct local and global hyperbolic capsules in the Poincaré ball. Figure 3 illustrates the framework for HYPERCAPS.

Local Hyperbolic Capsule Layer
N-gram kernels K ∈ R k×w with different window size k are applied on the local region of the word representations E t:t+k−1 ∈ R k×w to construct the local features as where • denotes the element-wise multiplication and ϕ is a non-linearity (e.g. ReLU). For simplicity, the bias term is omitted.
With totally d channels, the local hyperbolic capsules at position t can be constructed as Therefore, a k-gram kernel with 1 stride can construct T −k+1 local hyperbolic capsules. The local hyperbolic capsule set is denoted as {u 1 , . . . , u L }.

Global Hyperbolic Capsule Layer
Bidirectional GRU (Chung et al., 2014) is adopted to incorporate forward and backward global contextual information and construct the global hyperbolic capsules. Forward and backward hidden states at time-step t are obtained by Each of the total 2T hidden states can be taken as a global hyperbolic capsule using the exponential map, i.e. − → g t = exp 0 ( − → h t ), and equally for the backward capsules. The global hyperbolic capsule set is denoted as {u 1 , . . . , u G }.

Hyperbolic Compression Layer
As discussed in (Zhao et al., 2019), the routing procedure is computational expensive for a large number of capsules. Compressing capsules into a smaller amount can not only relieve the computational complexity, but also merge similar capsules and remove outliers. Therefore, hyperbolic compression layer is introduced. Each compressed local hyperbolic capsule is calculated as a weighted Möbius summation over all the local hyperbolic capsules. For instance, where r k is a learnable weight parameter. And likewise for compressing global hyperbolic capsules. Let set {u 1 , . . . , u P } denote the compressed local and global hyperbolic capsules together, which are then aggregated in a label-aware manner via HDR.

Hyperbolic Dynamic Routing
The purpose of Hyperbolic Dynamic Routing (HDR) is to iteratively aggregate local and global hyperbolic capsules into label-aware hyperbolic capsules, whose activations stand for probabilities of the labels.

Label-Aware Hyperbolic Capsules
With the acquirement of the compressed local and global hyperbolic capsule set {u 1 , . . . , u P } in layer , let {v 1 , . . . , v Q } denote the label-aware hyperbolic capsule set in the next layer +1, where Q equals to the number of labels.
Following (Sabour et al., 2017), the compressed hyperbolic capsules are firstly transformed into a set of prediction capsules {û j|1 , . . . ,û j|P } for the j-th label-aware capsule, each of them is calculated byû where W ij is a learnable parameter.
Then v j is calculated as a weighted Möbius summation over all the prediction capsules by where c ij denotes the coupling coefficient that indicates the connection strength betweenû j|i and v j . The coupling coefficient c ij is iteratively updated during the HDR procedure and computed by the routing softmax where the logits b ij are the log prior probabilities between capsule i and j, which are initialized as 0.
Once the label-aware hyperbolic capsules are produced, each b ij is then updated by where d B (·, ·) denotes the Poincaré distance, which can be written as And K is a Epanechnikov kernel function (Wand and Jones, 1994) with where γ is the maximum Poincaré distance between two points in the Poincaré ball, which is d B (p, 0) with p = 1 − ( = 10 −5 ) to avoid numerical errors. HDR is summarized in Algorithm 1. Different from the routing procedure described in (Sabour et al., 2017), HDR does not require the squashing function since all the hyperbolic capsules are constrained in the Poincaré ball.

Adaptive Routing
The large amount of labels in MLC is one major source of the computational complexity for the routing procedure. Since most of the labels are unrelated to a document, calculating the label-aware hyperbolic capsules for all the unrelated labels is redundant. Therefore, encoding based adaptive routing layer is used to efficiently decide the candidate labels for the document.
The adaptive routing layer produces the candidate probability of each label by for all capsule i in layer and capsule j in layer return v j where σ denotes the Sigmoid function. W c and the bias b c are learnable parameters updated by minimizing the binary cross-entropy loss (Liu et al., 2017) where c j ∈ [0, 1] is the j-th element in c and y j ∈ {0, 1} denotes the ground truth about label j. The adaptive routing layer selects the candidate labels during test. Label-aware hyperbolic capsules are then constructed via HDR to predict probabilities of these candidate labels. During the training process, negative sampling is used to improve the the scalability of HYPERCAPS. Let N + denote the true label set and N − denote the set of randomly selected negative labels, the loss function is derived as where a j = σ(d B (v j , 0)) is activations of the j-th label-aware capsules, which is proportional to the distance from the origin of the Poincaré ball.

Experiments
The proposed HYPERCAPS is evaluated on four benchmark datasets with various label number from 54 to 4271. We compare with the state-of-the-art methods in terms of widely used metrics. Performance on tail labels is also compared to demonstrate the superiority of HYPERCAPS for MLC. An ablation test is also carried out to analyse the contribution of each component of HYPERCAPS.

Experimental Setup
Datasets Experiments are carried out on four publicly available MLC datasets, including the small-scale AAPD (Yang et al., 2018b) and RCV1 (Lewis et al., 2004), the large-scale ZHIHU 1 and EUR-LEX57K (Chalkidis et al., 2019). Labels are divided into head and tail sets according to their number of training instances, i.e. labels have less than average number of training instances are divided into the tail label set. Their statistics can be found in Table 1.
Evaluation metrics We use the rank-based evaluation metrics which have been widely adopted for MLC tasks (Bhatia et al., 2015;Liu et al., 2017), i.e. Precision@k (P@k for short) and nDCG@k, which are respectively defined as where y j ∈ {0, 1} denotes the the ground truth about label j, rank k (a) denotes the indices of the candidate label-aware hyperbolic capsules with k largest activations in descending order, and y 0 is the true label number for the document instance. The final results are averaged over all the test instances.
Baselines To demonstrate the effectiveness of HYPERCAPS on the benchmark datasets, six comparative text classification methods are chosen as the baselines. FASTTEXT (Joulin et al., 2017) is a representative encoding-based method which use average pooling to construct document representations and MLP to make the predictions. SLEEC (Bhatia et al., 2015) is a typical label-embedding method for MLC, which uses k-nearest neighbors search to predict the labels. XML-CNN (Liu et al., 2017) employs CNN as local n-gram feature extractors and a dynamic pooling technique as aggregation method. SGM (Yang et al., 2018b) applies the seq2seq model with attention mechanism, which takes the global contextual information. REGGNN (Xu et al., 2019) uses a combination of CNN and LSTM with a dynamic gate that controls the information from these two parts. NLP-CAP (Zhao et al., 2019) is a capsule-based approach for MLC, which reformulates the routing algorithm. NLP-CAP use only CNN to construct capsules, and it applies the squashing function onto capsules.
Implementation Details All the words are converted to lower case and padding is used to handle the various lengths of the text sequences. Maximum length of AAPD, RCV1 and EUR-LEX57K is set to 500, while maximum length of ZHIHU is 50. To compose the word vector representations, pre-trained 300-dimensional GLOVE (Pennington et al., 2014) word embeddings are used for AAPD, RCV1 and EUR-LEX57K, while ZHIHU uses its specified 256-dimensional word embeddings. The dimension of the Poincaré ball is set to 32 with a radius 1 − ( = 10 −5 ) to avoid numerical errors. Multiple one-dimensional convolutional kernels (with window sizes of 2, 4, 8) are applied in the local hyperbolic capsule layer. The number of compressed local and global hyperbolic capsules is 128. Adaptive routing layer is not applied on the small-scale datasets AAPD and RCV1. The maximum candidate label number is set to 200 for the large-scale datasets ZHIHU and EUR-LEX57K. For the baselines, hyperparameters recommended by their authors are adopted.

Experimental Results
The proposed HYPERCAPS is evaluated on the four benchmark datasets by comparing with the six baselines in terms of P@k and nDCG@k with k = 1, 3, 5. Results on all the labels averaged over the test instances are shown in Table 2. nDCG@1 is omitted since it gives the same value as P@1.
It is notable that HYPERCAPS obtains competitive results on the four datasets. The encoding-based FASTTEXT is generally inferior to the other baselines as it applies the average pooling on word vector representations, which ig-  nores word order for the construction of document representations. The typical MLC method SLEEC takes advantage of label correlations by embedding the label co-occurrence graph. However, SLEEC uses TF-IDF vectors to represent documents, thus word order is also ignored. XML-CNN uses a dynamic pooling technique to aggregate the local contextual features extracted by CNN, while SGM uses attention mechanism to aggregate the global contextual features extracted by LSTM. REGGNN is generally superior to both of them as it combines the local and global contextual information dynamically and takes label correlations into consideration using a regularized loss. However, the two capsulebased methods NLP-CAP and HYPERCAPS consistently outperform all the other methods owing to dynamic routing, which aggregates the fine-grained capsule features in a label-aware manner.
Moreover, NLP-CAP only uses CNN to extract the local contextual information, while HYPER-CAPS benefits from the parallel combination of local and global contextual information. In addi-tion, NLP-CAP applies the non-linear squashing function for capsules in the Euclidean space, while HDR is designed for hyperbolic capsules, which take advantage of the representation capacity of the hyperbolic space. Therefore, HYPERCAPS outperforms NLP-CAP as expected. This result further confirms that the proposed HYPERCAPS with HDR is effective to learn the label-aware hyperbolic capsules for MLC.

Performance on Tail Labels
In MLC, tail labels have low occurring frequency and hence are hard to predict compared to head labels. The performance on tail labels of the four benchmark datasets is evaluated in terms of nDCG@k with k = 1, 3, 5. Figure 4 shows the results of the five deep learning based MLC methods, i.e. XML-CNN, SGM, REGGNN, NLP-CAP and HYPERCAPS. nDCG@1 is smaller than nDCG@3 on AAPD, RCV1 and ZHIHU since most of their test instances contain less than three tail labels. It is remarkable that HYPERCAPS outperforms all the other methods on tail labels.
REGGNN takes advantage of the local and global contextual information and label correlations, thus it outperforms XML-CNN and SGM. The two capsule-based methods NLP-CAP and HYPERCAPS are both superior to the other methods, which indicates that the label-aware dynamic routing is effective for the prediction on tail labels. In addition, the fact that HYPERCAPS significantly improves the prediction performance compared to NLP-CAP implies that the representation capacity of the hyperbolic space and the combination of local and global contextual information are helpful for learning on tail labels. The results demonstrate the superiority of the proposed HYPERCAPS on tail labels for MLC.

Ablation Test
An ablation test would be informative to analyze the effect of varying different components of the proposed HYPERCAPS, which can be taken apart as local Euclidean capsules only (denoted as L), global Euclidean capsules only (denoted as G), a combination of the local and global Euclidean capsules (denoted as L + G), and a combination of the local and global hyperbolic capsules (denoted as L + G + H). Euclidean capsules (in L, G and L + G) are aggregated via the origin dynamic routing (Sabour et al., 2017), while hyperbolic capsules (in L + G + H) are aggregated via our HDR. Figure 5 shows the results on EUR-LEX57K in terms of P@k with k = 1, 3, 5. In order to make the comparison fair, the number of total compressed capsules is equally set to 256 for all the four models. Adaptive routing is also applied with the maximum candidate label number set equally to 200. Generally, the proposed combination of local and global contextual information contributes to the effectiveness of the model (L + G). Therefore, it is practical to combine the local and global contextual information via dynamic routing. HDR furthermore improves the performance by making use of the representation capacity of the hyperbolic space. Overall, each of the components benefits the performance of HYPERCAPS for MLC.
In summary, extensive experiments are carried out on four MLC benchmark datasets with various scales. The results demonstrate that the proposed HYPERCAPS can achieve competitive performance compared with the baselines. In particular, effectiveness of HYPERCAPS is shown on tail labels. The ablation test furthermore confirms that the combination of local and global contextual information is practical and HYPERCAPS benefits from the representation capacity of the hyperbolic space.
6 Related Work

Multi-Label Classification
Multi-label classification (MLC) aims at assigning multiple relevant labels to one document. The MLC label set is large compared to Multi-class classification (MCC). Besides, the correlations of labels (e.g. hierarchical label structures (Banerjee et al., 2019)) and the existence of tail labels make MLC a hard task (Bhatia et al., 2015).
As data sparsity and scalability issues arise with the large number of labels, XML-CNN (Liu et al., 2017) employs CNN as efficient feature extractor, whereas it ignores label correlations, which are often used to deal with tail labels. The traditional MLC method SLEEC (Bhatia et al., 2015) makes use of label correlations by embedding the label co-occurrence graph. The seq2seq model SGM (Yang et al., 2018b) uses the attention mechanism to consider the label correlations, while REGGNN (Xu et al., 2019) applies a regularized loss specified for label co-occurrence. REGGNN additionally chooses to dynamically combine the local and global contextual information to construct document representations.

Capsule Networks
Capsule networks are recently proposed to address the representation limitations of CNN and RNN. The concept of capsule is first introduced by (Hinton et al., 2011). (Sabour et al., 2017) replaces the scalar output features of CNN with vector capsules and pooling with dynamic routing. (Hinton et al., 2018) proposes the EM algorithm based routing procedure between capsule layers. (Gong et al., 2018) proposes to regard dynamic routing as an information aggregation procedure, which is more effective than pooling. (Yang et al., 2018a) and (Du et al., 2019a) investigate capsule networks for text classification. (Zhao et al., 2019) then presents a capsule compression method and reformulates the routing procedure to fit for MLC.
Our work is different from the predecessors as we design the Hyperbolic Dynamic Routing (HDR) to aggregate the parallel combination of local and global contextual information in form of hyperbolic capsules, which are constrained in the hyperbolic space without the requirement of non-linear squashing function. In addition, adaptive routing is proposed to improve the scalability for large number of labels.

Hyperbolic Deep Learning
Recent research on representation learning (Nickel and Kiela, 2017) indicates that hyperbolic space is superior to Euclidean space in terms of representation capacity, especially in low dimension. (Ganea et al., 2018b) generalizes operations for neural networks in the Poincaré ball using formalism of Möbius gyrovector space. Some works lately demonstrate the superiority of the hyperbolic space for serval natural language processing tasks, such as textual entailment (Ganea et al., 2018a), machine translation (Gulcehre et al., 2019) and word embedding (Tifrea et al., 2019). Our work presents the Hyperbolic Capsule Networks (HYPERCAPS) for MLC.

Conclusion
We present the Hyperbolic Capsule Networks (HYPERCAPS) with Hyperbolic Dynamic Routing (HDR) and adaptive routing for Multi-Label Classification (MLC). The proposed HYPERCAPS takes advantage of the parallel combination of finegrained local and global contextual information and label-aware feature aggregation method HDR to dynamically construct label-aware hyperbolic capsules for tail and head labels. Adaptive routing is additionally applied to improve the scalability of HYPERCAPS by controlling the number of capsules during the routing procedure. Extensive experiments are carried out on four benchmark datasets. Results compared with the state-of-the-art methods demonstrate the superiority of HYPERCAPS, especially on tail labels. As recent works explore the superiority of hyperbolic space to Euclidean space for serval natural language processing tasks, we intend to couple with the hyperbolic neural networks (Ganea et al., 2018b) and the hyperbolic word embedding method such as POINCARÉGLOVE (Tifrea et al., 2019) in the future. Figure 1 and Figure 6 show the label distributions of the four benchmark datasets. Head and tail labels are divided based on the average number of training instances (listed in Table 1), i.e. labels have less than average number of training instances are tail labels. We observe that this division generally follows the Pareto Principle, as nearly 80% of labels are divided into the tail label set.