Investigating Capsule Network and Semantic Feature on Hyperplanes for Text Classification

As an essential component of natural language processing, text classification relies on deep learning in recent years. Various neural networks are designed for text classification on the basis of word embedding. However, polysemy is a fundamental feature of the natural language, which brings challenges to text classification. One polysemic word contains more than one sense, while the word embedding procedure conflates different senses of a polysemic word into a single vector. Extracting the distinct representation for the specific sense could thus lead to fine-grained models with strong generalization ability. It has been demonstrated that multiple senses of a word actually reside in linear superposition within the word embedding so that specific senses can be extracted from the original word embedding. Therefore, we propose to use capsule networks to construct the vectorized representation of semantics and utilize hyperplanes to decompose each capsule to acquire the specific senses. A novel dynamic routing mechanism named ‘routing-on-hyperplane’ will select the proper sense for the downstream classification task. Our model is evaluated on 6 different datasets, and the experimental results show that our model is capable of extracting more discriminative semantic features and yields a significant performance gain compared to other baseline methods.


Introduction
Text classification is a crucial task in natural language processing, which has many applications, such as sentiment analysis, intent identification and topic labeling [Aggarwal and Zhai, 2012;Wang and Manning, 2012a]. Recent years, many studies rely on neural networks and have shown promising performance. * Authors contributed equally.
The success of deep learning model for NLP is based on the progress in learning distributed word representations in semantic vector space, where each word is mapped to a vector called a word embedding. The word's representation is calculated relying on the distributional hypothesisthe assumption that semantically similar or related words appear in similar contexts [Mikolov et al., 2013;Langendoen, 1959]. Normally, each word's representation is constructed by counting all its context features. However, for the polysemic word which contains multiple senses, the context features of different senses are mixed together, leading to inaccurate word representation. As demonstrated in [Arora et al., 2018], multiple senses of a word actually reside in linear superposition within the word embedding: v ≈ α 1 v sense1 +α 2 v sense2 +α 3 v sense3 +· · · , (1) where coefficients α i are nonnegative and v sense1 , v sense2 ... are the hypothetical embeddings of different senses. As a result, the word embedding v deviates from any sense, which brings ambiguity for the subsequent task. It demands us to extract the separate senses from the overall word representation to avoid the ambiguity.
Similar to the word embedding, the recently proposed capsule network constructs vectorized representations for different entities [Hinton et al., 2018;Sabour et al., 2017;Hinton et al., 2011]. A dynamic routing mechanism, 'routingby-agreement', is implemented to ensure that the output of the capsule gets sent to an appropriate parent in the layer above. Very recently, capsule network is applied in the field of NLP where each capsule is obtained from the word embedding. Compared with the standard neural nets using a single scalar (the output of a neural unit) to represent the detected semantics, the vectorized rep-resentation in capsule network enables us to utilize hyperplanes to extract the component from the overall representation and get the specific sense.
Therefore, we propose to attach different hyperplanes to capsules to tackle the ambiguity caused by polysemy. Each capsule is decomposed by projecting the output vector on the hyperplanes, which can extract the specific semantic feature. The projected capsule denotes a specific sense of the target words, and a novel dynamic routing mechanism named 'routing-on-hyperplane' will decide which specific senses are selected for the downstream classification and which senses are ignored. Similar to routing-by-agreement [Sabour et al., 2017], we aim to activate a higher-level capsule whose output vector is agreed with the predictions from the lower-level capsules. Differently, before the active capsule at a level makes predictions for the next-level capsules, the capsule's output vector will be projected on the trainable hyperplane. The hyperplanes will be trained discriminatively to extract specific senses. Moreover, in order to encourage the diversity of the hyperplanes, a well-designed penalization term is implemented in our model. We define the cosine similarity between the normal vectors of the hyperplanes as a measure of redundancy, and minimize it together with the original loss. We test our model (HCapsNet) on the text classification task and conduct extensive experiments on 6 datasets. Experimental results show that the proposed model could learn more discriminative features and outperform other baselines. Our main contributions are summarized as follows: • We explore the capsule network for text classification and propose to decompose capsules by means of projecting on hyperplanes to tackle the polysemy problem in natural language.
• Propose routing-on-hyperplane to dynamically select specific senses for the subsequent classification. A penalization term is designed to obtain diversified hyperplanes and offer multiple senses representations of words.
• Our work is among the few studies which prove that the idea of capsule networks have promising applications on natural language processing tasks.  [Kim, 2014], recursive neural network [Socher et al., 2013] and recurrent neural networks. There have been several recent studies of CNN for text classification in the large training dataset and deep complex model structures [Schwenk et al., 2017;Johnson and Zhang, 2017]. Some models were proposed to combine the strength of CNN and RNN [Lai et al., 2015;Zhang et al., 2016]. Moreover, the accuracy was further improved by attention-based neural networks [Lin et al., 2017;Vaswani et al., 2017;Yang et al., 2016]. However, these models are less efficient than capsule networks. As a universal phenomenon of language, polysemy calls much attention of linguists. It has been demonstrated that learning a distinct representation for each sense of an ambiguous word could lead to more powerful and fine-grained models based on vector-space representations [Li and Jurafsky, 2015].

Capsule Network
Capsule network was proposed to improve the representational limitations of CNN and RNN by extracting features in the form of vectors. The technique was firstly proposed in [Hinton et al., 2011] and improved in [Sabour et al., 2017] and [Hinton et al., 2018]. Vector-based representation is able to encode latent inter-dependencies between groups of input features during the learning process. Introducing capsules also allows us to use routing mechanism to generate high-level features which is a more efficient way for feature encoding.
Several types of capsule networks have been proposed for natural language processing.  investigated capsule networks with routing-by-agreement for text classification. They also found that capsule networks exhibit significant improvement when transfer single-label to multi-label text classification. Capsule networks also show a good performance in multi-task learning [Xiao et al., 2018]. Xia et al. [2018] discovered the capsule-based model's potential on zeroshot learning. However, existing capsule networks for natural language processing cannot model the polysemic words or phrases which contain multiple senses.

Model
In this section, we begin by introducing the idea of routing-on-hyperplane and formulate it in details. Then the architecture of the HCapsNet is formally presented in the second subsection. Finally, the penalization term and loss function implemented in this paper are explained.

Routing-on-hyperplane
Suppose that we have already decided on the output vectors of all the capsules in the layer L and we now want to decide which capsules to activate in the layer L + 1. We should also consider how to assign each active capsule in the layer L to one active capsule in the layer L + 1. The output vector of capsule i in the layer L is denoted by u i , and the output vector of capsule j in the layer L + 1 is denoted by v j .
Firstly, for all the capsules in the layer L, we attach the trainable hyperplane to each capsule. Capsule's output vectors will be projected on the hyperplanes before making predictions. More specifically, for capsule i, we define the trainable matrix W h i , which is used to decide the normal vector w i of the attached hyperplane. By restricting w i 2 = 1, we can get the projected capsule's output vector u ⊥i : In this way, the output vectors of capsules will be projected on the specific hyperplanes to get different components which denote the specific senses in our task. To retain or ignore a specific sense (projected capsule) will be decided through an iterative procedure. The procedure contains making predictions and calculating the agreement. When one projected capsule's prediction is highly agreed with one target parent capsule, the probability of retaining the projected capsule gets gradually larger. In another word, when a specific sense is highly relevant with the subsequent classification, we choose to keep it and ignore others. Therefore, the u ⊥i will then be used to make predictions for the L + 1 layer's capsules and calculate coupling coefficients c ij . When making Algorithm 1 Routing-on-hyperplane returns the output vector of capsule j in the layer L + 1 given the output vector of capsule i in the layer L. W ij are trainable parameters denoting the transformation matrix between the two adjacent layers. W h i are trainable parameters for each capsule i to calculate the proposed hyperplane's normal vectors w i , we restrict that w i 2 = 1.
1: initialize the routing logits: for all capsule i in the layer L and capsule j in the layer L + 1: b ij ← 0; 2: for every capsule i in the layer L: for all capsule j in the layer L + 1: for all capsule i and capsule j: b ij ← b ij + u ⊥j|i · v j 10: end for 11: return v j predictions, the capsules in the layer L will multiply their projected output vector u ⊥i by a weight matrix W ij :û whereû ⊥j|i denotes the 'vote' of the capsule i for the capsule j. The agreement between the prediction vectorû ⊥j|i with current output vector of parent capsule j will be fed back to the coupling coefficients c ij between the two capsules: increase c ij if highly agreed. Similar with [Sabour et al., 2017] we define the agreement as scalar product between the two vectors. b ij is the accumulation of the agreement after each iteration and the softmax function is implemented to ensure the coupling coefficients between the capsule i and all the capsules in the layer above sum to one: Each iteration will result in a temporary output vector of the capsule j: the weighted sum over all prediction vectorsû ⊥j|i using coefficient c ij . Moreover, to ensure the length of the output vector of capsule j is able to represent the probability and prevent it from being too big, we use a non-linear 'squashing' function to make the vector's length range from zero to one without changing the vector's direction: The v j will then be returned as input to calculate the agreement for the next iteration. The coupling coefficients c ij and the output vector of capsule j gradually converge after several iterations. After the last iteration of the routing process, the coupling coefficients c ij is determined. Hyperplane plays the role to extract specific senses and assist to route the lower-level capsules to the right parent capsules. We detail the whole routing-onhyperplane algorithm in Algorithm 1.

HCapsNet Model Architecture
We propose a model named HCapsNet for text classification based on the theory of capsule network and routing-on-hyperplane. The architecture is illustrated in Figure 1. The model consists of three layers: one bi-directional recurrent layer, one convolutional capsule layer, and one fully connected capsule layer. The input of the model is a sentence S consisting of a sequence of word tokens t 1 , t 2 , ..., t n . The output of the model contains a series of capsules. Each top-level capsule corresponds to a sentence category. The length of the top-level capsule's output vector is the probability p that the input sentence S belongs to the corresponding category.
The recurrent neural network can capture longdistance dependencies within a sentence. For this strength, a bi-directional recurrent neural network is the first layer of HCapsNet. We concatenate the left context and the right context as the word's elementary representation x i , which is the input to the second layer: The second layer is a convolutional capsule layer. This is the first layer consisting of capsules, we call capsules in this layer as primary capsules. Primary capsules are groups of detected features which means piecing instantiated parts together to make familiar wholes. Since the output of the bidirectional recurrent neural network is not in the form of capsules, no routing method is used in this layer.
The final layer is fully connected capsule layer. Each capsule corresponds to a sentence class. All the capsules in this layer receive the output of the lower-level capsules by the routing-on-hyperplane method as we described in Section 3.1. The length of the top-level capsule's output vector represents the probability that the input sentence belongs to the corresponding category.

Penalization Term
The HCapsNet may suffer from redundancy problem if the output vectors of capsules are always getting projected on the similar hyperplanes at the routing-on-hyperplane procedure. Thus, we need a penalization term to encourage the diversity of the hyperplanes. We introduce an easy penalization term with low time complexity and space cost. Firstly, we construct a matrix X i the columns of which is the normal vectors w of the hyperplanes for the ith word. The dot product of X i and its transpose, subtracted by an identity matrix is defined as a measure of redundancy. The penalization term is the sum of all the words' redundancy: where || • || F stands for the Frobenius norm of a matrix. Similar to adding the L2 regularization term, this penalization term P will be multiplied by a coefficient, and we minimize it together with the original loss. Let's consider the two columns w a and w b in X i , which are two normal vectors of hyperplanes for the ith word. We have restricted that ||w|| = 1 as described in Algorithm 1. For any non-diagonal elements x ab (a = b) in the X i X T i matrix, it corresponds to the cosine similarity between the two Routing-on-hyperplane where w a k and w b k are k-th element in the w a and w b vectors, respectively. In the most extreme case, where the two normal vectors of hyperplanes are orthometric with each other, i.e. the word is projected to two extremely different meanings, the corresponding x ab is 0. Otherwise, the absolute value will be positive. In the other most extreme case, where the two normal vectors of hyperplanes are identical, i.e. the word is projected to the same vector, the corresponding absolute value of x ab is 1. The diagonal elements x ab (a = b) in the X i X T i matrix is the normal vectors' cosine similarity with themselves, so they are all 1. The X i X T i is subtracted by an identity matrix I so as to eliminate the meaningless elements. We minimize the Frobenius norm of P i to encourage the non-diagonal elements in P i to converge to 0, in another word, to encourage word vector to be projected on orthometric hyperplanes and get diversified explanation.

Loss Function
In HCapsNet, each top-level capsule corresponds to a sentence category. The length of the top-level capsule's output vector represents the probability that the input sentence belongs to the corresponding category. We would like the top-level capsule for the category k to have a long output vector if the input sentence belongs to the category k and have a short output vector if the input sentence does not belong to the category k. Similar with [Sabour et al., 2017], We use a separate margin loss, L k for each top-level capsule k. The total loss L is simply the sum of the losses of all toplevel capsules: where T k will be 1 if the sentence belongs to the k class, or else T k will be 0. ||v k || is the length of the output vector of capsule k. We introduce λ 1 to reduce the penalization to avoid shrinking the length of the capsules' output vectors in the initial learning stage. P is the penalization term introduced in Section 3.3. In our experiments, m + = 0.9, m − = 0.1, λ 1 = 0.5.

Experiments
We compare our method with the widely used text classification methods and baseline models (listed in Table 1).

Datasets
HCapsNet is evaluated on six widely studied datasets including three common text classification tasks: sentiment analysis, question classification and topic classification. These datasets are Stanford Sentiment Treebank [Socher et al., 2013], Movie Review Data [Pang and Lee, 2005], Subjectivity dataset [Pang and Lee, 2004], TREC [Li and Roth, 2002] and AG's corpus of news articles [Zhang et al., 2015b]. Summary statistics of the datasets are listed in Table2.

Hyperparameters
In our experiments, we use 300-dimensional word2vec [Mikolov et al., 2013] vectors to initialize word representations. In the first bi-directional RNN layer of HCapsNet, we use Long Short Term Memory network, the dimension of the hidden state is 256. The second layer contains 32 channels of primary capsules and the number of capsules in one channel depends on the sentence length. Each primary capsule contains 8 atoms which means that the dimension of the primary capsules is 8. The top-level capsules are obtained after 3 routing iterations. The dimension of the output vector of top-level capsules is 16. For all the datasets, we conduct mini-batch with size 25. We use Adam [Kingma and Ba, 2014] as our optimization method with 1e − 3 learning rate. λ 2 is 0.01. Table 1 reports the results of our model on different datasets comparing with the widely used text classification methods and state-of-the-art approaches. We can have the following observations.

Results and Discussions
Our HCapsNet achieves the best results on 5 out of 6 datasets, which verifies the effectiveness of our model. In particular, HCapsNet outperforms vanilla capsule network Capsule-B  by a remarkable margin, which only utilizes the dynamic routing mechanism without hyperplane projecting.
HCapNet does not perform best on the TREC dataset. One main reason maybe TREC dataset is used for question type classification, where samples are all question sentences. The task is mainly determined by interrogative words. For example, the sentence containing 'where' will probably be classified to 'location'. The ability to tackle polysemy doesn't play an important role. So, our model gets a similar result with Capsule-B.

Ablation Study
To analyze the effect of different components including hyperplane projection, penalization term, and routing iterations, we report the results of variants of HCapsNet in Table 4.
The results show that capsule network performs best when conducting 3 routing iterations, which stays in line with the conclusion in [Sabour et  and dreary weather is a perfect metaphor for the movie itself , which contains few laughs and not much drama [-0.3810, -0.3923, -0.3016, -0.3045, 0.2417, -0.3109, 0.5999, -0.0391] N × P Table 3: Projected primary capsule's representations for polysemic words. P and N denote positive and negative classification results, respectively. denotes the right classification and × denotes the incorrect classification.
• Almost every scene in this film is a gem that could stand alone, a perfectly realized observation of mood, behavior and intent. • A spunky, original take on a theme that will resonate with singles of many ages.

•
The story drifts so inexorable into cliches about tortured lrb and torturing rrb artists and consuming but impossible love that you can't help but become more disappointed as each overwrought new sequence plods on.

•
The premise is in extremely bad taste, and the film's supposed insights are so poorly thought out and substance free that even a high school senior taking his or her first psychology class could dismiss them. al., 2017; . Compared with the vanilla capsule network (row 3), applying routingon-hyperplane brings a noticeable improvement (row 2). This demonstrates the necessity of integrating hyperplane projecting at the routing procedure to tackle the polysemy problems. Moreover, the penalization term described in Section 3.3 also marginally improves the accuracy, which proves that the orthogonal constraint on hyperplane is beneficial for text classification.   Table 3 shows some sample cases from SST validation dataset, which are movie reviews for sentiment analysis. We analyze the attended primary capsule representation for the polysemic words in brackets. Specifically, we report the output vectors of the projected primary capsule, which is mostly attended by the routing mechanism. The word 'wonder' in the first sample sentence means something that fills you with surprise and admiration, which shows a very positive sentiment. However, the polysemic word 'wonder' in the second and third sentences means to think about something and try to decide what is true, which is neutral in sentiment. We can observe that for the same word, the attended projected capsule representations are quite different according to different word senses. The projected representations for the same sense are similar, the Euclidean distance is 0.23 (row 2,3). On the contrary, for the different senses, the Euclidean distance is 1.12 (row 1,2). This property helps our model to make the predictions all correctly, while Capsule-B  can not handle the latter two sentences. Similarly, the word 'cold' conveys two different senses in the last two samples (row 4-5), which means cruel and low temperature, respectively. The corresponding projected vectors are also quite different, which verifies the ability to tackle polysemy by routing-on-hyperplane.

Visualizing Routing Results
After several iterations of the routing algorithm, each primary capsule and the top-level capsule will be connected via a calculated coupling coefficient. The coupling coefficient corresponds to how much contribution of a low-level capsule to a specific high-level capsule. Routing-on-hyperplane can also be viewed as a parallel attention mech-  anism that allows each capsule at one level to attend to some active capsules at the level below and to ignore others. We can thus draw a heat map to figure which phrases are taken into account a lot, and which ones are skipped by the routing-onhyperplane in the task of text classification. We randomly select 4 examples of reviews from the test set of SST, when the model has a high confidence (>0.8) in predicting the label. As shown in Figure 2, the words whose coupling coefficient greater than 0.7 are marked. It is easy to conclude that our routing method can effectively extract the sentimental words that indicate strongly on the sentiment behind the sentence and assign a greater coupling coefficient between the corresponding capsules. For example, 'gem', 'spunky', 'disappointed', 'bad taste' etc.

Visualizing Effects of The Hyperplane
In order to assess the effect of the hyperplane, we randomly select 3 examples in SST dataset and draw the distribution maps of primary capsules' output vectors before and after the projection operation respectively. As the dimension of the primary capsule's output vector is 8, T-Distributed Stochastic Neighbor Embedding (t-SNE) is performed on the vectors to reduce the dimension for visualization. As illustrated in Figure 3, the three pictures in the first line show the distribution before the projection operation for the three example sentences respectively. And the three pictures in the second line show the distribution after the projection. The blue points in the distribution maps denote the normal words and the red crosses denote the words attended by the routing algorithm which are defined in Section 4.6.
The relationship between the semantic capsules can be estimated by analyzing the distribution of the low-dimensional data in Figure 3. We find that originally scattered points which denote attended words converge after the projection. The attended words' projected vectors are close with each other, showing that they contain similar senses which are beneficial for the subsequent task. On the contrary, the capsules before projection contain multiple senses and show a scattered pattern. This demonstrates that the hyperplanes can effectively extract the guided senses and get attended by the routing-on-hyperplane mechanism.

Conclusion and Future Work
In this paper, we explore the capsule network for text classification and propose to decompose the capsule by means of projecting on hyperplanes to tackle the polysemy problem in natural language. Routing-on-hyperplane, a dynamic routing method, is implemented to select the sensespecific projected capsules for the subsequent classification task. We assess the effect of the hyperplane by case study and analyzing the distribution of the capsules' output vectors. The experiments demonstrate the superiority of HCapsNet and our proposed routing-on-hyperplane method outperforms the existing routing method in the text classification task.
In future, we would like to investigate the application of our theory in various tasks including reading comprehension and machine translating. We believe that capsule networks have broad applicability on the natural language processing tasks. Our core idea that decomposing the semantic capsules by projecting on hyperplanes is a necessary complement to capsule network to tackle the polysemy problem in various natural language processing tasks.