Investigating Capsule Networks with Dynamic Routing for Text Classification

In this study, we explore capsule networks with dynamic routing for text classification. We propose three strategies to stabilize the dynamic routing process to alleviate the disturbance of some noise capsules which may contain “background” information or have not been successfully trained. A series of experiments are conducted with capsule networks on six text classification benchmarks. Capsule networks achieve state of the art on 4 out of 6 datasets, which shows the effectiveness of capsule networks for text classification. We additionally show that capsule networks exhibit significant improvement when transfer single-label to multi-label text classification over strong baseline methods. To the best of our knowledge, this is the first work that capsule networks have been empirically investigated for text modeling.


Introduction
Modeling articles or sentences computationally is a fundamental topic in natural language processing. It could be as simple as a keyword/phrase matching problem, but it could also be a nontrivial problem if compositions, hierarchies, and structures of texts are considered. For example, a news article which mentions a single phrase "US election" may be categorized into the political news with high probability. But it could be very difficult for a computer to predict which presidential candidate is favored by its author, or whether the author's view in the article is more liberal or more conservative.
Earlier efforts in modeling texts have achieved limited success on text categorization using a simple bag-of-words classifier (Joachims, 1998;Mc-Callum et al., 1998), implying understanding the meaning of the individual word or n-gram is a necessary step towards more sophisticated models. It is therefore not a surprise that distributed representations of words, a.k.a. word embeddings, have received great attention from NLP community addressing the question "what" to be modeled at the basic level (Mikolov et al., 2013;Pennington et al., 2014). In order to model higher level concepts and facts in texts, an NLP researcher has to think cautiously the so-called "what" question: what is actually modeled beyond word meanings. A common approach to the question is to treat the texts as sequences and focus on their spatial patterns, whose representatives include convolutional neural networks (CNNs) (Kim, 2014;Zhang et al., 2015;Conneau et al., 2017) and long shortterm memory networks (LSTMs) (Tai et al., 2015;Mousa and Schuller, 2017). Another common approach is to completely ignore the order of words but focus on their compositions as a collection, whose representatives include probabilistic topic modeling (Blei et al., 2003;Mcauliffe and Blei, 2008) and Earth Mover's Distance based modeling (Kusner et al., 2015;Ye et al., 2017).
Those two approaches, albeit quite different from the computational perspective, actually follow a common measure to be diagnosed regarding their answers to the "what" question. In neural network approaches, spatial patterns aggregated at lower levels contribute to representing higher level concepts. Here, they form a recursive process to articulate what to be modeled. For example, CNN builds convolutional feature detectors to extract local patterns from a window of vector sequences and uses max-pooling to select the most prominent ones. It then hierarchically builds such pattern extraction pipelines at multiple levels. Being a spatially sensitive model, CNN pays a price for the inefficiency of replicating feature detectors on a grid. As argued in (Sabour et al., 2017), one has to choose between replicating detectors whose size grows exponentially with the number of dimensions, or increasing the volume of the labeled training set in a similar exponential way. On the other hand, methods that are spatially insensitive are perfectly efficient at the inference time regardless of any order of words or local patterns. However, they are unavoidably more restricted to encode rich structures presented in a sequence. Improving the efficiency to encode spatial patterns while keeping the flexibility of their representation capability is thus a central issue.
A recent method called capsule network introduced by Sabour et al. (2017) possesses this attractive potential to address the aforementioned issue. They introduce an iterative routing process to decide the credit attribution between nodes from lower and higher layers. A metaphor (also as an argument) they made is that human visual system intelligently assigns parts to wholes at the inference time without hard-coding patterns to be perspective relevant. As an outcome, their model could encode the intrinsic spatial relationship between a part and a whole constituting viewpoint invariant knowledge that automatically generalizes to novel viewpoints. In our work, we follow a similar spirit to use this technique in modeling texts. Three strategies are proposed to stabilize the dynamic routing process to alleviate the disturbance of some noise capsules which may contain "background" information such as stop words and the words that are unrelated to specific categories. We conduct a series of experiments with capsule networks on top of the pre-trained word vectors for six text classification benchmarks. More importantly, we show that capsule networks achieves significant improvement when transferring singlelabel to multi-label text classifications over the compared baseline methods.

Our Methodology
Our capsule network, depicted in Figure 1, is a variant of the capsule networks proposed in Sabour et al. (2017). It consists of four layers: ngram convolutional layer, primary capsule layer, convolutional capsule layer, and fully connected capsule layer. In addition, we explore two capsule frameworks to integrate these four components in different ways. In the rest of this section, we elaborate the key components in detail.

N -gram Convolutional Layer
This layer is a standard convolutional layer which extracts n-gram features at different positions of a sentence through various convolutional filters. Suppose x ∈ R L×V denotes the input sentence representation where L is the length of the sentence and V is the embedding size of words. Let x i ∈ R V be the V -dimensional word vector corresponding to the i-th word in the sentence. Let W a ∈ R K 1 ×V be the filter for the convolution operation, where K 1 is the N -gram size while sliding over a sentence for the purpose of detecting features at different positions. A filter W a convolves with the word-window x i:i+K 1 −1 at each possible position (with stride of 1) to produce a column feature map m a ∈ R L−K 1 +1 , each element m a i ∈ R of the feature map is produced by where • is element-wise multiplication, b 0 is a bias term, and f is a nonlinear activate function (i.e., ReLU). We have described the process by which one feature is extracted from one filter. Hence, for a = 1, . . . , B, totally B filters with the same N -gram size, one can generate B feature maps which can be rearranged as

Primary Capsule Layer
This is the first capsule layer in which the capsules replace the scalar-output feature detectors of CNNs with vector-output capsules to preserve the instantiated parameters such as the local order of words and semantic representations of words. Suppose p i ∈ R d denotes the instantiated parameters of a capsule, where d is the dimension of the capsule. Let W b ∈ R B×d be the filter shared in different sliding windows. For each matrix multiplication, we have a window sliding over each Ngram vector denoted as M i ∈ R B , then the corresponding N -gram phrases in the form of capsule are produced with

ConvCaps Capsule
Probability column-list of capsules p ∈ R (L−K 1 +1)×d , each capsule p i ∈ R d in the column-list is computed as where g is nonlinear squash function through the entire vector, b 1 is the capsule bias term. For all C filters, the generated capsule feature maps can be rearranged as where totally (L − K 1 + 1) × C d-dimensional vectors are collected as capsules in P.

Child-Parent Relationships
As argued in (Sabour et al., 2017), capsule network tries to address the representational limitation and exponential inefficiencies of convolutions with transformation matrices. It allows the networks to automatically learn child-parent (or partwhole) relationships constituting viewpoint invariant knowledge that automatically generalizes to novel viewpoints. In this paper, we explore two different types of transformation matrices to generate prediction vector (vote)û j|i ∈ R d from its child capsule i to the parent capsule j. The first one shares weights W t 1 ∈ R N ×d×d across child capsules in the layer below, where N is the number of parent capsules in the layer above. Formally, each corresponding vote can be computed by: where u i is a child-capsule in the layer below and b j|i is the capsule bias term.
In the second design, we replace the shared weight matrix W t 1 j with non-shared weight matrix W t 2 i,j , where the weight matrices W t 2 ∈ R H×N ×d×d and H is the number of child capsules in the layer below.

Dynamic Routing
The basic idea of dynamic routing is to construct a non-linear map in an iterative manner ensuring that the output of each capsule gets sent to an appropriate parent in the subsequent layer: For each potential parent, the capsule network can increase or decrease the connection strength by dynamic routing, which is more effective than the primitive routing strategies such as max-pooling in CNN that essentially detects whether a feature is present in any position of the text, but loses spatial information about the feature. We explore three strategies to boost the accuracy of routing process by alleviating the disturbance of some noisy capsules: Orphan Category Inspired by Sabour et al. (2017), an additional "orphan" category is added to the network, which can capture the "background" information of the text such as stop words and the words that are unrelated to specific categories, helping the capsule network model the child-parent relationship more efficiently. Adding "orphan" category in the text is more effective than in image since there is no single consistent "background" object in images, while the stop words are consistent in texts such as predicate "s", "am" and pronouns "his", "she".

Leaky-Softmax
We explore Leaky-Softmax Sabour et al. (2017) in the place of standard softmax while updating connection strength between the children capsules and their parents. Despite the orphan category in the last capsule layer, we also need a light-weight method between two consecutive layers to route the noise child capsules to extra dimension without any additional parameters and computation consuming.

Coefficients Amendment
We also attempt to use the probability of existence of child capsules in the layer below to iteratively amend the connection strength as Eq.6.
Algorithm 1: Dynamic Routing Algorithm 1 procedure ROUTING(û j|i ,â j|i , r, l) 2 Initialize the logits of coupling coefficients b j|i = 0 3 for r iterations do 4 for all capsule i in layer l and capsule j in layer l + 1: for all capsule i in layer l and capsule j in layer l + 1: Given each prediction vectorû j|i and its probability of existenceâ j|i , whereâ j|i =â i , each iterative coupling coefficient of connection strength c j|i is updated by where b j|i is the logits of coupling coefficients. Each parent capsule v j in the layer above is a weighted sum over all prediction vectorsû j|i : where a j is the probabilities of parent capsules, g is nonlinear squash function Sabour et al. (2017) through the entire vector. Once all of the parent capsules are produced, each coupling coefficient b j|i is updated by: For simplicity of notation, the parent capsules and their probabilities in the layer above are denoted as v, a = Routing(û) whereû denotes all of the child capsules in the layer below, v denotes all of the parent-capsules and their probabilities a.
Our dynamic routing algorithm is summarized in Algorithm 1.

Convolutional Capsule Layer
In this layer, each capsule is connected only to a local region K 2 × C spatially in the layer below. Those capsules in the region multiply transformation matrices to learn child-parent relationships followed by routing by agreement to produce parent capsules in the layer above.
Suppose W c 1 ∈ R D×d×d and W c 2 ∈ R K 2 ×C×D×d×d denote shared and non-shared weights, respectively, where K 2 · C is the number of child capsules in a local region in the layer below, D is the number of parent capsules which the child capsules are sent to. When the transformation matrices are shared across the child capsules, each potential parent-capsuleû j|i is produced bŷ whereb j|i is the capsule bias term, u i is a child capsule in a local region K 2 × C and W c 1 j is the j th matrix in tensor W c 1 . Then, we use routingby-agreement to produce parent capsules feature maps totally (L−K 1 −K 2 +2)×D d-dimensional capsules in this layer. When using the non-shared weights across the child capsules, we replace the transformation matrix W c 1 j in Eq. (10) with W c 2 j .

Fully Connected Capsule Layer
The capsules in the layer below are flattened into a list of capsules and fed into fully connected capsule layer in which capsules are multiplied by transformation matrix W d 1 ∈ R E×d×d or W d 2 ∈ R H×E×d×d followed by routing-by-agreement to produce final capsule v j ∈ R d and its probability a j ∈ R for each category. Here, H is the number of child capsules in the layer below, E is the number of categories plus an extra orphan category.

The Architectures of Capsule Network
We explore two capsule architectures (denoted as Capsule-A and Capsule-B) to integrate these four Capsule-B Capsule-A starts with an embedding layer which transforms each word in the corpus to a 300-dimensional (V = 300) word vector, followed by a 3-gram (K 1 = 3) convolutional layer with 32 filters (B = 32) and a stride of 1 with ReLU non-linearity. All the other layers are capsule layers starting with a B × d primary capsule layer with 32 filters (C = 32), followed by a 3 × C × d × d (K 2 = 3) convolutional capsule layer with 16 filters (D = 16) and a fully connected capsule layer in sequence.
Each capsule has 16-dimensional (d = 16) instantiated parameters and their length (norm) can describe the probability of the existence of capsules. The capsule layers are connected by the transformation matrices, and each connection is also multiplied by a routing coefficient that is dynamically computed by routing by agreement mechanism.
The basic structure of Capsule-B is similar to Capsule-A except that we adopt three parallel networks with filter windows (N ) of 3, 4, 5 in the N -gram convolutional layer (see Figure 2). The final output of the fully connected capsule layer is fed into the average pooling to produce the final results. In this way, Capsule-B can learn more meaningful and comprehensive text representation.

Experimental Datasets
In order to evaluate the effectiveness of our model, we conduct a series of experiments on six bench-marks including: movie reviews (MR) (Pang and Lee, 2005), Stanford Sentiment Treebankan extension of MR (SST-2) (Socher et al., 2013), Subjectivity dataset (Subj) (Pang and Lee, 2004), TREC question dataset (TREC) (Li and Roth, 2002), customer review (CR) (Hu and Liu, 2004), and AG's news corpus (Conneau et al., 2017). These benchmarks cover several text classification tasks such as sentiment classification, question categorization, news categorization. The detailed statistics are presented in Table 1

Implementation Details
In the experiments, we use 300-dimensional word2vec (Mikolov et al., 2013) vectors to initialize embedding vectors. We conduct mini-batch with size 50 for AG's news and size 25 for other datasets. We use Adam optimization algorithm with 1e-3 learning rate to train the model. We use 3 iteration of routing for all datasets since it optimizes the loss faster and converges to a lower loss at the end.

Quantitative Evaluation
In our experiments, the evaluation metric is classification accuracy. We summarize the experimental results in Table 2. From the results, we observe that the capsule networks achieve best results on 4 out of 6 benchmarks, which verifies the effectiveness of the capsule networks. In particular, our model substantially and consistently outperforms

Ablation Study
To analyze the effect of varying different components of our capsule architecture for text classification, we also report the ablation test of the capsule-B model in terms of using different setups of the capsule network. The experimental results are summarized in Table 5. Generally, all three proposed dynamic routing strategies contribute to the effectiveness of Capsule-B by alleviating the disturbance of some noise capsules which may contain "background" information such as stop words and the words that are unrelated to specific categories.

Single-Label to Multi-Label Text Classification
Capsule network demonstrates promising performance in single-label text classification which assigns a label from a predefined set to a text (see Table 2). Multi-label text classification is, however, a more challenging practical problem. From singlelabel to multi-label (with n category labels) text classification, the label space is expanded from n to 2 n , thus more training is required to cover the whole label space. For single-label texts, it is practically easy to collect and annotate the samples. However, the burden of collection and annotation for a large scale multi-label text dataset is generally extremely high. How deep neural networks (e.g., CNN and LSTM) best cope with multi-label text classification still remains a problem since obtaining large scale of multi-label dataset is a timeconsuming and expensive process. In this section, we investigate the capability of capsule network on multi-label text classification by using only the single-label samples as training data. With feature property as part of the information extracted by capsules, we may generalize the model better to multi-label text classification without an over extensive amount of labeled data.
The evaluation is carried on the Reuters-21578 dataset (Lewis, 1992). This dataset consists of 10,788 documents from the Reuters financial newswire service, where each document contains either multiple labels or a single label. We reprocess the corpus to evaluate the capability of capsule networks of transferring from single-label to multi-label text classification. For dev and training, we only use the single-label documents in the Reuters dev and training sets. For testing, Reuters-Multi-label only uses the multi-label documents in testing dataset, while Reuters-Full includes all documents in test set. The characteristics of these two datasets are described in Table 3.
Following (Sorower, 2010), we adopt Micro Averaged Precision (Precision), Micro Averaged Recall (Recall) and Micro Averaged F1 scores (F1) as the evaluation metrics for multi-label text classification. Any of these scores are firstly computed on individual class labels and then averaged over all classes, called label-based measures. In addition, we also measure the Exact Match Ratio (ER) which considers partially correct prediction as incorrect and only counts fully correct samples.
The experimental results are summarized in Table 4. From the results, we can observe that the capsule networks have substantial and significant improvement in terms of all four evaluation metrics over the compared baseline methods on the test sets in both Reuters-Multi-label and Reuters-Full datasets. In particular, larger improvement   is achieved on Reuters-Multi-label dataset which only contains the multi-label documents in the test set. This is within our expectation since the capsule network is capable of preserving the instantiated parameters of the categories trained by singlelabel documents. The capsule network has much stronger transferring capability than the conventional deep neural networks. In addition, the good results on Reuters-Full also indicate that the capsule network has robust superiority over competitors on single-label documents.

Connection Strength Visualization
To visualize the connection strength between capsule layers clearly, we remove the convolutional capsule layer and make the primary capsule layer followed by the fully connected capsule layer directly, where the primary capsules denote N-gram phrases in the form of capsules. The connection strength shows the importance of each primary capsule for text categories, acting like a parallel attention mechanism. This should allow the capsule networks to recognize multiple categories in the text even though the model is trained on singlelabel documents. Due to space reasons, we choose a multilabel document from Reuters-Multi-label test set whose category labels (i.e., Interest Rates and Money/Foreign Exchange) are correctly predicted (fully correct) by our model with high confidence (p > 0.8) to report in Table 6. The categoryspecific phrases such as "interest rates" and "foreign exchange" are highlighted with red color. We use the tag cloud to visualize the 3-gram phrases for Interest Rates and Money/Foreign Exchange categories. The stronger the connection strength, the bigger the font size. From the results, we observe that capsule networks can correctly recognize and cluster the important phrases with respect to the text categories. The histograms are used to show the intensity of connection strengths between primary capsules and the fully connected capsules, as shown in Table 6 (bottom line). Due to space reasons, five histograms are demonstrated. The routing procedure correctly routes the votes into the Interest Rates and Money/Foreign Exchange categories.
To experimentally verify the convergence of the routing algorithm, we also plot learning curve to show the training loss over time with different iterations of routing. From Figure 3, we observe that the Capsule-B with 3 or 5 iterations of routing optimizes the loss faster and converges to a lower loss at the end than the capsule network with 1 iteration.

Related Work
Early methods for text classification adopted the typical features such as bag-of-words, n-grams, and their TF-IDF features (Zhang et al., 2008)  Interest rates on the London money market were slightly firmer on news U.K. Chancellor of the Exchequer Nigel Lawson had stated target rates for sterling against the dollar and mark, dealers said. They said this had come as a surprise and expected the targets, 2.90 marks and 1.60 dlrs, to be promptly tested in the foreign exchange markets. Sterling opened 0.3 points lower in trade weighted terms at 71.3. Dealers noted the chancellor said he would achieve his goals on sterling by a combination of intervention in currency markets and interest rates. Operators feel the foreign exchanges are likely to test sterling on the downside and that this seems to make a fall in U.K. Base lending rates even less likely in the near term, dealers said. The feeling remains in the market, however, that fundamental factors have not really changed and that a rise in U.K. Interest rates is not very likely. The market is expected to continue at around these levels, reflecting the current 10 pct base rate level, for some time.

Orphan Mergers/Acquisitions Money/Foreign Exchange Trade Interest Rates
Recent advances in deep neural networks and representation learning have substantially improved the performance of text classification tasks. The dominant approaches are recurrent neural networks, in particular LSTMs and CNNs. (Kim, 2014) reported on a series of experiments with CNNs trained on top of pre-trained word vectors for sentence-level classification tasks. The CNN models improved upon the state of the art on 4 out of 7 tasks. (Zhang et al., 2015) offered an empirical exploration on the use of character-level convolutional networks (Convnets) for text classification and the experiments showed that Convnets outperformed the traditional models. (Joulin et al., 2016) proposed a simple and efficient text classification method fastText, which could be trained on a billion words within ten minutes. (Conneau et al., 2017) proposed a very deep convolutional networks (with 29 convolutional layers) for text classification. (Tai et al., 2015) generalized the LSTM to the tree-structured network topologies (Tree-LSTM) that achieved best results on two text classification tasks.
Recently, a novel type of neural network is proposed using the concept of capsules to improve the representational limitations of CNN and RNN. Hinton et al. (2011) firstly introduced the concept of "capsules" to address the representational limitations of CNNs and RNNs. Capsules with transformation matrices allowed networks to automatically learn part-whole relationships. Consequently, Sabour et al. (2017) proposed capsule networks that replaced the scalar-output feature detectors of CNNs with vector-output capsules and max-pooling with routing-by-agreement. The capsule network has shown its potential by achieving a state-of-the-art result on MNIST data. Unlike max-pooling in CNN, however, Capsule net-work do not throw away information about the precise position of the entity within the region. For lowlevel capsules, location information is placecoded by which capsule is active. (Xi et al., 2017) further tested out the application of capsule networks on CIFAR data with higher dimensionality. (Hinton et al., 2018) proposed a new iterative routing procedure between capsule layers based on the EM algorithm, which achieves significantly better accuracy on the smallNORB data set. (Zhang et al., 2018) generalized existing routing methods within the framework of weighted kernel density estimation. To date, no work investigates the performance of capsule networks in NLP tasks. This study herein takes the lead in this topic.

Conclusion
In this paper, we investigated capsule networks with dynamic routing for text classification. Three strategies were proposed to boost the performance of the dynamic routing process to alleviate the disturbance of noisy capsules. Extensive experiments on six text classification benchmarks show the effectiveness of capsule networks in text classification. More importantly, capsule networks also show significant improvement when transferring single-label to multi-label text classifications over the co baseline methods.