MCapsNet: Capsule Network for Text with Multi-Task Learning

Multi-task learning has an ability to share the knowledge among related tasks and implicitly increase the training data. However, it has long been frustrated by the interference among tasks. This paper investigates the performance of capsule network for text, and proposes a capsule-based multi-task learning architecture, which is unified, simple and effective. With the advantages of capsules for feature clustering, proposed task routing algorithm can cluster the features for each task in the network, which helps reduce the interference among tasks. Experiments on six text classification datasets demonstrate the effectiveness of our models and their characteristics for feature clustering.


Introduction
Multi-task learning (MTL) has achieved a great success in the field of natural language processing, which can share the knowledge among multiple tasks, implicitly increasing the volume of training data. The combination of multi-task learning and deep neural networks (DNNs) generates a further synergy via the regularization effect on the DNNs (Collobert and Weston, 2008), which helps alleviate the overfitting and learn a more universal presentation.
Inspired by this, more DNN-based multi-task learning models are proposed to improve the performance. As depicted in Figure 1, they can be categorized into three groups by structure: tree scheme (Collobert and Weston, 2008;Liu et al., 2015), parallel scheme (Liu et al., 2016) and mediate scheme . Tree scheme reuses some shallow layers of network and separates the higher layers for different tasks, which is the most common architecture for MTL but can only share the low-level knowledge. To share deeper level knowledge among the tasks, more layers are linked in parallel and mediate schemes. But this would severely suffer from the interference among tasks. Useless features following the helpful ones are fully shared among tasks, which may contaminate the feature spaces of tasks by useless ones. Besides, models under these two schemes usually employ multiple subnets in the structures, which would contain more parameters and are hard to train. Apparently, there is a contradiction between knowledge sharing and interference. Sharing too much between tasks would inevitably bring about the interference that feature space for each task may be contaminated by others. Shared useless features may mislead the prediction of network. This dilemma is caused by the lack of management for sharing process, in which the network can not discriminate the features and collect the appropriate features for each task.
Capsule network (Hinton et al., 2011;Mousa et al., 2017;Hinton et al., 2018) embeds the features into capsules and connects the neighbor layers via "routing-by-agreement". The dynamic routing algorithm has an ability to decide the routes of capsules, namely, to cluster the features for each category. So, intuitively this property of capsule network can be employed in MTL to discriminate the features for tasks.
In this paper, we explore the performance of capsule network for text (CapsNet-1, CapsNet-1) and show the benefits and potential of cap-sule network for NLP. Then we mainly propose a capsule-based architecture for multi-task learning (McapsNet), which is unified, simple, effective and can cluster the feature for tasks. We designed a Task Routing algorithm to route the feature flows to tasks and vote for the classes, which can reduce the interference. In extensive experiences, our approach achieves competitive results in single-task scenario and shows obvious improvement in multi-task scenario, which proves our approach effective and its ability to reduce the interference among multiple tasks. Also, our visualization experiments intuitively show the feature clustering mechanism and how it helps make right predictions.
The contribution of this paper are three-folds: • This paper investigates the performance of capsule network on text and designs two effective capsule-based models for text classification, which give clear improvement to several benchmarks.
• We novelly combine the capsule and multitask learning, which can help reduce the interference among tasks.
• Proposed task routing algorithm can route the capsules to multiple tasks, by which the features is clustered into groups for tasks.

Convolutional Neural Network and Multi-Task Learning
Capsule network is based on the convolutional neural network (CNN) and uses a lot of convolution operations. The main differences between them are that capsule network uses vectors to represent the features and discards the pooling operation. CNN is good at feature extraction and can capture short and long range relations through the feature graph over text sequence (Kalchbrenner et al., 2014;Kim, 2014). In this section, we provide some formulations for CNN and some background knowledge for multi-task learning.

Single-Task CNN for Text Classification
Given a text sequence x 1:l = x 1 x 2 · · · x l of length l, the target of CNN is to predict the categoryŷ of x 1:l from a set {y 1 , y 2 , · · · , y C }, or a one-hot form ofŷ, where C is the class number. Using f (·) denote the network, the prediction process can be formalized as f (x 1 , x 2 , · · · , x l ) =ŷ.
For details, convolutional neural network f (·) first uses a lookup table to embed the word sequence x 1:l into vectors x. Then CNN produces the representation of the input sequence by stacking the layers of convolution, pooling and fullyconnected in order.
where K is the kernel of convolution operation * ; p(·) denotes the pooling operation; F andF represent the feature maps; w and b denote the weight and bias respectively in fully connected layer. The parameters of the network are optimized via all kinds of SGD (stochastic gradient decent) algorithms to minimize the loss between predictionŷ and ground truth labelỹ where i, j enumerate the training samples and classes respectively.

Multi-Task Learning
Multi-task learning model is usually the variant or combination of single-task ones (CNNs, RNNs or DNNs) like the architectures illustrated in Figure 1. Given K text classification tasks {T 1 , T 2 , · · · , T K }, a multi-task learning model f (·) shall have the ability to make prediction for samples x And the overall loss for all the tasks is usually a linear combination of the costs for each.
where λ k ,N k and C k denote the weight, number of training samples and class number of task T k .

Capsule Networks for Text
Capsule network (CapsNet) is first proposed by Sabour et al. (2017) for image classification, which is position sensitive and shows strong performance on some classification tasks. As depicted in Figure 2, we propose several capsule networks for text, which are suitable for text representation and multi-task learning. They are comprised of convolutional layer, primary capsule layer and class capsule layer. In the rest of this section, we will first give the formulation of singletask capsule networks (CapsNet-1 and CapsNet-2) for text classification, and then transfer it into a multi-task version (McapsNet).

Primary Capsule Layer
Given an embedded sample of x ∈ R l×d with length of l and word vectors of d-dimension, capsule network first employs a plain convolution layer to extract the local features from N-grams. Each kernel K i with a bias b emits a feature maps F i by convolution.
By assembling I feature maps together, we have a I-channel layer The generated feature maps are then fed into the primary capsule layer, piecing the instantiated parts together via another convolution. Primary capsules use vectors instead of scales to preserve the instantiated parameters for each feature, which can not only represent the intensity of activation but also record some details of the instantiated parts in input. In this way, capsule can be regarded as a short representation of instantiated parts that are detected by kernel.
Sliding over the feature map F, each kernel K j would output a series of capsules p j ∈ R d of ddimension. These capsules comprise a channel P j of primary capsule layer.
where g is the nonlinear squash function and b is the capsule bias term. All the J channels can be arranged as

Connection Between Capsule Layers
Capsule network generates the capsules in next layer using "routing-by-agreement". This process takes the place of pooling operation that usually discards the location information, which helps augment the robust of the network and also helps cluster features for prediction. Between two neighbor layers l and l + 1, a "prediction vectors"û j|i is first computed from the capsule u i in lower layer l, by multiplying a weight matrix W iĵ Then, in the higher layer l+1 a capsule s j is generated by the linear combination of all the prediction vectors with weights c ij where c ij are coupling coefficients decided by the iterative dynamic routing process. Coupling coefficients are calculated by a "routing softmax" function on original logits b ij , which are the log prior probability that capsule i should be coupled to capsule j.
This process of "routing softmax" guarantee the sum of all the coefficients for capsule j is 1.
The length of capsule represents the probability that the input sample has the object capsule describes, that is the activation of capsule. So the length of capsule is limited in range [0, 1] with a non-linear squashing function.
By that, the short vectors are pushed to shrink to zero length, and long ones are pushed to one.

Dynamic Routing
Suppose capsule layer l has been generated. We have to decide the intensity of the connections between capsule i and j from l-th layer to (l + 1)-th layer, that is the coupling coefficient c ij . The initial digit of coupling coefficient b ij is updated with routing by agreement a ij , which is calculated by a scale product between capsules in two layers.
Value of agreement a ij is added to the digit to calculated the capsules in the next layer.

Class Capsule Layer and Loss
Class capsule layer, as the top-level layer, is comprised of C class capsules, in which each one corresponds to a category. The length of instantiated parameters in each capsule represents the probability that the input sample belongs to this category, and the direction of each set of instantiated parameters preserves the characteristics of the features, which could be regarded as an encoded vector for the input sample.
Margin Loss To increase the difference between the lengths of classes, CapsNet utilizes a separated margin loss: where v j is the capsule for class j; m + and m − is the top and bottom margins respectively, which help push the length to shrink beyond two margins; G j = 1 if and only if class j is the ground truth: λ is the weight for the absent classes, which reduces the weight of absent classes, avoiding shrinking the lengths of all the capsules too much at prophase training. In this paper, λ is set to 0.5.
Orphan Category A drawback of CapsNet is that it tends to account for everything in the input sampling, including some "background" information such as stop word and punctuations that would interfere the prediction. So an orphan category is added in class capsules in the output layer, which belongs to none of the categories of the task. The orphan category would help collect the less contributive capsules that contain too much "background" information, which reduces the interference for normal categories.

Substitutional Modules for Multi-Task
Task Routing Dynamic routing algorithm is first proposed by Sabour et al. (2017), which displaces the pooling operation used in conventional convolution neural network. It maintains the position information for features, which is beneficial to both image and text representation. More importantly, this routing-byagreement method has an ability to cluster the features into each class. Inspirited by this, we employ this thought to cluster the features for different tasks and propose the Task Routing algorithm, which gives a simple and efficient solution to the problem that existing for all capsule j in layer l + 1: for all capsule i in layer l and capsule 9: j in task k: end for 12: return v (k) j 13: end function MTL models (Liu et al., 2017;Ruder et al., 2017;Fang et al., 2017) want to address: "What feature should be shared and what should not among tasks?" By that, network can decide the contribution of the features for each tasks and set the appropriate coupling coefficients between features and tasks.
More concretely, we introduce a coupling coefficient c (k) ij between capsule i in l-th layer and capsule j in class capsule (l + 1)-th layer for task k, which is the result of a softmax function on b Then, instantiated parameter v ij is restricted in range [0, 1], which represents the probability that capsule i belongs to class capsule j in task k. And it is update by the algorithm is described in Algorithm 3.5.

Multi-Task Loss
The loss for each task is the sum of margin losses for all the classes ∑ C j=1 L (k) j . By linearly combining the loss for every task, we get multi-task loss where β (k) is the weight for each loss and ∑ K k=1 β (k) = 1. In this paper, all the β (k) are set to be 1/K to make a balance among K tasks.

Multi-Task Training
In order to juggle several tasks in a unified network, following (Collobert and Weston, 2008), each task is trained alternatively in a stochastic manner. The steps can be described as follows: 1. Pick up a task k randomly; 2. Select an arbitrary sample s from the task k; 3. Feed the sample s into the McapsNet and update the parameters; 4. Go back to 1.

Architectures of CapsNets for Text
As illustrated in Figure 2, we propose a capsulebased multi-task learning architecture Mcap-sNet, which is base on the single-task structures CapsNet-1 and CapsNet-2. Architectures for them are detailed as following.
CapsNet-1 As depicted in Figure 2, CapsNet-1 is a fundamental framework with three layers. The first layer is a plain convolution operation with 256 kernels with window size of 3 and stride of 1. For activation function, we use ReLU to augment nonlinearity. This layer helps extract local features from the input sequences, which is the base to construct primary capsules.
Primary capsule layer employs 32 kernels with window size of 3 and stride of 1. The emitted primary capsules are 8-dimensional, which have bigger respective field, helping reassemble the piece features into wholes.
Last one is the class capsule layer, which is comprised of 16-dimensional capsules for the classes. They are connected to PrimaryCaps with routing-by-agreement and the coupling coefficients are updated by dynamic routing algorithm.
CapsNet-2 On this basis of CapsNet-1, CapsNet-2 upgrades the convolutional layer and uses multiple kernel sizes, which enriches the features. And concatenating 1 them up allows primary capsule see the features with different kernel sizes in the same time.
MCapsNet McapsNet is a unified multi-task structure based on CapsNet-2. It replaces the dynamic routing with task routing (Algorithm 3.5), which enables the network to route the features to Dataset Train Dev Test Classes Type  MR  9500  -1100  2  review  SST-1  8544 1101 2210  5  sentiment  SST-2  6920 872 1821  2  sentiment  Subj  9000  -1000  2  subjectivity  TREC  5900  -500  6  question  AG's  120k  -7600  4  news   Table 1: Statistics for six datasets each tasks. And the whole network is optimized in a stochastic way with multi-task training (Section 3.5).
Implement Details For word embedding, we use the word vectors in Word2Vec (Mikolov et al., 2013), which is 300-dimensional and has 3M vocabularies. And all the routing logits b (k) ij is initialized to zero, so that all the capsules in adjacent layers (û j|i , v j ) are connected with equal possibility c ij . The coupling coefficients are updated by routing with 3 iterations, which performs best for our approach. For training, we use Adam optimizer (Kingma and Ba, 2014) with exponentially decaying learning rate. Moreover, we use minibatch with size of 8 for all the datasets.

Experiment
We test our capsule-based models on six datasets in both single-task and multi-task scenarios to demonstrate the effectiveness of our approaches. We also in this section conduct some investigations like ablation study and visualization to give a comprehensive understanding to the characteristics of our models.

Datasets
For both single-task and multi-task scenarios, we conduct extensive experiments on six benchmarks: movie reviews (MR) (Bo and Lee, 2005), Stanford Sentiment Treebank (SST-1 and SST-2) (Socher et al., 2013), subjectivity classification (Subj) (Pang et al., 2004), question dataset (TREC) (Li and Roth, 2002), AG's news corpus (Mousa et al., 2017). These datasets cover a wide range of text classification tasks, which can fully test a model and the details are listed in Table 1.

Single-Task Learning Results
We first test our approach on six datasets for text classification under the scheme of single-task. As Table 2 shows, our single-task network enhanced by capsules is already a strong model. CapsNet-1 that has one kernel size obtains the best accuracy on 2 out of 6 datasets, and gets competitive results on the others. And CapsNet-2 with multiple kernel sizes further improves the performance and get best accuracy on 4 datasets. This proves our capsule networks are effective for text. Particularly, our capsule network outperforms conventional CNNs like DCNN, CNN-MC and VD-CNN with a large margin (by average 1.1%, 0.7% and 1.0% respectively), which shows the advantages of capsule network over conventional CNNs for clustering features and leveraging the position information.

Routing Iteration
The coupling coefficients c ij are updated by dynamic routing algorithm, which determines the connections between the capsules. To find the best updating iteration for coupling co- efficients, we test the CapsNet-2 with a series of iterations (1, 3 and 5) on MR dataset. As shown in Figure 3, network with 3 iterations convergences fast and performs best, which stays in line with the conclusion in (Sabour et al., 2017). So we utilize 3 iterations in all our experiments.
Ablation Study on Orphan Category Orphan category in class capsule layer helps collect the noise capsules that contain the 'background' information like stop words, punctuations or any unrelated words. We conduct the ablation experiment on orphan category, and result (Table 2) shows that network with orphan category perform better than the without one by 0.4%. This demonstrates the effectiveness of orphan category.

Multi-Task Learning Results
Up to now, we have obtained an optimized singletask architecture. In this section, we equip CapsNet-2 with the task routing and multi-task training procedure, namely the model MCapsNet, so that this capsule based architecture can learn several datasets in a unified network. Extensive experiments are conducted in this section to demonstrate the effectiveness of our multi-task learning architecture, as well as its ability for feature clustering.

Multi-Task Performance
We simultaneously train our model McapsNet on six tasks in Table 1 and compare it with singletask scenario (Table 3). We can see that our multitask architecture clearly improves the performance over the single task models, which demonstrates the benefits of our multi-task architecture.  As Table 3 shows, MCapsNet also outperforms the state-of-the-art multi-task learning models by at least 1.1%. This shows the advantages of our task routing algorithm, which can cluster the features for each task, instead of freely sharing the features among tasks.

Routing Visualization
To show the mechanism how capsule benefits the multi-task learning, we visualize the coupling coefficient c (k) ij ∈ [0, 1] between primary and class capsules. We use kernel with size 1 for primary capsule layer so that every capsule represents only one 3-gram phrase. The strength of these connections indicates the importance of these 3-grams to their corresponding task and class.
We feed a random sample from the dataset MR into MCapsNet. In the first row of Table 4, we show the most important 3-gram phrases for two tasks MR and Subj (two classes for each) with word cloud. The sizes of the grams represent the weights of coupling coefficients. We can see that task routing algorithm helps lead the grams into the most related tasks, which allows each task only consider the helpful features for them. In another word, task routing builds a feature space for each task and avoids they contaminate each other. This demonstrates that MCapsNet has the ability of feature clustering, which can benefit MTL by reducing the interference.  Table 4: Visualization of the task routing for a positive sample from MR, "it 's not so much enjoyable to watch as it is enlightening to listen to new sides of a previous reality , and to visit with some of the people who were able to make an impact in the theater world"

Related Work
Related work can be divided into two threads. The first thread is capsule network, which has been proven effective on many classification tasks. Concept of capsule is first proposed by Hinton et al. (2011), which first use vector to describe the pose of object. This work improves the representation ability of the neural networks against the vanilla CNNs and also enhances the robust of network for transformation. Then dynamic routing algorithm is proposed in (Sabour et al., 2017), which is aimed to displace the pooling operation, building a part-whole relationship for object recognition. Dynamic routing can maintain the position information of features for objects that pooling operations generally discard. And the result shows the proposed method improves the state-of-the-art performance for MNIST dataset. Next, Hinton et al. (2018) employs the matrix to depict the pose and, based on EM algorithm designs a new routing procedure between capsule layers. This work shows strong ability for addressing transformation problem and gains significant improvement on smallNORB dataset.
All these methods are proposed for computer vision, while in this paper we investigate the benefits of capsules for text.
The other thread is about multi-task learning. The earliest idea can be traced back to (Caruana, 1997) and there have been some work completed in this field to augment the performance. Collobert and Weston (2008) develop a multi-task learning model based on CNN. It shares only one lookup table to train a better word embedding. And Liu et al. (2015) propose a DNN-based model for multi-task learning, which shares some low layers but separate the high-level layers to complete several different tasks.
Some models are proposed to share deeper layers of networks, which can exchange high-level knowledge among tasks and gain better performance.  and (Liu et al., 2016) introduce some RNN architectures and design different schemes for knowledge sharing. These trials promote the performance of models, but they give no consideration to the interference in multitask learning. Liu et al. (2017) add the adversarial losses in multi-task RNNs, which can alleviate the interference among tasks by finding a common feature space for tasks. However, the model has multiple subnets and various losses, which requires more computation and training skills.
Different from these methods, we use the thought of capsule in natural language processing (NLP) field. And proposed a capsule based multitask learning architecture with task routing algorithm. This approach can cluster the features for each task, reducing the interference among them.

Conclusion and Future Work
This paper investigates the performance of capsule network for text representation, and proposes several effective architectures. By means of the characteristics of capsule network, we design a unified, sample yet effective architecture with task routing for multi-task learning, which has the ability to clustering the features, building a private feature space for every task.
In future work, we would like to investigate the relations of various tasks in multi-task learning by exploiting the potential of capsule network.