Towards Scalable and Reliable Capsule Networks for Challenging NLP Applications

Obstacles hindering the development of capsule networks for challenging NLP applications include poor scalability to large output spaces and less reliable routing processes. In this paper, we introduce: (i) an agreement score to evaluate the performance of routing processes at instance-level; (ii) an adaptive optimizer to enhance the reliability of routing; (iii) capsule compression and partial routing to improve the scalability of capsule networks. We validate our approach on two NLP tasks, namely: multi-label text classification and question answering. Experimental results show that our approach considerably improves over strong competitors on both tasks. In addition, we gain the best results in low-resource settings with few training instances.


Introduction
In recent years, deep neural networks have achieved outstanding success in natural language processing (NLP), computer vision and speech recognition.However, these deep models are datahungry and generalize poorly from small datasets, very much unlike humans (Lake et al., 2015).
This is an important issue in NLP since sentences with different surface forms can convey the same meaning (paraphrases) and not all of them can be enumerated in the training set.For example, Peter did not accept the offer and Peter turned down the offer are semantically equivalent, but use different surface realizations.
In image classification, progress on the generalization ability of deep networks has been made by capsule networks (Sabour et al., 2017;Hinton et al., 2018).They are capable of generalizing to the same object in different 3D images with various viewpoints. 1 Our code is publicly available at http://bit.ly/311Dcod Jerry completed his project.
Jerry managed to finish his project.
Jerry succeeded in finishing his project.Such generalization capability can be learned from examples with few viewpoints by extrapolation (Hinton et al., 2011).This suggests that capsule networks can similarly abstract away from different surface realizations in NLP applications.

Extrapolate
Figure 1 illustrates this idea of how observed sentences in the training set are generalized to unseen sentences by extrapolation.In contrast, traditional neural networks require massive amounts of training samples for generalization.This is especially true in the case of convolutional neural networks (CNNs), where pooling operations wrongly discard positional information and do not consider hierarchical relationships between local features (Sabour et al., 2017).Capsule networks, instead, have the potential for learning hierarchical relationships between consecutive layers by using routing processes without parameters, which are clusteringlike methods (Sabour et al., 2017) and additionally improve the generalization capability.We contrast such routing processes with pooling and fully connected layers in Figure 2.
Despite some recent success in NLP tasks (Wang et al., 2018;Xia et al., 2018;Xiao et al., 2018;Zhang et al., 2018a;Zhao et al., 2018), a few important obstacles still hinder the development of capsule networks for mature NLP applications.
For example, selecting the number of iterations is crucial for routing processes, because they iteratively route low-level capsules to high-level capsules in order to learn hierarchical relationships between layers.However, existing routing algorithms use the same number of iterations for all examples, which is not reliable to judge the convergence of routing.As shown in Figure 3, a routing process with five iterations on all examples converges to a lower training loss at system level, but on instance level for one example, convergence has still not obtained.
Additionally, training capsule networks is more difficult than traditional neural networks like CNN and long short-term memory (LSTM) due to the large number of capsules and potentially large output spaces, which requires extensive computational resources in the routing process.
In this work, we address these issues via the following contributions: • We formulate routing processes as a proxy problem minimizing a total negative agreement score in order to evaluate how routing processes perform at instance level, which will be discussed more in depth later.
• We introduce an adaptive optimizer to selfadjust the number of iterations for each example in order to improve instance-level convergence and enhance the reliability of routing processes.
• We present capsule compression and partial routing to achieve better scalability of capsule networks on datasets with large output spaces.
• Our framework outperforms strong baselines on multi-label text classification and question answering.We also demonstrate its superior generalization capability in low-resource settings.

NLP-Capsule Framework
We have motivated the need for better capsule networks being capable of scaling to large output spaces and higher reliability for routing processes at instance level.We now build a unified capsule framework, which we call NLP-Capsule.It is shown in Figure 4 and described below.

Convolutional Layer
We use a convolutional operation to extract features from documents by taking a sliding window over document embeddings.
Let X ∈ R l×v be a matrix of stacked vdimensional word embeddings for an input document with l tokens.Furthermore, let W a ∈ R l×k be a convolutional filter with a width k.We apply this filter to a local region X i:i+k−1 ∈ R k×l to generate one feature: where • denotes element-wise multiplication, and f is a nonlinear activation function (i.e., ReLU).For ease of exposition, we omit all bias terms.
Then, we can collect all m i into one feature map (m 1 , . . ., m (v−k+1)/2 ) after sliding the filter over the current document.To increase the diversity of features extraction, we concatenate multiple feature maps extracted by three filters with different window sizes (2,4,8) and pass them to the primary capsule layer.

Primary Capsule Layer
In this layer, we use a group-convolution operation to transform feature maps into primary capsules.As opposed to using a scalar for each element in the feature maps, capsules use a group of neurons to represent each element in the current layer, which has the potential for preserving more information.
where p ij = m i • w j ∈ R and ⊕ is the concatenation operator.Furthermore, g is a non-linear function (i.e., squashing function).The length ||p i || of each capsule p i indicates the probability of it being useful for the task at hand.Hence, a capsule's length has to be constrained into the unit interval [0, 1] by the squashing function g: Capsule Compression One major issue in this layer is that the number of primary capsules becomes large in proportion to the size of the input documents, which requires extensive computational resources in routing processes (see Section 2.3).To mitigate this issue, we condense the large number of primary capsules into a smaller amount.In this way, we can merge similar capsules and remove outliers.Each condensed capsule u i is calculated by using a weighted sum over all primary capsules, denoted as: where the parameter b j is learned by supervision.

Aggregation Layer
Pooling is the simplest aggregation function routing condensed capsules into the subsequent layer, but it loses almost all information during aggregation.Alternatively, routing processes are introduced to iteratively route condensed capsules into the next layer for learning hierarchical relationships between two consecutive layers.We now describe this iterative routing algorithm.Let {u 1 , . . ., ûm } and {v 1 , . . ., v n } be a set of condensed capsules in layer and a set of high-level capsules in layer +1, respectively.The basic idea of routing is two-fold.
First, we transform the condensed capsules into a collection of candidates ûj|1 , . . ., ûj|m for the j-th high-level capsule in layer + 1.Following Sabour et al. (2017), each element ûj|i is calculated by: where W c is a linear transformation matrix.
Then, we represent a high-level capsule v j by a weighted sum over those candidates, denoted as: where c ij is a coupling coefficient iteratively updated by a clustering-like method.
Our Routing As discussed earlier, routing algorithms like dynamic routing (Sabour et al., 2017) and EM routing (Hinton et al., 2018), which use the same number of iterations for all samples, perform well according to training loss at system level, but on instance level for individual examples, convergence has still not been reached.This increases the risk of unreliability for routing processes (see Figure 3).
To evaluate the performance of routing processes at instance level, we formulate them as a proxy problem minimizing the negative agreement score (NAS) function: The basic intuition behind this is to assign higher weights c ij to one agreeable pair v j , u j|i if the capsule v j and u j|i are close to each other such that the total agreement score i,j c ij v j , u j|i is maximized.However, the choice of NAS functions remains an open problem.Hinton et al. (2018) hypothesize that the agreeable pairs in NAS functions are from Gaussian distributions.Instead, we study NAS functions by introducing Kernel Density Estimation (KDE) since this yields a non-parametric density estimator requiring no assumptions that the agreeable pairs are drawn from parametric distributions.Here, we formulate the NAS function in a KDE form.
where d is a distance metric with 2 norm, and k is a Epanechnikov kernel function (Wand and Jones, 1994) with: The solution we used for KDE is taking Mean Shift (Comaniciu and Meer, 2002) to minimize the NAS function f (u): Then, c τ +1 ij can be updated using standard gradient descent: where α is the hyper-parameter to control step size.
To address the issue of convergence not being reached at instance level, we present an adaptive optimizer to self-adjust the number of iterations for individual examples according to their negative agreement scores (see Algorithm 1).Following Zhao et al. (2018), we replace standard softmax with leaky-softmax, which decreases the strength of noisy capsules.

Representation Layer
This is the top-level layer containing final capsules calculated by iteratively minimizing the NAS function (See Eq. 1), where the number of final capsules corresponds to the entire output space.Therefore, as long as the size of an output space goes to a large scale (thousands of labels), the computation of this function would become extremely expensive, which yields the bottleneck of scalability of capsule networks.
Partial Routing As opposed to the entire output space on data sets, the sub-output space corresponding to individual examples is rather small, i.e., only few labels are assigned to one document in text classification, for example.As a consequence, it is redundant to route low-level capsules to the entire output space for each example in the training stage, which motivated us to present a partial routing algorithm with constrained output spaces, such that our NAS function is described as: where D + and D − denote the sets of real (positive) and randomly selected (negative) outputs for each example, respectively.Both sets are far smaller than the entire output space.

Experiments
The major focus of this work is to investigate the scalability of our approach on datasets with a large output space, and generalizability in low-resource settings with few training examples.Therefore, we validate our capsule-based approach on two specific NLP tasks: (i) multi-label text classification with a large label scale; (ii) question answering with a data imbalance issue.

Multi-label Text Classification
Multi-label text classification task refers to assigning multiple relevant labels to each input document, while the entire label set might be extremely large.We use our approach to encode an input document and generate the final capsules corresponding to the number of labels in the representation layer.The length of final capsule for each label indicates the probability whether the document has this label.Baselines We compare our approach to the following baselines: non-deep learning approaches using TF-IDF features of documents as inputs: FastXML (Prabhu and Varma, 2014), and PD-Sparse (Yen et al., 2016), deep learning approaches using raw text of documents as inputs: FastText (Joulin et al., 2016), Bow-CNN (Johnson and Zhang, 2014), CNN-Kim (Kim, 2014), XML-CNN (Liu et al., 2017)), and a capsule-based approach Cap-Zhao (Zhao et al., 2018).For evaluation, we use standard rank-based measures (Liu et al., 2017) such as Precision@k, and Normalized Discounted Cumulative Gain (NDCG@k).

Implementation Details
The word embeddings are initialized as 300-dimensional GloVe vectors (Pennington et al., 2014).In the convolutional layer, we use a convolution operation with three different window sizes (2,4,8) to extract features from input documents.Each feature is transformed into a primary capsule with 16 dimensions by a group-convolution operation.All capsules in the primary capsule layer are condensed into 256 capsules for RCV1 and 128 capsules for EUR-Lex by a capsule compression operation.
To avoid routing low-level capsules to the entire label space in the inference stage, we use a CNN baseline (Kim, 2014) trained on the same dataset with our approach, to generate 200 candidate labels and take these labels as a constrained output space for each example.

Experimental Results
In Table 2, we can see a noticeable margin brought by our capsule-based approach over the strong baselines on EUR-Lex, and competitive results on RCV1.These results appear to indicate that our approach has superior generalization ability on datasets with fewer training examples, i.e., RCV1 has 729.67 examples per label while EUR-Lex has 15.59 examples.
In contrast to the strongest baseline XML-CNN with 22.52M parameters and 0.08 seconds per batch, our approach has 14.06M parameters, and takes 0.25 seconds in an acceleration setting with capsule compression and partial routing, and 1.7 seconds without acceleration.This demonstrates that our approach provides competitive computational speed with fewer parameters compared to the competitors.This finding agrees with our speculation on generalization: the distance between our approach and XML-CNN increases as fewer training data samples are available.In Table 3, we also find that our approach with 70% of training examples achieves about 5% improvement over XML-CNN with 100% of examples on 4 out of 6 metrics.

Routing Comparison
We compare our routing with (Sabour et al., 2017) and (Zhang et al., 2018b) on EUR-Lex dataset and observe that it performs best on all metrics (Table 4).We speculate that the improvement comes from enhanced reliability of routing processes at instance level.

Question Answering
Question-Answering (QA) selection task refers to selecting the best answer from candidates to each question.For a question-answer pair (q, a), we use our capsule-based approach to generate two final capsules v q and v a corresponding to the respective question and answer.The relevance score of question-answer pair can be defined as their cosine similarity: 5, we conduct our experiments on the TREC QA dataset collected from TREC QA track 8-13 data (Wang et al., 2007).The intuition behind this dataset selection is that the cost of hiring human annotators to collect positive answers for individual questions can be prohibitive since positive answers can be conveyed in multiple different surface forms.Such issue arises particularly in TREC QA with only 12%   positive answers.Therefore, we use this dataset to investigate the generalizability of our approach.
For evaluation, we use standard measures (Wang et al., 2007) such as Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR).
Implementation Details The word embeddings used for question answering pairs are initialized as 300-dimensional GloVe vectors.In the convolutional layer, we use a convolution operation with three different window sizes (3,4,5).All 16dimensional capsules in the primary capsule layer are condensed into 256 capsules by the capsule compression operation.
Experimental Results and Discussions In Table 6, the best performance on MAP metric is achieved by our approach, which verifies the effectiveness of our model.We also observe that our approach exceeds traditional neural models like CNN, LSTM and NTN-LSTM by a noticeable margin.This finding also agrees with the observation we found in multi-label classification: our approach has superior generalization capability in low-resource setting with few training examples.
In contrast to the strongest baseline HD-LSTM with 34.51M and 0.03 seconds for one batch, our approach has 17.84M parameters and takes 0.06 seconds in an acceleration setting, and 0.12 seconds without acceleration.

Multi-label Text Classification
Multi-label text classification aims at assigning a document to a subset of labels whose label set might be extremely large.With increasing numbers of labels, issues of data sparsity and scalability arise.Several methods have been proposed for the large multi-label classification case.
Tree-based models (Agrawal et al., 2013;Weston et al., 2013) induce a tree structure that recursively partitions the feature space with nonleaf nodes.Then, the restricted label spaces at leaf nodes are used for classification.Such a solution entails higher robustness because of a dynamic hyper-plane design and its computational efficiency.FastXML (Prabhu and Varma, 2014) is one such tree-based model, which learns a hierarchy of training instances and optimizes an NDCG-based objective function for nodes in the tree structure.
Label embedding models (Balasubramanian and Lebanon, 2012;Chen and Lin, 2012;Cisse et al., 2013;Bi and Kwok, 2013;Ferng and Lin, 2011;Hsu et al., 2009;Ji et al., 2008;Kapoor et al., 2012;Lewis et al., 2004;Yu et al., 2014a) address the data sparsity issue with two steps: compression and decompression.The compression step learns a low-dimensional label embedding that is projected from original and highdimensional label space.When data instances are classified to these label embeddings, they will be projected back to the high-dimensional label space, which is the decompression step.Recent works came up with different compression or decompression techniques, e.g., SLEEC (Bhatia et al., 2015).
Linear classifiers: PD-Sparse (Yen et al., 2016) introduces a Fully-Corrective Block-Coordinate Frank-Wolfe algorithm to address data sparsity.

Question and Answering
State-of-the-art approaches to QA fall into two categories: IR-based and knowledge-based QA.
IR-based QA firstly preprocesses the question and employ information retrieval techniques to retrieve a list of relevant passages to questions.Next, reading comprehension techniques are adopted to extract answers within the span of retrieved text.For answer extraction, early methods manually designed patterns to get them (Pasca).A recent popular trend is neural answer extraction.Various neural network models are employed to represent questions (Severyn and Moschitti, 2015;Wang and Nyberg, 2015).Since the attention mechanism naturally explores relevancy, it has been widely used in QA models to relate the question to candidate answers (Tan et al., 2016;Santos et al., 2016;Sha et al., 2018).Moreover, some researchers leveraged external large-scale knowledge bases to assist answer selection (Savenkov and Agichtein, 2017;Shen et al., 2018;Deng et al., 2018).
Knowledge-based QA conducts semantic parsing on questions and transforms parsing results into logical forms.Those forms are adopted to match answers from structured knowledge bases (Yao and Van Durme, 2014;Yih et al., 2015;Bordes et al., 2015;Yin et al., 2016;Hao et al., 2017).Recent developments focused on modeling the interaction between question and answer pairs: Tensor layers (Qiu and Huang, 2015;Wan et al., 2016) and holographic composition (Tay et al., 2017) have pushed the state-of-the-art.

Capsule Networks
Capsule networks were initially proposed by Hinton (Hinton et al., 2011) to improve representations learned by neural networks against vanilla CNNs.Subsequently, Sabour et al. (2017) replaced the scalar-output feature detectors of CNNs with vector-output capsules and max-pooling with routing-by-agreement. Hinton et al. (2018) then proposed a new iterative routing procedure between capsule layers based on the EM algorithm, which achieves better accuracy on the smallNORB dataset.Zhang et al. (2018a) applied capsule networks to relation extraction in a multi-instance multi-label learning framework.Xiao et al. (2018) explored capsule networks for multi-task learning.Xia et al. (2018) studied the zero-shot intent detection problem with capsule networks, which aims to detect emerging user intents in an unsupervised manner.Zhao et al. (2018) investigated capsule networks with dynamic routing for text classification, and transferred knowledge from the single-label to multi-label cases.Cho et al. (2019) studied capsule networks with determinantal point processes for extractive multi-document summarization.
Our work is different from our predecessors in the following aspects: (i) we evaluate the performance of routing processes at instance level, and introduce an adaptive optimizer to enhance the reliability of routing processes; (ii) we present capsule compression and partial routing to achieve better scalability of capsule networks on datasets with a large output space.

Conclusion
Making computers perform more like humans is a major issue in NLP and machine learning.This not only includes making them perform on similar levels (Hassan et al., 2018), but also requests them to be robust to adversarial examples (Eger et al., 2019) and generalize from few data points (Rücklé et al., 2019).In this work, we have addressed the latter issue.
In particular, we extended existing capsule networks into a new framework with advantages concerning scalability, reliability and generalizability.Our experimental results have demonstrated its effectiveness on two NLP tasks: multi-label text classification and question answering.
Through our modifications and enhancements, we hope to have made capsule networks more suitable to large-scale problems and, hence, more mature for real-world applications.In the future, we plan to apply capsule networks to even more challenging NLP problems such as language modeling and text generation.

Figure 1 :
Figure 1: The extrapolation regime for an observed sentence can be found during training.Then, the unseen sentences in this regime may be generalized successfully.

Figure 2 :
Figure 2: Outputs attend to a) active neurons found by pooling operations b) all neurons c) relevant capsules found in routing processes.

Figure 3
Figure 3: left) System-level routing evaluation on all examples; right) Instance-level routing evaluation on one example.

Figure 4 :
Figure 4: An illustration of NLP-Capsule framework.Using 1 × 1 filters W b = {w 1 , ..., w d } ∈ R d , in total d groups are used to transform each scalar m i in feature maps to one capsule p i , a ddimensional vector, denoted as:

Table 1 :
Characteristics of the datasets.Each label of RCV1 has about 729.67 training examples, while each label of EUR-Lex has merely about 15.59 examples.
To further study the generalization capability of our approach, we vary the percentage of training examples from 100% to 50% on the entire training set, leading to the number of training examples per label decreasing from 15.59 to 7.77.Figure 5 shows that

Table 2 :
Comparisons of our NLP-Cap approach and baselines on two text classication benchmarks, where '-' denotes methods that failed to scale due to memory issues.

Table 3 :
Experimental results on different fractions of training examples from 50% to 100% on EUR-Lex.

Table 4 :
Performance on EUR-Lex dataset with different routing process.

Table 5 :
Characteristic of TREC QA dataset.%Positive denotes the percentage of positive answers.

Table 6 :
Experimental results on TREC QA dataset.