An Analysis of Capsule Networks for Part of Speech Tagging in High- and Low-resource Scenarios

Neural networks are a common tool in NLP, but it is not always clear which architecture to use for a given task. Different tasks, different languages, and different training conditions can all affect how a neural network will perform. Capsule Networks (CapsNets) are a relatively new architecture in NLP. Due to their novelty, CapsNets are being used more and more in NLP tasks. However, their usefulness is still mostly untested.In this paper, we compare three neural network architectures—LSTM, CNN, and CapsNet—on a part of speech tagging task. We compare these architectures in both high- and low-resource training conditions and find that no architecture consistently performs the best. Our analysis shows that our CapsNet performs nearly as well as a more complex LSTM under certain training conditions, but not others, and that our CapsNet almost always outperforms our CNN. We also find that our CapsNet implementation shows faster prediction times than the LSTM for Scottish Gaelic but not for Spanish, highlighting the effect that the choice of languages can have on the models.


Introduction
Neural networks have become a common tool in natural language processing (NLP) for many tasks, but are different architectures better suited for different tasks, languages, and/or resources? To try to answer this question, we examine the performance of two common neural network architectures, long short-term memory networks (LSTM) (Greff et al., 2017) and convolutional neural networks (CNN) (LeCun et al., 1989), against the newer capsule networks (CapsNets), another neural network architecture based on CNNs (Hinton et al., 2011).
While LSTMs and CNNs are common in NLP, capsule networks are relatively new to the field. Due to their recency, it's not always clear if or when they are better than other widely used sequence models. This paper investigates the CapsNet architecture in comparison with LSTMs and CNNs. For our analysis, we apply these three architectures to a part of speech (POS) tagging task, on two languages, and using both low-and high-resource scenarios.
Much of the focus of NLP research is on resource-rich languages like English. However, the performance of different models can depend on the linguistic properties of the language under study (Bender, 2009) and the amount of training data available. To compare the performance of these architectures under different training conditions, we look at Spanish-another resource-rich language-and Scottish Gaelic-a low-resource language using different amounts of training data. This comparison is a step in the right direction, but it does have the limitations of comparing neural network architectures implemented in different frameworks and only comparing two languages.
The main contribution of this paper is comparing the LSTM, CNN, and CapsNet architectures across different training conditions. Our analysis finds that none of the architectures consistently performs best across training conditions. This illustrates how different languages and training conditions can inform which architecture is best suited for a given NLP task, and that there is no obviously correct answer.

Related Work
CapsNets are a relatively new type of neural network. Hinton et al. (2011) introduces the architecture, with modifications by Sabour et al. (2017) (dynamic routing) and Hinton et al. (2018) (EM routing). A CapsNet is essentially a modified ver-sion of a CNN that trades max pooling for a more data-retentive process called routing by agreement. Instead of the prediction with the highest score getting chosen, the weighted sum of all predictions are considered for classification. Essentially, a Cap-sNet uses convolution to create first round predictions for objects-primary capsules-and then utilizes routing by agreement to predict the presence of higher level objects-secondary capsules.
Many implementations of CapsNets are designed for image recognition (Hinton et al., 2011;Sabour et al., 2017;Hinton et al., 2018). However, the CapsNet architecture is being applied more and more to NLP tasks, including Chinese word segmentation (Li et al., 2018), and multi-label text classification and question answering (Zhao et al., 2019). This paper continues this path by investigating how CapsNets compare to other neural network architectures for the task of part of speech tagging.

Data
Our comparison considers two languages: Spanish 1 and Scottish Gaelic 2 . Spanish is a resource-rich language, being the second most spoken language by number native speakers, fourth most spoken language by total number of speakers, and the third or fourth most widely used language on the internet 3 . Scottish Gaelic is a low-resource language, with 57,375 fluent speakers in Scotland per the 2011 census 4 . The Spanish data come from the UD Spanish AnCora treebank 5 . The Scottish Gaelic data come from the UD ARCOSG treebank 6 . Both corpora use 17 part of speech tag classes.
To study how different low-resource conditions affect training, we artificially create training partitions of different sizes. From the original training data (train100), we create partitions consisting of 50% (train50), 10% (train10), and 1% (train1) of the training sentences. The amount of data for each partition is shown in Table 1 for Spanish and Table 2 for Scottish Gaelic. We use FastText word embeddings (Grave et al., 2018) for both Spanish (2,000,000 words) and Scottish Gaelic (14,318 words). The embedding dimension is 300.
1 Indo-European, Romance 2 Indo-European, Celtic 3 Third by number internet users by language, fourth by number of websites by language 4 Only 1.1% of Scotland's population over 3 years old 5 UD Ancora 6 UD Scottish Gaelic ARCOSG

Approach
In this section, we describe the implementation details of our CapsNet, CNN, and LSTM methods. Our CapsNet and CNN implementations build on top of Yeung et al.'s implementation 7 , which was kept as close as possible to the architectures described by Sabour et al. (2017). Importantly, we tried to keep all three models as close to each other as possible in order to make our comparison as faithful as possible. However, certain differences persist for this project-for example, the CapsNet and CNN are implemented in Python using Tensorflow 8 and Keras 9 , whereas the LSTM is implemented in Scala using DyNet. 10 The hyperparameters for our CapsNet and CNN implementations were chosen to be as close as possible to the original implementation. The hyperparameters of the LSTM were chosen to be a reasonable approximation to the CapsNet and CNN models. It is important to note that our comparison does not attempt to compare the best of the best of each architecture.

CapsNet Implementation
Our CapsNet model has two 1D convolutional layers, two routing by agreement capsule layers and one fully connected layer. Both convolutional layers have 256 channels, a kernel size of 3, and a stride of 1. The primary capsule layer has 160 capsules with 8 dimensions, a kernel size of 3 and stride of 1. There are 17 secondary capsules with dimensions of 16 and 3 dynamic routing passes.

CNN Implementation
Our CNN model has three 1D convolutional layers, a max pooling layer, and two fully connected layers. The first two convolutional layers are identical to the first two layers of the CapsNet. The third convolutional layer has 128 channels, size of 3 and stride 1. The two feed-forward layers have a size fo 328 and 192. These settings were chosen to make the CNN implementation as comparable as possible to the CapsNet implementation. io).

LSTM Implementation
The LSTM code we used is a reimplementation of the LSTM-CRF approach of Lample et al. (2016). To make this implementation as similar as possible with the previous two approaches, we: (a) removed the CRF layer, 11 and (b) removed the characterlevel biLSTM encoder from the word embeddings. Thus, the actual LSTM architecture used consists of three layers: (i) an input layer with 300dimensional FastText word embeddings; (ii) one biLSTM intermediate layer, where each LSTM has a hidden state of dimension 128 neurons, and (iii) a linear output layer coupled with a softmax function to output the POS tags.

Results
In addition to our four training data conditions per language, we evaluate the use of learning the word embeddings during training for all models ("learn" vs. "no learn"), yielding 24 training conditions per language. We trained all models five times with 11 In initial experiments we observed that the CRF layer had a major contribution to other sequence models such as named entity recognition, but no impact on POS tagging.   a different random seed and averaged the results. Each condition trained for 10 epochs, with early stopping after 2 epochs if the loss did not improve. The results are given in Table 3 (Spanish) and Table 4 (Scottish Gaelic). We report Precision, Recall, F1, training time, and prediction time. These results show a few trends: 1. The LSTM always outperforms the CapsNet and CNN for Spanish, but the CapsNet and CNN occasionally outperform the LSTM for Scottish Gaelic, whose training dataset is an order of magnitude smaller than the Spanish one.

2.
The difference in F1 on the no learn train condition between the Spanish 10% and Scottish Gaelic 100% partitions, which have a comparable number of sentences, is greater for the LSTM (down 9.93%) than the Capsnet (down 2.98%) or CNN (up 3.93%). This suggests that properties of the language, not just the amount of data, play a role in performance. F1 improves by 0.01%, but the CapsNet and CNN models improve by 4.82% and 3.39%, respectively.

4.
Another obvious difference is in the model training and prediction times. The training time for the CapsNet and CNN is much slower than the LSTM. However, for the Scottish Gaelic case CapsNets are much faster than the LSTM at prediction time. This is an encouraging result, considering that our CapsNet implementation is in Python, whereas the LSTM is implemented in a faster Scala framework.
Overall, the LSTM performs best in most conditions, but the CapsNet often comes close. The CapsNet also usually outperforms the CNN. These performance differences are potentially offset by faster prediction time, depending on the language. The balance between predictive accuracy, training time, and prediction time can be delicate, especially when looking at low-resource languages. These results suggest that depending on the use case, a Cap-sNet architecture may be preferable to an LSTM, despite the fact that when more resources are available, the LSTM tends to perform the best under the common hyperparameters investigated here.
We also compared different hyperparameters for the LSTM and CapsNet, which is shown in Table 5. The values we chose for the LSTM hidden state size and the CapsNet capsule layer kernel size perform the best in nearly all conditions.

Conclusion
In this paper, we compare the performance of three neural network architectures-LSTM, CNN, and CapsNet-on part of speech tagging and find that LSTMs are not always better under the common hyperparameters investigated. We examine how the best performing model changes under different high-and low-resource training conditions using Spanish and Scottish Gaelic. We show that the relatively new CapsNet architecture performs nearly as well as the more complex LSTM under certain conditions and outperforms the CNN under most conditions we examined. These results suggest that there is no one obviously clear choice for a model architecture, and that the properties of a language and the amount of training data can affect which architecture performs best. Future work should address the limitations of this paper. Specifically, future effort should consider more training conditions, including other languages; the consistency of these results within groups of similar languages; and making the implementation of these architectures closer, to guarantee the performance differences are due to the architecture and not an artifact of how they were implemented.