A General-Purpose Tagger with Convolutional Neural Networks

We present a general-purpose tagger based on convolutional neural networks (CNN), used for both composing word vectors and encoding context information. The CNN tagger is robust across different tagging tasks: without task-specific tuning of hyper-parameters, it achieves state-of-the-art results in part-of-speech tagging, morphological tagging and supertagging. The CNN tagger is also robust against the out-of-vocabulary problem; it performs well on artificially unnormalized texts.

In this paper, we present a state-of-the-art general-purpose tagger that uses CNNs both to compose word representations from characters and to encode context information for tagging. 1 We show that the CNN model is more capable than the LSTM model for both functions, and more stable for unseen or unnormalized words, which is the main benefit of character composition models. Yu and Vu (2017) compared the performance of CNN and LSTM as character composition model for dependency parsing, and concluded that CNN performs better than LSTM. In this paper, we show that this is also the case for POS tagging. Furthermore, we extend the scope to morphological tagging and supertagging, in which the tag set is much larger or long-distance dependencies between words are more important.
In these three tagging tasks, we compare our tagger with the bilstm-aux tagger (Plank et al., 2016) and the CRF-based morphological tagger MarMot (Müller et al., 2013) as baselines. The CNN tagger shows robust performance across the three tasks, and achieves the highest average accuracies in all tasks. It considerably outperforms the LSTM tagger in morphological tagging and both baselines in supertagging.
To test the robustness of the taggers against the OOV problem, we also conduct experiments on unnormalized text by artificially corrupting words in the normal dev sets. With the increasing degree of unnormalization, the performance of the CNN tagger degrades much slower than the other two, which suggests that the CNN tagger is more robust against unnormalized text.
Therefore we conclude that our CNN tagger is a robust state-of-the-art general-purpose tagger that can effectively compose word representation from characters and encode context information.

Model
Our proposed CNN tagger has two main components: the character composition model and the context encoding model. Both components are essentially very similar CNN models, capturing dif-ferent levels of information: the first CNN captures morphological information from character ngrams, the second one captures contextual information from word n-grams. Figure 1 shows a diagram of both models of the tagger.

Character Composition Model
The character composition model is similar to Yu and Vu (2017), where several convolution filters are used to capture character n-grams of different sizes. The outputs of each convolution filter are fed through a max pooling layer, and the pooling outputs are concatenated to represent the word.

Context Encoding Model
The context encoding model captures the context information of the target word by scanning through the word representations of its context window. The word representation could be only word embeddings ( w), only composed vectors ( c), or the concatenation of both ( w + c).
A context window consists of N words to both sides of the target word and the target word itself. To indicate the target word, we concatenate a binary feature to each of the word representations with 1 indicating the target and 0 otherwise, similar to Vu et al. (2016). Additional to the binary feature, we also concatenate a position embedding to encode the relative position of each context word, similar to Gehring et al. (2017).

Hyper-parameters
For the character composition model, we take a fixed input size of 32 characters for each word, with padding on both sides or cutting from the middle if needed. We apply four convolution filters with sizes of 3, 5, 7, and 9. Each filter has an output channel of 25 dimensions, thus the composed vector is 100-dimensional. We apply Gaussian noise with standard deviation of 0.1 on the composed vector during training.
For the context encoding model, we take a context window of 15 (7 words to both sides of the target word) as input and predict the tag of the target word. We also apply four convolution filters with sizes of 2, 3, 4 and 5, each filter is stacked by another filter with the same size, and the output has 128 dimensions, thus the context representation is 512-dimensional. We apply one 512-dimensional hidden layer with ReLU non-linearity before the prediction layer. We apply dropout with probability of 0.1 after the hidden layer during training.
The model is trained with averaged stochastic gradient descent with a learning rate of 0.1, momentum of 0.9 and mini-batch size of 100. We apply L2 regularization with a rate of 10 −5 on all the parameters of the network except the embeddings.

Data
We use treebanks from version 1.2 of Universal Dependencies 2 (UD), and in the case of several treebanks for one language, we only use the canonical one. There are in total 22 treebanks, as in Plank et al. (2016). 3 Each treebank splits into train, dev, and test sets, we use the dev sets for early stop training.
In order to compare to more previous works on POS tagging, we additionally experiment POS tagging on the more established Penn Treebank Wall Street Journal (WSJ) data set (Marcus et al., 1993). We use the standard splitting, where sections 0-18 are used for training, 19-21 for tuning, and 22-24 for testing (Collins, 2002).
For POS tagging we use Universal POS tags, which are an extension of Petrov et al. (2012). The universal tag set tries to capture the "universal" properties of words and facilitate cross-lingual learning. Therefore the tag set is very coarse and leaves out most of the language-specific properties to morphological features.
Morphological tags encode the languagespecific morphological features of the words, e.g., number, gender, case. They are represented in the UD treebanks as one string which contains several key-value pairs of morphological features. 4 Supertags (Joshi and Bangalore, 1994) are tags that encode more syntactic information than standard POS tags, e.g. the head direction or the subcategorization frame. We use dependency-based supertags (Foth et al., 2006) which are extracted from the dependency treebanks. Adding such tags into feature models of statistical dependency parsers significantly improves their performance (Ouchi et al., 2014;Faleńska et al., 2015). Supertags can be designed with different levels of granularity. We use the standard Model 1 from Ouchi et al. (2014), where each tag consists of head direction, dependency label and dependent directions. The SUPER task is more difficult than POS and MORPH because it generally requires taking long-distance dependencies between words into consideration.
These three tagging tasks differ strongly in tag set sizes. Generally, the POS set sizes for all the languages are no more than 17 and SUPER set sizes are around 200. When treating morphological features as a string (i.e. not splitting into keyvalue pairs), the sizes of the MORPH tag sets range from about 100 up to 2000.

Setups
As baselines to our models, we take the two stateof-the-art taggers MarMot 5 (denoted as CRF) and bilstm-aux 6 (denoted as LSTM). We train the taggers with the recommended hyper-parameters from the documentations.
To ensure a fair comparison (especially between LSTM and CNN), we generally treat the three tasks equally, and do not apply task-specific tuning on them, i.e., using the same features and same model hyper-parameters in each single task. Also, we do not use any pre-trained word embeddings.
For the LSTM tagger, we use the recommended hyper-parameters from the documentation 7 including 64-dimensional word embeddings ( w) and 100-dimensional composed vectors ( c). We train the w, c and w + c models as in Plank et al. (2016). 4 German, French and Indonesian do not have MORPH tags in UD-1.2, thus not evaluated in this task. 5 http://cistern.cis.lmu.de/marmot/ 6 https://github.com/bplank/bilstm-aux 7 We use the most recent version of the tagger and stacking 3 layers of LSTM as recommended. The average accuracy for POS in our evaluation is slightly lower than reported in the paper, presumably due to different versions of the tagger, but it does not influence the conclusion.
We train the CNN taggers with the same dimensionalities for word representations.
For the CRF tagger, we predict POS and MORPH jointly as in the standard setting, which performs much better than with separate predictions, as shown in Müller et al. (2013). Also, the CRF tagger splits the morphological tags into keyvalue pairs, whereas the two neural-based taggers treat the whole string as a tag. 8 We predict SUPER as a separate task.

Results
The test results for the three tasks are shown in Table 1 in three groups. The first group of seven columns are the results for POS, where both LSTM and CNN have three variations of input features: word only ( w), character only ( c) and both ( w + c). For MORPH and SUPER, we only use the w + c setting for both LSTM and CNN.
On macro-average, three taggers perform close in the POS task, with the CNN tagger being slightly better. In the MORPH task, CNN is again slightly ahead of CRF, while LSTM is about 2 points behind. In the SUPER task, CNN outperforms both taggers by a large margin: 2 points higher than LSTM and 8 points higher than CRF.
While considering the input features of the LSTM and CNN taggers, both taggers perform close with w as input, which suggests that the two taggers are comparable in encoding context for POS. However, with only c, CNN performs much better than LSTM (95.54 vs. 92.61), and close to w + c (96.18). Also, c consistently outperforms w for all languages with CNN. This suggests that the CNN model alone is capable of learning most of the information that the word-level model can learn, while the LSTM model is not.
The more interesting cases are MORPH and SUPER, where CNN performs much higher than LSTM. One potential explanation for the considerably large difference is that the LSTM tagger may be more sensitive to hyper-parameters and requires task specific tuning. We use the same setting which is tuned for the POS task, thus it underperforms in the other tasks. Another factor could be the large tag sets in MORPH tagging task, which are larger than POS in orders of magnitudes, especially for cs, eu, fi, hr, pl, and sl, all of which have more than 500 distinct tags, and the LSTM  Table 1: Tagging accuracies of the three taggers in the three tasks on the test sets of UD 1.2, the highest accuracy for each task on each language is marked in boldface.
tagger performs poorly on these languages. In the SUPER task, where the information from longdistance context is more important, CNN performs much better than both CRF and LSTM. CRF simply has a much smaller context window, thus the poor performance. The LSTM model theoretically can model long-distance contexts, but the information may gradually fade away during the recurrence, whereas the CNN model treat all words equally as long as they are in the context window.
On the more established WSJ data set, Table 2 shows the tagging performances of the CNN model along with some previous works as reference. Generally, the differences among the taggers are very small, we could not conclude any one being considerably better on this data set. This result is expected since English is not a morphologically rich language and WSJ is large data set and has a relatively low OOV rate. Note that the Convnet tagger by dos Santos and Zadrozny (2014) used pre-trained word embeddings while our CNN tagger does not.

Unnormalized Text
It is a common scenario to use a model trained with news data to process text from social media, which could include intentional or unintentional WSJ Accuracy CRF (Müller et al., 2013) 97.30 Convnet (dos Santos andZadrozny, 2014) 97.32 bi-LSTM (Ling et al., 2015) 97.36 bi-LSTM (Plank et al., 2016) 97.22 CNN (this work) 97.30 misspellings. Unfortunately, we do not have social media data for all the languages. However, we design an experiment to simulate unnormalized text, by systematically editing the words in the dev sets with one of the four operations: insertion, deletion, substitution, and swap. For example, if we modify a word abcdef at position 2 (0-based), the modified words would be abxcdef, abdef, abxdef, and abdcef, where x is a random character from the alphabet of the language. For each operation, we create a group of modified dev sets, where all words longer than two characters are edited by the operation with a probability of 0.25, 0.5, 0.75, or 1. For each language, we use the models trained on the normal training sets and predict POS for the modified dev sets. The average accuracies are shown in Figure 2.
Generally, all models suffer from the increasing degrees of unnormalization, but CNN always de-grades the least and slowest. In the extreme case where almost all words are unnormalized, CNN performs 4 to 8 points higher than LSTM and 4 to 12 points higher than CRF. This suggests that the CNN is more robust to misspelt words.
While looking into the specific cases of misspelling, CNN is less sensitive to substitution, while insertion and deletion have stronger effect, and swap degrades its performance the most. In the case of substitution, the distortion to the character n-gram patterns are smaller than varying the lengths, i.e. insertion and deletion, thus has smaller negative impact. However, in the case of swap, the effect is similar to substituting two characters instead of one, thus larger degradation. LSTM and CRF on the other hand, are affected the most by substitution.

Conclusion
In this paper, we propose a general-purpose tagger that uses two CNNs for both character composition and context encoding. On the universal dependency treebanks (v1.2), the tagger achieves state-of-the-art or comparable results for POS tagging and morphological tagging, and to the best of our knowledge, it performs by far the best for supertagging. The tagger works well across different tagging tasks without tuning the hyper-parameters, and it is also robust against unnormalized text. Our tagger uses a greedy window-based approach, which mainly aims at showing the effectiveness of CNN in composing word representations and encoding contexts. However, a globally normalized decoding method, e.g. beam-search or sentence-level inference as in Collobert et al. (2011), could potentially further improve the tagger's performance, which is left for future work.