Character Composition Model with Convolutional Neural Networks for Dependency Parsing on Morphologically Rich Languages

We present a transition-based dependency parser that uses a convolutional neural network to compose word representations from characters. The character composition model shows great improvement over the word-lookup model, especially for parsing agglutinative languages. These improvements are even better than using pre-trained word embeddings from extra data. On the SPMRL data sets, our system outperforms the previous best greedy parser (Ballesteros et. al, 2015) by a margin of 3% on average.


Introduction
As with many other NLP tasks, dependency parsing also suffers from the out-of-vocabulary (OOV) problem, and probably more than others since training data with syntactical annotation is usually scarce. This problem is particularly severe when the target is a morphologically rich language. For example, in the SPMRL shared task data sets (Seddah et al., 2013(Seddah et al., , 2014, 4 out of 9 treebanks contain more than 40% word types in the development set that are never seen in the training set. One way to tackle the OOV problem is to pretrain the word embeddings, e.g., with word2vec (Mikolov et al., 2013), from a large set of unlabeled data. This comes with two main advantages: (1) more word types, which means that the vocabulary is extended by the unlabeled data, so that some of the OOV words now have a learned representation; (2) more word tokens per type, which means that the syntactic and semantic similarities of the words are better modeled than only using the parser training data.
Pre-trained word embeddings can alleviate the OOV problem by expanding the vocabulary, but it does not model the morphological information. Instead of looking up word embeddings, many researchers propose to compose the word representation from characters for various tasks, e.g., part-of-speech tagging (dos Santos and Zadrozny, 2014;Plank et al., 2016), named entity recognition (dos Santos and Guimarães, 2015), language modeling (Ling et al., 2015), machine translation (Costa-jussà and Fonollosa, 2016). In particular, Ballesteros et al. (2015) use a bidirectional long short-term memory (LSTM) character model for dependency parsing. Kim et al. (2016) present a convolutional neural network (CNN) character model for language modeling, but make no comparison among the character models, and state that "it remains open as to which character composition model (i.e., LSTM or CNN) performs better".
We propose to apply the CNN model by Kim et al. (2016) in a greedy transition-based dependency parser with feed-forward neural networks (Chen and Manning, 2014;Weiss et al., 2015). This model requires no extra unlabeled data but performs better than using pre-trained word embeddings. Furthermore, it can be combined with word embeddings from the lookup table since they capture different aspects of word similarities.
Experimental results show that the CNN model works especially well on agglutinative languages, where the OOV rates are high. On other morphologically rich languages, the CNN model also performs at least as good as the word-lookup model.
Furthermore, our CNN model outperforms both the original and our re-implementation of the bidirectional LSTM model by Ballesteros et al. (2015) by a large margin. It provides empirical evidence to the aforementioned open question, suggesting that the CNN is the better character composition model for dependency parsing.

Baseline Parsing Model
As the baseline parsing model, we re-implement the greedy parser in Weiss et al. (2015) with some modifications, which brings about 0.5% improvement, outlined below. 2 Since most treebanks contain non-projective trees, we use an approximate non-projective transition system similar to Attardi (2006). It has two additional transitions (LEFT-2 and RIGHT-2) to the Arc-Standard system (Nivre, 2004) that attach the top of the stack to the third token on the stack, or vice versa. We also extend the feature templates in Weiss et al. (2015) by extracting the children of the third token in the stack. The complete transitions and feature templates are listed in Appendix A.
Note that Weiss et al. (2015) directly concatenate the embeddings of the words, tags, and labels of all the tokens together as input to the hidden layer. Instead, we first group the embeddings of the word, tag, and label of each token and compute an intermediate representation with shared parameters, then concatenate all the representations as input to the hidden layer.

LSTM Character Composition Model
To tackle the OOV problem, we want to replace the word-lookup table with a function that composes the word representation from characters.
As a baseline character model, we re-implement the bidirectional LSTM character composition model following Ballesteros et al. (2015). We replace the lookup table in the baseline parser with the final outputs of the forward and backward LSTMs ⃖⃖⃖⃖⃖⃖⃖⃖ and ⃖⃖⃖⃖⃖⃖⃖ ⃗ . Equation (1) is then replaced with We refer the readers to Ling et al. (2015) for the details of the bidirectional LSTM.

CNN Character Composition Model
In contrast to the LSTM model, we propose to use a "flat" CNN as the character composition model, similar to Kim et al. (2016). 4 Equation (1) is thus replaced with Concretely, the input of the model is a concatenated matrix of character embeddings ∈ ℝ × , where is the dimensionality of character embeddings (number of input channels) and is the length of the padded word. 5 We apply convolutional kernels  ∈ ℝ × × with ReLU nonlinearity on the input, where is the number of output channels and is the length of the kernel. The output of the convolution operation is ∈ ℝ ×( − +1) , and we apply a max-overtime pooling that takes the maximum activations of the kernel along each channel, obtaining the final output ∈ ℝ , which corresponds to the most salient n-gram representation of the word, denoted in Equation (2). We then concatenate the outputs of several such CNNs with different lengths, so that the information from different n-grams are extracted and can interact with each other.  Table 1: LAS on the test sets, the best LAS in each group is marked in bold face.

Experimental Setup
We conduct our experiments on the treebanks from the SPMRL 2014 shared task (Seddah et al., 2013(Seddah et al., , 2014, which includes 9 morphologically rich languages: Arabic, Basque, French, German, Hebrew, Hungarian, Korean, Polish, and Swedish. All the treebanks are split into training, development, and test sets by the shared task organizers. We use the fine-grained predicted POS tags provided by the organizers, and evaluate the labeled attachment scores (LAS) including punctuation. We experiment with the CNN-based character composition model (CNN) along with several baselines. The first baseline (WORD) uses the wordlookup model described in Section 2.1 with randomly initialized word embeddings. The second baseline (W2V) uses pre-trained word embeddings by word2vec (Mikolov et al., 2013) with the CBOW model and default parameters on the unlabeled texts from the shared task organizers. The third baseline (LSTM) uses a bidirectional LSTM as the character composition model following Ballesteros et al. (2015). Appendix C lists the hyper-parameters of all the models.
Further analysis suggests that combining the character composition models with word-lookup models could be beneficial since they capture different aspects of word similarities (orthographic vs. syntactic/semantic). We therefore experiment with four combined models in two groups: (1) randomly initialized word embeddings (LSTM+WORD vs. CNN+WORD), and (2) pre-trained word embeddings (LSTM+W2V vs. CNN+W2V).
The experimental results are shown in Table 1, with Int denoting internal comparisons (with three groups) and Ext denoting external comparisons, the highest LAS in each group is marked in bold face.

Internal Comparisons
In the first group, we compare the LAS of the four single models WORD, W2V, LSTM, and CNN. In macro average of all languages, the CNN model performs 2.17% higher than the WORD model, and 1.24% higher than the W2V model. The LSTM model, however, performs only 0.9% higher than the WORD model and 1.27% lower than the CNN model.
The CNN model shows large improvement in four languages: three agglutinative languages (Basque, Hungarian, Korean), and one highly inflected fusional language (Polish). They all have high OOV rate, thus difficult for the baseline parser that does not model morphological information. Also, morphemes in agglutinative languages tend to have unique, unambiguous meanings, thus easier for the convolutional kernels to capture.
In the second group, we observe that the additional word-lookup model does not significantly improve the CNN moodel (from 82.75% in CNN to 82.90% in CNN+WORD on average) while the LSTM model is improved by a much larger margin (from 81.48% in LSTM to 82.56% in LSTM+WORD on average). This suggests that the CNN model has already learned the most important information from the the word forms, while the LSTM model has not. Also, the combined CNN+WORD model is still better than the LSTM+WORD model, despite the large improvement in the latter.

External Comparisons
We also report the results of the two models from Ballesteros et al. (2015): B15-WORD with randomly initialized word embeddings and B15-LSTM as their proposed model. Finally, we report the best published results (BestPub) on this data set (Björkelund et al., 2013(Björkelund et al., , 2014. On average, the B15-LSTM model improves their own baseline by 1.1%, similar to the 0.9% improvement of our LSTM model, which is much smaller than the 2.17% improvement of the CNN model. Furthermore, the CNN model is improved from a strong baseline: our WORD model performs already 2.22% higher than the B15-WORD model. Comparing the individual performances on each language, we observe that the CNN model almost always outperforms the WORD model except for Hebrew. However, both LSTM and B15-LSTM perform higher than baseline only on the three agglutinative languages (Basque, Hungarian, and Korean), and lower than baseline on the other six. Ballesteros et al. (2015) do not compare the effect of adding a word-lookup model to the LSTM model as in our second group of internal comparisons. However, Plank et al. (2016) show that combining the same LSTM character composition model with word-lookup model improves the performance of POS tagging by a very large margin. This partially confirms our hypothesis that the LSTM model does not learn sufficient information from the word forms.
Considering both internal and external comparisons in both average and individual performances, we argue that CNN is more suitable than LSTM as character composition model for parsing.
While comparing to the best published results (Björkelund et al., 2013(Björkelund et al., , 2014, we have to note that their approach uses explicit morphological features, ensemble, ranking, etc., which all can boost parsing performance. We only use a greedy parser with much fewer features, but bridge the 6 points gap between the previous best greedy parser and the best published result by more than one half.

Discussion on CNN and LSTM
We conjecture that the main reason for the better performance of CNN over LSTM is its flexibility in processing sub-word information. The CNN model uses different kernels to capture ngrams of different lengths. In our setting, a kernel with a minimum length of 3 can capture short morphemes; and with a maximum length of 9, it can practically capture a normal word. With the flexibility of capturing patterns from morphemes up to words, the CNN model almost always outperforms the word-lookup model. In theory, LSTM has the ability to model much longer sequences, however, it is composed step by step with recurrence. For such deep network architectures, more data would be required to learn the same sequence, in comparison to CNN which can directly use a large kernel to match the pattern. For dependency parsing, training data is usually scarce, this could be the reason that the LSTM has not utilized its full potential.

Analyses on OOV and Morphology
The motivation for using character composition models is based on the hypothesis that it can address the OOV problem. To verify the hypothesis, we analyze the LAS improvements of the CNN and LSTM model on the development sets in two cases: (1) both the head and the dependent are in vocabu-lary or (2) at least one of them is out of vocabulary. Table 2 shows the results, where the two cases are denoted as ΔIV and ΔOOV. The general trend in the results is that the improvements of both models in the OOV case are larger than in the IV case, which means that the character composition models indeed alleviates the OOV problem. Also, CNN improves on seven languages in the IV case and eight languages in the OOV case, and it performs consistently better than LSTM in both cases.
To analyze the informativeness of the morphemes at different positions, we conduct an ablation experiment. We split each word equally into three thirds, approximating the prefix, stem, and suffix. Based on that, we construct six modified versions of the development sets, in which we mask one or two third(s) of the characters in each word. Then we parse them with the CNN models trained on normal data. Table 3 shows the degradations of LAS on the six modified data sets compared to parsing the original data, where the position of ♣ signifies the location of the masks. The three agglutinative languages Basque, Hungarian, and Korean suffer the most with masked words. In particular, the suffixes are the most informative for parsing in these three languages, since they cause the most loss while masked, and the least loss while unmasked. The pattern is quite different on the other languages, in which the distinction of informativeness among the three parts is much smaller.

Conclusion
In this paper, we propose to use a CNN to compose word representations from characters for dependency parsing. Experiments show that the CNN model consistently improves the parsing accuracy, especially for agglutinative languages. In an external comparison on the SPMRL data sets, our system outperforms the previous best greedy parser.
We also provide empirical evidence and analysis, showing that the CNN model indeed alleviates the OOV problem and that it is better suited than the LSTM in dependency parsing.  Table 5: The list of tokens to extract feature templates, where denotes the -th token in the stack, the -th token in the buffer, denotes the -th leftmost child, the -th rightmost child.

B Character Input Preprocessing
For the CNN input, we use a list of characters with fixed length to for batch processing. We add some special symbols apart from the normal alphabets, digits, and punctuations: <SOW> as the start of the word, <EOW> as the end of the word, <MUL> as multiple characters in the middle of the word squeezed into one symbol, <PAD> as padding equally on both sides, and <UNK> as characters unseen in the training data. For example, if we limit the input length to 9, a short word ein will be converted into <PAD>-<PAD>-<SOW>-e-i-n-<EOW>-<PAD>-<PAD>; a long word prächtiger will be <SOW>-p-r-ä-<MUL>-g-e-r-<EOW>. In practice, we set the length as 32, which is long enough for almost all the words.

C Hyper-Parameters
The common hyper-parameters of all the models are tuned on the development set in favor of the WORD model: • 100,000 training steps with random sampling of mini-batches of size 100; • test on the development set every 2,000 steps; • early stop if the LAS on the development does not improve for 3 times in a row; • learning rate of 0.1, with exponential decay rate of 0.95 for every 2,000 steps; • L2-regularization rate of 10 −4 ; • averaged SGD with momentum of 0.9; • parameters are initialized following He et al. (2015); • dimensionality of the embeddings of each word, tag, and label are 256, 32, 32, respectively; • dimensionality of the hidden layers are 512, 256; • dropout on both hidden layers with rate of 0.1; • total norm constraint of the gradients is 10.
The hyper-parameters for the CNN model are: • dimensionality of the character embedding is 32; • 4 convolutional kernels of lengths 3, 5, 7, 9; • number of output channels of each kernel is 64; • fixed length for the character input is 32.
The hyper-parameters for the LSTM model are: • 128 hidden units for both LSTMs; • all the gates use orthogonal initialization; • gradient clipping of 10; • no L2-regularization on the parameters.