Stanford’s Graph-based Neural Dependency Parser at the CoNLL 2017 Shared Task

This paper describes the neural dependency parser submitted by Stanford to the CoNLL 2017 Shared Task on parsing Universal Dependencies. Our system uses relatively simple LSTM networks to produce part of speech tags and labeled dependency parses from segmented and tokenized sequences of words. In order to address the rare word problem that abounds in languages with complex morphology, we include a character-based word representation that uses an LSTM to produce embeddings from sequences of characters. Our system was ranked first according to all five relevant metrics for the system: UPOS tagging (93.09%), XPOS tagging (82.27%), unlabeled attachment score (81.30%), labeled attachment score (76.30%), and content word labeled attachment score (72.57%).


Introduction
In this paper, we describe Stanford's approach to tackling the CoNLL 2017 shared task on Universal Dependency parsing (Nivre et al., 2016;Zeman et al., 2017;Nivre et al., 2017b,a). Our system builds on the deep biaffine neural dependency parser presented by Dozat and Manning (2017), which uses a well-tuned LSTM network to produce vector representations for each word, then uses those vector representations in novel biaffine classifiers to predict the head token of each dependent and the class of the resulting edge. In order to adapt it to the wide variety of different treebanks in Universal Dependencies, we make two noteworthy extensions to the system: first, we incorporate a word representation built up from character sequences using an LSTM, theorizing that this should improve the model's ability to adapt to rare or unknown words in languages with rich morphology; second, we train our own taggers for the treebanks using nearly identical architecture to the one used for parsing, in order to capitalize on potential improvements in part of speech tag quality over baseline or off-the-shelf taggers. This approach gets state-of-the-art results on the macro average of the shared task datasets according to all five POS tagging and attachment accuracy metrics.
One noteworthy feature of our approach is its relative simplicity. It uses a single tagger/parser pair per language, trained on only words and tags; thus we refrain from taking advantage of ensembling, lemmas, or morphological features, any one of which could potentially push accuracy even higher.

Deep biaffine parser
The basic architecture of our approach follows that of Dozat and Manning (2017), which is closely related to Kiperwasser and Goldberg (2016), the first neural graph-based (McDonald et al., 2005) parser. 1 In Dozat and Manning's 2017 parser, the input to the model is a sequence of tokens and their part of speech tags, which is then put through a multilayer bidirectional LSTM network. The output state of the final LSTM layer (which excludes the cell state) is then fed through four separate ReLU layers, producing four specialized vector representations: one for the word as a dependent seeking its head; one for the word as a head seeking all its dependents; another for the word as a dependent deciding on its label; and a fourth for the word as head deciding on the labels of its depen-. . . . . .

RNN Embed
ReLU arc-dep i arc-head j rel-dep i rel-head j Labels for j → i Edge j → i Figure 1: The architecture of our parser. Arrows indicate structural dependence, but not necessarily trainable parameters. dents. 2 These vectors are then used in two biaffine classifiers: the first computes a score for each pair of tokens, with the highest score for a given token indicating that token's most probable head; the second computes a score for each label for a given token/head pair, with the highest score representing the most probable label for the arc from the head to the dependent. This is shown graphically in Figure 1.
Put formally, given a sequence of n word embeddings (to be described in more detail in Section , we concatenate each pair together and feed the result into a BiLSTM with initial state r 0 : 3 We then produce four distinct vectors from each recurrent hidden state h i (without the recurrent cell state c i ) using ReLU perceptron layers: In order to produce a prediction y (arc) i for token i, we use a biaffine classifier involving the (arc) 2 Interestingly, other researchers have found similar approaches to be beneficial for other tasks; cf. Reed and de Freitas (2016); Miller et al. (2016); Daniluk et al. (2017) 3 We adopt the convention of using lowercase italics for scalars, lowercase bold for vectors, uppercase italics for matrices, and uppercase bold for tensors. We maintain this convention when indexing and stacking; so ai is the ith vector of matrix A, and matrix A is the stack of all vectors ai. hidden vectors: Note first the similarity between line 8 and a traditional affine classifier of the form W h + b, with each of W and b first being transformed by H (arc-head) . Note also that both terms of the biaffine layer have intuitive interpretations: the first relates to the probability of word j being the head of word i given the information in both h (arc) vectors (for example, the probability of word i depending on word j given that word i is the and word j is cat); the second relates to the probability of word j being the head of word i given only the information in the head's vector (for example, the probability of word i depending on word j given that word j is the, which should be very small no matter what word i is).
After deciding on a head y i for word i, we use another biaffine transformation-this time involving the (rel) hidden vectors-to produce a predicted label: Again, each term in line 10 has an intutive interpretation: the first term relates to the probability of observing a label given the information in both h (rel) vectors (e.g. the probability of the label det given word i is the with head cat); the second relates to the probability of observing a label given either h (rel) vector (e.g. the probability of the label det given that word i is the or that word j is cat); the last relates to the prior probability of observing a label. We jointly train these two biaffine classifiers by optimizing the sum of their softmax cross-entropy losses. At test time, we ensure the tree is wellformed by iteratively identifying and fixing cycles for each proposed root and selecting the one with the highest score, which is both simple and sufficient for our purposes. 4 2.2 Character-level model Dozat and Manning (2017) represented words as the sum of a pretrained vector 5 and a holistic word embedding for frequent words. However, that approach seems insufficient for languages with rich morphology; so we add a third representation built up from sequences of characters. Each character is given a trainable vector embedding, and each sequence of character embeddings is fed into a unidirectional LSTM. However, the LSTM produces a sequence of recurrent states (r 1 , . . . , r n ), which we need to convert into a single vector. The simplest approach is to take the last one-which would represent a summary of all the information aggregated one character at a time-and linearly transform it to the desired dimensionality. Another approach, suggested by Cao and Rei (2016), is to use attention over the hidden states, and then trasform the resulting context vector to the desired size; in theory, this should both allow the model to learn morpheme information more easily by attending more closely to the LSTM output at morpheme boundaries. We choose to combine both approaches, using the hidden states for attention and the cell state for summarizing, shown in Fig That is, given a sequence of n character embeddings and an initial state r 0 for the LSTM, we each embedding into an LSTM as before, extracting hidden and cell states: We then compute linear attention over the stack of hidden vectors H and concatenate it to the final cell state: In this way we use the hidden states for attention and the cell state as a final summary vector. After computing the character-level word embedding, we add together elementwise the pretrained embedding, the holistic frequent token embedding, and the newly generated character-level embedding. We also add together embeddings for the language's UPOS and XPOS tags. The resulting two vectors are used as input to the BiLSTM parser in Section 2.1.

POS tagger
The final piece of our system is a separatelytrained part of speech tagger. The architecture for the tagger is almost identical to that of the parser (and shares fundamental properties with other neural taggers; cf. Ling et al. (2015); Plank et al. (2016))-it uses a BiLSTM over word vectors (using the tripartite representation from Section 2.2), then uses ReLU layers to produce one vector representation for each type of tag.
Thus we use a BiLSTM, as with the parser architecture: And we use affine classifiers for each type of tag, which we add together for the parser: The tag classifiers are trained jointly using crossentropy losses that are summed together during optimization, but the tagger is trained independently from the parser.

Training details
Our model largely adopts the same hyperparameter configuration laid out by Dozat and Manning (2017), with a few exceptions. The parser uses three BiLSTM layers with 100-dimensional word and tag embeddings and 200-dimensional recurrent states (in each direction); the arc classifier uses 400-dimensional head/dependent vector states and the label classifier uses 100-dimensional ones; we drop word and tag embeddings independently with 33% probability; 6 we use samemask dropout (Gal and Ghahramani, 2015) in the LSTM, ReLU layers, and classifiers, dropping input and recurrent connections with 33% probability; and we optimize with Adam (Kingma and Ba, 2014), setting the learning rate to 2e −3 and β 1 = β 2 = .9. We train models for up to 30,000 training steps (where one step/iteration is a single minibatch with approximately 5,000 tokens), at 6 When only one is dropped, we scale the other by a factor of two first saving the model every 100 steps if fewer than 1,000 iterations have passed, and afterwards only saving if validation accuracy increases (or training accuracy for languages with no validation data). When 5,000 training steps pass without improving accuracy, we terminate training.
For the character model, we use 100dimensional uncased character embeddings with 400-dimensional recurrent states. We don't drop characters but do include 33% dropout in the LSTM and attention connections.
In the tagger we use nearly identical settings, with a few exceptions: the BiLSTM is only two layers deep, we increase the dropout between recurrent connections to 50%, and we use cased character embeddings.
Our approach for dealing with the surprise languages was to train delexicalized "language family" parsers with the same architecture detailed in Section 2.1 on UDPipe v1.1 (Straka et al., 2016)'s UPOS tags with no word-level information. For Buryat (Altaic), we used as input the training datasets for Turkish, Uyghur, Kazakh, Korean, and Japanese; for Kurmanji (Indo-Iranian), we used Persian, Urdu, and Hindi; for North Sámi (Uralic), we used Finnish, Finnish-FTB, Estonian, and Hungarian; and for Upper Sorbian (Slavic), we used Bulgarian, Czech, Old Church Slavonic, Polish, Russian, Russian-SynTagRus, Slovak, Slovenian, Slovenian-SST, and Ukrainian.
There's substantial variability in training and testing speed across treebanks, but on an NVidia Titan X GPU the models train at 100 to 1000 sentences/sec and test at 1000 to 5000 sentences/sec. Even without GPU acceleration a tagger or parser can be run on an entire test treebank in ten to twenty seconds. By far the greatest runtime overhead comes not from the model itself, but from reading in the large matrices of pretrained embeddings, which can take several minutes. A full run over the 81 test sets on the TIRA virtual machine (Potthast et al., 2014) takes about 16 hours, but when parallelized on faster machines it can be done in under an hour.

Results
Our model uses a provided tokenization and segmentation and produces UPOS tags, XPOS tags, arcs, and labels. Thus the relevant metrics for the system are UPOS accuracy, XPOS accuracy, unlabeled attachment score, labeled attachment score,  Table 1: Results on each treebank in the shared task plus the macro average over all of them. State of the art performance by the system is in bold. and content labeled attachment score. Our system achieves the highest aggregated score on all five of these metrics in the shared task. Below we explore where our model does particularly well, and where it can be improved. We choose to evaluate on CLAS performance because we feel it more accurately reflects model performance, being a principled extension of the common practice of removing punctuation from evalution. We also exclude surprise languages from the following analyses.
One small point to that end is that our system assumes tokenization and segmentation has already been done; we therefore trained on gold segmentation and evaluated using the segmentation provided by UDPipe. For most treebanks this was easily sufficient, but for Vietnamese, Chinese, Japanese, and Arabic, UDPipe's lower performance at segmenting or tokenizing was correlated with a relatively large gap between CLAS and gold-aligned CLAS. Because our model reports comparable numbers for nearly all other treebanks, we take this to mean that alignment errors propagated through the system into parsing errors.

Nonprojectivity
In Universal Dependencies, unlike many other popular benchmarks, several treebanks have a large fraction of crossing dependencies, so any competitive system will need to be able to produce nonprojective arcs. One of the most frequently used approaches for producing fully nonprojective parsers in transition-based systems is to add the swap action (Nivre, 2009). This makes any arbitrary nonprojective arc possible, but increases the number of transition steps required to produce that arc. One valid concern is that this might bias the model toward producing projective arcs; in our graph-based system, by contrast, there's little reason to think nonprojective arcs should be harder to predict than projective ones. Here we aim to explore how the fraction of nonprojective arcs in a treebank affects the performance of the two types of systems.
To test the relative performance of a graphbased and a transition-based model, we compute the difference in per-treebank CLAS performance between our parser and the UDPipe v1.1 baseline (Straka et al., 2016), which uses a transition-based parser with the swap operation (Straka et al., 2015). We then plot this against the frequency of nonprojective arcs in the test set. To determine whether there is a significant relationship between the difference in performance, we fit the data to a generalized linear mixed effects regression model (Fisher, 1930), using Markov chain Monte Carlo sampling (Hadfield, 2010). We include log data size, morphological complexity (see Section 5.2), and training set projectivity as random effects. We plot the data with the learned regression lines in Figure 3a. What we find is that the margin between the performance of the graph-based and transition-based parsers increases with the nonprojectivity of the test set significantly (p < 0.001). This remains significant even when outliers 7 are excluded (p < 0.05). To the extent that UDPipe represents a typical nonprojective transition-based parser, our results suggest that a graph-based approach is better suited to parsing UD treebanks that have significant syntactic freedom or complexity than a transition-based one.
Predicting crossing arcs requires more operations (and therefore more long-term planning on behalf of the parser) when using the swap feature in a transition-based system, but in our graphbased system they can be predicted as easily as projective arcs. One might hypothesize that because of this, a transition-based swapping system would need to see more examples of crossing dependencies than a graph-based system in order to generalize well. The data shown in Figure  3b support this hypothesis: we computed the difference between the projectivity of each test and training set, and used this as the fixed effect in another mixed effects model with data size, morphological complexity, and train/test nonprojectivity as random effects. We find that when the training set has drastically fewer crossing dependencies than the test set, the graph-based model achieves relatively higher accuracy; but when the transition-based parser can train on many crossing arcs, the models are closer in performance (p < 0.001), even when excluding the same outliers (p < 0.05). This suggests that the graphbased approach learns and generalizes crossing dependencies more efficiently than the transition- Effect of Tagger Improvement 0.35x + 0.75 Figure 5: Performance difference between a version of our model trained on our own predicted tags and a version trained on UDPipe v1.1 tags as a function of the performance difference between our taggers and the UDPipe taggers based approach, although this again comes with the assumption that UDPipe's parser is representative of most transition-based swapping parsers when it comes to producing nonprojective parses.

Data size
We use the same hyperparameter configuration for all datasets, regardless of how much training data there is for that treebank, which means we may have overfit to small training datasets or underfit to large ones. To test this, we computed the pertreebank difference between the test CLAS performance of our model and that of the highestperforming model other than ours, and plotted that ratio against the log training data size in Figure  4. We fit the differences to another mixed effects regression model with train/test projectivity and morphological complexity set as random effects, finding that our system on average tends to do relatively better on larger datasets compared to other approaches and worse on smaller ones (p < 0.001). When the outliers are excluded, 8 this tendency is still significant (p < 0.001). This suggests that our model is overfitting to smaller datasets, and that increasing regularization or decreasing model capacity may improve accuracy for lower-resource languages.  Figure 6: Performance difference between parsers using our taggers and parsers without tags (left) and between parsers using UDPipe v1.1's tags and parsers without tags (right), with both histograms fit to skew normal distributions 5 Ablation Studies

POS Tagger
We chose to train our parsers on our own predicted tags instead of using provided taggers; here we aim to justify that strategy empirically with an ablation study. We trained another set of parsers with otherwise identical hyperparameter settings using the baseline tags provided by UDPipe v1.1, and computed the difference in CLAS between our reported models and the new ones. We also computed the difference in UPOS accuracy between UDPipe v1.1's taggers and our own. In Figure 5, we plot how the difference in tagger quality affects the CLAS of the parser, making two noteworthy observations. The first is that the performance difference between the set of models trained on our own tags is statistically significantly better than the performance of the models trained on UDPipe tags according to a Wilcoxon test (p < 0.001). The second is that this can be explained by the improvement of our tagger over UDPipe v1.1, again accounting for dataset size, nonprojectivity, and morphology in a mixed effects model (p < 0.001). This suggests that improving upstream tagger performance is an effective way of improving downstream parser accuracy. We also examined the effect of training size on the difference in parser performance, finding no significant correlation (p > 0.05).
The approach laid out in this paper uses one neural network to tag the sequences of tokens, and a second neural network to produce a parse from the tokens and tags. One might ask to what extent the tagger network is actually necessary, for a number of reasons: presumably whatever predictive patterns it learns from the token sequences would also be learnable by the parser network; errors by the tagger are likely to be propagated by the parser; and Ballesteros et al. (2015) found that POS tags are drastically less important for character-based parsers. In order to examine how useful the POS tag information is to our character-based system, we trained an additional set of parsers without UPOS or XPOS input, comparing them to the other two, with the differences graphed in Figure 6. We find that the variant with no POS tag input is likewise significantly worse than our reported model according to a Wilcoxon test (p < 0.001), but not statistically different from the one trained with UDPipe tags (p > 0.05). This suggests that predicted POS tags are still useful for achieving maximal parsing accuracy in our system, provided the tagger's performance is sufficiently high.

Character model
One of the ways in which we build on Dozat and Manning's 2017 work is by adding a characterlevel word representation similar to that of Ballesteros et al. (2015), hypothesizing that it should allow the model to more effectively learn the relationships between words in languages with rich morphology and loose word order. We test this using another ablation study; we trained a second set of taggers and parsers on the dataset with only whole token and pretrained vectors, leaving out the vector composed from character sequences Figure 7: Performance difference between our character-based approach and a pure token-based approach for parsing (left) and tagging (right) as a function of approximated morphological complexity (for maximal comparability, we use the original character-based taggers for the token-based parsers). As morphological complexity increases, the difference between the models should increase as well.
The basis of our approach to quantifying morphological complexity will be the assumption that in a morphologically complex language, the ratio between the size of the vocabulary |V (X) | of a corpus to the size of the corpus |X| will be relatively high, because the same lemma may occur with many different forms; but in a morphologically simplex language, that ratio will be smaller, because a given lemma will normally appear with only a few forms. Assuming both languages have the same number of lemmas, the vocabulary size of the complex language will then be larger. The most principled way of modeling this intuition is through Heaps' law (Herdan, 1960;Heaps, 1978) in Equation 22, which says that the log vocabulary size increases linearly in the log corpus size.
We can take advantage of Heaps' law directly in approximating morphological complexity. Morphologically richer languages should increase the size of their vocabulary at a faster rate as the corpus size grows, because a new token being added to the corpus has a higher probability of having a previously observed lemma with a previously unobserved morphological form, thereby increasing the vocabulary size; in a morphologically simplex language, previously observed lemmas are unlikely to have many morphological forms that could increase |V |. Therefore, we would expect the parameter w of Equation 22  Thus we use the coefficient w in Equation 22 as our metric for morphological richness, and plot the difference between models trained with characterlevel word embeddings and token-level word embeddings against this value in Figure 7. First we perform a Wilcoxon signed rank test, finding that the difference between the two approaches is statistically significant for the taggers (p < 0.001) and parsers (p < 0.001). Then we fit a mixed effects model to the data with treebank size and training/test projectivity as random effects, finding that the character-level approach tends to significantly improve performance more as complexity grows both for parsing (p < 0.005) and tagging (p < 0.001). 9 This indicates that incorporating subword information into UD parsing models is a promising way to improve performance on languages with significant morphology.

Conclusion
In this paper we describe our relatively simple neural system for parsing that achieved state-ofthe-art performance on the 2017 CoNLL Shared Task on UD parsing without utilizing lemmas, morphological features, or ensembling. The system uses BiLSTM networks for tagging and parsing, and includes character-level word representations in addition to token-level ones. We also examined what can be learned more generally from our model's performance. We explore the relative performance of nonprojective graph-based and transition-based architectures on this task, finding evidence that modern graph-based parsers might be better at producing nonprojective arcs (with some caveats). Additionally, our network performs better when there's an abundance of data, suggesting that more regularization could improve accuracy on lower-resource languages.
We also sought to quantitatively justify the additional complexity of our system. We considered how important the POS tagger is to the system, comparing the downstream performance of parsers using our tagger, the baseline tagger, and no tagger at all. We find that our tagger beats both baselines significantly, whereas the two baselines don't statistically differ from each other, indicating that POS tags can help our system but must be sufficiently accurate. The character-based approach was found to significantly boost performance on languages that scored high on our metric for morphological complexity-both for parsing and tagging-suggesting that constructing token representation from subtoken information is effective for capturing the influence of morphology on syntax, and the naïve approach of using only holistic word embeddings is insufficient. Our success at the shared task demonstrates that a welltuned, straightforward neural approach to parsing and tagging can get state-of-the-art performance for datasets with a wide variety of syntactic properties.