CLCL (Geneva) DINN Parser: a Neural Network Dependency Parser Ten Years Later

This paper describes the University of Geneva’s submission to the CoNLL 2017 shared task Multilingual Parsing from Raw Text to Universal Dependencies (listed as the CLCL (Geneva) entry). Our submitted parsing system is the grandchild of the first transition-based neural network dependency parser, which was the University of Geneva’s entry in the CoNLL 2007 multilingual dependency parsing shared task, with some improvements to speed and portability. These results provide a baseline for investigating how far we have come in the past ten years of work on neural network dependency parsing.


Introduction
The system described in this paper is the grandchild of the first transition-based neural network dependency parser (Titov and Henderson, 2007b), which was the University of Geneva's entry in the CoNLL 2007 multilingual dependency parsing shared task (Titov and Henderson, 2007a). The system has undergone some developments and modifications, in particular the faster discriminative version introduced by Yazdani and Henderson (2015), but in many respects the design and implementation of this parser is unchanged since 2007. One of our motivations for our submission to this CoNLL 2017 multilingual dependency parsing shared task is to provide a baseline to evaluate to what extent recent advances in neural network models and training do in fact improve performance over "traditional" recurrent neural networks. We are listed in the table of results as the CLCL (Geneva) entry.
As with previous work using the Incremental Neural Network architecture (e.g. Henderson, 2003), the main philosophy of our submis-sion is that we build language universal inductive biases into the model structure of the recurrent neural network, but we do not do any feature engineering. Training the neural network induces language-specific hidden representations automatically. To provide such a baseline, we use UDPipe for all pre-processing (Straka et al., 2016), and Malt Parser for all projectivisation (Nivre et al., 2006). The only exception is our strategy for surprise languages, discussed below.
These goals match well the aim of the 2017 Universal dependencies shared task, described in the introductory overview (Zeman et al., 2017). This task makes true cross-linguistic comparison possible thanks to the universal dependency annotation project, which underlies the data used in this shared task. We train exactly the same parsing model on every language, thereby allowing further comparisons. In addition, the feature induction abilities of the recurrent neural network help minimise any remaining cross-lingual differences due to pre-processing or annotation.

Data
We use only the provided treebanks. For large treebanks, we train the model on the UD treebank (Nivre et al., 2017a), with some tuning of metaparameter using the development set.
For surprise languages, we train on the concatenation of the treebank for the language, no matter how small, and the treebank of an identified source language with a larger treebank. In post-testing experiments, we also apply this same strategy to other small treebanks, resulting in substantial improvements (average 43% better) over the submitted results.
We don't use externally trained word embeddings (we trained our own internally to the parser) or any other data resource.

Preprocessing
Tokenisation, word and sentence segmentation is provided by UD pipe (Straka et al., 2016). We do not use the morphological transducers from Apertium/Giellatekno that had been made available for the shared task.
Because our parser can only produce projective dependency trees, we apply the projectivisation transformation of the Malt parser package (Nivre et al., 2006) to all treebanks before training.

Parser
We apply a single DINN parser to each language. We do not use any ensemble methods. This makes our results more useful for comparison, and allows our model to be used within an ensemble with other parsers.
We use the parser described in Yazdani and Henderson (2015), the Discriminative Incremental Neural Network parser (DINN). Like the previous version of this parser (Titov and Henderson, 2007b), it uses a recurrent neural network (RNN) to predict the actions for a fast shift-reduce dependency parser. Decoding is done with a beam search where pruning occurs after each shift action. The RNN model has an output-dependent structure that matches locality in the parse structure, making it an "incremental" neural network (INN, previously called SSN). This INN computes hidden vectors that encode the preceding partial parse, and estimates the probabilities of the parser actions given this history. Unlike the previous generative INN parser, DINN is a discriminative parser, using lookahead instead of word prediction. In order to combine beam search with a discriminative model, word predictions are replaced by a binary correctness probability which is trained discriminatively.

Transition-Based Neural Network Parsing
In DINN, the neural network is used to estimate the conditional probabilities of a transition-based statistical parsing model.

The Probabilistic Parsing Model
In shift-reduce dependency parsing, a parser configuration consists of a stack P of words, the queue Q of words and the partial labelled dependency trees constructed by the previous history of parser actions. The parser starts with an empty stack P and all the input words in the queue Q. It stops when it reaches a configuration with an empty queue Q, with any words left on the stack then being attached to ROOT. We use an arc-eager algorithm, which has 4 actions that all manipulate the word s on top of the stack P and the word q on the front of the queue Q: Left-Arc r adds a dependency arc from q to s labelled r, then popping s from the stack. Right-Arc r adds an arc from s to q labelled r. Reduce pops s from the stack. Shift shifts q from the queue to the stack. For exact details, see Titov and Henderson (2007b).
To model parse trees, we model the sequences of parser actions which generate them. We take a history based approach to model these sequences of parser actions. So, at each step of the parse sequence, the parser chooses between the set of possible next actions using an estimate of its conditional probability, where T is the parse tree, D 1 · · ·D m is its equivalent sequence of shiftreduce parser actions and S is the input sentence: Unlike in previous dependency parser evaluations, the evaluation script for this shared task requires that exactly one word be attached to the ROOT node of the sentence. We implemented this constraint by modifying the calculation of the set of possible next actions. If an action will lead to a parser configuration where all possible ways of finishing the parse result in more than one word being attached to ROOT, then that action is not a possible action.

Estimating Action Probabilities
To estimate each P (D t |D 1 · · ·D t−1 , S), we need to condition on the unbounded sequences D 1 · · ·D t−1 and S. To condition on the words in the queue, we use a bounded lookahead: where w t a 1 · · ·w t a k is the first k words on the front of the queue at time t. At every Shift, one word is moved from the lookahead onto the stack and a new word from the input is added to the lookahead.
To estimate the probability of a decision at time t conditioned on the history of actions D 1 · · ·D t−1 , This model is depicted in Figure 1. The hidden representation at time t is computed from selected previous hidden representations, plus pre-defined features. The model defines a set of link types c ∈ C which select previous states t c <t and connect them to the current hidden layer h t via the hiddenhidden weights W c HH . The model also defines a set of features f ∈ F calculated from the previous decision and the current queue and stack, which are connected to the current hidden layer via the input-hidden weights W IH : where σ is the sigmoid function and W (i, :) is row i of matrix W . The probability of each decision is estimated with a softmax layer (a normalised exponential) with outputs for all decisions that are possible at this step, conditioned on the hidden representation.
where W HO is the weight matrix from hidden representations to the outputs.

Hidden and Input Features
C and F are the only hand-coded parts of the model. Because C defines the recurrent connections in the neural network, it is responsible for passing information about the unbounded parse history to the current decision. Because RNNs are biased towards learning correlations which are close together in the connected sequence of hidden layers, we exploit this bias by making the structure of the neural network match the structure of the output parse. This is achieved by including previous states in C if they had a word on the top of the stack or front of the queue which are also relevant to the current decision. In the version we use in this experiment, we use a minimal set of these link types, specified in section 5.
The input features F are typical of any statistical model. But in the case of neural networks, it is common to decompose the parametrisation of these features into a matrix for the feature role (e.g. front-of-the-queue) and a vector for the feature value (e.g. a word). This decomposition of features overcomes feature sparsity, because the same value vector can be shared across multiple roles. Word embedding vectors are the most common example of this decomposition.
Unlike in the previous versions of the parser, Yazdani and Henderson (2015) added feature decompositions in the definition of the input-tohidden weights W IH .
Every row in W emb. is an embedding for a feature value, which may be a word, lemma, POS tag, or dependency relation. val(f ) is the index of the value for feature role role(f ), for example the particular word that is at the front of the queue. The matrix W role(f ) HH is the feature role matrix, which maps the feature value embedding to the role-value feature vector for the given feature role role(f ). For simplicity, we assume here that the size of the embeddings and the size of the hidden representations of the DINN are the same. In this way, the parameters of the embedding matrix W emb. is shared among various feature input link types role(f ), which can improve the model in the case of sparse features f .
We train our own word embeddings within the parsing model, using only the parsed training data. We tried initialising with Facebook embeddings on a sample of languages, but random initialisation worked better.
Unlike Yazdani and Henderson (2015), we did not cache any features, either in testing or in training. Caching can have a big impact on speed, but it has not been shown to improve accuracies.

Discrimination of Correct Partial
Parses Unlike in the previous generative models, the above formulas for computing the probability of a parse make independence assumptions, in that words to the right of w t a k are assumed to be independent of D t . And even for words in the lookahead, it can be difficult to learn correlations with the unstructured lookahead string. If a discriminative model uses normalised estimates for decisions, then once a wrong decision is made there is no way for the estimates to express that this decision has lead to a structure that is incompatible with the current or future lookahead string (see Lafferty et al. (2001) for more discussion). For this reason, there is no obvious way to make effective use of beam search for a normalised discriminative model.
To overcome this problem, DINN estimates a correctness probability after every Shift action. This output is trained to discriminate correct from incorrect parse prefixes, using the same hidden representation as used to predict parser actions, as depicted in Figure 1. A beam search is then used to consider multiple possible partial parses so that the correctness probability can be used to select between them. The total score of a parse is the multiplication of the probabilities of all its actions with the correctness probabilities at the shift of each word. For more details on this technique, see Yazdani and Henderson (2015).

Experimental Settings
The implementation of DINN uses a parameter file to define the hidden-hidden connections, the input-hidden features, training meta-parameters, and various other parameters of the parser. We use the same settings for all languages. For the official submission, we used the following settings.
-We used a frequency cutoff for words/lemmas of 3. -We did not normalise the input string to lowercase. -Word embeddings are initialised randomly.
-We do not apply any feature caching.
-Validation occurs at every iteration.
-The configurations of the Input-to-Hidden layer connections are as follows: + Look at 4 last elements in the stack and 4 next elements from the input (Except the treebanks fr, ko, it partut, grc proiel, cu, where we look at 5 last elements from the stack.) + For each element, use all possible features from UDPipe (except UPoS if XPoS exists).
-The configurations of the Hidden-to-Hidden layer connections are as follows: Closest Current H-to-H Queue Queue + Top Top + Queue Top + In this specification of the hidden-to-hidden connections, Queue refers to the front of the input queue and Top refers to the top of the stack in the parser configuration. This specification uses the same simplified set of connections between hidden states used in Yazdani and Henderson (2015). We assume that the induced hidden features primarily relate to the word on the top of the syntactic stack and the word at the front of the queue, since these are the words used in any action. To decide which previous state's hidden features are most relevant to the current decision, we look at these words in the current parser configuration. For each such word, we look for previous states where the top of the stack or the front of the queue was the same word. If more than one previous state matches, then the hidden vector of the most recent one is used. If no state matches, then no connection is made.

Training
One aspect of the current implementation which is basically unchanged from ten years ago is the training protocol. Learning rates and weight decay regularisation rates are reduced during training whenever there is a decrease in accuracy on the development set, and early stopping is used to prevent overtraining. Training and development splits are those provided by the shared task. The development set is also used to select which iteration's To deal with small treebanks without development sets, we use a fixed training protocol developed by looking at the training of models with other small training sets. We run for a total of 8 iterations and changed the learning rate and weight decay values every other iteration.

Dealing with surprise languages and other small datasets
To build a model for the surprise languages, we use simple cross-lingual techniques. For the official test phase, we identified the most similar languages to the surprise language with a string-based technique, concatenated the treebanks, trained and tested on the surprise languages.
The string-based technique constructs a list of words for each language. We used the sample data for the surprise language and the training data for the languages for which we have enough resources. Call these languages with big data sets B. We denote T as the set of lists of words of B, and t is a word in T . For a given surprise language, we calculate the similarity score S for each t. We treat two words as similar if and only if the first three characters of these two words are identical and the edit distance between these two words is less than or equal to 1. We choose the language that has the best S for training our model for the surprise language. This procedure yields the following similar languages for training. We call them source languages.
Buryat: Russian (rus syntagrus), Turkish (tr) Upper Sorbian: Czeck (cs), Norwegian (no bokmaal) Kurmanji: Spanish (es), Turkish (tr) North Sami: Czech (cs), Finnish (fi ftb) To train a parser for the surprise language, we concatenate the datasets for the source languages with three copies of the dataset for the target language. Because our frequency threshold is three, this means that all words in the target language dataset are included in the vocabulary. Then we trained a parser on this concatenated dataset, using the surprise language corpus also as a development set.
In addition to the surprise languages, there are other languages whose available data is just enough for a small training set without any development set. For the submitted test run, we did  not do anything special for these datasets (other than the training schedule discussed above), training parsing models on the individual datasets. But in subsequent experiments we tried treating them in the same way as surprise languages, with much improved results, discussed below.

Test Phase Results
Evaluation was run on the provided TIRA platform (Potthast et al., 2014) using the data provided by the organisers (Nivre et al., 2017b), but blind to us, as described in the introduction. The results of our submission are shown in the next three tables. Accuracy by LAS is shown in Table 1. Accuracy on surprise languages is shown in Table 2. Accuracy on parallel UD data is shown in Table 3.

Analysis of results
Our results are 25th over the 33 participants globally, 22nd on the large treebanks only, 19th on the PUD treebanks only, 30th on the small treebanks with only 6% accuracy (see below), 20th on sur-   prise languages. They are rather firmly in the bottom third, around 22nd-25th place. They rarely beat the baseline. They are well above the baseline or close to it (above or below) for twelve treebanks ( Fi ftp, 8th, well above; fr pud, 9th/33, well above; grc, 15th, just under; hi, 17th, just above; hi pud, 10th, well above; it, 19th; ja pud, 16th just below; ko, 18th, well above; la ittb, 18th, same; pl, 14th, above; sk, 18th, above; sl, 14th a little above. ) There are a number of treebanks where the submitted parser does very poorly (fr partut, 17%; ga, 4%; gl treegal, 2.75%; kk, 1%; la, 6%; sl sst, 4%; ug, 9%; uk, 8%). These are all small treebanks with no development set, which we treated in the same way as all other treebanks. As discussed in the post-test results section, treating these treebanks with the same approach that we used for surprise languages yielded instead results on-average 43% LAS better. Table 4 shows the training and parsing times, calculated on the training and development sets, respectively. Our shared task submission was prepared primarily by one computer science MSc student.

Post-Test Results
In the post-test results, we aim to increase the performance on the small treebanks, and correct errors in the submitted system.

Postprocessing, if any
In the submitted parser, we overlooked the need to deprojectivise the output of the parser. In the post-test results, we run the Malt parser deprojectivisation routine on the output of the DINN parser before doing evaluation. Deprojectivisation makes no or little difference for most languages, but there is an improvement on some. Improvements range from zero to 1.3% LAS score, with an average improvement of 0.16%. We report some deprojectivisation results on small treebanks in Table 5.

Dealing with small treebank languages
In the test phase, we train on small treebanks. Given that our results were particularly unsatisfactory on small treebanks, in the post-test phase, we tried a different technique: we treated small treebanks like surprise languages.
For small treebanks, we identified the best source language by exhaustively searching all the possible languages. As with surprise languages, we then concatenated three copies of the small treebank to the larger treebank and trained a parser on this combined dataset. Table 5 shows the treebank configurations and results on the development set and test set. This new method raises the total average score of our parser by 4.20% LAS.

Conclusions and Future Work
With this submission, we have shown how a neural network dependency parser whose main architecture is largely unchanged from ten years ago performs with respect to the state of the art. These results can serve as a baseline for future work evaluating to what extent recently proposed methods have a measurable impact on neural network dependency parser accuracy.