Combining Global Models for Parsing Universal Dependencies

We describe our entry, C2L2, to the CoNLL 2017 shared task on parsing Universal Dependencies from raw text. Our system features an ensemble of three global parsing paradigms, one graph-based and two transition-based. Each model leverages character-level bi-directional LSTMs as lexical feature extractors to encode morphological information. Though relying on baseline tokenizers and focusing only on parsing, our system ranked second in the official end-to-end evaluation with a macro-average of 75.00 LAS F1 score over 81 test treebanks. In addition, we had the top average performance on the four surprise languages and on the small treebank subset.


Introduction
General Parsing Approach Our submitted system to the CoNLL 2017 shared task (Zeman et al., 2017) focuses only on the task of dependency parsing, assuming that tokenization, sentence boundary detection, part-of-speech (POS) tagging and morphological features are already handled by a baseline model. In this paper, we highlight our neural-network-based feature extractors and ensemble of global parsing models, including two novel global transition-based models.
Bi-directional long-short term memory networks (Graves and Schmidhuber, 2005, bi-LSTMs) have recently achieved state-of-the-art performance on syntactic parsing (Kiperwasser and Goldberg, 2016;Cross and Huang, 2016;Dozat and Manning, 2017). Our system leverages the representational power of bi-LSTMs to generate compact features for both graph-based and transition-based parsing frameworks. The latter further enables the application of dynamic programming techniques (Huang and Sagae, 2010;Kuhlmann et al., 2011) for global training and exact decoding. With just two bi-LSTM vectors as features, all three global parsing paradigms in our system have efficient Opn 3 q implementations. The full system consists of 3-5 each of these unlabeled parsing models (9-15 in total, depending on the treebank), and another ensemble of arc labelers.
Adaptation of General Approach to the Shared Task The CoNLL 2017 shared task presents two unique challenges: 1. A large fraction of the datasets are morphologically-rich languages. Some languages have an exceedingly-high out-ofvocabulary ratio of over 30%. 2. For many languages, very little training data is provided. Furthermore, there are four surprise language, for which we only have tens of sample sentences.
We address the first challenge with characterlevel bi-LSTMs, which have previously been shown to be effective in multi-lingual POS tagging (Plank et al., 2016) and dependency parsing (Ballesteros et al., 2015;Alberti et al., 2017). Character-level representation gives better coverage, and it directly learns sub-word information through end-to-end training.
The second challenge is approached by transferring delexicalized information. For each of those languages with little training data, we select the most similar language according to linguistic typology. We then train delexicalized models taking only part-of-speech and morphology tags as input features, which are made available through baseline prediction during test time.
Our full system scored a macro-average LAS F1 score of 75.00, which ranked second among all participating systems. Additionally, in the categories of small treebanks and surprise languages, we obtained the best average performance.  Figure 1 illustrates our pipelined system. It processes raw texts in four stages starting from baseline UDPipe (Straka et al., 2016) tokenization and sentence delimitation. For this stage we use predictions provided by the organizers instead of training our own UDPipe models.
For each sentence, Stage II ( §3) extracts a dense feature vector for each word in the sentence. For most languages, we employ character-level bi-LSTMs to capture morphological information. On top of the character-level representations, there is another layer of bi-LSTMs processing at the word level, the output of which gives context-sensitive features associated with every word in the sentence. For the four surprise languages and a selected set of languages with small training treebanks, we substitute the character-level encodings of each word in Stage II with concatenation of part-of-speech (POS) tag embeddings and morphological feature embeddings, but keep the wordlevel bi-LSTMs. We call these delexicalized features as opposed to the lexicalized features in the general case. All later stages are kept the same. The POS tags and morphological features are provided by baseline UDPipe predictions.
Stage III ( §4.1) focuses on unlabeled parsing with an ensemble of three global models, one firstorder graph-based maximal spanning tree algorithm (MST), and two transition-based, namely arc-hybrid and arc-eager dynamic programming (AHDP and AEDP). They share the same underlying feature extractors. We combine outputs from the unlabeled parsing models with a uniform weight reparsing model (Sagae and Lavie, 2006).
The final stage ( §4.2) of our system is arc labeling. Based on the extracted LSTM features and predicted unlabeled parse trees, this stage assigns the highest scoring label to each arc. Similar to Stage III, we train multiple models with different random initializations, and the ensemble prediction is obtained via majority vote.
Our system was implemented with DyNet (Neubig et al., 2017). Each single model is of small size and runs efficiently. The submitted full system completed the test phase in 4.64 hours with 2 threads. We provide implementation details for all the modules and training process in §6. The code is available at https://github.com/ CoNLL-UD-2017/C2L2.

Feature Extractors
In Stage II of our system, we first extract features for each word in isolation, then consider one sentence at a time for context-sensitive representations. These two feature extractors both leverage the representational power of bi-LSTMs.

Character LSTMs
Among the most straightforward ways for representing a word are through binary features or word embeddings. Though popular in many existing parsers, they are not ideal for languages with high out-of-vocabulary (OOV) ratios. In Universal Dependencies, the 56 development sets have an average OOV ratio of 14.4%, with four languages (et, hu, ko and sk) higher than 30%, posing a severe challenge for lexical representation. On the other hand, the average out-of-charset (OOC) ratio is 0.03%, with the highest (zh) not exceeding 0.1%, suggesting the promise of character-level representations in terms of coverage.
Our system adopts character-level bi-LSTMs similar to Plank et al. (2016) and Ballesteros et al. (2015). They show that the obtained sub-word information is especially useful for rare and OOV words in morphologically-rich languages.
Formally, for a word w with its character sequence rBOW, c 1 , ..., c m , EOWs, with two special begin-of-word (BOW, or c 0 ) and end-of-word (EOW, or c m`1 ) symbols, we run a forward and a backward LSTM at layer l: each c l i denotes the vector representation at layer l for c i ,˝denotes concatenation of vectors, and r¨s is a shorthand for a list of vectors. The inputs to the first layer c 0 i are character embeddings that are jointly trained with the model. We take the concatenation of Ñ c m`1 and Ð c 0 at the final layer of the LSTMs as the output vectors. We use twolayer bi-LSTMs in our system.
Efficiency Improvement Considering the Zipfian distribution for word frequencies, most of the time is spent on getting char bi-LSTM representations for frequent words. On the other hand, for those words, it is considerably easier to train decent representations even without char bi-LSTMs. We thus directly learn the dense word vectors for frequent words, as a proxy for character-level bi-LSTMs and they can be considered as fast look-up tables without actually running the LSTMs 1 .

Delexicalized Features
For languages with small treebanks, the provided data is not adequate to learn character bi-LSTMs. We choose to use the available delexicalized information predicted by UDPipe. Namely, we use information from two fields: universal POS tags (UPOS) and morphological tags.
To get dense vectors for each word w in the same form as the output of char bi-LSTMs, we use the concatenation of UPOS embeddings Ñ p w and the bag-of-morphology (BOM) embeddings poolpt Ñ m w uq. The BOM embeddings require a pooling function poolp¨q because each word may receive multiple morphological tags. In our system, we use element-wise max operator as the pooling function.

Word-level LSTMs
The character bi-LSTM vector for each word is computed in isolation from other words in the sentence. In this module, we again leverage bi-LSTMs for integration of contextual information. Similar to §3.1, we pad a sentence with two special begin-of-sentence (BOS, or w 0 ), and end-of-sentence (EOS, or w n`1 ) symbols into rBOS, w 1 , ..., w n , EOSs. Inputs to the first layer are character bi-LSTM encodings, or concatenation of POS-tag and BOM embeddings in the case of delexicalized models. We take the bidirectional vectors ÑÐ w i at the final layer as the context-sensitive representation associated with w i . All parsing components to be described in the following section will build from these vectors.

Parsing Components
Our system parses a sentence in two steps, first predicting the unlabeled parse tree, and next predicting the label for each arc in the unlabeled tree.

Global Models for Unlabeled Parsing
Our system includes one graph-based and two transition-based, a total of three different global parsing paradigms. All of these models only handle projective cases. For this reason, before training, we projectivize all gold-standard trees in the training sets.
First-order Graph-based Parsing Our graphbased model is based on the popular edge-factored Eisner's algorithm (Eisner, 1996;Eisner and Satta, 1999). Each potential arc ph, mq in the graph (Opn 2 q in total with sentence length n) is first scored with a function score MST ph, mq. Then Eisner's algorithm is used to find the maximum spanning tree among all possible projective trees: argmax valid parses y ÿ ph,mqPy score MST ph, mq Following Dozat and Manning (2017), we use a deep bi-affine scoring function: Global Transition-based Parsing We include global training and exact decoders for two transition systems, arc-hybrid and arc-eager. They are based on dynamic programming approaches (Huang and Sagae, 2010;Kuhlmann et al., 2011), thus we call the two models AHDP and AEDP.
The dynamic programming shares computation for parser configurations with the same extracted features. In our system, we only use two bi-LSTM vectors, one from the top of the stack ( ÑÐ s 0 ), and one from the top of the buffer ( ÑÐ b 0 ). This compact set of features enables dynamic programming to compress the exponentially-large search space down to Opn 3 q for the two transition systems.
Below we illustrate the AHDP decoder, with AEDP being similar. The bare deduction system, adapted from Kuhlmann et al. (2011) is: sh ri, js rj, j`1s re ñ rk, is ri, js rk, js k ñ i re ð rk, is ri, js rk, js i ð j each deduction item ri, js corresponds to a push computation detailed in Kuhlmann et al. (2011). For the purpose of our decoder, the deduction item can also be understood as a parser configuration with w i being s 0 and w j being b 0 . The deduction system has an axiom r0, 1s and goal r0, n`1s corresponding to initial and terminal configurations. Next, we incorporate scoring functions: The scoring functions are bi-affine and take the same form as score MST p¨q. The highest-scoring proof for the goal item r0, n`1s constitutes the predicted transition sequence.
Training We employ discriminative training strategies for all three global parsing models. Cost-augmented decoding (Taskar et al., 2005;Smith, 2011) is applied during training. A correct parse tree is instructed to get higher scores than an incorrect parse tree by a margin set to be the number of incorrectly-attached nodes (Hamming distance). This technique has previously been applied in training a neural MST parser (Kiperwasser and Goldberg, 2016

Arc Labeling
We separate out the stage of arc labeling and adopt a simple labeler proposed by Kiperwasser and Goldberg (2016). For a predicted arc with h as the head and m being the modifier, their associated vectors are concatenated to be the input to a MLP. Each dimension of the output from the MLP corresponds to the score for a potential label, And we select the label with the highest score: The arc-labeling models are trained with goldstandard ph, mq tuples. And we use a discriminative hinge loss, with margin of 1.

Results
The main official evaluation results are given in Table 1. Our system achieved second place in overall ranking. When considering average performance on small treebanks (8 treebanks) and surprise languages (4 treebanks, detailed in Table 2), we scored the first among all teams.
We for tokenization and sentence boundary detection, which is reflected by the gap between our system and the best-performing systems on ja, vi, he and zh. The other large source of gaps comes from languages with large non-projective ratios, such as grc, la and nl. The global transitionbased AHDP and AEDP models are not compatible with non-projective parsing, and we did not implement or test with non-projective graph-based parsers due to time and resource constraints.
Our system performs relatively well on languages with high OOV ratios, such as hu, ko, lv and et, with the help of character bi-LSTMs. In addition, the strategies of concatenating multiple training treebanks for the same language (see §6) brought success on small treebanks. Table 3 gives the performance of our system on the 14 additional parallel treebanks. The results are largely consistent with in-domain evaluation results, and we ranked within top third for most treebanks except ja pud, en pud and ru pud. We did not implement our own tokenizer for Japanese, explaining the gap. For the other two languages, our selected models were not domainrobust. We perform a post-evaluation analysis and parse the PUD treebanks (Nivre et al., 2017a) with models trained on the canonical treebanks. The two languages observe an improvement on LAS scores of 7.53 and 14.73 respectively.

Ablation Analaysis
To examine the effect of individual components in our ensemble system, we evaluate several variations, where we use single or an incomplete set of models for unlabeled parsing and arc-labeling. Results are shown in Table 4. AEDP gives higher unlabeled parsing performance, and an ensemble of three instances of AEDPs achieves comparable performance to our full system. The arc-labeling ensemble gives another gain in LAS result of 0.31.

Implementation Details
Our system was trained on the UD 2.0 dataset (Nivre et al., 2016(Nivre et al., , 2017b, with the provided training and development splits when available. For languages without development sets, we split   the training sets into train/dev sets with ratio 0.9{0.1. We did not use any additional data. All neural network computation was implemented with DyNet (Neubig et al., 2017). Stage I of our system is the baseline system UD-Pipe 1.1, and we directly used the outputs provided by the organizers. We implemented modules for all later stages. They were trained with goldstandard features and tokenizations. For all languages and all treebanks, we trained models with 2-layer-deep and 192-unit-wide (96 units for each direction) word-level bi-LSTMs as feature extractors. Lexicalized character bi-LSTMs are 2 layers deep and 128 units wide, with 64-dimensional input character embeddings. For languages without lexicalized feature extractors, we used concatenation of 64-dimensional UPOS embeddings, and max pooling of 64-dimensional morphological embeddings as input to word-level bi-LSTMs.
The word-level bi-LSTM feature vectors were passed through MLPs with 1 hidden layer and 192 hidden units, before the bi-affine scoring functions for MST, AHDP and AEDP unlabeled parsing. In arc-labelers, we concatenated the word-level feature vectors and passed it through a 1-layer MLP with 192 hidden units to get scores for the arc labels. Output layer size depends on the number of labels appearing in the training set for the concerned treebank. We projected language-specific arc tags into universal ones before training.
All the aforementioned hidden layers used tanh as activation functions. And the parameters were uniformly initialized (Glorot and Bengio, 2010), except for the weight matrices in the bi-affine scoring functions, which were initialized to be orthogonal (Saxe et al., 2013). We did not use any pretrained word embeddings.
We applied dropout at every stage. MLPs had dropout rates of 0.3 (Srivastava et al., 2014). Bi-LSTMs, both character-level and word-level, also had dropout rates of 0.3 for input and recurrent connections (Gal and Ghahramani, 2016). Further, we zeroed out input vectors to word-level LSTMs for 15% of the time, to encourage the models gain more information from context.
When we trained each model, we randomly shuffled the training set before starting each epoch, and grouped sentences into mini-batches of approximately 100 words. The discriminative loss functions were optimized via Adam optimizer (Kingma and Ba, 2015), with default hyperparameters except initial learning rate set to be 0.002. We evaluated the models with development data after every 500 mini-batches. We halved the learning rate if the performance plateaued in 5 consecutive evaluations, The process was repeated 3 times before we terminated the training process.
We employed the technique of stackpropagation (Zhang and Weiss, 2016), where the auxiliary task of UPOS prediction was used as a regularizer. It received 0.1 the weight of other components in computing the loss.
For the languages with multiple treebanks, we first concatenated the training treebanks and trained a general model. We then fine-tuned the models on the respective individual treebanks.
To speed up training,we simultaneously trained MST, AHDP, AEDP and arc labeling models with shared LSTM feature extractors. Their losses were linearly combined with weights 0.6, 0.3, 0.3, 1.5 respectively. After a joint model had been trained, we fine-tuned each of the four tasks separately.
Our final system included ensembles both for unlabeled parsing and arc labeling. They were obtained with different random initializations of the neural network, but trained in the same fashion. For languages with multiple treebanks, we trained 3 sets of models (3 for each parsing paradigm, 9 unlabeled parsing models in total, plus 3 for arc labeling). For languages with single treebanks, we trained 5 sets of models.
For surprise languages, we first trained delexicalized models using the training data in a most similar language according to the WALS features (Dryer and Haspelmath, 2013). We selected fi, fa, hi, cs for sme, kmr, bxr, hsb respectively. We then fine-tuned the models on the sample data for these languages. We treated kk and ug similarly as they have quite small training sets. Both of them used tr as the source language.
The entire training process of all models in the ensemble for all treebanks was done using 8 CPU cores (2ˆIntel i7-4790 @ 3.60GHz) in approximately one week. Each model required at most 2GB RAM plus the amount needed for holding the training sets. On the online evaluation platform TIRA (Potthast et al., 2014), the test phase for our full model finished in 4.64 hours with 2 threads. Each model required at most 500MB RAM plus the amount needed for holding the test sets.