Linear Neural Parsing and Hybrid Enhancement for Enhanced Universal Dependencies

To accomplish the shared task on dependency parsing we explore the use of a linear transition-based neural dependency parser as well as a combination of three of them by means of a linear tree combination algorithm. We train separate models for each language on the shared task data. We compare our base parser with two biaffine parsers and also present an ensemble combination of all five parsers, which achieves an average UAS 1.88 point lower than the top official submission. For producing the enhanced dependencies, we exploit a hybrid approach, coupling an algorithmic graph transformation of the dependency tree with predictions made by a multitask machine learning model.


System Overview
The shared task is aimed at performing all the levels of linguistic analysis according to the UD guidelines, starting from raw text all the way to enhanced dependency graphs. All this in a multi-language setting for seventeen languages (Bouma et al., 2020).
In this endeavor, we concentrate on the syntactic parsing and enhancement stages, by exploiting existing tools for tokenization, sentence splitting, POS tagging and morphological analysis.
For syntactic parsing we make experiments exploring different ideas, in an attempt to improve state-of-the-art parsers with linear complexity. A parser combination is then used for our official submission, exploiting the linear tree combination algorithm by Attardi and Dell'Orletta (2009), resulting in an overall linear algorithm.
For the enhancement step, we build on previous work in writing an enhancer for UD, based on algorithmic graph transformation, that was used to produce the Italian version of the enhanced dependencies . The script used language specific heuristics and lexical information, achieving a good degree of accuracy for Italian and English. In this multi-language challenge, we have to deal with partial implementations of the expected enhancement types as well as with varying degree of compliance with the guidelines in the different languages. In order to address this additional complexity, we implement a new version of the script for making it modular, parametric, and language independent. For specific enhancement tasks, we integrate the output of machine learning classifiers, in an attempt to learn from the training data and make the heuristics more robust and general.

Syntactic parsing
State of the art dependency parsers currently often adopt the graph-based model, based on neural networks for the choice of arcs and labels.
We consider as current SoTA on the English PTB the graph dependency parsers listed in Table 1.
In particular the Bi-LSTM-based deep biaffine neural dependency parser by Dozat and Manning (2017) has been quite popular and used in three out of five of the top submissions to the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (Zeman et al., 2018), in particular in the top non-ensemble submission (Kanerva et al., 2018).
The preference for such models leads to systems with high accuracy but possibly slower due to their O(n 2 ) complexity. For example, the original implementation of the Dozat parser is rated at about 400 sents/sec on GPUs, while for example the neural transition-based parser by Chen and Manning (2014) is rated at 640 sents/sec just on CPUs. Our experiments attempt to find a parser with linear complexity and hence good speed performance. Indeed the linear transition parser that we choose for our experiments (UUParser) is twice as fast as the Parser UAS LAS HPSG (Zhou and Zhao, 2019) 96.09 94.68 BIST-Graph (Kiperwasser and Goldberg, 2016) 93.10 91.00 Biaffine (Dozat and Manning, 2017) 95.74 94.08 Pointer-TD (Ma et al., 2018) 95.87 94.19 Pointer-LR (Fernández-González andGómez-Rodríguez, 2019) 96.04 94.43 UUParser (de Lhoneux et al., 2017) 94.63 92.77 BIST-Transition (Kiperwasser and Goldberg, 2016) 93.9 91.9 CM (Chen and Manning, 2014) 91.80 89.60 Table 1: SoTA dependency parsers, grouped into graph-based (top) and transition-based (bottom).
latest version of the biaffine parser from Stanford (Stanza). However, after submission, we discovered a new implementation of the biaffine parser in PyTorch (Zhang, 2019), which is 5 times faster by better exploiting GPU acceleration. We trained our own models for each language on the shared task treebanks for UUParser, UDPipe and Zysite, while we used a pretrained multilanguage model for UDify and pretrained individual language models for Stanza.

UUParser
We choose UUParser as our base parser. UU-Parser (de Lhoneux et al., 2017) is a transitionbased parser model, derived from the parser by (Kiperwasser and Goldberg, 2016): the (bidirectional) LSTM's recurrent output vector for each word is concatenated with each possible head's recurrent vector, and the result is used as input to a MLP that scores each resulting arc. The predicted tree structure at training time is the one where each word depends on its highest-scoring head. Labels are generated analogously, with each word's recurrent output vector and its gold or predicted head word's recurrent vector being used in a multi-class MLP. We ported the Kiperwasser parser to Python 3. UUParser was further extended to deal with non-projectivity by means of a swap transition and to support ELMo embeddings as an input to the LSTM.
We further extended UUParser in order to exploit BERT and AlBERT embeddings. Words are first tokenized with their specific tokenizer and then the embeddings for words split into wordpieces obtained as the average of the wordpiece embeddings.
The code for the extended version is available on GitHub 1 .
On development experiments, using the English  and Italian train and development sets provided for the task, we obtained the results in Table 2. For BERT we use the base-uncased model and for AlBERT the large-v2 model, which we keep frozen during training. Given the minor difference between using BERT and AlBERT, in our experiments we choose to use the BERT model.
We explored the idea to provide hints to the parser, obtained from structural syntax probes (Hewitt and Mannings, 2019). We use a syntax probe to estimate the parse tree path distance between two tokens. The transition-based parser needs to decide at each step which transition to apply to the pair of words on the top of the stack (s0) and on the input buffer (b0). The parser computes a distance matrix for each pair of tokens in a sentence. The parser is provided as additional features the estimated distances between b0 and the top k (default 3) tokens on the stack. These distances should help the parser in deciding whether to perform a Shift transition rather than a premature Reduce.
The results we obtained with such an extension on the English development corpus where 92.21 UAS and 90.31 LAS, using ELMo embeddings for word representations and BERT for syntax probes, a small improvement with respect to 91.32 UAS and 89.33 LAS without using these features.
We also tested two biaffine parsers: the implementation by Zysite (Zhang, 2019) and Stanza (Qi et al., 2020) which augments the bi-  affine parser with features to predict the linearization order of two words in a given language, and to predict the typical distance in linear order between them. We report in table 3 the average speed performance on all the 17 test sets of the challenge obtained by the linear parser and the two quadratic biaffine graph parsers.
The Zysite biaffine parser turns out to be both the most accurate and the fastest. It is also worth mentioning the significant training time, as for example Zysite takes more than 39 hours to train it on the Czech treebank, with 68,495 sentences. The experiments were performed on a Dell server using a single NVIDIA Tesla T4 GPU.

Tokenization, Tagging
UUParser does not provide tokenization nor tagging capabilities, so we have to rely on another set of tools to accomplish these tasks. We choose to use UDPipe (Straka and Straková, 2017) to perform sentence splitting, tokenization and tagging. This gives us a common tagged representation to use also with alternative parsers.
Some of the parsers tested provide the ability to perform end-to-end parsing from raw text, in particular UDify (Kondratyuk and Straka, 2019) and Stanza. However, they turn out not to be very effective: the pretrained model of UDify does not support all the task languages and Stanza has a weird behavior: for example, it would split a word like "GoogleOS" not just into two tokens, "Google" and "OS", but into two separate sentences.
So eventually we decided to use the same tokenization provided by UDPipe as input to all parsers. This enables us also to produce an ensemble version combining the outputs of three parsers.

Ensemble of Parsers
In the official submission, we exploit the linear tree combination algorithm by Attardi and Dell'Orletta (2009) to combine the outputs of an ensemble of dependency parsers. The algorithm is greedy and works by combining the trees top down. It has been shown to outperform more complex algorithms based on computing the Maximum Spanning Tree.
The parsers used are UDify, UUParser and UD-Pipe.
In a later unofficial submission labeled comb5, we included also Zysite and Stanza in the ensemble. Table 7 presents the results of this submission compared to the best performing official submission in the challenge.

Enhanced Dependencies
For producing the enhanced dependencies we follow a "hybrid" approach, using a combination of an algorithmic graph transformation of the syntactic dependency tree coupled with predictions made by three machine learning classifiers. The basic enhancing script is an evolution of the work presented in  to bootstrap enhanced dependencies for the Italian treebank, also used for experiments in .
One classifier is used to recognize the external subjects in xcomp constructions. The second classifier detects when a head should be propagated in conjunctions. The third classifier detects the case of propagation of dependents in conjunctions. The classifiers are trained jointly on the three tasks and produce three binary predictions.
The script that adds the enhanced dependencies is modular, so that it can be adapted to perform just the required analysis depending on the kind of enhanced dependencies present in each language and to bypass those that were not implemented. In addition, the script is parametric with respect to predictions coming from machine learning classifiers, which can be taken into account or ignored. We describe below how the different kinds of enhancements are addressed.

Controlled/Raised Subjects
This type of enhancement applies to subordinate infinitive clauses introduced by the xcomp relation and consists in adding an extra nsubj dependency to the embedded or controlled verb. The difficult aspect of this enhancement is to predict the correct subject for the dependent clause among the different dependents of the main verb. In fact, this extra subject can be the subject, object or an oblique complement, as the following examples testify: 1. Mary wants to buy a book. Mary is the subject of buy.
2. Mary asked John to buy a book. John, the object, is the subject of buy.
[Mary asked John to buy a book]. In Italian, the buyer, Giovanni, is an indirect complement (obl) of the main verb chiesto.
We train a neural binary classifier to predict which of the dependents of the main verb should be chosen to play the role of the extra subject for the dependent verb, if any. If more than one token is predicted as an external subject of the subordinate clause, currently all of them are added. The classifier is applied to tokens that have a sibling in a xcomp relation, which are either a noun or a pronoun and whose deprel is one of the following: nsubj, csubj, obj, iobj, obl, nsubj:pass, csubj:pass.
Such tokens are represented by the following features: the form, the upos and the deprel of the token, the form, the upos and the deprel of the token's head, the form of the xcomp sibling, the form of the case or mark which introduces the subordinate phrase. A training example for the first classifier has the features for a token as input and a binary value as output depending on whether the sibling is indeed a nsubj for the subordinate clause.

Propagation over Conjuncts
The classifiers for propagation over conjuncts act in a similar way. We train two distinct classifiers for recognizing candidates for head propagation and for dependents propagation over conjuncts. Candidates for head propagation are conjoined subjects and objects, that should each be attached to their head as in "Paul and Mary are running" or "Paul bought apples and oranges".
Candidates for dependent propagation are subjects, objects and other complements of conjoined verbs, as it is the case of she in "She was reading and watching a movie".
The model is trained to predict whether a candidate for propagation should be safely propagated, by making the implicit relations explicit.

Model Architecture
The three classifiers share the same neural network architecture. The first layer collects the embeddings for each form, upos or deprel in the input vector. The embeddings for the forms are obtained from FastText (Bojanowski et al. 2017). The embeddings for upos and deprel are learned as vectors of size 20 each.
The second layer of the classifier concatenates the embeddings from the first layer. The third layer is a flatten layer, which is followed by a fully connected layer with a hidden dimension of 100. This is followed by a dropout with a probability of 50% (chosen by tuning experiments) and finally there is a fully connected layer with a sigmoid activation.
The classifiers are trained jointly with a binary cross entropy loss function and an Adam optimizer (Kingma and Ba, 2015) on the training set of each language. The training is run for up to four epochs, even though in most cases the loss stops decreasing after the second epoch. Validation accuracies during training range around 97-98%. The code is written in Keras on a Tensorflow backend.

Relative Clauses
The treatment of enhancements for relative clauses is quite straightforward. It consists in attaching the relative pronoun to its antecedent with the special ref relation and attaching the referred antecedent as an argument to the main predicate of the relative clause. This enhancement may create circularities in the enhanced graph.

Label Augmentation with Case/Mark Information
The most difficult sub-task turned out to be guessing the right case/mark information for augmenting the relation name of non-core dependents, due to the different interpretations and varying degree of compliance with the guidelines in the various treebanks. Given the high frequency of occurrence of this type of enhancements, doing this task right has high impact on the overall performance. As it turned out, the differences concern all the following aspects, and their combinations: 1. the type of deprels considered for the augmentation (e.g. conj is not specialized in Arabic, Bulgarian, Estonian, Finnish, French, Latvian, Lithuanian, Polish etc.) 2. the case/mark information used (either the lemma or form of the case/mark dependent) 3. the strategy adopted in presence of multiple marks/cases dependents (whether their concatenation or the last one as in English) 4. the strategy adopted when cases/marks are fixed multi-word expressions (whether forms or lemmas are combined) 5. the use or not of morphological case information and to what extent 6. the presence of non canonical keywords in some languages (for example agentxoxnsubj and enh introduced in the French treebank to encode diathesis normalization as described by Candito et al. (2017)).
In this sense, the inclusion/exclusion of type specialization depending on the language is a too coarse strategy, since it does not account of all these variations; moreover the differences are treebankwise (as opposed to language-wise) in the sense that different subparts of the test set for a specific language may be coming from different treebanks following different approaches. In order to address these issues, we adopted a very simplistic data driven approach to adjust the result of a rule-based algorithm, which implements the guidelines. We computed a mapping from the label predicted from our enhancer to the gold label found in the training data set and filtered out correspondences whose frequency was less than a fixed threshold, in order to be tolerant to sporadic errors. As a final "patch", we applied the resulting transformation to produce the final augmented label.
This strategy is far from perfect and clean, but it does take care of systematic differences among languages, such as the use of case features (gen, tt nom, dat, tt ins etc.) in some of the languages with morphological cases. However, it provides no solution to issues related to non-conventional label completions, nor solves the problem of selecting the correct mark or case when multiple ones are present (e.g. about whether, along with in English), or to address the non-canonical use, with respect to the guidelines, of lemma vs form in augmentations 2 .

Tuning Parameters
The machine learning modules and the "patch" strategy were not equally effective for all languages. On the basis of the performance on the development set, we selected for each language the best choice of parameters for the enhancement script. These were consistently applied in producing the enhanced version of the parser results in all submissions. Table 4 summarizes the choice of parameters for the different languages, where the values for the parameters "-e" represent the types of enhancement to be excluded, since not implemented for the language (consistently with the parameters of the evaluation script), ml means that we used the predictions from the machine learning classifiers, patch means that we used the mappings strategy for fixing label augmentation. The lack of parameters means that only the basic enhancement script was used, and all enhancement types were performed.

Results
The official results are those labeled UNIPI-003 in our submission, obtained through the combination of the parsers UDify, UUParser and UDPipe. Table 5 shows the official results obtained in tokenization and tagging on the test sets. Table 6 shows our team official results on parsing and en- These are encouraging results that show that a transition-based parser can be competitive with graph-based ones.
We then produced a new run comb5 (UNIPI-comb5), as an ensemble of five parsers: UUParser, UDify, UDPipe, Stanza and Zysite. We report these unofficial results in Table 7.
The improvements on parsing by the ensemble of five parsers with respect to the single parser UUParser are summarized in Table 8.
The most significant improvements from the ensemble combinations are +13.07 UAS on Estonian, +5.31 on Tamil, +5.11 on Dutch, +3.59 on Lithuanian, +4.37 on Finnish.
Estonian, Finnish, Latvian, Lithuanian turned out as the most difficult for our dependency parsers, with a difference between 4.2 and 6.5 points of UAS with respect to the submission by Kanerva and even 10.7 point lower on Arabic.
If we consider the average UAS excluding the Baltic languages, the average UAS of the ensemble parsers is 89.82. As for the enhancement task, its difficulty, besides what we discussed in section 3.5, seems to be confirmed by a significant drop from our EU-LAS score (restricted to UD relations) to the ELAS score, which also takes into account label enhancements. The average drop is 6.26 points and for some languages more than 10 points. The effectiveness of our 'patch' strategy had been carefully assessed with the development data, but did not provide analogous results on the test set. Our algorithm was poor in predicting the label extended with case information. Perhaps a machine learning approach would have provided better results in this case.

Conclusions
We experimented with both linear transition-based parsers and two implementations of graph-based biaffine parsers. All parsers have difficulties with Baltic languages, Finnish and Arabic which somehow we were able to mitigate by combining them into an ensemble, except for Arabic, which remains 8.2 points UAS lower than the top submission. Our enhanced version of UUParser, using BERT embeddings, performs competitively well with respect to the biaffine Zysite parser, except on Estonian, Tamil and Dutch, while it outperforms it by +14 LAS on Arabic. Since all the parsers use the same base model, multilingual uncased of BERT, it might be worthwhile to investigate how such models affect the performance on Baltic languages.
The implementation of the biaffine parser by Zysite was a surprising discovery, since it is capable to outperform in speed all other parsers, possibly due to its use of a more efficient biaffine operation via torch.einsum().
For adding the enhanced relations to the output of the parser we opted for a hybrid approach, where for some languages, which appear to be more conforming to the guidelines, we applied an algorithmic solution, while for the rest we exploited machine learning classifiers.
In principle the algorithmic approach should be sufficient as soon as languages adhere more strictly to the guidelines. In the meanwhile, we wonder whether it is worthwhile to develop techniques which are language specific in order to obtain better results, unless there are ways to devise a language agnostic solution.    Table 8: Improvements by parser combination on unofficial run.