The ADAPT Enhanced Dependency Parser at the IWPT 2020 Shared Task

We describe the ADAPT system for the 2020 IWPT Shared Task on parsing enhanced Universal Dependencies in 17 languages. We implement a pipeline approach using UDPipe and UDPipe-future to provide initial levels of annotation. The enhanced dependency graph is either produced by a graph-based semantic dependency parser or is built from the basic tree using a small set of heuristics. Our results show that, for the majority of languages, a semantic dependency parser can be successfully applied to the task of parsing enhanced dependencies. Unfortunately, we did not ensure a connected graph as part of our pipeline approach and our competition submission relied on a last-minute fix to pass the validation script which harmed our official evaluation scores significantly. Our submission ranked eighth in the official evaluation with a macro-averaged coarse ELAS F1 of 67.23 and a treebank average of 67.49. We later implemented our own graph-connecting fix which resulted in a score of 79.53 (language average) or 79.76 (treebank average), which would have placed fourth in the competition evaluation.


Introduction
The 2020 IWPT Shared Task on enhanced dependency parsing (Bouma et al., 2020) requires participants to predict the enhanced dependencies (DEPS column in the CoNLL-U format) in addition to sentence boundaries, tokenisation, lemmata, POS tags, morphological features and the basic dependency tree. We take a pipeline approach using 1. UDPipe for sentence splitting and tokenisation, 2. ensembles of UDPipe-future basic parsers, that also predict lemmata, POS tags and morphological features, with added support for multi-treebank models (Stymne et al., 2018), and 3. two types of enhancers: (a) copying the basic tree and applying a small set of heuristics (baseline system), and (b) a graph-based semantic dependency parser .
To enable reproduction of our results, we make available our helper scripts and modifications of the semantic parser. 1 Our approach to the task does not guarantee a connected graph -something that we did not account for. Thus, on submission day, we did not have an appropriate solution ready to fix our outputs but were able to provide a valid submission due to some functionality that was added to the quick-fix tool provided by the organisers 2 to alter the enhanced graph. The solution was designed primarily to make the files pass validation but in doing so, harms F1-score. In a post-competition run, we addressed the connected graph issue with an alternative solution which increased our macroaveraged ELAS F1-score from 67.23 to 79.53 and the treebank average from 67.49 to 79.76.

Segmentation
We use UDPipe 3 (Straka and Straková, 2017) with off-the-shelf UD v2.5 models 4  for the languages of the shared task to split the raw input text into sentences and tokens. In cases where more than one UDPipe model is available for a language, we try all models during development 5 and select for each test language the best overall pipeline according to ELAS on the treebank with the biggest development set for the language. 6

Basic Parsing
We choose UDPipe-future (Straka, 2018) for basic parsing and joint prediction of lemmata, POS tags and morphological features so as to not require a separate tagger. We extend UDPipe-future to train multi-treebank models as introduced by (Stymne et al., 2018) with UUParser. 7,8 Inspired by , we use two types of external word embeddings with UDPipe-future: ELMo contextualised word embeddings  and FastText character-n-gram-based word embeddings (Bojanowski et al., 2017). 9 For 15 of the 17 test languages, ElmoForManyLangs 10 (Che et al., 2018) provides ELMo models. We train FastText on the raw text provided by the CoNLL'17 shared task for the same 15 languages after shuffling sentences. For the Russian FastText model, we kept getting vectors with large component values even after trying a different machine and a different permutation of sentences, prohibiting effective training of the parser. We then used a model trained on 2 ⁄3 of the Russian data for which component values and parser LAS were in the expected range. Furthermore, we train UDPipe-future models using FastText and internal embeddings only. 5 Due to a configuration error, we did not try segmentation with UDPipe models trained on fi ftb, lt hse and sv lines in the official submission. 6 For Czech, we based our decision on results for cs cac instead of cs pdt as we did not have full results available for cs pdt.
7 Multi-treebank models supply each token with the source treebank ID as additional input with a separate embedding table. Like Stymne et al. (2018), we use a vector size of 12. At test time, a proxy treebank must be chosen when the input sentence does not come from one of the training treebanks or the source is unknown. 8 https://github.com/jowagner/ UDPipe-Future/tree/multitreebank 9 The FastText word embedding is restricted to a fixed vocabulary of one million tokens, not taking advantage of FastText's ability to produce new vectors for OOVs. UDPipefuture does not fine-tune these word embeddings. Instead, the parser trains an additional embedding exclusively for training words and a character-based representation. The latter two are added and the result is concatenated with the two externally provided representations. As far as we understand the code, an all-zero vector is used for OOVs, i. e. words not in the selected one-million-word FastText vocabulary. 10 https://github.com/HIT-SCIR/ ELMoForManyLangs For Lithuanian and Tamil, we train UDPipefuture without external word embeddings. The parser still uses an internal word embedding covering all words of the training treebank(s) and a word representation obtained with a bidirectional GRU layer over the input characters.
For each target language, we train (a) monotreebank models for each training treebank available with surface strings in UD v2.5, preferring the shared-task version when available, and (b) a multitreebank model for each language using all treebanks for that language for which we also trained mono-treebank models. We train up to seven models with different initialisation for each setting to combine them in ensembles. 11,12 We consider ensembles not just of a single type of model with different initialisation but also combinations of models trained on different treebanks (mono-treebank models) or treebank combinations (multi-treebank models) and in the plain, FastText and ELMo variants. 13 As the number of possible combinations increases exponentially with the number of models, we prune the candidates giving preference to models using all or only one treebank and to models using ELMo. We then test each ensemble on the development data (raw input segmented with UD-Pipe) and pick the best ensembles based on ELAS after applying our heuristic enhancer (Section 2.3) to the basic trees.
To pick the proxy treebank (see description in Footnote 7) for multi-treebank parsing, we use the treebank name in the filename of the raw text during development. However, for final testing, the treebank identifier is unavailable (and if it had been available there would have been cases where 11 We trained 68 types of models. We trained seven seeds for 34 of these, five seeds for 30 and three seeds for four. Ensembles sizes three, five and seven are considered, including a combination of (n + 1)/2 models of one type and (n − 1)/2 models of another type with n ∈ {3, 5, 7}. 12 We use our implementation https://github.com/ jowagner/ud-combination of the linear combiner of Attardi and Dell'Orletta (2009). 13 While predicting on development data to facilitate model selection, we temporarily introduced a bug in our system causing it to use the first initialisation seed for all ensemble members only, effectively falling back to a single model when only one model type is used. We fixed this bug before we switched to making test set predictions and tried to account for it in the model selection but, under time pressure, made some hard to explain ad hoc choices, e. g. we used an ensemble of three models for Czech, two mono-treebank models trained on cs cac and one multi-treebank model, even though we also had test set predictions with an ensemble of seven models with the same mixture of model types available. For details, see the reproducibility notes in our code repository. this treebank is not one of the training treebanks). Given time limits, we decided to simply assign each test set, i. e. each test language, the training treebank with the largest amount of training data as the proxy treebank. 14

Heuristic Enhancement
We build a baseline system which concentrates on the two enhanced UD phenomena which are very straightforward to implement using simple heuristic rules, namely, co-reference in relative clauses and modifier relations containing case markers. These rules are applied to the output of the basic parser. We have two versions of the modifier relation rule -one in which the value of the case morphological feature is included in the relation label and one without. We also have a rule which adds the lemma of a conjunction to the enhanced label of its head. For each development set, we find the optimal subset of the set of heuristic rules in terms of ELAS among all possible subsets except those combining the two case rules.
This baseline system is clearly suboptimal since it makes no attempt at all to handle those more interesting enhanced UD phenomena which involve the addition or deletion of arcs, i.e. conjunct propagation, ellipsis and control/raising constructions. Nonetheless it is useful as a baseline to check that the main system is performing reasonably and is available as a fall back.

Modelling Enhanced Dependencies
As our main system to predict the enhanced graph, we follow  and treat enhanced dependency parsing as a task similar to semantic dependency parsing. In semantic dependency parsing, words may have multiple heads. Thus,  apply their deep biaffine graph-based dependency parser (Dozat and Manning, 2017) to the task of semantic dependency parsing but replace the softmax cross-entropy loss with sigmoid cross-entropy loss for edge prediction. The above modification changes the modelling objective such that words are no longer competing with one another to be classified as the appropriate head; rather, the parser chooses whether an edge exists between each possible pair of words independently. Whether an edge exists between two words is based on a predefined threshold, where a score above this threshold results in an edge being predicted and, subsequently, the edge's label. In our experiments we use an edge prediction threshold of 0.5. If the parser did not predict an edge for a word, we take the edge with the highest probability. As we want to select the label with the highest probability for each chosen edge, standard softmax cross-entropy loss is used for label prediction as in .
In order for the semantic dependency parser to be able to model relationships where a word may have multiple heads, we create an adjacency matrix where the ij th entry in the matrix indicates whether an edge exists between tokens i and j with label type k. We also append the dummy root token to the adjacency matrix so that an edge can be predicted from the main predicate of the sentence to the dummy root token. Figure 1a shows the enhanced UD graph for the phrase, Tale of joy and sorrow. In the enhanced representation, each conjunct in the conjoined noun phrase is attached to the governor of the modifier phrase, e.g. there is an additional nmod relation marked in blue between the noun Tale and the second conjunct sorrow. Note that the lemma of the case and cc dependents are appended to the enhanced dependency labels of their heads. The corresponding edge-existence probabilities of the semantic parser trained on en ewt are shown in Figure 1b where the parser correctly predicts an edge from sorrow to the first conjunct joy as well as the head of the modifier phrase Tale.

Feature Representations
In our experiments, each word w i in a sentence S = (w 0 , w 1 , . . . , w N ) is converted to its vector representation x i . We trained different variants of our semantic parser where x i is the concatenation of different combinations of the below features:  • lemma embedding: The embedding of the word's lemma e All model variants use the lexical information of the first BERT word-piece embedding and the character embedding, where ; represents vector concatenation: The subsequent variation comes from the other types of features used where we experimented with the below feature settings: i ; e (u) i ; e i ; e For the morphological features, there may be multiple morphological tags m 1 , . . . , m t for a particular word w i . Thus, we split the full label into separate features (Hall et al., 2007) and embed each morphological property separately. We then sum the individual embedded representations together and divide by the number of active properties: We follow the same process for the headinformation embedding e (h) i . Rather than encoding the head as an integer value, we obtain a direction value and a distance value: for each headdependent pair (i, j), we subtract the indices of i, j giving the distance value. If the value is negative it means the head is to the left or if it is positive, to the right. We then take the absolute distance value and define ranges: short (1-4), medium (5-9), far (10-14) and long-range (>15). The qualitative direction (left or right) and distance labels are embedded in the same way as morphological features, e.g. embedded as separate components, summed together and then divided by the number of features (which in this case is always two): To encode the basic tree, we then concatenate the head representation and the dependency label embedding: e It is worth mentioning that more sophisticated approaches for modelling head distance and direction exist for basic dependency parsing (Qi et al., 2018) but we leave using this approach for enhanced dependency parsing as future work.

Training Details
Our semantic parser predicts edges in a greedy fashion based on local decisions, i. e. we did not make use of any maximum spanning tree algorithm or enforce any global constraints. One property of enhanced dependency graphs is that the graph may contain cycles, therefore, we did not remove any cycles from the graph but observed that this sometimes causes fragments in the graph which are not reachable from the notional ROOT. For graphs with unreachable nodes, we applied our post-processor to attach these (Section 2.5).
We found that this architecture can be easily applied to enhanced dependency parsing given its similar nature to semantic dependency parsing. One caveat is that in enhanced dependency parsing, the label set can be quite large as modifier lemma and case information can be appended to the dependency label which results in very high memory requirements for certain languages such as Arabic. Additionally, modelling all enhanced labels in this fashion means that the parser is limited in its ability to predict labels for rare modifiers. An examination of the semantic parser output on the en ewt development set shows that, although the parser often predicts the correct label, it can sometimes predict the wrong label containing a frequent modifier which is not in the sentence, e.g. advcl:if instead of advcl:as.
Our semantic parser is built upon the implementation in AllenNLP . Due to time constraints, we trained our semantic parsing models on the gold training data released by the organisers as opposed to creating jack-knifed silver data. Hyperparameters are similar to those in Dozat and Manning (2017) as we found the larger network size of  to be too restrictive for certain languages with high memory demands. Full hyperparameters of the semantic parser are given in Table 1. We trained for 75 epochs with early-stopping if the development score did not improve after 10 epochs.

Memory Considerations
We trained our semantic parsing models on two GPUs: the first was an NVIDIA RTX 2080 Ti with 12GB of VRAM where we had to remove very long sentences (< .03% of sentences overall) from the treebanks: cs cac, cs pdt, it isdt, ru syntagrus and sv talbanken in order to fit a batch into memory. We were also given access to an NVIDIA V100 GPU with 32GB of VRAM which enabled us to process all treebanks except for ar padt without removing long sentences. For ar padt, after removing the longest 75 sentences, the model still required 29GB of VRAM.

BERT Models
For the BERT models, in early development runs we compared multilingual BERT (mBERT) with a language-specific BERT model if there was one available in HuggingFace's (Wolf et al., 2019) models repository. 15 We used a language-specific BERT model for ar (Safaya et al., 2020), bg+cs , en (Devlin et al., 2019), fi (Virtanen et al., 2019), it 16 , nl (de Vries et al., 2019, pl 17 , ru  and sv 18 and for the rest of the languages we used mBERT (Devlin et al., 2019). We found that the language-specific variant was always better than mBERT except for pl lfg. For fr sequoia, we tried using the CamemBERT model (Martin et al., 2020). As this model uses RoBERTA (Liu et al., 2019) as opposed to BERT, we installed AllenNLP from the master repository which uses HuggingFace's AutoTokenizer module which supports many BERT-like models. We noticed a trend of lower results when using the master branch for some languages but training was also more stable for certain treebanks where we had previously encountered a nan in the loss. 19 Consequently, we include models from the stable release and the bleeding-edge master branch in our development pipeline.

Connecting the Graph
We had no solution ready to connect fragmented graphs produced by our semantic parser 20 on the system submission day and resorted to using the "connect-to-root" option of the quick-fix tool provided by the shared task organisers, who warned that it had not been thoroughly tested.
After the system submission deadline, we investigated the fragmentation issue. The task is to make all nodes reachable from the notional ROOT 21 , where reachability is directional. Adding more edges than necessary harms precision and thus F1score. We found that the quick-fix tool with the "connect-to-root" option adds edges to every unreachable node. We also noticed a bug in the implementation where certain reachable nodes were being reported as unreachable.
We then implemented an improved tool to connect fragmented graphs trying to minimise the number of edges added to the graph. We repeatedly 16 https://github.com/dbmdz/berts 17 https://github.com/kldarek/polbert 18 https://github.com/Kungbib/ swedish-bert-models 19 We incurred a nan loss for cs cac, cs pdt, it isdt and ru syntagrus using the AllenNLP stable branch 0.9.0 and used the best model from the available epochs. 20 Between 90.18% (Lithuanian) and 99.51% (Russian) of test sentences in our official submission are not affected, i. e. all nodes are reachable from a root node. This observation excludes Estonian, for which we submitted predictions using our heuristic system. 21 UD distinguished between the notional ROOT (ID 0) and root nodes. The latter are any nodes that have '0' as a head.  Table 2 Development set ELAS F1 score for the best semantic parser evaluated without connecting fragmented graphs (sem-frag) and for the best combination of heuristic rules (heuristic) check for each unreachable node how many unreachable nodes can be reached from it. Among the nodes that maximise this number we pick the first node in surface order and make it a child of the notional ROOT, i. e. it becomes an additional root node. This is a rather naive approach which does not try to connect fragments in a sensible manner but, rather, mimics the behaviour of the "connect-to-root" option. Future work could try to show whether our above algorithm adds the minimal number of edges necessary to connect the graph or if a lower optimum exists.  Table 3 Test set results: subm = submitted, frag fix = using our own fragment connector and quick-fix.pl without connect-to-root, re-run = a re-run with bug fixes, no new models but new model selection guages, the difference in performance is large. For et ewt, which does not have a development set, we suspect that we overfitted our semantic parser on the et ewt training data by allowing it to train for 75 epochs. Table 3 shows test set ELAS obtained on the shared task submission site for (a) our submission fully relying on the organiser's quick-fix tool to fix issues in the output of our system, (b) the same predictions post-processed by our own fragment connector that aims to minimise the number of root edges added, and (c) a re-run of our pipeline using the same models for system components as before but with all bugs fixed during development applied to all predictions and new decisions which models to apply to the test sets. While the quick-fix tool enabled us to make a valid submission in time, its approach of adding edges from the root node to all unreachable tokens has a strong negative impact on precision, e. g. 62.26 ELAS precision on the Czech CAC development set vs. 87.37 without post-processing. Our own post-competition fix avoids this and would have brought us to the top half of the competition.

Conclusion
In this system submission, we use a graph-based semantic parser to parse enhanced dependencies and compare to a baseline in which we create enhanced graphs from the basic tree using a very limited set of heuristics. Avenues for future work include: Post-processing Predict the head and label for edges connecting fragments (as opposed to a dummy "0:root" edge) where this information could come from new edges available from lowering the score threshold or from the basic tree.

Label Prediction
The semantic parser performs competitively despite treating enhanced dependency labels containing lemmas and case information as atomic units. However, a more sophisticated approach should still be tried.
Multi-treebank Parsing When randomising the proxy treebank for multi-treebank models, use a different randomisation for each ensemble member. Predict the best proxy treebank for each test sentence or paragraph (Wagner et al., 2020).
Elided Tokens Our semantic parser handles elided tokens by appending the elided token to the adjacency matrix and offsetting the head indices. While we used this approach during training on gold data, we did not predict elided tokens and we wish to explore methods for doing so.