75 Languages, 1 Model: Parsing Universal Dependencies Universally

We present UDify, a multilingual multi-task model capable of accurately predicting universal part-of-speech, morphological features, lemmas, and dependency trees simultaneously for all 124 Universal Dependencies treebanks across 75 languages. By leveraging a multilingual BERT self-attention model pretrained on 104 languages, we found that fine-tuning it on all datasets concatenated together with simple softmax classifiers for each UD task can meet or exceed state-of-the-art UPOS, UFeats, Lemmas, (and especially) UAS, and LAS scores, without requiring any recurrent or language-specific components. We evaluate UDify for multilingual learning, showing that low-resource languages benefit the most from cross-linguistic annotations. We also evaluate for zero-shot learning, with results suggesting that multilingual training provides strong UD predictions even for languages that neither UDify nor BERT have ever been trained on.


Introduction
In the absence of annotated data for a given language, it can be considerably difficult to create models that can parse the language's text accurately. Multilingual modeling presents an attractive way to circumvent this low-resource limitation. In a similar way learning a new language can enhance the proficiency of a speaker's previous languages (Abu-Rabia and Sanitsky, 2010), a model which has access to multilingual information can begin to learn generalizations across languages that would not have been possible through monolingual data alone. Works such as  training data of similar languages can boost evaluation scores of models predicting syntactic information like part-of-speech and dependency trees. Multilinguality not only can improve a model's evaluation performance, but can also reduce the cost of training multiple models for a collection of languages (Johnson et al., 2017;Smith et al., 2018). However, scaling to a higher number of languages can often be problematic. Without an ample supply of training data for the considered languages, it can be difficult to form appropriate generalizations and especially difficult if those languages are distant from each other. But recent techniques in language model pretraining can profit from a drastically larger supply of unsupervised text, demonstrating the capability of transferring contextual sentence-level knowledge to boost the parsing accuracy of existing NLP models (Howard and Ruder, 2018;Peters et al., 2018;Devlin et al., 2018).
One such model, BERT (Devlin et al., 2018), introduces a self-attention (Transformer) network that results in state-of-the-art parsing performance when fine-tuning its contextual embeddings. And with the release of a multilingual version pretrained on the entirety of the top 104 resourced languages of Wikipedia, BERT is remarkably capable of capturing an enormous collection of cross-lingual syntactic information. Conveniently, these languages nearly completely overlap with languages supported by the Universal Dependencies treebanks, which we will use to demonstrate the ability to scale syntactic parsing up to 75 languages and beyond.
The Universal Dependencies (UD) framework provides syntactic annotations consistent across a large collection of languages (Nivre et al., 2018;Zeman et al., 2018). This makes it an excellent candidate for analyzing syntactic knowledge transfer across multiple languages. UD offers tokenized sentences with annotations ideal for multi-task learning, including lemmas (LEMMAS), treebank-specific part-of-speech tags (XPOS), universal part-of-speech tags (UPOS), morphological features (UFEATS), and dependency edges and labels (DEPS) for each sentence.
We propose UDify, a semi-supervised multitask self-attention model automatically producing UD annotations in any of the supported UD languages. To accomplish this, we perform the following: 1. We input all sentences into a pretrained multilingual BERT network to produce contextual embeddings, introduce task-specific layer-wise attention similar to ELMo (Peters et al., 2018), and decode each UD task simultaneously using softmax classifiers.
2. We apply a heavy amount of regularization to BERT, including input masking, increased dropout, weight freezing, discriminative finetuning, and layer dropout.
3. We train and fine-tune the model on the en-tirety of UD by concatenating all available training sets together.
We evaluate our model with respect to UDPipe Future, one of the winners of the CoNLL 2018 Shared Task on Multilingual Parsing from Raw Text to Universal Dependencies (Straka, 2018;Zeman et al., 2018). In addition, we analyze languages that multilingual training benefits prediction the most, and evaluate the model for zeroshot learning, i.e., treebanks which do not have a training set. Finally, we provide evidence from our experiments and other related work to help explain why pretrained self-attention networks excel in multilingual dependency parsing.
Our work uses the AllenNLP library built for the PyTorch framework. Code for UDify and a release of the fine-tuned BERT weights are available at https://github. com/hyperparticle/udify.

Multilingual Multi-Task Learning
In this section, we detail the multilingual training setup and the UDify multi-task model architecture. See Figure 1 for an architecture diagram.

Multilingual Pretraining with BERT
We leverage the provided BERT base multilingual cased pretrained model 1 , with a self-attention network of 12 layers, 12 attention heads per layer, and hidden dimensions of 768 (Devlin et al., 2018). The model was trained by predicting randomly masked input words on the entirety of the top 104 languages with the largest Wikipedias. BERT uses a wordpiece tokenizer (Wu et al., 2016), which segments all text into (unnormalized) sub-word units.

Cross-Linguistic Training Issues
Table 1 displays a list of vocabulary sizes, indicating that UD treebanks possess nearly 1.6M unique tokens combined. To sidestep the problem of a ballooning vocabulary, we use BERT's wordpiece tokenizer directly for all inputs. UD expects predictions to be along word boundaries, so we take the simple approach of applying the tokenizer to each word using UD's provided segmentation. For prediction, we use the outputs of BERT corresponding to the first wordpiece per word, ignoring  the rest 2 .
In addition, the XPOS annotations are not universal across languages, or even across treebanks. Because each treebank can possess a different annotation scheme for XPOS which can slow down inference, we omit training and evaluation of XPOS from our experiments.

Multi-Task Learning with UD
For predicting UD annotations, we employ a multi-task network based on UDPipe Future (Straka, 2018), but with all embedding, encoder, and projection layers replaced with BERT. The remaining components include the prediction layers for each task detailed below, and layer attention (see Section 3.1). Then we compute softmax cross entropy loss on the output logits to train the network. For more details on reasons behind architecture choices, see Appendix A.
UPOS As is standard for neural sequence tagging, we apply a softmax layer along each word input, computing a probability distribution over the tag vocabulary to predict the annotation string.
UFeats Identical to UPOS prediction, we treat each UFeats string as a separate token in the vocabulary. We found this to produce higher evaluation accuracy than predicting each morphological feature separately. Only a small subset of the full Cartesian product of morphological features is valid, eliminating invalid combinations.
Lemmas Similar to Chrupała (2006);Müller et al. (2015), we reduce the problem of lemmatization to a sequence tagging problem by predicting a class representing an edit script, i.e., the sequence of character operations to transform the word form to the lemma. To precompute the tags, we first find the longest common substring between the form and the lemma, and then compute the shortest edit script converting the prefix and suffix of the form into the prefix and suffix of the lemma using the Wagner-Fischer algorithm (Wagner and Fischer, 1974). Upon predicting a lemma edit script, we apply the edit operations to the word form to produce the final lemma. See also Straka (2018) for more details. We chose this approach over a sequence-to-sequence architecture like Bergmanis and Goldwater (2018) or Kondratyuk et al. (2018), as this significantly reduces training efficiency.
Deps We use the graph-based biaffine attention parser developed by Dozat and Manning (2016);Dozat et al. (2017), replacing the bidirectional LSTM layers with BERT. The final embeddings are projected through arc-head and arc-dep feedforward layers, which are combined using biaffine attention to produce a probability distribution of arc heads for each word. We then decode each tree with the Chu-Liu/Edmonds algorithm (Chu, 1965;Edmonds, 1967).

Fine-Tuning BERT on UD Annotations
We employ several strategies for fine-tuning BERT for UD prediction, finding that regularization is absolutely crucial for producing a highscoring network.

Layer Attention
Empirical results suggest that when fine-tuning BERT, combining the output of the last several layers is more beneficial for the downstream tasks than just using the last layer (Devlin et al., 2018). Instead of restricting the model to any subset of layers, we devise a simple layer-wise dot-product attention where the network computes a weighted sum of all intermediate outputs of the 12 BERT layers using the same weights for each token. This is similar to how ELMo mixes the output of multiple recurrent layers (Peters et al., 2018).
More formally, let w i be a trainable scalar for BERT embeddings BERT ij at layer i with a token at position j, and let c be a trainable scalar. We compute contextual embeddings e (task) such that To prevent the UD classifiers from overfitting to the information in any single layer, we devise layer dropout, where at each training step, we set each parameter w i to −∞ with probability 0.1. This effectively redistributes probability mass to all other layers, forcing the network to incorporate the information content of all BERT layers. We compute layer attention per task, using one set of c, w parameters for each of UPOS, UFeats, Lemmas, and Deps.

Transfer Learning with ULMFiT
The ULMFiT strategy defines several useful methods for fine-tuning a network on a pretrained language model (Howard and Ruder, 2018). We apply the same methods, with a few minor modifications.
We split the network into two parameter groups, i.e., the parameters of BERT and all other parameters. We apply discriminative fine-tuning, setting the base learning rate of BERT to be 5e −5 and 1e −3 everywhere else. We also freeze the BERT parameters for the first epoch to increase training stability.
While ULMFiT recommends decaying the learning rate linearly after a linear warmup, we found that this is prone to training divergence in self-attention networks, introducing vanishing gradients and underfitting. Instead, we apply an inverse square root learning rate decay with linear warmup (Noam) seen in training Transformer networks for machine translation (Vaswani et al., 2017).

Input Masking
The authors of BERT recommend not to mask words randomly with [MASK] when fine-tuning the network. However, we discovered that masking often reduces the tendency of the classifiers to overfit to BERT by forcing the network to rely on the context of surrounding words. This word dropout strategy has been observed in other works showing improved test performance on a variety of NLP tasks (Iyyer et al., 2015;Bowman et al., 2016;Clark et al., 2018;Straka, 2018).

Experiments
We evaluate UDify with respect to every test set in each treebank. As there are too many results to fit within one page, we display a salient subset of scores and compare them with UDPipe Future. The full results are listed in Appendix A.
We do not directly reference metrics from other models in the CoNLL 2018 Shared Task, as the tables of results do not assume gold word segmentation and may not provide a fair comparison. Instead, we retrained the open source UDPipe Future model using gold segmentation and report results here due to its architectural similarity to UDify and its strong performance. Note that the UDPipe Future baseline does not itself use BERT. Evaluation of BERT utilization in UDPipe Future can be found in Straka et al. (2019).
To train the multilingual model, we concatenate all available training sets together, similar to Mc-Donald et al. (2011). Before each epoch, we shuffle all sentences and feed mixed batches of sentences to the network, where each batch may contain sentences from any language or treebank, for a total of 80 epochs 3 .

Hyperparameters
A summary of hyperparameters can be found in Table 6 in Appendix A.1.

Probing for Syntax
Hewitt and Manning (2019) introduce a structural probe for identifying dependency structures in contextualized word embeddings. This probe evaluates whether syntax trees (i.e., unlabeled undirected dependency trees) can be easily extracted as a global property of the embedding space using a linear transformation of the network's contextual word embeddings. The probe trains a weighted adjacency matrix on the layers of contextual embeddings produced by BERT, identifying a linear transformation where squared L2 distance between embedding vectors encodes the distance between words in the parse tree. Edges are decoded by computing the minimum spanning tree on the weight matrix (the lowest sum of edge distances).  Test set scores for a subset of highresource (top) and low-resource (bottom) languages in comparison to UDPipe Future without BERT, with 3 UDify configurations: Lang, fine-tune on the treebank. UDify, fine-tune on all UD treebanks combined. UDify+Lang, fine-tune on the treebank using BERT weights saved from fine-tuning on all UD treebanks combined.
We train the structural probe on unmodified and fine-tuned BERT using the default hyperparameters of Hewitt and Manning (2019) Table 3: Ablation comparing the average of scores over all treebanks: task-specific layer attention (4 sets of c, w computed for the 4 UD tasks), global layer attention (one set of c, w for all tasks), and simple sum of layers (c = 1 and w = ).   ate whether the representations affected by finetuning BERT on dependency trees would more closely match the structure of these trees.

Results
We show scores of UPOS, UFeats (FEATS), and Lemma (LEM) accuracies, along with unlabeled and labeled attachment scores (UAS, LAS) evaluated using the offical CoNLL 2018 Shared Task evaluation script. 4 Results for a salient subset of high-resource and low-resource languages are shown in Table 2, with a comparison between UDPipe Future and UDify fine-tuning on all languages. In addition, the table compares UDify with fine-tuning on either a single language or both languages (fine-tuning multilingually, then fine-tuning on the language with the saved multilingual weights) to provide a reference point for multilingual influences on UDify. We provide a full table of scores for all treebanks in Appendix A.4. A more comprehensive overview is shown in Table 3, comparing different attention strategies applied to UDify. We display an average of scores over all (89) treebanks with a training set. For zero-shot learning evaluation, Table 4 displays a subset of test set evaluations of treebanks that do not have a training set, i.e., Breton, Tagalog, Faroese, Naija, and Sanskrit. We plot the layer attention weights w after fine-tuning BERT in Figure 2, showing a set of weights per task. And Table 5 compares the unlabeled undirected attachment scores (UUAS) of dependency trees produced using a structural probe on both the unmodified multilingual cased BERT model and the extracted BERT model fine-tuned on the English EWT treebank.

Discussion
In this section, we discuss the most notable features of the results.

Model Performance
On average, UDify reveals a strong set of results that are comparable in performance with the state-of-the-art in parsing UD annotations. UDify excels in dependency parsing, exceeding UD-Pipe Future by a large margin especially for lowresource languages. UDify slightly underperforms with respect to Lemmas and Universal Features, likely due to UDPipe Future additionally using character-level embeddings (Santos and Zadrozny, 2014;Ling et al., 2015;Ballesteros et al., 2015;Kim et al., 2016), while (for simplicity) UDify does not. Additionally, UDify severely underperforms the baseline on a few low-resource languages, e.g., cop scriptorum. We surmise that this is due to using mixed batches on an unbalanced training set, which skews the model towards predicting larger treebanks more accurately. However, we find that fine-tuning on the treebank individually with BERT weights saved from UDify eliminates most of these gaps in performance.
Echoing results seen in Smith et al. (2018), UDify also shows strong improvement leveraging multilingual data from other UD treebanks. In low-resource cases, fine-tuning BERT on all treebanks can be far superior to fine-tuning monolingually. A second round of fine-tuning on an individual treebank using UDify's BERT weights can improve this further, especially for treebanks that underperform the baseline. However, for languages that are already display strong results, we typically notice worse evaluation performance across all the evaluation metrics. This indicates that multilingual fine-tuning really is superior to single language fine-tuning with respect to these high-performing languages, showing improvements of up to 20% reduction in error.
Interestingly, Slavic languages tend to perform the best with multilingual training. While languages like Czech and Russian possess the largest UD treebanks and do not differ as much in performance from monolingual fine-tuning, evidenced by the improvements over single-language finetuning, we can see a large degree of morphological and syntactic structure has transferred to low-resource Slavic languages like Upper Sorbian, whose treebank contains only 646 sentences. But this is not only true of Slavic languages, as the Turkic language Kazakh (with less than 1,000 training sentences) has also improved significantly.
The zero-shot results indicate that fine-tuning on BERT can result in reasonably high scores on languages that do not have a training set. It can be seen that a combination of BERT pretraining and multilingual learning can improve predictions for Breton and Tagalog, which implies that the network has learned representations of syntax that cross lingual boundaries. Furthermore, despite the fact that neither BERT nor UDify have directly observed Faroese, Naija, or Sanskrit, we see unusually high performance in these languages. This can be partially attributed to each language closely resembling another: Faroese is very close to Ice-  Figure 3: Examples of minimum spanning trees produced by the syntactic probe are shown below each sentence, evaluated on BERT (left) and on UDify (right). Gold dependency trees are shown above each sentence in black.

UDify
Matched and unmatched spanning tree edges are shown in blue and red respectively.
landic, Naija (Nigerian Pidgin) is a variant of English, and Sanskrit is an ancient Indian language related to Greek, Latin, and Hindi. Table 3 shows that layer attention on BERT for each task is beneficial for test performance, much more than using a global weighted average. In fact, Figure 2 shows that each task prefers the layers of BERT differently, uniquely extracting the optimal information for a task. All tasks favor the information content in the last 3 layers, with a tendency to disprefer layers closer to the input. However, an interesting observation is that for Lemmas and UFeats, the classifier prefers to also incorporate the information of the first 3 layers. This meshes well with the linguistic intuition that morphological features are more closely related to the surface form of a word and rely less on context than other syntactic tasks. Curiously enough, the middle layers are highly dispreferred, meaning that the most useful processing for multilingual syntax (tagging, dependency parsing) occurs in the last 3-4 layers. The results released by Tenney et al. (2019) also agree with the intuition behind the weight distribution above, showing how the different layers of BERT generate hierarchical information like a traditional NLP pipeline, starting with low-level syntax (e.g., POS tagging) and building up to high-level syntactic and semantic dependency parsing.

Effect of Syntactic Fine-Tuning on BERT
Even without any supervised training, BERT encodes its syntax in the embedding's distance close to human-annotated dependencies. But more notably, the results in Table 5 show that fine-tuning BERT on Universal Dependencies significantly boosts UUAS scores when compared to the gold dependency trees, an error reduction of 41%.
This indicates that the self-attention weights have learned a linearly-transformable representation of its vectors more closely resembling annotated dependency trees defined by linguists. Even with just unsupervised pretraining, a global structural property of the vector space of the BERT weights already produces a decent representation of the dependency tree in the squared L2 distance. Following this, it should be no surprise that training with a non-linear graph-based dependency decoder would produce even higher quality dependency trees.

Attention Visualization
We performed a high-level visual analysis of the BERT attention weights to see if they have changed on any discernible level. Our observations reveal something notable: the attention weights tend to be more sparse, and are more often sensitive to constituent boundaries like clauses and prepositions. Figure 4 illustrates this point, showing the attention weights of a particular attention head on an example sentence. We find similar behavior in 13 additional attention heads for the provided example sentence.
We see that some of the attention structure remains after fine-tuning. Previously, the attention head was mostly sensitive to previous words and punctuation. But after fine-tuning, it demonstrates more fine-grained attention towards immediate wordpieces, prepositions, articles, and adjectives. We found similar evidence in other attention heads, which implies that fine-tuning on UD produces attention that more closely resembles localized dependencies within constituents. We also find that BERT base heavily preferred to attend to punctuation, while UDify BERT does to a much lesser degree.

Factors that Enable BERT to Excel at Dependency Parsing and Multilinguality
Goldberg (2019) assesses the syntactic capabilities of BERT and concludes that BERT is remarkably capable of processing syntactic tasks despite not being trained on any supervised data. Conducting similar experiments, Vig (2019) and Sileo (2019) visualize the attention heads within each BERT layer, showing a number of distinct attention patterns, including attending to previous/next words, related words, punctuation, verbs/nouns, and coreference dependencies. This neat delegation of certain low-level information processing tasks to the attention heads hints at why BERT might excel at processing syntax. We see that from the analysis on BERT finetuned with syntax using the syntactic probe and attention visualization, BERT produces a representation that keeps constituents close in its vector space, and improves this representation to more closely resemble human annotated dependency trees when fine-tuned on UD as seen in Figure 3. Furthermore, Ahmad et al. (2018) provide results consistent with their claim that self-attention networks can be more robust than recurrent networks to the change of word order, observing that selfattention networks capture less word order information in their architecture, which is what allows them to generally perform better at cross-lingual parsing. Wu and Dredze (2019) also analyze multilingual BERT and report that the model retains both language-independent as well as languagespecific information related to each input sentence, and that the shared embedding space with the input wordpieces correlates strongly with crosslingual generalization.
From the evidence above, we can see that the combination of strong regularization paired with the ability to capture long-range dependencies with self-attention and contextual pretraining on an enormous corpus of raw text are large contributors that enable robust multilingual modeling with respect to dependency parsing. Pretraining self-attention networks introduces a strong syntactic bias that is capable of generalizing across languages. The dependencies seen in the output dependency trees are highly correlated with the implicit dependencies learned by the self-attention, showing that self-attention is remarkably capable of modeling syntax by picking up on common syntactic patterns in text. The introduction of multilingual data also shows that these attention heads provide a surprising amount of capacity that do not degrade the performance considerably when compared to monolingual training. E.g., Devlin et al. (2018) report that the fine-tuning on the multilingual BERT model results in a small degradation in English fine-tune performance with 104 pretrained languages compared to an equivalent model pretrained only on English. This also hints that the BERT model can be compressed significantly without compromising heavily on evaluation performance.

Related Work
This work's main contribution in combining treebanks for multilingual UD parsing is most similar to the Uppsala system for the CoNLL 2018Shared Task (Smith et al., 2018. Uppsala combines treebanks of one language or closely related languages together over 82 treebanks and parses all UD annotations in a multi-task pipeline architecture for a total of 34 models. This approach reduces the number of models required to parse each language while also showing results that are no worse than training on each treebank individually, and in especially low-resource cases, significantly improved. Combining UD treebanks in a language-agnostic way was first introduced in Vilares et al. (2016), which train bilingual parsers on pairs of UD treebanks, showing similar improvements.
Other efforts in training multilingual models include Johnson et al. (2017), which demonstrate a machine translation model capable of supporting translation between 12 languages. Recurrent models have also shown to be capable of scaling to a larger number of languages as seen in Artetxe and Schwenk (2018), which define a scalable approach to train massively multilingual embeddings using recurrent networks on an auxiliary task, e.g., natural language inference. Schuster et al. (2019) produce context-independent multilingual embeddings using a novel embedding alignment strategy to allow models to improve the use of crosslingual information, showing improved results in dependency parsing.

Conclusion
We have proposed and evaluated UDify, a multilingual multi-task self-attention network finetuned on BERT pretrained embeddings, capable of producing annotations for any UD treebank, and exceeding the state-of-the-art in UD dependency parsing in a large subset of languages while being comparable in tagging and lemmatization accuracy. Strong regularization and task-specific layer attention are highly beneficial for fine-tuning, and coupled with training multilingually, also reduce the number of required models to train down to one. Multilingual learning is most beneficial for low-resource languages, even ones that do not possess a training set, and can be further improved by fine-tuning monolingually using BERT weights saved from UDify's multilingual training. All these results indicate that self-attention networks are remarkably capable of capturing syntactic patterns, and coupled with unsupervised pretraining are able to scale to a large number of languages without degrading performance.

Acknowledgments
The work described herein has been sup-

A Appendix
In this section, we detail and explain hyperparameter choices and miscellaneous details related to model training and display the full tables of evaluation results of UDify across all UD languages.

A.1 Hyperparameters
Upon concatenating all training sets, we shuffle all the sentences, bundle them into batches of 32 sentences each, and train UDify for a total of 80 epochs before stopping. We hold the learning rate constant until we unfreeze BERT in the second epoch, where we and linearly warm up the learning rate for the next 8,000 batches and then apply inverse square root learning rate decay for the remaining epochs. For the dependency parser, we use feedforward tag and arc dimensions of 300  and 800 respectively. We apply a small weight decay penalty of 0.01 to ensure that the weights remain small after each update. For optimization we use the Adam optimizer and we compute softmax cross entropy loss to train the network. We use a default β 1 value of 0.9 and lower the β 2 value from the typical 0.999 to 0.99. The reasoning is to increase the decay rate of the second moment in the Adam optimizer to reduce the chance of the optimizer being too optimistic with respect to the gradient history. We clip the gradient updates to a maximum L2 magnitude of 5.0. A summary of hyperparameters can be found in Table 6.
To speed up training, we employ bucketed batching, sorting all sentences by their length and grouping similar length sentences into each batch. However, to ensure that most sentences do not get grouped within the same batch, we fuzz the lengths of each sentence by a maximum of 10% of its true length when grouping sentences together.
Despite using all the regularization strategies shown previously, we still observe overfitting and must apply more aggressive techniques. To further regularize the network, we also increase the attention and hidden dropout rates of BERT from 0.1 to 0.2, and we also apply a dropout rate of 0.5 to all BERT layers before computing layer attention for each of the four tasks and applying a layer dropout with probability 0.1. We increase the masking probability of each wordpiece from 0.15 to 0.2.
With all these regularization strategies and hyperparameter choices combined, we are able to fine-tune BERT for far more epochs before the network starts to overfit, i.e., 80 as opposed to around 10. Even so, we believe even more regularization can improve test performance.
The final multilingual UDify model was trained over approximately 25 days on an NVIDIA GTX 1080 Ti taking an average of 8 hours per epoch. We use half-precision (fp16) training to be able to keep the BERT model in memory. One notable aspect of training is that while we observed the model start to level out in validation performance at around epoch 30, the model continually made small, incremental improvements over each subsequent epoch, resulting in far higher scores than if the model training was terminated early. This can be partially attributed to the decaying inverse square root learning rate.
Due to the high training times, we are only able to report on a small number of training experiments for the most relevant and useful results. Prior to developing the final model, we conducted fine-tuning experiments on pairs of languages to find a set of hyperparameters that worked best for multilingual learning. After this, we gradually scaled up training to 3 languages, 5 languages, 15 languages, and then finally the model presented above. We had high doubts, and wanted to see where the limit was in multilingual training. We were pleasantly surprised to find that this simple training scheme was able to scale up so well to all UD treebanks.

A.2 Training Size Effect on Performance
To gain a better understanding of where the largest score improvements in UDify occur, we plot the LAS improvement UDify provides over UDPipe Future for each treebank, ordered by the size (number of sentences) of the training set, see ure 5. The results show that the largest improvements tend to occur on small treebanks with less than 3,000 training examples. For absolute LAS values, see Figure 6, which indicates that more training resources tend to improve evaluation performance overall.

A.3 Miscellaneous Details
Our results show that modeling language-specific properties is not strictly necessary to achieve highperforming cross-lingual representations for dependency parsing, though we caution that the model can also likely be improved by these techniques.
Fine-tuning BERT on UD introduces a syntactic bias in the network, and we are interested in observing any differences in transfer learning by fine-tuning this new "UD-BERT" on other tasks. We leave a comprehensive evaluation of injecting syntactic bias into language models with respect to knowledge transfer for future work.
We note that saving the weights of BERT and fine-tuning a second round can improve performance as demonstrated in Stickland et al. (2019). The improvements of UDify+Lang over just UDify can be partially attributed to this, but we can see that even these improvements can be inferior to fine-tuning on all UD treebanks.
BERT limits its positional encoding to 512 wordpieces, causing some sentences in UD to be too long to fit into the model. We use a sliding window approach to break up long sentences into windows of 512 wordpieces, overlapping each window by 256 wordpieces. After feeding the windows into BERT, we select the first 256 wordpieces of each window and any remaining wordpieces in the last window to represent the contex-tual embeddings of each word in the original sentence.

A.4 Full Results of UD Scores
We show in Tables 7, 8, 9, and 10 UDify scores evaluated on all 124 treebanks with the official CoNLL 2018 Shared Task evaluation script. For comparison, we also include the full test evaluation of UDPipe Future on the subset of 89 treebanks with a training set. We also add a column indicating the size of each treebank, i.e., the number of sentences in the training set.