UDapter: Language Adaptation for Truly Universal Dependency Parsing

Recent advances in the field of multilingual dependency parsing have brought the idea of a truly universal parser closer to reality. However, cross-language interference and restrained model capacity remain a major obstacle to this pursuit. To address these issues, we propose a novel multilingual task adaptation approach based on recent work in parameter-efficient transfer learning, which allows for an easy but effective integration of existing linguistic typology features into the parsing network. The resulting parser, UDapter, consistently outperforms strong monolingual and multilingual baselines on both high-resource and low-resource (zero-shot) languages, setting a new state of the art in multilingual UD parsing. Our in-depth analyses show that soft parameter sharing via typological features is key to this success.


Introduction
Monolingual training of a dependency parser has been successful for languages where relatively large treebanks are available (Kiperwasser and Goldberg, 2016;Dozat and Manning, 2016). However, for many languages, annotated treebanks are either insufficient or unavailable. Considering this, multilingual models leveraging Universal Dependency annotations (Nivre et al., 2018) have drawn serious attention from the NLP community (Zhang and Barzilay, 2015;Ammar et al., 2016;de Lhoneux et al., 2018;Kondratyuk and Straka, 2019). Multilingual approaches try to learn generalizations across languages and to share information between them, making it possible to parse a target language even without receiving any supervision in that language. Moreover, multilingual models can be faster to train and easier to maintain than a large set of monolingual models.
However, scaling a multilingual model over a high number of languages can lead to sub-optimal results, especially if the languages that the model is trained on are typologically diverse. In various NLP tasks, multilingual neural models have been found to outperform their monolingual counterparts on low-and zero-resource languages due to positive transfer effects, but underperform them in high-resource languages (Johnson et al., 2017;Arivazhagan et al., 2019;Conneau et al., 2019), a problem also known as "the curse of multilinguality". Generally speaking, a multilingual model without any language-specific supervision is likely to suffer from over-generalization and perform poorly on high-resource languages due to limited capacity compared to the monolingual baselines, as verified by our experiments on parsing.
In this paper, we address this problem by striking the right balance between maximum sharing and language-specific capacity in multilingual dependency parsing. Inspired by previous work on parameter-efficient transfer learning (Houlsby et al., 2019;Platanios et al., 2018), we propose a new multilingual parsing architecture that learns to modify its language-specific parameters as a function of language embeddings. This allows the model to share parameters across languages, ensuring generalization and transfer ability, but also enables the language-specific parametrization within a single multilingual model. Furthermore we propose not to learn language embeddings from scratch, but to leverage a mix of linguistically curated and predicted typological features as obtained from the URIEL language typology database (Littell et al., 2017) which supports 3718 languages including all languages represented in the UD. While the importance of typological features for cross-lingual parsing transfer has been known at least since (Naseem et al., 2012), we are the first to use them effectively as direct input to a neural parser over a large number of languages from high-to zero-resource, and without manual feature selection. We further show that this choice is crucial to the success of our approach, leading to a substantial +27.5 accuracy/LAS increase on zero-shot languages and no loss on the high-resource languages when compared to the use of randomly initialized and learned language embeddings.
We train and test our model multilingually on the concatenation of 13 syntactically diverse highresource languages that were used by Kulmizev et al. (2019), and also evaluate on 30 genuinely low-resource languages. Results show that our approach outperforms the state-of-the-art monolingual (Kulmizev et al., 2019) and multilingual (Kondratyuk and Straka, 2019) parsers on both highresource and low-resource languages (zero-shot learning).
Contributions In this paper, we conduct several experiments on a large set of languages and perform a thorough analysis of our model. Accordingly, we make the following contributions: • We apply the idea of adapter tuning (Rebuffi et al., 2018;Houlsby et al., 2019) to the task of universal dependency parsing.
• We combine that with the idea of contextual parameter generation (Platanios et al., 2018), leading to a novel language adaptation approach with state-of-the art UD parsing results.
• We provide a simple but effective method for conditioning the language adaptation on existing typological language features, which we show is crucial for zero-shot performance.

Previous Work
Our work builds on several approaches from different fields such as multilingual neural machine translation, cross-lingual transfer learning and parsing. This section presents the background for these approaches.
Multilingual Neural Networks Research on multilingually trained neural networks has drawn massive attention over the past few years. Early models in multilingual neural machine translation designed dedicated architectures (Dong et al., 2015;Firat et al., 2016) whilst subsequent models, from Johnson et al. (2017) onwards, added a simple language identifier to the models with the same architecture as their monolingual counterparts. More recent studies on multilingual NMT models have focused on maximizing transfer accuracy for lowresource language pairs, while preserving highresource language accuracy (Platanios et al., 2018;Neubig and Hu, 2018;Aharoni et al., 2019;Arivazhagan et al., 2019), which is also known as the (positive) transfer -(negative) interference tradeoff. Another line of work in this context focuses to build massively multilingual pre-trained language models to produce contextual representation of words and sentences to be used in downstream tasks (Devlin et al., 2019;Conneau et al., 2019;Lample and Conneau, 2019). As the leading model, multilingual BERT (mBERT) 2 (Devlin et al., 2019) which is a deep self-attention network, was trained without language-specific signals on the 104 languages with the largest available Wikipedias. It uses a shared vocabulary of 110K WordPieces (Wu et al., 2016), and it has been shown to facilitate cross-lingual transfer in several applications (Pires et al., 2019;Wu and Dredze, 2019).

Cross-Lingual Dependency Parsing
In dependency parsing, the task is to predict the dependency tree of a sentence from raw text and the provided annotations such as part-of-speech tags and word tokenization.The cross-lingually consistent annotation efforts and the available treebanks in many languages (McDonald et al., 2013;Nivre et al., 2018) have provided an opportunity for cross-lingual studies to share information across languages within a single parser. Early studies trained a delexicalized parser (Zeman and Resnik, 2008;McDonald et al., 2013) on one or more source languages and applied it to target languages. Building on the delexicalized approach, later works used additional features such as typological language properties (Naseem et al., 2012), syntactic embeddings (Duong et al., 2015, and cross-lingual word clusters (Täckström et al., 2012 and Tiedemann, 2016). The goal in these studies is to embed language information in real-valued vectors in order to enrich internal representations with input language for multilingual models. Another line of work (Naseem et al., 2012;Zhang and Barzilay, 2015) suggests that the typological features accommodate rich language information, and that they enable selective sharing for the transfer information. Based on this idea, Ammar et al. (2016) use typological features to learn language embeddings as part of the parsing network training, more precisely by augmenting each token and parsing action representation. Unfortunately though, this technique was found to underperform the simple use of randomly initialized language embeddings (language ID). In this work, we demonstrate that typological features can in fact be very effective if used in combination with the right adaptation strategy. Finally, Lin et al. (2019) use typological features, along with size and other properties of the training data, to choose the optimal transfer language(s) for various tasks, including UD parsing, in a hard manner. By contrast, we focus on a soft parameter sharing approach to maximize useful generalizations within a single universal model.

Proposed Model
In this section, we present our language adaptation approach to achieve a truly universal dependency parser, which we name UDapter. UDapter consists of a biaffine attention layer stacked on top of the pre-trained Transformer encoder (mBERT), similar to (Wu and Dredze, 2019;Kondratyuk and Straka, 2019), however mBERT layers are interleaved with special adapter layers inspired by (Houlsby et al., 2019). While mBERT weights are frozen, biaffine attention and adapter layers' weights are generated by a contextual parameter generator (Platanios et al., 2018) that takes a language embedding as input and is updated while training on the treebanks. We would like to stress that the proposed adaptation approach is not restricted to dependency parsing and is in principle applicable to a range of multilingual NLP tasks.
The following sections describe in detail the components of our model.

Biaffine Attention Parser
The top layer of UDapter is a graph-based biaffine attention parser proposed by Dozat and Manning (2016). In this model, an encoder generates an internal representation r i for each word; the decoder takes r i and passes it through separate feedforward layers (MLP), and finally uses deep biaffine attention to score arcs connecting a head and a dependent: Similar to the arc scores, label scores are calculated by using a biaffine classifier over two separate feedforward layers. Lastly, the Chu-Liu/Edmonds algorithm (Chu, 1965;Edmonds, 1967) is used to find the highest scoring valid dependency tree.

Transformer Encoder with Adapters
To obtain contextualized word representations, UDapter uses the pre-trained BERT encoder (Devlin et al., 2019), which is a deep bidirectional Transformer network (Vaswani et al., 2017) trained with a masked language model objective together with next sentence prediction. More specifically we use multilingual BERT. For a token i in sentence S, BERT builds an input representation w i composed by summing a WordPiece embedding x i (Wu et al., 2016) and a position embedding f i . Each w i ∈ S is then passed to a stacked self-attention layers (SA) to generate the final encoder representation r i : where Θ (ad) denotes the adapter modules. During training, instead of fine-tuning the whole encoder network together with the task-specific top layer, we apply adapter modules (Rebuffi et al., 2018;Houlsby et al., 2019), or simply adapters, to capture both task-specific and language-specific information. Adapters are small modules consisting of two feedforward projections with a nonlinearity, added between layers of a pre-trained network as shown in Figure 1. In this approach, the weights of the original network are frozen, whilst the adapters are trained for a downstream task. Tuning with   adapters was mainly suggested for parameter efficiency but they also act as an information module for the task to be adapted. In this way, the original network serves as a memory for the language(s). We adopt adapter tuning for two reasons: 1) Each adapter module consists of only few parameters and supports contextual parameter generation (CPG; see § 3.3) with a reasonable number of trainable parameters. 2) Adapters enable task-specific as well as language-specific adaptation via CPG since it keeps backbone multilingual representations as memory for all languages in pre-training, which is important for multilingual transfer.

Contextual Parameter Generator
To control the amount of sharing across languages, we generate trainable parameters of the model by using a contextual parameter generator (CPG) function inspired by Platanios et al. (2018). CPG enables UDapter to retain high multilingual quality without losing performance on a single language, during multi-language training. We define CPG as a function of language embeddings. Since we only train adapters and the biaffine attention but not the other parts of the network (i.e. adapter tuning), the parameter generator is formalized as denotes the parameter generator with language embedding l e , and θ (ad) and θ (bf ) denote parameters of adapters and biaffine attention respectively. We implement CPG as a simple linear transform of a language embedding, similar to Platanios et al. (2018), so that weights of adapters in the encoder and biaffine attention are generated by the dot product of language embeddings: where l e ∈ R M , W (ad) ∈ R P (ad) ×M , W (bf) ∈ R P (bf) ×M , M is the language embedding size, P (ad) and P (bf) are the number of parameters for adapters and biaffine attention respectively. 3 An important advantage of CPG is that it allows for an easy integration of existing language features. In the experiment section we show that this is indeed key to achieve improvements on both high-and low-resource languages.

Projecting Language Typology to Language Embeddings
The proposed use of soft sharing via CPG enables our model to modify its parsing decisions depending on a language embedding. While this allows UDapter to perform accurately on the languages in training, even if they are typologically diverse, information sharing is still a problem for languages not seen during training (zero-shot learning) as a language embedding is not available. Inspired by Naseem et al. (2012) and Ammar et al. (2016), we address this problem by defining language embeddings as a function of a large set of language typological features, including syntactic and phonological features. We use a multi-layer perceptron MLP (lang) with two feedforward layers and a ReLU nonlinear activation to compute a language embedding l e : where l t is a typological feature vector for a language, consisting of all 103 syntactic, 28 phonological and 158 phonetic inventory features, i.e, no filtering or manual selection was applied on the feature list. We obtain these features from the URIEL language typology database (Littell et al., 2017) Table 1). 4 During training, a language identifier is added to each sentence, and gold word segmentation is provided. We evaluate our models on the set of training languages (high-resource set), as well as on 30 genuinely low-resource languages that have no or very little training data (low-resource set) in a zero-shot setting -i.e, without using any training data for these languages. 5 The detailed treebank list is provided in Appendix C.
For the encoder, we use BERT-multilingualcased together with its WordPiece tokenizer. Since dependency annotations are between words, we pass the BERT output corresponding to the first wordpiece per word to the biaffine parser. We apply the same hyper-parameter settings as Kondratyuk and Straka ( for adapter size and language embedding size respectively. Note that, in our approach, pre-trained BERT weights are frozen, and only adapters and biaffine attention are trained, thus we use the same learning rate for the whole network by applying an inverse square root learning rate decay with linear warmup (Howard and Ruder, 2018). Appendix A gives the hyper-parameter details.
To enable a direct comparison, we also re-train UDify on our set of 13 high-resource languages both monolingually (one treebank at a time; monoudify) and multilingually (on the concatenation of languages; multi-udify). Finally, we evaluate a version of our model that has only task-specific adapter modules (adapter-only) and no languagespecific adaptation, i.e. no contextual parameter generator. For a fair comparison, this model has a larger adapter size (1024) than the full UDapter.
Importantly, all baselines are either trained for a single language, or multilingually without any language-specific adaptation. By comparing UDapter to these parsers, we highlight its unique character that enables language specific  parametrization by typological features within a multilingual framework for both supervised and zero-shot learning setup.

Results
Average LAS results and standard deviation across languages for UDify models (Kondratyuk and Straka, 2019) and UDapter are given in Figure 2. The highlight is that UDapter outperforms monolingual and multilingual baselines on both highresource and zero-shot languages. Standard deviations among languages in each set are also slightly reduced. In the following section we elaborate on these results.
High-resource Languages Labelled Attachement Scores (LAS) on the high-resource set are given in Table 1. UDapter consistently outperforms both our monolingual and multilingual baselines in all languages, and beats the previous work, setting a new state of the art, in 9 out of 13 languages. Among our directly comparable baselines, multiudify gives the worst performance in the typologically diverse high-resource setting. This multilingual model is always clearly worse than its monolingually trained counterparts mono-udify, losing 3 LAS points on average (83.0 vs 86.0). This result resounds with previous findings in the field of multilingual NMT (Arivazhagan et al., 2019) and highlights the importance of performing mul-tilingual adaptation even when using high-quality sentence representations like those produced by multilingual BERT. To isolate the importance of each of our model's components, we first look at the adapter-only model which has almost the same architecture as multi-udify except for the adapter modules (Rebuffi et al., 2018;Houlsby et al., 2019) and the tuning choice (frozen mBERT weights). Interestingly, this model performs considerably better than multi-udify (85.0 vs 83.0), indicating that adapter modules are also effective in multilingual scenarios. Finally, UDapter achieves the best results overall with consistent gains over both multiudify and adapter-only, showing the importance of linguistically informed adaptation even for intraining languages.

Low-Resource Languages
The average LAS on our set of 30 low-resource languages is shown in the last column of Table 1. It can be seen that UDapter outperforms the multi-udify and adapteronly baselines proving the benefits of our approach on both in-training and zero-shot languages. For a closer look, we provide individual results for the 18 languages in our low-resource set 2. Here we find a more mixed picture: namely, UDapter outperforms multi-udify on 13 out of 18 languages. 6 In total, UDapter outperforms multi-udify on 22/30 low-resource languages. We emphasize though that getting improvements in the zero-shot parsing setup is very difficult, thus we believe this result is an important step towards overcoming the problem of positive/negative transfer trade-off. A more in-depth analysis of this result is provided in the following section.

Analysis
In this section we perform a number of analyses on UDapter, aimed at understanding its impact on different languages, as well as the importance of its various components.

5.1
In which languages is the model most beneficial? Figure 3 presents the LAS gain of UDapter over the multi-udify baseline for each high-resource language along with the respective treebank training size. According to the general trend, the gains are higher in the languages with less training data. This suggests that our adaptation approach allows to share useful knowledge among the in-training languages, thereby benefitting the less resourced languages without hurting performance on the highest resourced ones. For zero-shot languages the overall difference between the two models is rather small compared to the high-resource languages (+1.2 LAS) as seen the Table 1. While it is harder to find a clear trend among these languages, we notice that UDapter seems to be more beneficial for the languages that are not present in m-BERT training corpus, where it outperforms multi-udify in 18 out of 22 (non-mBERT) languages. This suggests that typological feature-based adaptation leads to improved sen- tence representations when even the pre-trained encoder has never been exposed to a certain language.

How much do the typological features boost the performance?
As one of its defining aspects, UDapter learns language embeddings by projecting from typological features of languages which consist of syntactic, phonological and inventory features. A natural alternative to this choice is to learn language embeddings from scratch. To understand the role of typological language features, we trained a model where language embeddings are initialized randomly and learned in end-to-end manner for the in-training languages. For the zero-shot languages, we take the average of all in-training language embeddings and use the resulting embedding in the parameter sharing function. Figure 4a shows the overall results on the two sets of languages. On the high-resource set, both models perform very similarly: The model with randomly initialized embeddings and the model with typological features achieve 87.1 and 87.3 average LAS respectively. On zero-shot languages, however, the model without typological features underperforms UDapter by a very large margin: 9.0 vs 36.5 average LAS score over 30 languages. This confirms our expectation that, for intraining languages, the model could learn reliable language embeddings from the available syntactic annotations. On the other hand, the results clearly show that typological signals are required to keep a consistent parsing quality on zero-shot languages, otherwise it fails. Thus, the indirect supervision provided by typological features is crucial for zeroshot setting in our model. 5.3 What does the language embedding space look like? Figure 6 illustrates the 2D vector spaces generated by t-SNE technique for the typological feature vectors l t of all languages (6a) and for the language embeddings l t learned from them by UDapter (6b). Red and blue dots represent in-training and zeroshot languages respectively. Similar clusters located in both spaces are detected manually and highlighted. We find that typologically similar zero-shot languages tend to have similar embeddings to the closest high-resource language. This is expected since our model learns to project typological features to the language embeddings for high-resource languages in the first place, then it computes zero-shot language embeddings by using typological features according to that projection. Figure 6 also reveals an interesting pattern: while the languages spread homogeneously in the typological vector space (6a), in the learned language embedding space, high-resource languages (red dots) are scattered to the outer space and more distant from each other. This shows that the learned language embeddings amplify differences among training languages, thereby drifting away from a linguistically motivated representation space (e.g. Korean parting from Japanese, Turkish parting from Finnish). The use of typological features, though, contains this tendency and makes it possible to adapt the parser even on languages for which no supervision of any kind is available. We further analyze the projection weights assigned to different typological features by the first layer of the language embedding network (see eq. 7). Figure 4b shows the averages of normalized syntactic, phonological and phonetic inventory features. Although dependency parsing is a fully syntactic task, the language embedding network does not only attend syntactic features, as also observed by Lin et al. (2019). Instead the projection network uses all available typological features as proxy to represent languages.

Is CPG really essential to our adaptation approach?
In section 4.1 we observed that adapter tuning alone (that is, without CPG) improved the multilingual baseline in the high-resource languages, but worsened it considerably in the zero-shot setup. By contrast, the addition of CPG with typological features led to the best results over all languages. But could we have obtained similar results by simply increasing the adapter size? For instance, in the related field of multilingual MT, increasing overall model capacity of an already very large and deep architecture has been proved a powerful alternative to more sophisticated parameter sharing approaches (Arivazhagan et al., 2019). To answer this question we train another adapter-only model with doubled size (2048 instead of the 1024 used in the main experiments). As shown in 5a, this brings a slight gain to the high-resource languages, but actually leads to a small loss in the zero-shot setup. Both model versions are outperformed by the full UDapter including CPG, confirming once more the importance of this component. Regarding model sizes, although adapter models have a higher number of parameters in total, the number of trainable parameters is considerably lower compared to multi-udify. Thereby, adapters allow to enlarge the per-language capacity for intraining languages, without adding computational cost. However, on zero-shot languages results are opposite. This shows that adapter-tuning enables better adaptation for languages provided, but at the same time it hurts generalization and zero-shot transfer.
For our last analysis (Fig. 5b) we look at the role of soft parameter sharing via CPG on different portions of the network, namely: only on the adapter modules 'cpg (adapters)' versus on both adapters and biaffine attention 'cpg (adap.+biaf.)' corresponding to the full UDapter. Results show that most of the gain in the high-resource languages is obtained by only applying CPG on the multilingual encoder. By contrast, for the low-resource languages, typological feature based parameter sharing is most important in the biaffine attention layer. We leave a further investigation of this finding to future work.

Conclusion
In this work, we have presented UDapter, a multilingual dependency parsing model that learns to adapt language-specific parameters on the basis of adapter modules (Rebuffi et al., 2018;Houlsby et al., 2019) and the contextual parameter generation (CPG) method (Platanios et al., 2018). While adapters provide a more general task-specific adaptation, CPG enables language-specific adaptation. The latter adaptation is defined as a function of language embeddings projected from linguistically curated typological features so that while the model retains high language-specific performance for languages in the training data, it also allows better zero-shot transfer. We train this parser on a concatenation of typologically diverse languages and evaluate on both high-resource and low-resource languages. Experiments show that our parser outperforms both monolingual and multilingual stateof-the-art parsers which reflects its strong balance between per-language capacity and maximum sharing. Finally the analyses we performed on the underlying characteristics of our model show that typological features are crucial for zero-shot languages.
The proposed adaptation approach is not restricted to the task of dependency parsing and is in principle applicable to a range of multilingual NLP tasks.

A Implementation Details
We implement UDabter based on Udify (Kondratyuk and Straka, 2019). Hyper-parameters that used in the experiments are given in Table 3. In addition to the default hyper-parameters in UDify, we added adapter size and language embedding size. Note that for adapter-only model (see § 4) we used 1024 as adapter-size unlike the final UDapter. To provide fair comparison, mono-udify and multiudify are re-trained on the concatenation of 13 highresource languages without multi-tasking that we observe negatively affect the parsing performance. Besides we did not use a layer attention for both our model and the baselines.