Parameter sharing between dependency parsers for related languages

Previous work has suggested that parameter sharing between transition-based neural dependency parsers for related languages can lead to better performance, but there is no consensus on what parameters to share. We present an evaluation of 27 different parameter sharing strategies across 10 languages, representing five pairs of related languages, each pair from a different language family. We find that sharing transition classifier parameters always helps, whereas the usefulness of sharing word and/or character LSTM parameters varies. Based on this result, we propose an architecture where the transition classifier is shared, and the sharing of word and character parameters is controlled by a parameter that can be tuned on validation data. This model is linguistically motivated and obtains significant improvements over a monolingually trained baseline. We also find that sharing transition classifier parameters helps when training a parser on unrelated language pairs, but we find that, in the case of unrelated languages, sharing too many parameters does not help.


Introduction
The idea of sharing parameters between parsers of related languages goes back to early work in crosslingual adaptation (Zeman and Resnik, 2008), and the idea has recently received a lot of interest in the context of neural dependency parsers (Duong et al., 2015;Ammar et al., 2016;Susanto and Lu, 2017).Modern neural dependency parsers, however, use different sets of parameters for representation and scoring, and it is not clear what parameters it is best to share.
The Universal Dependencies (UD) project (Nivre et al., 2016), which is seeking to harmonize the annotation of dependency treebanks across languages, has seen a steady increase in languages that have a treebank in a common standard.Many of these languages are low resource and have small UD treebanks.It seems interesting to find out ways to leverage the wealth of information contained in these treebanks, especially for low resource languages.
In this paper, we evaluate 27 different parameter sharing strategies.We focus on a particular transition-based neural dependency parser (de Lhoneux et al., 2017a,b), which performs close to the state of the art.This parser has three sets of parameters: i) the parameters of a characterbased one-layer, bidirectional LSTM; ii) the parameters of a word-based two-layer, bidirectional LSTM; iii) and the parameters of a multi-layered perceptron (MLP) with a single hidden layer.The two first sets are for learning to represent configurations; the third for selecting the next transition.We consider all combinations of sharing these sets of parameters; and in addition, we consider two ways of sharing each set of parameters, namely with and without a prefixed language embedding.The latter enables partial, soft sharing.In sum, we consider all 3 3 combinations of no sharing, hard sharing and soft sharing of the three sets of parameters.We evaluate the 27 multilingual parsers on 10 languages from the UD project, representing five pairs of related languages, each pair from a different language family.We repeat the experiment with the same set of languages, but using pairs of unrelated languages.
Contributions This paper is, to the best of our knowledge, the first to evaluate different parameter sharing strategies for exploiting synergies between neural dependency parsers of related languages.We evaluate the different strategies on 10 languages, representing five different language families.We find that sharing (MLP) transition transition is used to generate non-projective dependency trees (Nivre, 2009).
For an input sentence of length n with words w 1 , . . ., w n , the parser creates a sequence of vectors x 1:n , where the vector x i representing w i is the concatenation of a word embedding and the final state of the character-based LSTM after processing the characters of w i .The character vector ch(w i ) is obtained by running a (bi-directional) LSTM over the characters ch j (1 ≤ j ≤ m) of w i .Each input element is represented by the word-level, bi-directional LSTM, as a vector v i = BILSTM(x 1:n , i).For each configuration, the feature extractor concatenates the LSTM representations of core elements from the stack and buffer.Both the embeddings and the LSTMs are trained together with the model.
A configuration c is represented by a feature function φ(•) over a subset of its elements.For each configuration, transitions are scored by a classifier, in this case an MLP, and φ(•) is a concatenation of BiLSTM vectors on top of the stack and the beginning of the buffer.The MLP scores transitions together with the arc labels for transitions that involve adding an arc.In practice, we use two interpolated MLPs, one which only scores the transitions, and one which scores transitions together with the arc label.For simplicity, we refer to that interpolated MLP as the MLP.

Parameter sharing
Since our parser has three basic sets of model parameters, we consider sharing all combinations of those three sets.We also introduce two ways of sharing, namely, with or without the addition of a vector representing the language.This language embedding enables the model, in theory, to learn what to share between the two languages in question.Since for all three model parameter sets, we now have three options -not sharing, sharing, or sharing in the context of a language embeddingwe are left with 3 3 = 27 parameter sharing strategies; see Table 2.
In the setting where we do not share (✗) word parameters (W), we construct a different word lookup table and a different word-level BiLSTM for each language.In the setting where we do hard parameter sharing (�) of word parameters, we only construct one lookup table and one word BiLSTM for the languages involved.In the setting where we do soft sharing (ID) of word pa- rameters, we share those parameters, and in addition, concatenate a language embedding l i representing the language of word w i to the vector of the word w i at the input of the word BiLSTM: Similarly for character parameters (C), we construct a different character BiLSTM and one character lookup for each language (✗), create those for all languages and share them (�) or share them and concatenate a (randomly initialized) language embedding l i representing the language of word w i at the input of the character BiLSTM (ID): ch j = e(ch j ) • l i .At the level of configuration or parser states (S), we either construct a different MLP for each language (✗), share the MLP (�) or share it and concatenate a language embedding l i representing the language of word w i to the vector representing the configuration, at the input of the MLP (ID):

Experiments
Language pairs We use 10 languages in our experiments, representing five language pairs from different language families.Our two SEMITIC languages are Arabic and Hebrew.These two languages differ in that Arabic tends to favour VSO word order whereas Hebrew tends to use SVO, but are similar in their rich transfixing morphology.
Our two FINNO-UGRIC languages are Estonian and Finnish.These two languages differ in that Estonian no longer has vowel harmony, but share a rich agglutinative morphology.Our two SLAVIC languages are Croatian and Russian.These two languages differ in that Croatian uses gender in plural nouns, but otherwise share their rich inflectional morphology.Our two ROMANCE languages are Italian and Spanish.These two languages differ in that Italian uses a possessive adjective with a definite article, but share a fairly strict SVO order.Finally, our two GERMANIC languages are Dutch and Norwegian.These two languages differ in morphological complexity, but share word ordering features to some extent.
Datasets For all 10 languages, we use treebanks from the Universal Dependencies project.The dataset characteristics are listed in Table 1.To keep the results comparable across language pairs, we down-sample the training set to the size of the smallest of our languages, Hebrew: we randomly sample 5000 sentences for each training set.Note that while this setting makes the experiment somewhat artificial and will probably overestimate the benefits that can be obtained from sharing parameters when using larger treebanks, we find it interesting to see how much low resource languages can benefit from parameter sharing, as explained in the introduction.
Baselines and systems This is an evaluation paper, and our results are intended to explore a space of sharing strategies to find better ways of sharing parameters between dependency parsers of related languages.Our baseline is the Uppsala parser trained monolingually.Our systems are parsers trained bilingually by language pair where we share subsets of parameters between the languages in the pair, and we report on what sharing strategies seem superior across the 10 languages that we consider.
Implementation details A flexible implementation of parameter strategies for the Uppsala parser was implemented in Dynet. 2 We make the code publicly available.3

Results and discussion
Our results on development sets are presented in Table 2.We use labeled attachment score (LAS) as our metric for evaluating parsers.Table 2 presents numbers for a select subset of the 27 sharing strategies.The other results can be found in the supplementary material.Our main observations are: (i) that, generally, and as observed in previous work, multi-task learning helps: all different sharing strategies are on average better than the monolingual baselines, with minor (0.16 LAS points) to major (0.86 LAS points) average improvements; and (ii) that sharing the MLP seems to be overall a better strategy than not sharing it: the 10 best strategies share the MLP.Whereas the usefulness of sharing the MLP seems to be quite robust across language pairs, the usefulness of sharing word and character parameters seems more dependent on the language pairs.This reflects the linguistic intuition that character-and word-level LSTMs are highly sensitive to phonological and morphosyntactic differences such as word order, whereas the MLP learns to predict less idiosyncratic, hierarchical relations from relatively abstract representations of parser configurations.
Based on this result, we propose a model (OURS) where the MLP is shared and the sharing of word and character parameters is controlled by a parameter that can be set on validation data.Results are given in Table 3.We obtain a 0.6 LAS improvement on average and our proposed model is significantly better than the monolingual baseline with p < 0.01.Significance testing is performed using a randomization test, with the script from the CoNLL 2017 Shared Task.

Unrelated languages
We repeated the same set of experiments with unrelated language pairs.We hypothesise that parameter sharing between unrelated language pairs will be less useful in general than with related language pairs.However, it can still be useful, it has been shown previously that unrelated languages can benefit from being trained jointly.For example, Lynn et al. (2014) have shown that Indonesian was surprisingly particularly useful for Irish.
The results are presented in Table 4.The table only presents part of the results, the rest can be found in the supplementary material.As expected, there is much less to be gained from sharing parameters between unrelated pairs.However, it is possible to improve the monolingual baseline by sharing some of the parameters.In general, sharing the MLP is still a helpful thing to do.It is most helpful to share the MLP and optionally one of the two other sets of parameters.Results are close to the monolingual baseline when everything is shared.Sharing word and character parameters but not the MLP hurts accuracy compared to the monolingual baseline.

Related work
Previous work has shown that sharing parameters between dependency parsers for related languages can lead to improvements (Duong et al., 2015;Ammar et al., 2016;Susanto and Lu, 2017).Smith et al. (2018) Naseem et al. (2012) proposed to selectively share subsets of a parser across languages in the context of a probabilistic parser.Options we do not explore here are learning the architecture jointly with optimizing the task objective (Misra et al., 2016;Ruder et al., 2017), or learning an architecture search model that predicts an architecture based on the properties of datasets, typically with reinforcement learning (Zoph and Le, 2017;Wong and Gesmundo, 2018;Liang et al., 2018).We also do not explore the option of sharing selectively based on more fine-grained typological information about languages, which related work has indicated could be useful (Bjerva and Augenstein, 2018).Rather, we stick to sharing between languages of the same language families.
The strategies explored here do not exhaust the space of possible parameter sharing strategies.For example, we completely ignore soft sharing based on mean-constrained regularisation (Duong et al., 2015).

Conclusions
We present evaluations of 27 parameter sharing strategies for the Uppsala parser across 10 lan-guages, representing five language pairs from five different language families.We repeated the experiment with pairs of unrelated languages.We made several observations: (a) Generally, multitask learning helps.(b) Sharing the MLP parameters always helps.It helps to share MLP parameters when training a parser on a pair of related languages, and it also helps if the languages are unrelated.(c) Sharing word and character parameters is differently helpful depending on the language.(d) Sharing too many parameters does not help, when the languages are unrelated.
In future work, we plan to investigate what happens when training on more than 2 languages.Here, we focused on a setting with rather small amounts of balanced data.It would be interesting to experiment with using datasets that are not balanced with respect to size.Finally, we have restricted our experiments to a specific architecture, using fixed hyperparameters including word and character embedding dimensions.It would be interesting to experiment with different parsing architectures as well as varying those hyperparameters.

Table 2 :
Performance on development data (LAS; in %) across select sharing strategies.MONO is our single-task baseline; LANGUAGE-BEST is using the best sharing strategy for each language (as evaluated on development data); BEST is the overall best sharing strategy, across languages; CHAR shares only the character-based LSTM parameters; WORD shares only the wordbased LSTM parameters; ALL shares all parameters.� refers to hard sharing, ID refers to soft sharing, using an embedding of the language ID and ✗ refers to not sharing.

Table 3 :
LAS on the test sets of the best of 9 sharing strategies and the monolingual baseline.δ is the difference between OURS AND MONO.

Table 4 :
recently found that sharing parameters using the same parser as in this paper Performance on development data (LAS; in %) across select sharing strategies for unrelated languages.MONO is our single-task baseline; LANGUAGE-BEST is using the best sharing strategy for each language (as evaluated on development data); BEST and WORST are the overall best and worst sharing strategy across languages; CHAR shares only the character-based LSTM parameters; WORD shares only the word-based LSTM parameters; ALL shares all parameters.� refers to hard sharing, ID refers to soft sharing, using an embedding of the language ID and ✗ refers to not sharing.