Parser Training with Heterogeneous Treebanks

How to make the most of multiple heterogeneous treebanks when training a monolingual dependency parser is an open question. We start by investigating previously suggested, but little evaluated, strategies for exploiting multiple treebanks based on concatenating training sets, with or without fine-tuning. We go on to propose a new method based on treebank embeddings. We perform experiments for several languages and show that in many cases fine-tuning and treebank embeddings lead to substantial improvements over single treebanks or concatenation, with average gains of 2.0--3.5 LAS points. We argue that treebank embeddings should be preferred due to their conceptual simplicity, flexibility and extensibility.


Introduction
In this paper we investigate how to train monolingual parsers in the situation where several treebanks are available for a single language. This is quite a common occurrence; in release 2.1 of the Universal Dependencies (UD) treebanks (Nivre et al., 2017), 25 languages have more than one treebank. These treebanks can differ in several respects: they can contain material from different language variants, domains, or genres, and written or spoken material. Even though the UD project provides guidelines for consistent annotation, treebanks can still differ with respect to annotation choices, consistency and quality of annotation. Some treebanks are thoroughly checked by human annotators, whereas others are based entirely on automatic conversions. All this means that it is often far from trivial to combine multiple treebanks for the same language.
The 2017 CoNLL Shared Task on Universal Dependency Parsing (Zeman et al., 2017) included 15 languages with multiple treebanks. An additional parallel test set of 1000 sentences, PUD, was also made available for a selection of languages. Most of the participating teams did not take advantage of the multiple treebanks, however, and simply trained one model per treebank instead of one model per language. There were a few exceptions to this rule, but these teams typically did not investigate the effect of their proposed strategies in detail.
In this paper we begin by performing a thorough investigation of previously proposed strategies for training with multiple treebanks for the same language. We then propose a novel method, based on treebank embeddings. Our new technique has the advantage of producing a single flexible model for each language, regardless of the number of treebanks. We show that this method leads to substantial improvements for many languages. Of the competing methods, training on the concatenation of treebanks, followed by fine-tuning for each treebank, also performed well, but this method results in longer training times and necessitates multiple unwieldy models per language.

Training with Multiple Treebanks
The most obvious way to combine treebanks for a particular language, provided that they use the same annotation scheme, is simply to concatenate the training sets. This has the advantage that it does not require any modifications to the parser itself, and it produces a single model that can be directly used for any input from the language in question. Björkelund et al. (2017) and Das et al. (2017) used this strategy to parse the PUD test sets in the 2017 CoNLL Shared Task. Little details are given on the results, but while it was successful on dev data for most languages, results were mixed on the actual PUD test sets. For the two Norwegian language variants, concatenation has been proposed (Velldal et al., 2017), but it hurts results unless combined with machine translation.
Training on concatenated treebanks can be improved by a subsequent fine-tuning step. In this set-up, after training the model on concatenated data, it is refined for each treebank by training only on its own training set for a few additional epochs. This enables the models to learn differences between treebanks, but it requires more training, and results in separate models for each treebank. When the parser is applied to new data, there is thus a choice of which fine-tuned version to use. This approach was used by Che et al. (2017) and Shi et al. (2017) for languages with multiple treebanks in the CoNLL 2017 Shared Task. Che et al. (2017) apply fine-tuning to all but the largest treebank for each language, and show average gains of 1.8 LAS for a subset of nine treebanks. Shi et al. (2017) show that the choice of treebank for parsing the PUD test set is important, but do not have any specific evaluation of the effect of fine-tuning.
Another approach, not explored in this paper, is shared gated adversarial networks, proposed by Sato et al. (2017) for the CoNLL 2017 Shared Task. They use treebank prediction as an adversarial task. In this model, treebank-specific BiL-STMs are constructed for all treebanks in addition to a shared BiLSTM which is used both for parsing and for the adversarial task. This method requires knowing at test time which treebank the input belongs to. Sato et al. (2017) show that this strategy can give substantial improvements, especially for small treebanks. For large treebanks, however, there are mostly no or only minor improvements.
Our approach for taking advantage of multiple treebanks is to use a treebank embedding to represent the treebank to which a sentence belongs. In our proposed model, all parameters of the model are shared; the treebank embedding facilitates soft sharing between treebanks at the word level, and allows the parser to learn treebank-specific phenomena. At test time, a treebank identifier has to be given for the input data. A key benefit of using treebank embeddings is that we can train a single model for each language using all available data while remaining sensitive to the differences between treebanks. The addition of treebank embeddings requires only minor modifications to the parser (see section 3.1). To the best of our knowledge this approach is novel when applied to the monolingual case as treebank embeddings. The most similar approach we have found in the literature is Lim and Poibeau (2017), who used one-hot treebank representations to combine data for improving monolingual parsing for three tiny treebanks, with improvements of 0.6-1.9 LAS. It is also related to work on domain embeddings for machine translation (Kobus et al., 2017), and language embeddings for parsing (Ammar et al., 2016).
We previously used a similar architecture for combining languages with very small training sets with additional languages (de Lhoneux et al., 2017a). Language embeddings have also been explored for other cross-lingual tasks such as language modeling (Tsvetkov et al., 2016;Östling and Tiedemann, 2017) and POS-tagging (Bjerva and Augenstein, 2018). Cross-lingual parsing, however, often requires substantially more complex models. They typically include features such as multilingual word embeddings (Ammar et al., 2016), linguistic re-write rules (Aufrant et al., 2016), or machine translation (Tiedemann, 2015). Unlike much work on cross-lingual parsing, we do not focus on a low-resource scenario.

Experimental Setup
We perform experiments for 24 treebanks from 9 languages, using UUParser (de Lhoneux et al., 2017a,b). We compare concatenation (CONCAT), concatenation with fine-tuning (C+FT), and treebank embeddings (TB-EMB). In addition we compare these results to using only single treebanks for training (SINGLE). While some of these methods were previously suggested in the literature, no proper evaluation and comparison between them has been performed. For the PUD test data, there is no corresponding training set, so we need to choose a model or set a treebank embedding based on some other treebank. We call this a proxy treebank.
For evaluation we use labeled attachment score (LAS). Significance testing is performed using a randomization test, with the script from the CoNLL 2017 Shared Task. 1

The Parser
We use UUParser 2 (de Lhoneux et al., 2017a), which is based on the transition-based parser of Kiperwasser and Goldberg (2016), and adapted to UD. It uses the arc-hybrid transition system from Kuhlmann et al. (2011) extended with a SWAP transition and a static-dynamic oracle, as described in de Lhoneux et al. (2017b). This model allows the construction of non-projective dependency trees (Nivre, 2009).
A configuration c is represented by a feature function φ(·) over a subset of its elements and, for each configuration, transitions are scored by a classifier. In this case, the classifier is a multilayer perceptron (MLP) and φ(·) is a concatenation of the BiLSTM vectors v i of words on top of the stack and at the beginning of the buffer. The MLP scores transitions together with the arc labels for transitions that involve adding an arc.
For an input sentence of length n with words w 1 , . . . , w n , the parser creates a sequence of vectors x 1:n , where the vector x i representing w i is the concatenation of a word embedding e(w i ) and a character vector, obtained by running a BiLSTM over the m characters ch 1 , . . . , ch m of w i : Note that no POS-tags or morphological features are used in this parser.
In the TB-EMB setup, we also concatenate a treebank embedding tb(w i ) to the representation of w i : Finally, each input element is represented by a BiLSTM vector, v i : v i = BILSTM(x 1:n , i) All embeddings are initialized randomly, and trained together with the BiLSTMs and MLP. For hyperparameter settings we used default values from de Lhoneux et al. (2017a). The dimension of the treebank embedding is set to 12 in our experiments; we saw only small and inconsistent changes when varying the number of dimensions. We train the parser for 30 epochs per setting. For C+FT we apply fine-tuning for an additional 10 epochs for each treebank. We pick the best epoch based on LAS score on the dev set, using average dev scores when training on more than one treebank, and apply the model from this epoch to the test data.

Data
We performed all experiments on UD version 2.1 treebanks (Nivre et al., 2017), using gold sentence and word segmentation. We selected 9 languages, based on the criteria that they should have at least two treebanks with fully available training data and a PUD test set. The sizes of the training corpora for the 9 languages are shown in Table 1. The situation is quite different across languages with either treebanks of roughly the same size, as for Spanish, or very skewed data sizes with a mix of large and small treebanks, as for Czech. In all cases we use all available data, except for Czech, where we randomly choose a maximum of 15,000 sentences per treebank per epoch for efficiency reasons. Table 1 shows the results on the test sets of each training treebank and on the PUD test sets. Overall we observe substantial gains when using either C+FT or TB-EMB. On average both C+FT and TB-EMB beat SINGLE by 3.5 LAS points and CON-CAT by over 2.0 LAS points when testing on the test sets of the treebanks used for training, and both methods beat both baselines by over 2.0 LAS points for the PUD test set, if we consider the best proxy treebank.

Results
We see positive gains across many scenarios when using C+FT and TB-EMB. First, there are gains for both balanced and unbalanced data sizes, as in the cases of Spanish and French, respectively. Secondly, there are cases with different language variants, as for Portuguese, and different domains, as for Finnish where FTB only contains grammar examples and TDT contains a mix of domains. There are also cases of known differences in annotation choices, as for the Swedish treebanks.
When the data is very skewed, as for Russian, the effect of adding a small treebank to a large one is minor, as expected. While our results are not directly comparable to the adversarial learning in Sato et al. (2017) who used a different parser and test set, the improvements of C+FT and TB-EMB are typically at least on par with and often larger than their improvements. While our im-  Table 1: LAS scores when testing on the training treebank and on the PUD test set with training treebank as proxy. For each test set, the best result is marked with bold. Treebank size is given as number of sentences in the training data. Statistically significant differences, at the 0.05-level, from SINGLE are marked with +, from CONCAT with × and from both these systems with *. For clarity, significance for PUD is only shown for the proxy treebank with the highest score.
provements are, unsurprisingly, largest for smaller treebanks, we do also see some improvements for large treebanks, in contrast to Sato et al. (2017). Some variation can be observed between languages. In two cases, Italian ISDT and Czech PUD, CONCAT performs marginally better than the more advanced methods, but these differences are not statistically significant. In several cases, especially for small treebanks, CONCAT helps noticeably over SINGLE, whereas it actually hurts for Finnish and Russian. It is, however, nearly always better to combine treebanks in some way than to use only a single treebank. The differences between the two best methods, C+FT and TB-EMB are typically small and not statistically significant, with the exception of Czech PDT, and for some of the small proxy treebanks for PUD.
The PUD test set can be seen as an example of applying the proposed models to unseen data, without matching training data. For all languages, except Czech, the results for C+FT and TB-EMB with the best proxy treebank are significantly better than the equivalent result for SINGLE, and for six of the nine languages, TB-EMB performs significantly better than CONCAT. It is clear that some treebanks are bad fits to PUD, most notably Finnish FTB and Russian SynTagRus. However, even when a treebank is a bad fit, TB-EMB and C+FT can still improve substantially over using only the single model for the treebank with the best fit, as for Russian where there is a gain of nearly 8 LAS points for TB-EMB over SINGLE, when using GSD as a proxy. For some languages, however, most notably Italian, the choice of proxy treebank makes little difference for TB-EMB and C+FT. It is also interesting to see that in many cases it is not the largest treebank that is the best proxy for PUD. The large difference in results for PUD, depending on which treebank was used as proxy, also seems to point at potential inconsistencies in the UD annotation for several languages.

Error Analysis
To complement the LAS scores, we performed a small manual error analysis for Swedish, looking at the results for the PUD data, when translated Ett vittne berättade för polisen att offret hade attackerat den misstänkte i april . A witness related for the-police that the-victim had attacked the suspected in April . using different methods and proxy treebanks. The two Swedish treebanks, Talbanken and LinES, are known to differ in the annotation of a few constructions, notably relative clauses and prepositions that take subordinate clauses as complements. The error analysis reveals that the treebank embedding approach allows the model to learn the distinctive "style" of each treebank, while concatenation, even with fine-tuning, results in more inconsistencies in the output. A typical example is shown in Figure 1. When trained with treebank embeddings (and Talbanken as the proxy treebank), the parser produces the correct tree depicted above the sentence. When trained with finetuning instead, the parser incorrectly analyzes the subordinate clause as a relative clause (as shown by the dashed arc below the sentence), because the mark relation is also used for relative pronouns in the LinES treebank, despite the fact that such structures never occur in Talbanken.

Conclusion and Future Work
We have conducted the first large-scale study on how best to combine multiple treebanks for a single language, when all treebanks use the same annotation scheme but may be heterogeneous with respect to domain, genre, size, language variant, annotation style, and quality, as is the case for many languages in the UD project. We propose using treebank embeddings, which represent the treebank a sentence comes from. This method is simple, effective, and flexible, and performs on par with a previously suggested method of using concatenation in combination with fine-tuning, which, however, requires longer training, and produces more models. We show that both these methods give substantial gains for a variety of languages, including different scenarios with respect to the mix of available treebanks. Our results are also at least on par with a previously proposed, but more complex model, based on adversarial learning (Sato et al., 2017). To improve parsing accuracy, it is certainly worth combining multiple treebanks, when available, for a language, using more sophisticated methods than simple concatenation. We recommend the treebank embedding model due to its simplicity.
The proposed methods work well with a transition-based parser with BiLSTM feature extractors without POS-tags or pre-trained embeddings. In future work, we want to investigate how these methods interact with other parsers, and if the combination methods are useful also for tasks like POS-tagging and morphology prediction.
We did not yet investigate methods for choosing a proxy treebank when parsing new data. The results on the PUD test set could indicate which treebank is likely to be the best proxy for the languages explored here. Other factors that could be taken into account when making this choice include degree of domain match and treebank quality. The user may also simply choose the desired annotation style by selecting the corresponding proxy treebank. For the TB-EMB approach, interpolation of the various treebank embeddings is another possibility.
In the current paper, we explore only the monolingual case, using several treebanks for a single language. Preliminary experiments show that we can combine treebank and language embeddings and add other languages to the mix. Including closely related languages typically gives additional gains, which we will explore in future work.