MSTParser Model Interpolation for Multi-Source Delexicalized Transfer

We introduce interpolation of trained MSTParser models as a resource combination method for multi-source delexical-ized parser transfer. We present both an unweighted method, as well as a variant in which each source model is weighted by the similarity of the source language to the target language. Evaluation on the HamleDT treebank collection shows that the weighted model interpolation performs comparably to weighted parse tree combination method, while being computationally much less demanding.


Introduction
The task of delexicalized dependency parser transfer (or delex transfer for short) is to train a parser on a treebank for a source language (src), using only non-lexical features, most notably partof-speech (POS) tags, and to apply that parser to POS-tagged sentences of a target language (tgt) to obtain dependency parse trees. Delex transfer yields worse results than a supervised lexicalized parser trained on the tgt language treebank. However, for languages with no treebanks available, it may be useful to obtain at least a lower-quality parse tree for tasks such as information retrieval.
Usually, multiple src treebanks are available, and it is non-trivial to select the best one for a given tgt language. Therefore, information from some or all src treebanks is usually combined together. The standard ways are to train a parser on the concatenation of all src treebanks, or to train a separate parser on each src treebank and to combine the parse trees produced by the parsers using a maximum spanning tree algorithm. The tree combination method typically performs better; it can also be easily extended by weighting the src parser predictions by similarity of the src language to the tgt language, which can further improve its results.
In this work, we present a novel method for src information combination, based on interpolation of trained parser models. Our approach was motivated by an intuition that the more fine-grained information provided by the src edge scores could be of benefit, probably serving as src parser confidence. Moreover, model interpolation is significantly less computationally demanding at inference than the parse tree combination method, as instead of running a set of separate src parsers, only one parser is run.

Related Work
Delex transfer was conceived by Zeman and Resnik (2008), who also introduced two important preprocessing steps -mapping treebank-specific POS tagsets to a common set using Interset (Zeman, 2008), and harmonizing treebank annotation styles into a common style, which later led to the HamleDT harmonized treebank collection (Zeman et al., 2012). McDonald et al. (2011) applied delex transfer in a setting with multiple src treebanks available, finding that the problem of selecting the best src treebank without access to a tgt language treebank for evaluation is non-trivial, and proposed the treebank concatenation method as a solution. Søgaard and Wulff (2012) introduced weighting into the method, using a POS n-gram model trained on a tgt POS-tagged corpus to weight src sentences in a weighted perceptron learning scenario (Cavallanti et al., 2010); due to its large computational complexity, we only compare to the unweighted variant in our paper.
The parse tree combination method was introduced by Sagae and Lavie (2006) for a supervised monolingual setting, optionally weighting each src parser with a weight based on its accuracy. In (Rosa andŽabokrtský, 2015), we ported the method to a crosslingual setting by combining delex parsers for different languages, weighted by src-tgt language similarity; we largely build upon that work in this paper.
Other possibilities of estimating src-tgt language similarity for delex transfer include employment of WALS (Dryer and Haspelmath, 2013), focusing e.g. on genealogy distance and wordorder features, as done by Naseem et al. (2012) and Täckström et al. (2013), among others.
We are not aware of any prior work on interpolating dependency parser models. However, there is work on interpolating trained phrase-structure parsers, both in a monolingual setting for domain adaptation by McClosky et al. (2010), as well as in a multilingual setting by Cohen et al. (2011).

Method
In this section, we present our suggested approach of combining information from multiple src treebanks for parsing tgt language sentences in a crosslingual delex transfer scenario. The method proceeds as follows: 1. Train a delex parser model on each src treebank (Section 3.1). 2. Normalize the parser models (Section 3.2). 3. Interpolate the parser models (Section 3.3). 4. Parse the tgt text with a delex parser using the interpolated model.

Delexicalized MSTParser
Throughout this work, we use MSTperl (Rosa, 2015b), an unlabelled first-order non-projective single-best implementation of the MSTParser of McDonald et al. (2005b), trained using 3 iterations of MIRA (Crammer and Singer, 2003). The MSTParser model uses a set of binary features F that are assigned weights w f by training on a treebank. When parsing a sentence, the parser constructs a complete weighted directed graph over the tokens of the input sentence, and assigns each edge e a score s e which is the sum of weights of features that are active for that edge: The sentence parse tree is the maximum spanning tree over that graph, found using the algorithm of Chu and Liu (1965) and Edmonds (1967).
The delex feature set we use is based on the set of McDonald et al. (2005a) with lexical features removed. It consists of combinations of signed edge length (distance of head and parent, bucketed for values above 4 and for values above 10) with POS tag of the head, dependent, their neighbours, and all nodes between them. We use the Universal POS Tagset (UPOS) of Petrov et al. (2012). The parser configuration files containing the full feature set, together with the scripts we used for our experiments, are available in (Rosa, 2015a).

Model Normalization
An important preliminary step to model interpolation is to normalize each of the trained models, as the feature weights in models trained over different treebanks are often not on the same scale (we do not perform any regularization during the parser training). We use a simplified version of normalization by standard deviation. First, we compute the uncorrected sample standard deviation of the weights of the features in the model as wherew is the average feature weight, and |M | is the number of feature weights in model M ; only features that were assigned a weight by the training algorithm are taken into account. We then divide each feature weight by the standard deviation: 1 The choice of normalization by standard deviation is based on its high and stable performance on our development set, and Occam's razor. 2

Model Interpolation
The interpolated model is a linear combination of the normalized models trained over the src treebanks. The result is a model that can be used in the same way as a standard MSTParser model.
In unweighted model interpolation, the weight of each feature (w f ) is computed as the sum of the weights of that feature in the src models (w f,src ): (4) In the weighted variant of model interpolation, we extend (4) with multiplication by the KL −4 cpos 3 weight of Rosa andŽabokrtský (2015): src) . (5) The KL −4 cpos 3 weight corresponds to the similarity of the src language to the tgt language, and is defined as the negative fourth power of the KL divergence (Kullback and Leibler, 1951) of coarse POS tag trigram distributions in tgt and src corpora: where cpos 3 is a UPOS trigram, and f (cpos 3 ) is its relative frequency in a src or tgt corpus. 3

Baseline Methods
In this section, we describe the two baseline resource combination methods against which we compare our model interpolation method.

Treebank Concatenation
The treebank concatenation method of McDonald et al. (2011) proceeds as follows: 1. Concatenate all src treebanks. 2. Train a delex parser on the resulting treebank. 3. Apply the parser to the tgt text.

Parse Tree Combination
The parse tree combination method is defined by Rosa andŽabokrtský (2015) in the following way: 1. Train a delex parser on each src treebank.
2. Apply each of the parsers to the tgt sentence, obtaining a set of parse trees.
3 fsrc(cpos 3 ) := 1 N if the src corpus does not contain the given trigram (N is the number of tokens in the corpus).
3. Construct a weighted directed graph over tgt sentence tokens, with each edge assigned a score equal to the number of parse trees that contain this edge. (i.e., each parse tree contributes by 0 or 1 to the edge score). In the weighted variant, the contribution of each src parse tree is multiplied by KL −4 cpos 3 (tgt, src). 4. Find the maximum spanning tree over the graph with the Chu-Liu-Edmonds algorithm.
Note that if each src parse tree contributed with a (normalized) score of the edge as assigned by its model rather than with a 0 or 1, this method would be equivalent to the model interpolation method.

Dataset
We carry out all experiments using HamleDT 2.0 (Rosa et al., 2014), a collection of 30 treebanks converted into Universal Stanford Dependencies (de Marneffe et al., 2014). We use goldstandard UPOS tags in all experiments; while this is not fully realistic in the setting of underresourced languages, there exist high-performance semi-supervised taggers that could be used instead of gold tags (Das and Petrov, 2011;Agić et al., 2015), which we plan to evaluate in future. We use the treebank training sections for parser training and KL −4 cpos 3 computation, and the test sections for evaluation. We used 12 of the treebanks as a development set to select the model normalization method to avoid overfitting it to the dataset. 4 Table 1 contains the results of our model interpolation methods, as well as the baseline methods. For each tgt language, all remaining 29 src treebanks were used for parser training. We base our evaluation on comparing absolute differences in UAS on the whole set of 30 languages as targets. 5 The performance of the weighted model interpolation is comparable to the weighted tree combination -the difference in average UAS of the methods is lower than 0.1%, with model interpolation achieving a higher UAS than the tree combination for 16 of the 30 tgt languages. This shows In the unweighted setting, the situation is quite different, with model interpolation scoring much lower than tree combination (-2.4%), and only slightly higher than treebank concatenation (+0.4%) on average. This suggests that, contrary to our original intuition, edge scores assigned by the src models are not a good proxy for parser confidence, not even when appropriately normalized. 6 Furthermore, the weighted methods generally out- 6 The same tendency was observed across all normalization methods evaluated on the development set. perform the unweighted ones (by +4.0% for tree combination and by +6.4% for model interpolation on average), which suggests, among other, that the src-tgt language similarity is much more important than the exact values of src edge scores for resource combination in delex transfer.

Conclusion
We presented trained parser model interpolation as an alternative method for multi-source crosslingual delexicalized dependency parser transfer. Evaluation on a large collection of treebanks showed that in a setting where the source languages are weighted by their similarity to the target language, model interpolation performs comparably to the parse tree combination approach. Moreover, model interpolation is significantly less computationally demanding than the tree combination when parsing the target text, as the interpolation can be efficiently performed beforehand, thus only requiring to invoke a single parser at runtime, while in the tree combination approach, each source parser has to be invoked individually.
In the unweighted setting, model interpolation consistently performed much worse than tree combination, which we find rather surprising, and we therefore plan to further investigate this in future. Still, the weighted methods generally outperformed the unweighted ones, and as the language similarity measure that we used only requires the source treebanks and a target POS-tagged text, i.e. exactly the resources that are required even for the unweighted delex transfer methods, there is little reason not to employ the weighting. Therefore, the low performance of the unweighted model interpolation is of less importance than its high performance in the weighted setting.
In this work, we only used the unlabelled MST-Parser for all experiments. We believe that extending our method to other parsers constitutes an interesting path for future research.