Cross-lingual Parsing with Polyglot Training and Multi-treebank Learning: A Faroese Case Study

Cross-lingual dependency parsing involves transferring syntactic knowledge from one language to another. It is a crucial component for inducing dependency parsers in low-resource scenarios where no training data for a language exists. Using Faroese as the target language, we compare two approaches using annotation projection: first, projecting from multiple monolingual source models; second, projecting from a single polyglot model which is trained on the combination of all source languages. Furthermore, we reproduce multi-source projection (Tyers et al., 2018), in which dependency trees of multiple sources are combined. Finally, we apply multi-treebank modelling to the projected treebanks, in addition to or alternatively to polyglot modelling on the source side. We find that polyglot training on the source languages produces an overall trend of better results on the target language but the single best result for the target language is obtained by projecting from monolingual source parsing models and then training multi-treebank POS tagging and parsing models on the target side.


Introduction
Cross-lingual transfer methods, i. e. methods that transfer knowledge from one or more source languages to a target language, have led to substantial improvements for low-resource dependency parsing (Rosa and Mareček, 2018;Agić et al., 2016;Guo et al., 2015;Lynn et al., 2014;Mc-Donald et al., 2011;Hwa et al., 2005) and part-ofspeech (POS) tagging (Plank and Agić, 2018).In low-resource scenarios, there may be not enough data for data-driven models to learn how to parse.In cases where no annotated data is available, knowledge is often transferred from annotated data in other languages and when there is only a small amount of annotated data, additional knowl-edge can be induced from external corpora such as by learning distributed word representations (Mikolov et al., 2013;Al-Rfou' et al., 2013) and more recent contextualized variants (Peters et al., 2018;Devlin et al., 2019).
This work focuses on dependency parsing for low-resource languages by means of annotation projection (Yarowsky et al., 2001) and synthetic treebank creation (Tiedemann and Agić, 2016).We build on recent work by Tyers et al. (2018) who show that in the absence of annotated training data for the target language, a lexicalized treebank can be created by translating a target language corpus into a number of related source languages and parsing the translations using models trained on the source language treebanks. 1These annotations are then projected to the target language using separate word alignments for each source language, combined into a single graph for each sentence and decoded (Sagae and Lavie, 2006), resulting in a treebank for the target language, Faroese in the case of Tyers et al.'s and our experiments.Inspired by recent literature involving multilingual learning (Mulcaire et al., 2019;Smith et al., 2018;Vilares et al., 2016), we investigate whether additional improvements can be made by: 1. using a single polyglot 2 parsing model which is trained on the combination of all source languages to create synthetic source treebanks (which are subsequently projected to the target language) 1 In this paper, source language and target language always refer to the projection, not the direction of translation. 2We adopt the same terminology used in Mulcaire et al. (2019), who use the term cross-lingual transfer to describe methods involving the use of one or more source languages to process a target language.They reserve the term polyglot learning for training a single model on multiple languages and where parameters are shared between languages.For the polyglot learning technique applied to multiple treebanks of a single language, we use the term multi-treebank learning.
2. training a multi-treebank model on the individually projected treebanks and the treebank produced with multi-source projections.
The former differs from the approach of Tyers et al. (2018), who use multiple discrete, monolingual models to parse the translated sentences, whereas in this work we use a single model trained on multiple source treebanks.The latter differs from training on the target treebank produced by multi-source projection in that the information of the individual projections is still available and training data is not reduced to cases where all source languages provide a projection.
In other words, we aim to investigate whether the current state-of-the-art approach for Faroese, which relies on cross-lingual transfer, can be improved upon by adopting an approach based on source-side polyglot learning and/or target-side multi-treebank learning.We hypothesize that a polyglot model can exploit similarities in morphology and syntax across the included source languages, which will result in a better model to provide annotations for projection.On the target side, we expect that combining different sources of information will result in a more robust target model.
We evaluated our various models on the Faroese test set and experienced considerable gains for three of the four source languages (Danish, Norwegian Bokmål and Swedish) by adopting a polyglot model.However, for Norwegian Nynorsk, a stronger monolingual model was able to outperform the polyglot approach.When we extended multi-treebank learning to the target side, we experienced additional gains for all cases.Our best result of 71.5 -an absolute improvement of 7.2 points over the result reported by Tyers et al. (2018) -was achieved with multi-treebank target learning over the monolingual projections.
2 Background Tyers et al. (2018) describe a method for creating synthetic treebanks for Faroese based on previous work which uses machine translation and word alignments to transfer trees from source language(s) to the target language.Sentences from Faroese are translated into the four source languages Danish, Swedish, Norwegian Nynorsk and Norwegian Bokmål.The translated sentences are then tokenized, POS tagged and parsed using the relevant source language model trained on the source language treebank.The resulting trees are projected back to the Faroese sentences using word alignments.The four trees for each sentence are combined into a graph with edge scores one to four (the number of trees that support them), from which a single tree per sentence is produced using the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965;Edmonds, 1967).The resulting trees make up a synthetic treebank for Faroese which is then used to train a Faroese parsing model.The parser output is evaluated using the gold-standard Faroese test treebank developed by Tyers et al. (2018).The approach is compared to a delexicalized baseline, which it outperforms by a large margin.It is also shown that, for Faroese, a combination of the four source languages (multi-source projection) is superior to individual language projection.
The idea of annotation projection using wordalignments originates from (Yarowsky et al., 2001) who used word alignments to transfer information such as POS tags from source to target languages.This method was later used in dependency parsing by Hwa et al. (2005), who project dependencies to a target language and use a set of heuristics to form dependency trees in the target language.A parser is then trained on the projected treebank and evaluated against gold-standard treebanks.Zeman and Resnik (2008) introduced the idea of delexicalized dependency parsing whereby a parser is trained using only POS information and is then applied to a target language.McDonald et al. (2011) perform delexicalized dependency parsing using direct transfer and show that this approach outperforms unsupervised approaches for grammar induction.Importantly, this approach can be extended to the multi-source case by training on multiple source languages and predicting a target language.In an additional experiment, they combine annotation projection and multi-source transfer.Tiedemann and Agić (2016) present a thorough comparison of pre-neural cross-lingual parsing.Various forms of projected annotation methods are compared to delexicalized baselines, and the use of machine translation instead of parallel corpora to produce synthetic treebanks in the target language is explored.In contrast to Tyers et al. (2018), they translate a target sentence and project the source parse tree back to the target during test time instead of using this approach to obtain training data for the target language.Agić et al. (2016) leverage massively multilingual parallel corpora such as translations of the Bible and web-scraped data from the Watchtower Online Library website 3 for low-resource POS tagging and dependency parsing using annotation projection.They project weight matrices (as opposed to decoded dependency arcs) from multiple source languages and average the matrices weighted by word alignment confidences.They then decode the weight matrices into dependency trees on the target side, which are then used to train a parser.This approach utilizes dense information from multiple source languages, which helps reduce noise from source side predictions but to the best of our knowledge, the source-side parsing models learn information between source languages independently and the cross-lingual interaction only occurs when projecting the edge scores into multi-source weight matrices.
The idea of projecting dense information in the form of a weighted graph has been further extended by Schlichtkrull and Søgaard (2017) who bypass the need to train the target parser on decoded trees and develop a parser which can be trained directly on weighted graphs.Plank and Agić (2018) use annotation projection for POS tagging.They find that choosing high quality training instances results in superior accuracy than randomly sampling a larger training set.To this end, they rank the target sentences by the percentage of words covered by word alignments across all source languages and choose the top k covered sentences for training.
Meechan-Maddon and Nivre (2019) carry out an evaluation on cross-lingual parsing for three low-resource languages which are supported by related languages.They include three experiments: first, training a monolingual model on a small number of sentences in the target language; second, training a cross-lingual model on related source languages which is then applied to the target data and lastly, training a multilingual model which includes target data as well as data from the related support languages.They found that training a monolingual model on the target data was always superior to training a cross-lingual model.Interestingly, they found that the best results were achieved by training a model on the various support languages as well as the target data, i. e. their multilingual model.While we do not combine 3 https://wol.jw.org/ the synthetic target treebanks with the source treebanks in our experiments, the results of Meechan-Maddon and Nivre (2019) motivate us to carry out this experiment in the future.

Method
We outline the process used for creating a synthetic treebank for cross-lingual dependency parsing.We use the following resources: raw Faroese sentences taken from Wikipedia, a machine translation system to translate these sentences into all source languages (Danish, Swedish, Norwegian Nynorsk and Norwegian Bokmål), a word-aligner to provide word alignments between the words in the target and source sentences, treebanks for the four source languages on which to train parsing models, POS tagging and parsing tools, and, lastly a target language test set.We use the same raw corpus, alignments and tokenized and segmented versions of the source translations4 as Tyers et al. (2018) who release all of their data. 5In this way, the experimental pipeline is the same as theirs but we predict POS tags and dependency annotations using our own models.
Target Language Corpus We use the target corpus built by Tyers et al. (2018) which comprises 28,862 sentences which were extracted from Faroese Wikipedia dumps6 using the WikiExtractor script7 and further pre-processed to remove any non-Faroese texts and other forms of unsuitable sentences.
Machine Translation As noted by Tyers et al. (2018), popular repositories for developing machine translation systems such as OPUS (Tiedemann, 2016) contain an inadequate amount of sentences to train a data-driven machine translation system for Faroese.For instance, there are fewer than 7,000 sentence pairs between Faroese and Danish, Faroese and English, Faroese and Norwegian and Faroese and Swedish.Consequently, to create parallel source sentences, Tyers et al. (2018) use a rule-based machine translation system available in Apertium8 to translate from Faroese to Norwegian Bokmål.There also exists translation systems from Norwegian Bokmål to Norwegian Nynorsk, Swedish and Danish in Apertium.As a result, the authors use pivot translation from Norwegian Bokmål into the other source languages.
The process is illustrated in Fig. 1.For a more thorough description of the machine translation process and for resource creation in general, see the work of Tyers et al. (2018).
Word Alignments We use word alignments between the Faroese text and the source translations generated by Tyers et al. (2018) using fast align (Dyer et al., 2013), a word alignment tool based on IBM Model 2.9 Source Treebanks We use the Universal Dependencies v2.2 treebanks (Nivre et al., 2018) to train our source parsing models.This is the version used for the 2018 CoNLL shared task on Parsing Universal Dependencies (Zeman et al., 2018).
Source Tagging and Parsing Models In order for our parsers to work well with predicted POS tags, we follow the same steps as used in the 2018 CoNLL shared task for creating training and development treebanks with automatically predicted POS tags (henceforth referred to as silver POS).Since we are required to parse translated text which only has lexical features available, we disregard lemmas, language-specific POS (XPOS) and morphological features and only use the word form and universal POS (UPOS) tag as input features to our parsers.We develop our POS tagging and parsing models using the AllenNLP library (Gardner et al., 2018).We use jackknife resampling to predict the UPOS tags for the training treebanks.We split the training treebank into ten parts, train models on nine parts and predict UPOS for the excluded part.The process is repeated until all ten parts are predicted and they are then combined to recreate the treebank with silver POS tags.Only token features are used to predict the UPOS tag. 10 Finally, we train a model per source language on the full training data to check performance on the respective development set and to POS tag the source language translations before parsing.
We train two variants of parsing models.The first is a monolingual biaffine dependency parser (Dozat and Manning, 2017) trained on the individual source treebanks.The second is a polyglot model trained on all source treebanks using the multilingual parser of Schuster et al. (2019), which is the same graph-based biaffine dependency parser, extended to enable parsing with multiple treebanks.We additionally include a treebank embedding (Ammar et al., 2016;Stymne et al., 2018) to the input of the polyglot parser to help the parser differentiate between the source languages.We optimize the model for average development set LAS across the included languages.The process is illustrated in Fig. 2.
To ensure that our parser is realistic, we add a pre-trained monolingual word embedding to each monolingual parser, giving a considerable improvement in accuracy on the development sets of the source languages.We use the precomputed Word2Vec embeddings11 released as part of the 2017 CoNLL shared task on UD parsing (Zeman et al., 2017) which were trained on CommonCrawl and Wikipedia.
In order to use pre-trained word embeddings for the polyglot setting, we need to consider that a polyglot model uses a shared vocabulary across all input languages.In our experiments, we simply use the union of the word embeddings and average the word vector for words that occur in more than one language.Future work should explore crosslingual word embeddings with limited amount of parallel data or use aligned contextual embeddings as in (Schuster et al., 2019).Synthetic Source Treebanks Source translations are tokenized with UDPipe (Straka and Straková, 2017) by Tyers et al. (2018).For each source language, the POS model trained on the full training data (see previous section) is used to tag the tokenized translations.Once the text is tagged, we predict dependency arcs and labels with the parsing models of the previous section, and use annotation projection (described below) to provide syntactic annotations for the target sentences.

Swedish
Annotation Projection Once the synthetic source treebanks are compiled, i. e. the translations are parsed, the annotations are then projected from the source translations to the target language using the word alignments and Tyers et al.'s projection tool, resulting in a Faroese treebank.In some cases, not all tokens are aligned and Tyers et al. (2018) work around this by falling back to a 1:1 mapping between the target index and the source index.There are also cases where there is a mismatch in length between the source and target sentences and some dependency structures cannot be projected to the target language.Tyers et al.'s projection setup removes unsuitable projected trees containing e. g. more than one root token, a token that is its own head or a token with a head outside the range of the sentence.
Multi-source Projection For multi-source projection, the four source-language dependency trees for a Faroese sentence are projected into a single graph, scoring edges according to the number of trees that contain them (Sagae and Lavie, 2006;Nivre et al., 2007).The dependency structure is first built by voting over the directed edges.Afterward, dependency labels and POS tags are decided using the same voting procedure.The process is illustrated in Fig. 3.
Target Tagging and Parsing Models At this stage we have Faroese treebanks to train our POS tagging and parsing models.The Faroese treebanks come in two variants: the result of projection from source trees produced by either 1) a monolingual, or 2) the polyglot model.For each case, we train our POS tagging and parsing models directly on these synthetic treebanks and do not make use of word embeddings as we do not have them for Faroese.
Multi-treebank Target Parsing Since we have several synthetic Faroese treebanks, we have the option of training on a single treebank or using a multi-treebank approach where we train on all target treebanks in the same way as we did for inducing the polyglot source model.The process is illustrated in Fig. 4. When training a multi-treebank target model, for each target treebank, we add a treebank embedding denoting the source model used to project annotations to the target treebank.At predict time, we must include one of these treebank embeddings as input to the model.As we do not have real Faroese data in our target training treebanks, we must choose the treebank embedding of one of the synthetic target treebanks.Stymne et al. (2018) introduce the term "proxy treebank" to refer to cases where the test treebank is not in the training set and a treebank embedding from the training set must be used instead.

Experiments
In this section, we describe our experiments, which include a replication of the main findings of Tyers et al. (2018), using AllenNLP (Gardner et al., 2018) for POS tagging and parsing instead of UDPipe (Straka and Straková, 2017

Details
The hyper-parameters of our POS tagging and parsing models are given in Table 1.For POS tagging, we adopt a standard architecture with a word and character-level Bi-LSTM (Plank et al., 2016;Graves and Schmidhuber, 2005) to learn contextsensitive representations of our words.These representations are passed to a multilayer perceptron (MLP) classifier followed by a softmax function to choose a tag with the highest probability.For both the POS tagging and parsing models, we use a word embedding dimension of size 100 and a character embedding dimension of size 64.POS tag embeddings of dimension 50 are included in the parser.We train our Faroese models for fifty epochs.We do not split the synthetic Faroese treebanks into training/development portions though we suspect doing so will help the models to not overfit on the training data.For all experiments we report labelled attachment scores produced by the official CoNLL 2018 evaluation script. 13

Results
The development results of our monolingual and polyglot models on the source language treebanks are shown in Table 2.The results for the polyglot multilingual-parsing 13 https://github.com/ufal/conll2018/blob/master/evaluation_script/conll18_ ud_eval.pymodel are better for three out of four source languages, whereas for no nynorsk, the monolingual model marginally outperforms the polyglot one.These results suggest that the polyglot model will contribute better syntactic annotations for Faroese treebanks.
The statistics of the filtered Faroese treebanks obtained via projection with our source parsing models are given in Table 3.The treebank sizes are fairly similar regardless of whether source annotations are provided by a monolingual or a polyglot model which is expected because the word alignments are the major factor in determining whether a projection is successful.There is a proportionally lower number of sentences for multi-source projection.This is because this method only uses the intersection of sentences which are present across all synthetic treebanks after filtering.The treebank originating from Norwegian Bokmål has the highest number of valid sentences, suggesting that it could be a good candidate for projection to Faroese.It also has the highest source language parsing accuracy (Table 2).
The results of training on our various synthetic Faroese treebanks and predicting the Faroese test set are shown in the first result column of Table 4 (SINGLE).In terms of monolingual vs. polyglot, we find that projecting from a polyglot model helps with four out of the five possible treebanks (with three of them being statistically  formed by the monolingual model using Norwegian Nynorsk for projection though the difference is not statistically significant.On the source side, the monolingual Norwegian Nynorsk model also performed slightly better than the polyglot model (Table 2).This observation supports the intuition that the quality of the projected annotations can be improved by contributing better source annotations, i. e. improving the source model(s) is one way to improve performance of the target model.This is supported by the fact that the source lan- Table 5: LAS scores between target models trained on the subset of sentences eligible for multi-source projection (with annotations from the stated source).
guage with the highest LAS (Norwegian Bokmål) is also the best choice for projection (in this single target model setting).The multi-source approach was not that effective in our case and some individual better sources were able to surpass this combination approach.One could argue that this may be due to the lower amount of training data when using the multisource treebank.We test this hypothesis by only including those sentences which contributed to multi-source projection in the single-source synthetic treebanks.The results are given in Table 5. Comparing the results in Tables 4 and 5, we see that LAS scores tend to be slightly lower than on the version which included all target sen-WORK RESULT Rosa and Mareček (2018)  tences, indicating that we did lose some information by filtering out a large number of sentences.However, Norwegian Nynorsk still outperforms the multi-source model for the monolingual setting and both Norwegian models perform better than the multi-source model in the polyglot setting, suggesting that size alone does not explain the under-performance of the multi-source model.It is also worth noting that polyglot training is superior to all monolingual models which hints that for no nynorsk (the previously better performing model), the monolingual model was not able to achieve its full potential with the reduced data while the polyglot model was able to provide richer annotations.
Another reason why the multi-source model does not work as well in our experiments as it does in those of Tyers et al. (2018) might be that we use pre-trained embeddings whereas Tyers et al. (2018) do not.In this way, our monolingual models are stronger and likely do not benefit as much from voting.
The second result column (MULTI) of Table 4 shows the effect of training a multi-treebank POS tagger and parser on the Faroese treebanks created by each of the four source languages as well as the treebank which is produced by multi-source projection.This experiment is orthogonal to the experiment using a polyglot model on the source side and so we also test a combination of polyglot source side parsing and multi-treebank target side parsing.We see improvements over the single treebank setting for all cases. 15able 6 places our systems in the context of previous results on the same Faroese test set.The highest scoring system in the 2018 CoNLL shared task was that of Rosa and Mareček (2018) who achieved a LAS score of 49.4 on the Faroese test set.Note that they use predicted tokenization and segmentation whereas our experiments and Tyers et al.'s use gold tokenization and segmentation, which provides a small artificial boost.Tyers et al. (2018) report an LAS of 64.43 with a monolingual multi-source approach.Our implementation which uses a different parser (AllenNLP versus UDPipe) and pre-trained word embeddings achieves an LAS of 68.Our highest score of 71.51 is achieved through the combination of projecting from strong monolingual source models and then training multi-treebank POS tagging and parsing models on the outputs.

Conclusion
We have presented parsing results on Faroese, a low-resource language, using annotation projection from multiple monolingual sources versus a single polyglot model.We also extended the idea of multi-treebank learning to the target treebanks.
The results of our experiments show that the use of a polyglot source model helps in four out of five cases using single treebank target models.The two source languages that have lowest LAS when using monolingual parsers, namely Danish and Swedish, see significant improvements when switching to a polyglot model.Our best performing single target model is trained on Faroese trees projected from Norwegian Bokmål trees produced by a polyglot model.However, the strongest language with monolingual modelling, Norwegian Nynorsk, does not benefit from switching to a polyglot model.When we filtered the target treebank to the subset of sentences selected by multisource projection, the polyglot model is superior to all five monolingual models, even outperforming the Norwegian Nynorsk model.One explanation of the improvements seen with polyglot modelling is that it introduces a new interaction point for cross-lingual features via the feature extractor of the polyglot parser.With monolingual source models, cross-lingual features only interact indirectly in the graph-decoding stage of multi-source projection.
but found that it always helps to also perform multi-treebank training for the POS tagger.
We also applied the multi-treebank approach to the target-side POS tagger and parser and see improvements for all settings.The overall best result is with the setting that uses monolingual source models to create the source trees that are projected to Faroese and combined in a multitreebank model.The proxy treebank for the multitreebank model is the treebank that also gave best results with single treebank target models, projected from Norwegian Nynorsk.
We presented a simple solution to deal with using multiple pre-trained embeddings in a model with a shared vocabulary.It was a rather naïve solution and we want to explore the use of available cross-lingual word embedding tools.Additionally, the use of contextual embeddings such as ELMo (Peters et al., 2018) or multilingual BERT (Devlin et al., 2019) would likely provide better representations, with the effect of contributing better annotations for the target language.Indeed, recent work has already shown promising work in this area (Schuster et al., 2019;Kondratyuk, 2019).
In the multi-source projection experiments, our criteria for filtering is based on whether the sentence was present across all target treebanks and more sophisticated approaches could be used to select better training instances as in Plank and Agić (2018).
More generally, we would like to investigate how our findings might change when the number of source languages or treebanks is changed and how the observations carry over to other languages than Faroese.It would also be interesting to use multiple sources of arc weights in a dense graph as in (Agić et al., 2016) but with models induced from training on multiple source languages together.To work with language pairs with more deviating word orders and/or translations that are not word-for-word, the choice of word alignment algorithm and the projection algorithm may have to be revised.

Figure 1 :
Figure 1: Overview of the machine translation process.The Faroese sentences are first translated into Norwegian Bokmål and then from Norwegian Bokmål into the other source languages (pivot translation).

Figure 2 :
Figure2: Overview of the monolingual and polyglot parse experiments using Swedish translations as an example.This process is repeated for all source languages.

Figure 3 :
Figure 3: Multi-source projection.The source language is listed in brackets. ).12

Table 1 :
Figure4: Single versus multi-treebank training.The source language is listed in brackets.Chosen hyperparameters for our POS tagging and parsing models.both means the feature is common to both the POS tagger and parser.
significant).14Thepolyglot model was outper-14 Statistical significance is tested with udapi-python

Table 2 :
Source model LAS scores on the development treebanks using silver POS tags.

Table 3 :
The number of valid sentences in the Faroese synthetic treebank for each source language after annotation projection and sentence filtering.

Table 4 :
LAS on the target Faroese test treebank.Single refers to using a single synthetic Faroese treebank to train a Faroese model, Multi uses both a multitreebank POS tagger and a multi-treebank parser with all synthetic Faroese treebanks.The multi-treebank model is tested with each of the five training treebanks (four projected from individual source languages and one using multi-source projection) as proxy treebank.Statistically significant differences between the monolingual and polyglot setting are indicated by † for each result pair, excluding averages.