Multi-source morphosyntactic tagging for spoken Rusyn

This paper deals with the development of morphosyntactic taggers for spoken varieties of the Slavic minority language Rusyn. As neither annotated corpora nor parallel corpora are electronically available for Rusyn, we propose to combine existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish and adapt them to Rusyn. Using MarMoT as tagging toolkit, we show that a tagger trained on a balanced set of the four source languages outperforms single language taggers by about 9%, and that additional automatically induced morphosyntactic lexicons lead to further improvements. The best observed accuracies for Rusyn are 82.4% for part-of-speech tagging and 75.5% for full morphological tagging.


Introduction
This paper addresses the development of morphosyntactic taggers for spoken varieties of the Slavic minority language Rusyn by leveraging the resources available for the neighboring, etymologically related languages. Due to the lack of annotated and parallel Rusyn data, we propose to create Rusyn taggers by combining training data from related resource-richer languages such as Ukrainian, Polish, Slovak and Russian.
We start by giving a brief introduction to the characteristics of Rusyn and present related work in the domain of low-resource language tagging. After describing the training and test data, we present a set of experiments on different multisource tagging approaches. In particular, we investigate the impact of majority voting, Brown clustering, training corpus adaptation, and the ad-dition of automatically induced morphosyntactic lexicons. Finally, we give an outlook on future work.

Status of Rusyn and corpus data
Rusyn is a Slavic linguistic variety spoken predominantly in Transcarpathian Ukraine, Eastern Slovakia, and Southeastern Poland, and is linguistically close to the Ukrainian language. Its sociolinguistic status is disputed insofar as some scholars see Rusyn as a dialect of Ukrainian, others claim it to be an independent -the fourth East Slavic -language. Despite its closeness to Ukrainian, Rusyn exhibits numerous distinct features on all linguistic levels, which make Rusyn look more "West Slavic" as compared to Ukrainian. 1 Nowadays, most speakers of Rusyn are bilingual and have native-like command of, e.g., Polish or Slovak. This has an impact on their Rusyn speech and leads to new divergences within the old Rusyn dialect continuum, which can be investigated using the Corpus of Spoken Rusyn (www.russinisch.uni-freiburg. de/corpus) that is currently in the process of being built up. The corpus comprises several hours of transcribed Rusyn speech from the different countries where Rusyn is spoken. This means that both diatopic and individual speaker variation is reflected in the transcription, which is one reason for the fact that the corpus data is orthographically (and morphologically) heterogeneous. Another reason is that variation in transcription practices due to several individual transcribers could not completely be avoided.
The goal of the research presented here is to automatically provide morphosyntactic annotations  for the Corpus of Spoken Rusyn. However, there are virtually no NLP resources (annotated corpora or tools) available for Rusyn at the moment. The different types of variation present in the data complicate the task of developing NLP tools even more. Crucially, there is no parallel corpus available for Rusyn, which means that the popular projection-based approaches cannot be applied (see below).
Considering the lack of annotated Rusyn data and the etymological situation of Rusyn, our approach consists in training taggers for several related languages -namely, the East Slavic languages Ukrainian and Russian and the West Slavic languages Polish and Slovak -and combining and adapting them to Rusyn. This multi-source setting makes sense, because the Rusyn dialect continuum features both West Slavic and East Slavic linguistic traits to a different extent, depending on both the dialect region and the impact of the respective umbrella language. In order to get an idea of the similarities and differences of the Slavic languages involved, compare the different versions of John 1:1 in Figure 1.

Related work
The task of creating taggers for languages lacking manually annotated training corpora has inspired a lot of recent research. The most popular line of work, initiated by Yarowsky and Ngai (2001), draws on parallel corpora. They annotate the source side of a parallel corpus with an existing tagger, and then project the tags along the word alignment links onto the target side of the parallel corpus. A new tagger is then trained on the target side, with some smoothing to reduce the noise caused by alignment errors. Follow-up work has focused on the inclusion of several source languages (Fossum and Abney, 2005), more accu-rate projection algorithms (Das and Petrov, 2011;Duong et al., 2013), the integration of external lexicon sources (Li et al., 2012;Täckström et al., 2013), the extension from part-of-speech tagging to full morphological tagging (Buys and Botha, 2016), and the investigation of truly low-resource settings by resorting to Bible translations (Agić et al., 2015). A related approach (Aepli et al., 2014) uses majority voting to disambiguate tags proposed by several source languages. However, these projection approaches are not adapted to our setting as no parallel corpora -not even the Bible 2 -are electronically available for Rusyn.
Another approach consists in training a model for one language and applying it to another, closely related language. In this process, the model is trained not to focus on the exact shape of the words, but on more generic, language-independent cues, such as part-of-speech tags for parsing (Zeman and Resnik, 2008), or word clusters for partof-speech tagging (Kozhevnikov and Titov, 2014). A related idea consists in translating the words of the model to the target language, either using a hand-written morphological analyzer and a list of cognate word pairs (Feldman et al., 2006), or using bilingual dictionaries extracted from parallel corpora (Zeman and Resnik, 2008) or induced from monolingual corpora (Scherrer, 2014).
Our work mostly follows the second approach: we train taggers on four resource-rich Slavic languages and adapt them to Rusyn using a variety of techniques.

Training data
While morphosyntactically annotated corpora exist for all four source languages, e.g. in the form Fortunately, since version 1.4, the Universal Dependencies project 5 contains treebanks for the four relevant languages with unified part-of-speech tags and morphosyntactic descriptions (Nivre et al., 2016;Zeman, 2015). Two corpora are available for Russian, but the Ukrainian corpus is still rather small (see Table 1). Additionally, we were able to obtain more Ukrainian data developed by the non-governmental Institute of Ukrainian 6 and planned to be included in one of the upcoming Universal Dependencies releases; we converted these additional data from the MultextEast-style tags to universal tags and morphological features.
As Rusyn is written in Cyrillic script, we converted the Slovak and Polish corpora into Cyrillic script. During this process, we applied certain transformation rules in order to "rusynify" our training data (e.g., transform Polish ć to Cyrillic ть or Polish ą to Cyrillic у, which is in line with well-known historical phonological processes MULTEXT-East project, do not have a positive impact on Rusyn tagging. We therefore do not include these additional resources (except for the derived lexicons discussed in Section 5.5). We evaluate our methods on a small handannotated sample of Rusyn containing 104 sentences and 1 050 tokens and 96 distinct tags (henceforth RUE1). At the time of conducting the experiments, the Corpus of Spoken Rusyn (RUE2), which we aim to annotate with the presented methods, contains 5 922 sentences with 75 201 tokens. We also report OOV rates on the latter and use it as additional unlabeled data for some of the adaptation processes described below.

The MarMoT tagger
We use the MarMoT tagger for all of our experiments. MarMoT (Mueller et al., 2013) is a stateof-the-art toolkit for morphological tagging based on Conditional Random Fields (CRFs). It has been shown to work well on full morphological tagging with hundreds of tags (as opposed to part-ofspeech tagging, which typically only uses a few dozen tags), thanks to pruning and coarse-to-fine decoding. Unless stated otherwise, we use the default parameters for morphological tagging.
We evaluate the different models on the development sets of the five source corpora as well as on RUE1. A token is considered correctly tagged if its part-of-speech tag is correct and if all morphological features present in the gold annotation are found with the same value. 7  Table 2: Tagging accuracies and OOV rates for single-language taggers. Rows represent models, columns represent test sets.

Single-language taggers
We start by training five distinct taggers on the five training corpora and apply these taggers to the five source-language test corpora as well as to the Rusyn corpora. The results are shown in Table 2. Unsurprisingly, each test set is best tagged with the tagger based on its own training set. Polish and Russian fared somewhat better than Slovak and Ukrainian. The differences between RU1 and RU2 give an indication of the loss resulting from annotation/conversion differences as well as domain differences within the same language. For Rusyn, the best accuracy is obtained using the Ukrainian tagger, which is in line with the claims on linguistic proximity made above, followed by RU2, which is due to its large size rather than to small etymological distance. Also note that for none of the models, Rusyn is the worst-performing test language, hinting at its role as a bridge language between East and West Slavic.
In order to quantify the reliability of the Rusyn tagging results given the somewhat small test corpus, we split it into two equally-sized parts and computed the accuracies on both parts. The deviation of the accuracy values of these parts from the mean accuracy is indicated after the ± sign in Table 2.
While no single-language tagger achieves satisfactory accuracy on Rusyn, the results suggest that a combination of the five taggers (or of their training data) could yield improved accuracy on Rusyn. There are essentially two ways of combining taggers: using the five source language taggers and choosing the majority vote, or using a single tagger trained on merged data from the five source corpora. Aepli et al. (2014) develop a tagger for Macedonian by transferring morphosyntactic annotations from multiple source languages by word alignment, choosing one annotation by majority vote, and training a new tagger on the annotated corpus. We follow a similar method. We start by annotating the Rusyn data with the five source language taggers. A majority annotation is determined in two steps: first, the majority part-of-speech tag is determined, and second, the majority morphological features are determined on the basis of the taggers that have predicted the majority part-ofspeech tag. We propose two ways of dealing with ties: we either randomly resolve ties (Random) or weight the tags on the basis of a priori knowledge about the etymological distances of the languages (Weighted). 8 We report results on this direct annotation (see Table 3, rows MAJ-D), but also use the annotated RUE2 corpus to retrain a new tagger (see Table 3, rows MAJ-R). Only the weighted method yields similar tagging accuracies as the best singlelanguage tagger. The impact of retraining is negative, probably due to the fact that the OOV rate on RUE1 hardly decreases. While we could have tuned the weights of the majority-vote models to further improve their accuracy, this option did not look worthwhile in the light of the better results obtained with the approaches discussed below.

Creating multi-source taggers
For the multi-source tagger, we concatenate the five training sets, using only the first 10% of RU2 in order to keep the distribution better balanced. As shown in Table 3 (row MS), this simple combination of training resources yields better accuracy than all majority-vote systems and outperforms the best single-language model (UK) by nearly 9%, although with a high variance between the two parts of the corpus. If only parts-of-speech are eval-  uated, the multi-source tagger achieves 79.2% of accuracy, compared to 69.7% for the best singlelanguage model (UK).
Following e.g. Owoputi et al. (2013), we include word clusters as an additional feature for tagging. We obtain hierarchical word clusters (c=1 000) with the Brown clustering algorithm (Brown et al., 1992) on the concatenation of all source language and Rusyn texts (1.5M running tokens), and add the clusters as an additional feature to the tagger. This addition yields small improvements for some source languages and for Rusyn (see Table 3, row MS-B), although the latter impact is inconclusive due to the high variance between the two corpus parts. We observe that all word clusters spread over words from more than one language, suggesting that the clustering algorithm generalizes well over data from different languages. While larger amounts of unlabeled data will undoubtedly further increase source language tagging, it is less clear whether this will also have a positive impact on Rusyn tagging. In any case, larger Rusyn corpora will be hard to come by.
The idea behind tagger combination was that a lot of Rusyn words can be found in one of the source languages. This has been confirmed, as the OOV rates of the combined taggers (around 24% for Rusyn, see Table 3, rows MAJ-D and MS) are much lower than those of the single language taggers (between 37% and 54% for Rusyn, see Ta-ble 2). However, we assume that even more Rusyn words could be found in a source language if some transformations were applied. In the following two subsections, we investigate two different approaches.

Adding automatically induced lexicons
In Rabus and Scherrer (2017), we describe the automatic induction of morphosyntactic lexicons for Rusyn. In a nutshell, we match Rusyn words extracted from RUE1 and RUE2 with source language words extracted from the Polish, Slovak, Ukrainian and Russian MULTEXT-East lexicons as well as the morphological dictionary of UGtag 9 (Kotsyba et al., 2011), using vowel-sensitive Levenshtein distance, hand-written rules, and a combination of both. The Rusyn words are then associated with the morphosyntactic descriptions of the matched source-language words. The resulting lexicon contains 51 600 token-tag tuples when induced with Levenshtein distance, and 28 900 tuples when induced with rules. Table 3 (rows LEX-R and LEX-L) reports tagging results, where one of the induced lexicons is added to the multi-source tagger. As expected, the OOV rates drop considerably. 10 Both the rule-induced and the Levenshtein-induced lexicon improve accuracy, the latter by 3.5% to 75.5%, the best observed result. Moreover, these results are stable between the two parts of the RUE1 corpus, with only 0.2% difference for the ruleinduced lexicon and less than 0.1% difference for the Levenshtein-induced lexicon. If evaluated on the parts-of-speech only, the accuracies increase from 79.2% to 81.3% for the rule-induced lexicon and to 82.4% for the Levenshtein-induced lexicon. Combinations of rule-induction and Levenshteininduction do not lead to further tagging improvements with respect to the Levenshtein model.

Adapting the corpora to Rusyn
An alternative to adding Rusyn data in the form of lexicons is to modify the source language training corpora directly by making them look more Rusyn-like. The idea behind this method is to provide the tagger with additional Rusyn tokens in sentential context. We proceed as follows: for each source language word, we search for the most similar Rusyn word in the RUE1 and RUE2 corpora, again using Levenshtein distance or the hand-written rules. If the most similar Rusyn word is different from the source word, we replace the source word with the former. 11 As the number of known Rusyn words is small in comparison with the number of source words, there is a risk of replacing a source word by a nonrelated Rusyn word because the related one simply is not known. In this case, we prevent the replacement whenever another source word is closer to the Rusyn candidate. For example, the word презыдент in the Polish corpus (converted from prezydent 'president') would be replaced by the most similar Rusyn word, which happens to be the word презенті but which is unrelated. This replacement is blocked because another Polish word, презенты (< prezenty 'gifts'), is even closer to презенті. When more than one Rusyn word exists with the same distance, no replacement takes place. This phenomenon mostly occurs with Levenshtein distance, where 3-5% of tokens are concerned, but more rarely with the rules, where 1-3% of tokens are concerned. In the end, between 8% and 12% of source tokens are replaced with Levwords. 11 For relative Levenshtein distance, we introduce a threshold at 0.25 -as already in the lexicon induction experimentsabove which word matches are considered noise and are discarded. enshtein, and between 1% and 5% of source tokens with the rules.
The results presented in Table 3 (rows COR-R and COR-L) show that these conversions slightly decrease tagging accuracy for the source languages (which is expected, as training corpora now look less like the source languages), but do not improve the accuracy for Rusyn either compared to the simple multi-source model. We also reran the word clustering tool on the Levenshtein-converted data, under the assumption that the increased frequency of the Rusyn words would improve the reliability of the induced clustering. This assumption was indeed borne out with an accuracy increase of 2.4% absolute (row COR-L-B). However, this result did not surpass the one obtained with induced lexicons.

Conclusion and future work
We have investigated several approaches to morphosyntactic tagging of spoken Rusyn without relying on annotated Rusyn training data nor on annotation projection from aligned parallel data. Instead, we argued that fair tagging accuracy could be achieved by training taggers on the etymologically related languages Ukrainian, Slovak, Polish and Russian. The experiments also showed that although Ukrainian is most closely related to Rusyn, all four related languages are useful for tagging. We have shown that a multi-source tagger trained on a balanced set of source language corpora performs rather well and even outperforms majority vote approaches. In contrast, Brown clustering has only been modestly useful in our setting, which may be due to the low amount of unlabeled data used.
We have presented two additional techniques to adapt the taggers to the specificities of Rusyn: adding automatically induced morphosyntactic lexicons, or adapting the training corpora. We oriented the first technique towards maximising recall (e.g., keeping all possible readings of a Rusyn word in the induced lexicons) and the second towards high precision (e.g., only replacing unambiguous words in the corpus). The first approach turned out to be more successful.
However, we believe that further improvements can be achieved. First, the RUE1 corpus -currently our only gold standard -is not completely representative of the material found in RUE2. In fact, the RUE1 test set may actually underesti-mate the impact of the tagger adaptation methods, as it contains only Rusyn varieties spoken in Ukraine, with a low amount of orthographic variation, whereas RUE2 also contains Rusyn from Poland and Slovakia. As an illustration, compare the OOV rates of the UK tagger (Table 2), which is 2.5% higher in RUE2 than in RUE1. A cursory evaluation of the results confirms this hypothesis, but we cannot quantify it at the moment. Only the manual annotation of a balanced subset of the different RUE2 parts would provide us with a broader data basis for evaluation.
Second, it is crucial to keep in mind that both RUE1 and RUE2 -as opposed to the training corpora -are oral corpora with distinct features such as corrections, repetitions, incomplete sentences, unintelligible words or phrases, markers for pauses, etc. Any tagger trained on written data and applied to oral data will inevitably perform worse than when applied to written data (Nivre and Grönqvist, 2001;Westpfahl, 2014).
The final annotation of the Rusyn corpus is not only expected to consist of morphosyntactic descriptions, but also of lemmas. Therefore, we intend to train a separate lemmatization model on the tagged Rusyn corpora. The multi-source approach will be more problematic here, as we do not want the predicted lemmas to be a mix of the four source languages. The prediction of Rusyn lemmas is prevented by two factors: none of our Rusyn data are annotated with Rusyn lemmas, and the orthographic variation would also carry over to the lemmas, which we would like to avoid. Therefore, one goal could be to annotate the Rusyn tokens with Ukrainian lemmas such as those available in the UGtag lexicon.
Finally, all source language corpora used in our experiments are annotated with syntactic dependencies. We assume that a Rusyn dependency parser could be created using similar methods as those presented here for morphosyntactic tagging.