Unsupervised Cross-Lingual Part-of-Speech Tagging for Truly Low-Resource Scenarios

We describe a fully unsupervised cross-lingual transfer approach for part-of-speech (POS) tagging under a truly low resource scenario. We assume access to parallel translations be-tween the target language and one or more source languages for which POS taggers are available. We use the Bible as parallel data in our experiments: small size, out-of-domain and covering many diverse languages. Our approach innovates in three ways: 1) a robust approach of selecting training instances via cross-lingual annotation projection that exploits best practices of unsupervised type and token constraints, word-alignment conﬁdence and density of projected POS, 2) a Bi-LSTM architecture that uses contextualized word embeddings, afﬁx embeddings and hierarchical Brown clusters, and 3) an evaluation on 12 diverse languages in terms of language family and morphological typology. In spite of the use of limited and out-of-domain parallel data, our experiments demonstrate signiﬁcant improvements in accuracy over previous work. In addition, we show that using multi-source information, either via projection or output combination, improves the performance for most target languages.


Introduction
Majority of world's languages do not have annotated datasets even for the most simple NLP tasks such as part-of-speech (POS) tagging. However, efforts in documenting low-resource languages often contain translations, usually of religious text, into other high-resource languages. One such parallel corpus is the Bible (Mayer and Cysouw, 2014): 484 languages have a complete Bible translation, while 2551 have a part of the Bible translated. Our goal is to learn POS taggers for a diverse set of target languages in a truly low-resource scenario, where only a limited and possibly out-of-domain set of translations into one or more high-resource languages is available (e.g., the Bible), together with supervised POS taggers for the high-resource source language(s).
Unsupervised cross-lingual POS tagging via annotation projection has a long research history (Yarowsky et al., 2001;Fossum and Abney, 2005;Das and Petrov, 2011;Duong et al., 2013;Agić et al., 2015Agić et al., , 2016Buys and Botha, 2016). In contrast to our work, these approaches either use large and/or in-domain parallel data or rely on a large number of source languages for projection. However, since projection could suffer from bad translation, alignment mistakes or wrong assumptions, a key consideration for all these approaches is how to obtain high-quality training instances for the target language (i.e., sentences with accurate POS tags projected from the source-language(s)). Coupling token and type constraints (Das and Petrov, 2011;Täckström et al., 2013;Buys and Botha, 2016), word-alignment confidence (Duong et al., 2013), multi-source projection (Agić et al., 2016) and coverage (percentage of tokens covered by multisource projection) (Plank and Agić, 2018) have shown to lead to training instances of better quality. However, only one or two of these have been usually employed.
Our first contribution is a robust approach for selecting training instances via cross-lingual annotation projection that exploits and expands all these best practices: coupling type and token constraints obtained in an unsupervised way, wordalignment confidence together with the density of the projected POS, and (optionally) multi-source projection (Sub-section 2.1).
Our second contribution is a BiLSTM (Hochreiter and Schmidhuber, 1997) neural architecture that uses pre-trained contextualized word embeddings, affix embeddings and hierarchical Brown clusters (Brown et al., 1992). As contextualized embeddings, we show gains by exploiting the mul-tilingual XML-R model (Conneau et al., 2019), while affix embeddings are particularly useful for morphologically-rich languages, and word clusters have been shown to be useful for non-neural POS tagging (Kupiec, 1992;Täckström et al., 2013;Owoputi et al., 2012). Moreover, in addition to the single-source setups, we propose an approach that utilizes multiple source languages by combining the outputs of single-source taggers via weighted voting at the token level (Sub-section 2.2).
Our third contribution is an extensive experimental setup, with 12 diverse target languages in terms of language family and morphological typology and six high-resource source languages (Section 3). While projecting from a single source language can be efficient, we show that using multiple sources, either via projection or output combination, further improves the tagging accuracy for most target languages. Our experiments, using limited and outof-domain parallel data, demonstrate significant improvements over previous work (both unsupervised and semi-supervised), even when comparing our single-source setups to other multi-source ones. We also investigate how much gold data is needed to develop supervised taggers comparable to our best unsupervised models. In addition, we show that cross-lingual annotation projection generalizes across languages of different typologies better than the zero-shot model-transfer approach by Pires et al. (2019). Finally, our tagging scripts and models are made publicly available 1 .

Approach
Our goal is to induce a neural POS tagger for a target language of interest without any direct supervision. Instead, we rely on parallel translations between the target and one or more source languages for which POS taggers are accessible. This section describes our approach: 1) cross-lingual annotation projection via word alignments to prepare the training instances of the target language, and 2) neural POS tagging for the target language.

Cross-Lingual Projection via Word Alignments
Given sentence-aligned parallel data, we align the text of the source and target sides at the word level using GIZA++ (Och and Ney, 2003), while sentences of more than 80 tokens are eliminated. We construct bidirectional word alignments, by only considering the intersecting source-to-target and target-to-source alignments, and exclude the alignment points where the average of the alignment probabilities in the two directions is below some threshold α.
Tagging of Source Languages. Since crosslingual projection requires a common POS tagset for all languages, we use the universal POS tagset of the Universal Dependencies (UD) project 2 , which consists of 17 universal POS tags. We rely on off-the-shelf taggers to tag the source text prior to projecting the annotations as described next.
POS Projection using Token and Type Constraints. To project the POS tags from the source to the target language, we use token and type constraints based on the mapping induced by the wordlevel alignments. The idea of using both token and type constraints was first introduced by Täckström et al. (2013). Type constraints define the set of POS tags a word type can receive. In a semisupervised leaning setup, type constraints can be obtained from an annotated corpus (Banko and Moore, 2004) or from a resource that serves as a POS lookup such as the Wiktionary 3 (Li et al., 2012;Täckström et al., 2013). For the extraction of type constraints in an unsupervised fashion, we follow the approach of (Buys and Botha, 2016), where we define a tag distribution for each word type on the target side by accumulating the counts of the different POS tags of the source-side tokens that align with the target-side tokens of that word type. The POS tags whose probability is equal to or greater than some threshold β constitute the type constraints of the underlying word type. As token constraint, every aligned token on the target side gets assigned the POS tag of its corresponding source-side token. We combine both token and type constraints in a slightly different way than Täckström et al. (2013) and Buys and Botha (2016). If a token is not aligned, or its token constraint does not exist in the underlying type constraints, the token becomes unconstrained (i.e., receives a NULL tag). Otherwise, the token constraint is applied. Those applied token constraints represent the projected tags.
In contrast to the previous work, we do not use the type constraints to impose restrictions when training the model as they restrict the performance of our neural architecture.
Multilingual Projection. In addition to projecting the POS tags from one language to another, we experiment with a multilingual setup in which we follow Agić et al. (2016) by projecting the tags from multiple source languages prior to training the model (M ulti proj ). The intuition is that the projection from a single source might suffer from inaccurate translation or wrong induced alignments. Moreover, the POS tags of two correctly aligned sentences might differ because of languagedependent specifications. Such problems can be resolved by inducing the tags from multiple sources.
For each target token T , we assign the projected tag that receives the maximum voting, weighted by the alignment confidence for each source.
where p(l s |T ) is in {0, 1} to represent whether target token T is assigned a tag under the projection from language ls, while P (tag i,s |T ) is the probability of the alignment resulting in the assignment of tag i to target token T when projecting from language ls.
Selection of Training Instances. Prior to training a POS tagger using the projected tags as labels, we score the target sentences based on their "annotation" quality and exclude the ones whose scores are below a threshold γ. We define sentence score as the harmonic mean of density d S and alignment confidence a S , where d S is the percentage of tokens with projected tags, and a S is the average alignment probability of those tokens.
Filtering out sentences of low density and alignment confidence is crucial for training the model. While choosing the sentences with top alignment scores has proved successful in previous research (Duong et al., 2013), we add the density factor as our Bi-LSTM model benefits from longer contiguous labeled sequences.

Neural POS Tagging
The architecture of our POS tagger is a bidirectional long short-term memory (BiLSTM) neuralnetwork model (Hochreiter and Schmidhuber, 1997). BiLSTMs have been widely used for POS tagging (Huang et al., 2015;Wang et al., 2015;Plank et al., 2016;Ma and Hovy, 2016;Cotterell and Heigold, 2017) and other sequence-labeling tasks. The input to our BiLSTM model is a labeled sentence where the word representation is the concatenation of word and sub-word information, namely pre-trained and randomly initialized word embeddings, affix embeddings and word clusters. Figure 1 shows the complete structure of our neural architecture. 4 .
Word and Affix Embeddings We use two types of word-embedding features: pre-trained contextualized embeddings (PT) and randomly initialized embeddings (RI). For the pre-trained contextualized embeddings, we use the final layer of the multilingual XLM-RoBERTa model, XLM-R (Conneau et al., 2019) 5 XLM-R is a transformer-based multilingual masked language model that is pre-trained on texts of 100 languages, and its performance is competitive with strong monolingual models when tested on a variety of NLP tasks. It also shows better performance than multilingual BERT, mBERT (Devlin et al., 2019), particularly for low-resource languages. We use the average of the embedding vectors of the first and last sub-tokens of each word to represent its pre-trained embeddings.
It is worth noting that when using our architecture for a target language that is not present in the XLM-R model, one can consider training a custom XLM transformer-based model 6 given the availability of monolingual data and suitable computational resources, and thus our architecture is not limited to the languages available in the XLM-R model.
The randomly initialized embeddings are learned as part of training the model. Coupling both the randomly initialized embeddings and the pre-trained ones is essential when the domain of the training data is different from the one of the pre-trained embeddings, which is the case in our learning setup, where we use the Bible data for training, while the XLM-R model is trained on text from Wikipedia 7 and a CommonCrawl corpus (See Conneau et al. (2019) for more details).
In addition to word embeddings, we use randomly initialized prefix and suffix n-gram character embeddings, where n is in {1, 2, 3, 4}, as the use of affix information has proved effective in POS tagging (Ratnaparkhi, 1996  Word Clusters. The use of word clusters for POS tagging was first proposed by Kupiec (1992) in a supervised tagging setup, and has then proved efficient for unsupervised learning (Täckström et al., 2013;Buys and Botha, 2016). In this work, we follow Owoputi et al. (2012) by utilizing hierarchical Brown clustering (Brown et al., 1992), which is an HMM-based clustering of a binary merging criterion based on the logarithmic probability of a context under a class-based language model, where the objectives is to reduce the loss in adjusted mutual information (AMI). The output of hierarchical Brown clustering is a binary tree of n leaf nodes that represent n word clusters, where each word in the vocabulary belongs to a single leaf cluster. Leaf clusters are recursively grouped into parent ones (interior nodes) until a super cluster of the entire vocabulary is reached (the root).
We produce hierarchical brown clusters for each target language by applying Percy Liang's implementation of Brown clustering 8 (Liang, 2005) on monolingual text that is a combination of the Wikipedia and Bible texts of the target language.
For each word, we use the main cluster (the binary representation of the corresponding leaf node) and all of its ancestors (the prefixes of the binary representation) as features. This allows us to use the hierarchical clustering information and thus avoid the commitment to a specific granularity level, where high-level clusters may be insufficient, while the lower ones may represent over-clustering. 8 https://github.com/percyliang/brown-cluster Custom Softmax Activation. We use softmax activation on top of the BiLSTM encoding layer for the computation of the final output. However, since some words have NULL tags as a result of missing alignments or non-intersecting token and type constraints (Sub-section 2.1), we set the value of the output neuron corresponding to the NULL tag to −∞ so that it does not contribute to the calculation of the softmax probabilities and thus prohibit the model from decoding NULL. Moreover, we mask the words with NULL tags when computing the cross-entropy network loss.
Multilingual Decoding. In addition to the M ul proj setup presented in Sub-section 2.1, we conduct another multilingual setup where we combine the outputs of the single-source taggers through weighted maximum voting at the token level (M ul out ). The weight of a language pair, w(l s , l t ), is measured as a softmax function whose input vector is the average sentence-level alignment probabilities when aligning the source language l s to the underlying target language l t .
to represent whether target token T is assigned tag i by the model trained on the projection from language l s .
We use the multilingual parallel Bible corpus 10 (Christodouloupoulos and Steedman, 2015) as the source of our parallel data, where we perform the alignment on the verse and word levels. The Bible text is available in full for our source and target languages except Basque, where only the new testament is available.
We use Stanza 11 (Qi et al., 2020) to tag the source-side text of the source languages except for Arabic, for which we apply MADAMIRA (Pasha et al., 2014) for performance gain. However, since MADAMIRA was trained on PTB tags and was not designed to follow the UD guidelines, we mapped the Arabic PTB tags into their UD cognates and manually corrected the analyses of the most frequent 2,500 Arabic POS and lemma pairs by selecting the most likely analysis for each.
We evaluate our models in terms of POS accuracy on the test sets of the Universal Dependencies, UD v2.5 (Zeman et al., 2019) 12 . We also report our results on older versions in order to compare to the state-of-the-art systems, whenever needed.

Experimental Settings
The alignment and projection thresholds as well as the hyperparameters of the model are manually tuned on Bulgarian, Basque, Finnish and Indonesian when projecting from English using the UD development sets. We set the alignment threshold α to 0.1 and the threshold γ for the selection of training instances to 0.5. The POS type distribution threshold β is set to 0.3 as this has proved effective by Banko and Moore (2004) and Buys and Botha (2016). Table 1   Our BiLSTM networks are one layer deep with 128 nodes, while the size of all the randomly initialized word and affix embeddings is 64, and the number of Brown clusters is set to 128. We use Adam for optimization (Kingma and Ba, 2014) with a learning rate of 0.0001 and a learning decay rate of 0.1 at each epoch for a total of 12 epochs. To avoid overfitting, we apply L2 regularization and two dropout layers, before and after the BiLSTM encoder, with a dropout rate of 0.7. The training rate is approximately 2,500 sentences per hour when utilizing a single 2.00 GHz CPU. Table 2 reports the accuracy of our POS taggers for all 72 language pairs, in addition to the two multi-source setups M ul out and M ul proj , based on the average of three runs. As upper bound, we report the state-of-the-art supervised results when training on the UD training sets 13 using Stanza 14 (Qi et al., 2020).

Results
There is a noticeable variance in the performance of the different taggers. However, languages of the same families transfer best across each other. For instance, English and German transfer best to Afrikaans (IE, Germanic), while Spanish yields the best results for Portuguese (IE, Romance), and Russian is the best source for Bulgarian (IE, Slavic). 13 One exception is Amharic; only a test set is available. 14 https://stanfordnlp.github.io/stanza/performance.html  One exception is the case of transferring from Arabic to Amharic (Afro-Asiatic, Semitic). One possible reason is that the Arabic analyzer does not follow the UD guidelines (Sub-section 2.1), which also affects the performance of all the taggers that use Arabic as the source. Since English is the most vital language, where its morphological-annotation guidelines were the basis for those of other languages, transferring from English yields the best performance for seven target languages. On the target side, the Basque taggers suffer from the lowest performance since the parallel data is only available for the New Testament of the Bible, along with the fact that Basque is a language isolate, which is challenging for crosslingual transfer learning.
Multi-Source Performance. As expected, the multi-source setups achieve the best on-average results and the best tagging performance for eight target languages. In addition, M ul proj outperforms M ul output in seven occasions, which highlights the importance of producing projected tags of high quality prior to training the taggers. As shown in Table 1, M ul proj results in a significant increase in the number of training sentences, which, along with the quality of the projected tags, gives the best overall performance.
Per-Tag accuracy. Table 3 reports the accuracy of nouns, verbs and adjectives for each target language in the M ul proj setup. The accuracy of adjectives is the lowest across all target languages. The only exception is Persian, where the performance of verbs is lower than that of nouns and adjectives, and it is the lowest among all target languages. In contrast, the accuracy on nouns is the highest on average and across nine languages, where it exceeds 90% in Afrikaans, Bulgarian and Portuguese, while verbs achieve the highest accuracy in Amharic, Indonesian and Telugu. Each of the three tags is ranked second to the lowest in Basque, an isolate with the least available data.   decreases by absolute 2.2% and 5.1%, respectively. However, when projecting from multiple sources in the M ul proj setup, this is reduced to only 1.8% and 4.1%, respectively. Figure 2 reports the best performance for each target language in three setups: no ablation (full system), No XLM and No Mono. The impact of eliminating the XLM embeddings is most noticeable in Telugu, while it is negligible in Lithuanian, with absolute reduction of 5.8% and 0.6% in POS accuracy, respectively. On the other hand, Hindi benefits most from word clustering, where the No Mono performance is 4.9% below that of No XLM.
The performance drop in the No Mono setup highlights the importance of monolingual data, which is key to the competitive performance of our taggers, especially when compared to other systems that utilize linguistic resources. However, the performance of the system in the absence of only the XLM-R embeddings decreases by a small percent, which provides a relatively good compromise when one lacks adequate computational resources.

Comparison w.r.t. State-of-the-Art
Next, we show that our system outperforms the state-of-the-art unsupervised and semi-supervised cross-lingual POS taggers, where the robust selection of training instances and the rich word representation in the neural architecture are more efficient than using larger and/or domain-appropriate parallel data, some labeled data or off-the-shelf resources encapsulating linguistic knowledge.
We first compare our models to two state-of-theart systems that perform fully unsupervised crosslingual POS tagging via annotation projection: AGIC (Agić et al., 2016) and BUYS (Buys and Botha, 2016). AGIC is a multilingual annotationprojection system that is the basis of our M ul proj setup and uses a TnT POS tagger (Brants, 2000) for training. BUYS is a neural model that is based on the Wsabie algorithm (Weston et al., 2011) and utilizes morphological tags projected via coupling token and type constraints.
We report the performance of our system versus AGIC and BUYS on the test sets of UD v1.2 in Table 4. Our taggers outperform both AGIC and BUYS on all the common language pairs with error reduction of 49.1% and 9.0%, respectively, despite the use of smaller and out-of-domain parallel data and only six source languages in the multi-source setup. In contrast, AGIC has the advantage of utilizing 21 source languages for projection, while BUYS uses large-size parallel data, taken from Europarl 16 , that is up to 2M tokens whose domain is similar to the one of the UD test sets.  Next, we compare our system to two semisupervised cross-lingual POS tagging systems: CTRL (Cotterell and Heigold, 2017) and DsDs (Plank and Agić, 2018). CTRL is a character-level RNN tagger that jointly learns the morphological tags of a high-resource language and the target one, where it has two experimental setups that utilize 100 and 1000 manually annotated target tokens, denoted by D100 and D1000, respectively. DsDs is a BiLSTM tagger that follows the annotationprojection approach by Agić et al. (2016) and utilizes the Polyglot embeddings (Al-Rfou' et al., 2013) and lexical information from the Wiktionary. Table 5 reports the performance of our system versus CTRL on the test sets of UD v2, and versus DsDs on the development sets of UD v2.1 using the 12 universal tags of Petrov et al. (2012) (only Basque is evaluated on the test set). Our system outperforms CTRL except in the D1000 setup of Portuguese, where our results are still comparable. Our system also outperforms DsDs when evaluated on four language pairs out of six, with an overall error reduction of 43.7%.

Annotation Projection vs. Supervision
The comparison to the upper-bound supervised results in Table 2 shows that the unsupervised Afrikaans, Indonesian and Portuguese taggers successfully predict at least 90% of the correct decisions made by their corresponding supervised ones.
The impact of such small gaps could be tolerable when utilizing the taggers as part of downstream tasks, and thus the trade-off between developing an unsupervised tagger versus an expensive supervised one (if possible) should be considered. Next, for each target language, except Amharic, we estimate the amount of manual annotations needed to develop a supervised tagger that approximates the performance of the unsupervised one. We do so by iteratively training 17 and evaluating 17 We use the UD training data and the same parameters of the unsupervised setting but for 100 epochs instead of 12.
POS taggers in increments of 100 words until the target performance is reached. We list the results in Table 6 with respect to the best unsupervised results in Table 2

Annotation Projection vs. Model Transfer
One approach of zero-shot cross-lingual POS tagging is to apply a tagging model trained for a related language. Pires et al. (2019) investigate zero-shot model transfer by fine-tuning the multilingual BERT language model, mBERT (Devlin et al., 2019), for the POS tagging of some language and applying the fine-tuned model to another. While the approach does not require any translation or annotations on the source side, the pre-trained models do not generalize well across languages of different typologies. We compare our approach versus zero-shot model transfer when transferring from English to Japanese (different language families and morphological typologies). We utilize the Bible translation, where we use mBERT instead of XLM-R and train our model for only three epochs in order to replicate the experimental settings by Pires et al. (2019). As shown in Table 7, our approach achieves relative error reduction of 27.6% when evaluated on the Japanese test set from the CoNLL 2017 shared task (Zeman et al., 2017), This result suggests that annotation projection is less sensitive to the relatedness between the source and target languages (which is in line with the results in Table 2), and thus can better generalize across languages of different typologies.   Table 8 reports the macro-average POS accuracies when transferring between languages depending on their typological features: Subject/Object/Verb order (SVO and SOV) and Adjective/Noun order (AN and NA) 18 . In the work of Pires et al. (2019), the best performance is achieved when transferring from a language with similar typological features. In contrast, our system is less sensitive to typological similarities, where the performance of transferring from SVO languages is comparable to that of SOV sources, while both AN and NA targets equally benefit from NA sources. This could be explained since the typological features of the source only contribute to the alignment and projection phases, while training the POS model is fully conducted in the target space after eliminating erroneous annotations.

Related Work
Unsupervised POS tagging through annotation projection was first proposed by Yarowsky et al. (2001), where they transferred POS tags from English to French and Chinese. The work was then extended by Fossum and Abney (2005), where they combined the outputs of single-source taggers based on different source languages. The multilingual setups were then further explored by Agić et al. (2015) and Agić et al. (2016). In efforts to increase the coverage of the projected data, Das and Petrov (2011) proposed graphbased label propagation to expand the projected tags on the target side, while Duong et al. (2013) and Agić et al. (2015) applied self-training and revision, where they performed the projection and training in iterations. On another side, Täckström et al. (2013) and Buys and Botha (2016) organized the projection process through the use of token and type constraints, which we adapt in our approach.
Semi-supervised setups have been explored by either restricting the type constraints through the use of a POS dictionary (Täckström et al., 2013) or by adding additional signals in training, either by using a POS dictionary (Kirov et al., 2018;Plank and Agić, 2018) or by combining manual and projected annotations (Fang and Cohn, 2016). In contrast, our system is fully unsupervised, where we show that the robust construction of the training data can surpass the use of external resources.
While most prior work does tagging for several target languages, and so is our work, some research focuses on specific language pairs such as projecting from German to Hittite (Sukhareva et al., 2017) and from Russian to Ukrainian (Huck et al., 2019).

Conclusion and Future Work
We presented a fully unsupervised cross-lingual POS tagger that does annotation projection by utilizing translation from one or more source languages into the target one. We showed that despite the use of limited and out-of-domain parallel data, our models outperform the state-of-the-art systems. We also showed that the robust selection of training instances and the rich word representation in our neural architecture are more efficient than utilizing some labeled data or external linguistic resources.
In the future, we plan to enhance the system for handling morphologically complex languages trough unsupervised morphological segmentation. One approach is to perform the alignment and projection on the stem and morpheme levels. In addition, stem and morpheme information can be utilized as additional signals in training.