Distant Supervision from Disparate Sources for Low-Resource Part-of-Speech Tagging

a cross-lingual neural part-of-speech tagger that learns from disparate sources of distant supervision, and realistically scales to hundreds of low-resource languages. The model exploits annotation projection, instance selection, tag dictionaries, morphological lexicons, and distributed representations, all in a uniform framework. The approach is simple, yet surprisingly effective, resulting in a new state of the art without access to any gold annotated data.


Introduction
Low-resource languages lack manually annotated data to learn even the most basic models such as part-of-speech (POS) taggers. To compensate for the absence of direct supervision, work in crosslingual learning and distant supervision has discovered creative use for a number of alternative data sources to learn feasible models: -aligned parallel corpora to project POS annotations to target languages (Yarowsky et al., 2001;Agić et al., 2015;Fang and Cohn, 2016), -noisy tag dictionaries for type-level approximation of full supervision (Li et al., 2012), -combination of projection and type constraints (Das and Petrov, 2011;Täckström et al., 2013), -rapid annotation of seed training data . However, only one or two compatible sources of distant supervision are typically employed. In reality severely under-resourced languages may require a more pragmatic "take what you can get" viewpoint. Our results suggest that combining supervision sources is the way to go about creating viable low-resource taggers.
We propose a method to strike a balance between model simplicity and the capacity to easily integrate heterogeneous learning signals. Our system is a uniform neural model for POS tagging that learns from disparate sources of distant supervision (DSDS). We use it to combine: i) multi-source annotation projection, ii) instance selection, iii) noisy tag dictionaries, and iv) distributed word and sub-word representations. We examine how far we can get by exploiting only the wide-coverage resources that are currently readily available for more than 300 languages, which is the breadth of the parallel corpus we employ.
DSDS yields a new state of the art by jointly leveraging disparate sources of distant supervision in an experiment with 25 languages. We demonstrate: i) substantial gains in carefully selecting high-quality instances in annotation projection, ii) the usefulness of lexicon features for neural tagging, and iii) the importance of word embeddings initialization for faster convergence.

Method
DSDS is illustrated in Figure 1. The base model is a bidirectional long short-term memory network (bi-LSTM) (Graves and Schmidhuber, 2005;Hochreiter and Schmidhuber, 1997;Plank et al., 2016;Kiperwasser and Goldberg, 2016). Let x 1:n be a given sequence of input vectors. In our base model, the input sequence consists of word embeddings w and the two output states of a character-level bi-LSTM c. Given x 1:n and a desired index i, the function BiRN N θ (x 1:n , i) (here instantiated as LSTM) reads the input sequence in forward and reverse order, respectively, and uses the concatenated (•) output states as input for tag prediction at position i. 1 Our model differs from prior work on the type of input vectors x 1:n and distant data sources, in particular, we extend the input with lexicon embeddings, all described next.
Annotation projection. Ever since the seminal work of Yarowsky et al. (2001), projecting sequential labels from source to target languages has been one of the most prevalent approaches to crosslingual learning. Its only requirement is that parallel texts are available between the languages, and that the source side is annotated for POS.
We apply the approach by Agić et al. (2016), where labels are projected from multiple sources and then decoded through weighted majority voting with word alignment probabilities and source POS tagger confidences. We exploit their widecoverage Watchtower corpus (WTC), in contrast to the typically used Europarl data. Europarl covers 21 languages of the EU with 400k-2M sentence pairs, while WTC spans 300+ widely diverse languages with only 10-100k pairs, in effect sacrificing depth for breadth, and introducing a more radical domain shift. However, as our results show little projected data turns out to be the most beneficial, reinforcing breadth for depth.
While Agić et al. (2016) selected 20k projected sentences at random to train taggers, we propose a novel alternative: selection by coverage. We rank the target sentences by percentage of words covered by word alignment from 21 sources of Agić et al. (2016), and select the top k covered instances for training. In specific, we employ the mean coverage ranking of target sentences, whereby each target sentence is coupled with the arithmetic mean of the 21 individual word alignment coverages for each of the 21 source-language sentences. We show that this simple approach to instance selection offers substantial improvements: across all languages, we learn better taggers with significantly fewer training instances.
Dictionaries. Dictionaries are a useful source for distant supervision (Li et al., 2012;Täckström et al., 2013). There are several ways to exploit such information: i) as type constraints during encoding (Täckström et al., 2013), ii) to guide unsupervised learning (Li et al., 2012), or iii) as additional signal at training. We focus on the latter and evaluate two ways to integrate lexical knowledge into neural models, while comparing to the former two: a) by representing lexicon properties as n-hot vector (e.g., if a word has two properties according to lexicon src, it results in a 2-hot vector, if the word is not present in src, a zero vector), with m the number of lexicon properties; b) by embedding the lexical features, i.e., e src is a lexicon src embedded into an l-dimensional space. We represent e src as concatenation of all embedded m properties of length l, and a zero vector otherwise. Tuning on the dev set, we found the second embedding approach to perform best, and simple concatenation outperformed mean vector representations.
We evaluate two dictionary sources, motivated by ease of accessibility to many languages: WIK-TIONARY, a word type dictionary that maps tokens to one of the 12 Universal POS tags (Li et al., 2012;Petrov et al., 2012); and UNIMORPH, a morphological dictionary that provides inflectional paradigms across 350 languages (Kirov et al., 2016). For Wiktionary, we use the freely available dictionaries from Li et al. (2012) and . The size of the dictionaries ranges from a few thousands (e.g., Hindi and Bulgarian) to 2M (Finnish UniMorph). Sizes are provided in Table 1, first columns. UniMorph covers between 8-38 morphological properties (for English and Finnish, respectively).
Word embeddings. Embeddings are available for many languages. Pre-initialization of w offers consistent and considerable performance improvements in our distant supervision setup (Section 4). We use off-the-shelf Polyglot embeddings (Al-Rfou et al., 2013), which performed consistently better than FastText (Bojanowski et al., 2016). -GARRETTE: The approach by  that works with projections, dictionaries, and unlabeled target text. -LI: Wiktionary supervision (Li et al., 2012).
Data. Our set of 25 languages is motivated by accessibility to embeddings and dictionaries. In all experiments we work with the 12 Universal POS tags (Petrov et al., 2012). For development, we use 21 dev sets of the Universal Dependencies 2.1 (Nivre et al., 2017). We employ UD test sets on additional languages as well as the test sets of Agić et al. (2015) to facilitate comparisons. Their test sets are a mixture of CoNLL (Buchholz and Marsi, 2006;Nivre et al., 2007) and HamleDT test data (Zeman et al., 2014), and are more distant from the training and development data.
Model and parameters. We extend an off-theshelf state-of-the-art bi-LSTM tagger with lexicon information. The code is available at: https:// github.com/bplank/bilstm-aux. The parameter l=40 was set on dev data across all languages. Besides using 10 epochs, word dropout rate (p=.25) and 40-dimensional lexicon embeddings, we use the parameters from Plank et al. (2016). For all experiments, we average over 3 randomly seeded runs, and provide mean accuracy. For the learning curve, we average over 5 random samples with 3 runs each. Table 1 shows the tagging accuracy for individual languages, while the means over all languages are given in Figure 2. There are several take-aways.

Results
Data selection. The first take-away is that coverage-based instance selection yields substan-tially better training data. Most prior work on annotation projection resorts to arbitrary selection; informed selection clearly helps in this noisy data setup, as shown in Figure 2 (a). Training on 5k instances results in a sweet spot; more data (10k) starts to decrease performance, at a cost of runtime. Training on all WTC data (around 120k) is worse for most languages. From now on we consider the 5k model trained with Polyglot as our baseline (Table 1, column "5k"), obtaining a mean accuracy of 83.0 over 21 languages.
Embeddings initialization. Polyglot initialization offers a large boost; on average +3.8% absolute improvement in accuracy for our 5k training scheme, as shown in Figure 2 (b). The big gap in low-resource setups further shows their effectiveness, with up to 10% absolute increase in accuracy when training on only 500 instances.
Lexical information. The main take-away is that lexical information helps neural tagging, and embedding it proves the most helpful. Embedding Wiktionary tags reaches 83.7 accuracy on average, versus 83.4 for n-hot encoding, and 83.2 for type constraints. Only on 4 out of 21 languages are type constraints better. This is the case for only one language for n-hot encoding (French). The best approach is to embed both Wiktionary and Unimorph, boosting performance further to 84.0, and resulting in our final model. It helps the most on morphological rich languages such as Uralic.
On the test sets (Table 4, right) DSDS reaches 87.2 over 8 test languages intersecting Li et al. (2012) and Agić et al. (2016). It reaches 86.2 over the more commonly used 8 languages of Das and Petrov (2011), compared to their 83.4. This shows that our novel "soft" inclusion of noisy dictionaries is superior to a hard decoding restriction, and including lexicons in neural taggers helps. We did not assume any gold data to further enrich the lexicons, nor fix possible tagset divergences.

Discussion
Analysis. The inclusion of lexicons results in higher coverage and is part of the explanation for the improvement of DSDS; see correlation in Figure 3 (a). What is more interesting is that our model benefits from the lexicon beyond its content: OOV accuracy for words not present in the lexicon overall improves, besides the expected improvement on known OOV, see Figure 3 (b).  in case of equal means, the one with lower std is boldfaced. Averages over language families (with two or more languages in the sample, number of languages in parenthesis).
More languages. All data sources employed in our experiment are very high-coverage. However, for true low-resource languages, we cannot safely assume the availability of all disparate information sources. Table 2 presents results for four additional languages where some supervision sources are missing. We observe that adding lexicon information always helps, even in cases where only 1k entries are available, and embedding it is usually the most beneficial way. For closely-related languages such as Serbian and Croatian, using resources for one aids tagging the other, and modern resources are a better fit. For example, using the Croatian WTC projections to train a model for Serbian is preferable over in-language Serbian Bible data where the OOV rate is much higher.
How much gold data? We assume not having access to any gold annotated data. It is thus interesting to ask how much gold data is needed to reach our performance. This is a tricky question, as training within the same corpus naturally favors the same corpus data. We test both in-corpus (UD) and out-of-corpus data (our test sets) and notice an important gap: while in-corpus only 50 sentences are sufficient, outside the corpus one would need over 200 sentences. This experiment was done for a subset of 18 languages with both in-and out-ofcorpus test data.   Li et al. (2012), where the dictionaries are large, and the other languages in Figure 4, with smaller dictionaries. Compared to DAS, our tagger clearly benefits from pre-trained word embeddings, while theirs relies on label propagation through Europarl, a much cleaner corpus that lacks the coverage of the noisier WTC. Similar applies to Täckström et al. (2013), as they use 1-5M near-perfect parallel sentences. Even if we use much smaller and noisier data sources, DSDS is almost on par: 86.2 vs. 87.3 for the 8 languages from Das and Petrov (2011), and we even outperform theirs on four languages: Czech, French, Italian, and Spanish.

Related Work
Most successful work on low-resource POS tagging is based on projection (Yarowsky et al., 2001), tag dictionaries (Li et al., 2012), annotation of seed training data  or even more recently some combination of these, e.g., via multi-task learning (Fang and Figure 4: The performance of LI with our dictionary data over EM iterations, separate for the languages from Li et al. (2012) and all the remaining languages in Table 1. Cohn, 2016;Kann et al., 2018). Our paper contributes to this literature by leveraging a range of prior directions in a unified, neural test bed.
Most prior work on neural sequence prediction follows the commonly perceived wisdom that hand-crafted features are unnecessary for deep learning methods. They rely on end-to-end training without resorting to additional linguistic resources. Our study shows that this is not the case. Only few prior studies investigate such sources, e.g., for MT (Sennrich and Haddow, 2016;Chen et al., 2017;Li et al., 2017;Passban et al., 2018) and Sagot and Martínez Alonso (2017) for POS tagging use lexicons, but only as n-hot features and without examining the cross-lingual aspect.

Conclusions
We show that our approach of distant supervision from disparate sources (DSDS) is simple yet surprisingly effective for low-resource POS tagging. Only 5k instances of projected data paired with off-the-shelf embeddings and lexical information integrated into a neural tagger are sufficient to reach a new state of the art, and both data selection and embeddings are essential components to boost neural tagging performance.