Cross-lingual tagger evaluation without test data

We address the challenge of cross-lingual POS tagger evaluation in absence of manually annotated test data. We put forth and evaluate two dictionary-based metrics. On the tasks of accuracy prediction and system ranking, we reveal that these metrics are reliable enough to approximate test set-based evaluation, and at the same time lean enough to support assessment for truly low-resource languages.


Introduction
Cross-lingual learning of NLP models is currently in an evaluation impasse. While we can create reliable cross-lingual taggers and parsers for hundreds of low-resource languages (Agić et al., 2016), we can only evaluate our models for languages where some hand-annotated test data is available. The requirement for the uniformity of annotations (Mc-Donald et al., 2013) further strengthens the constraint. The set of languages with readily available test data is very exclusive. Namely, they are the resource-rich languages from the Universal Dependencies project (Nivre et al., 2015). 1 Recent works have suggested to evaluate crosslingual approaches by proxy, e.g., by using crowdsourced tag dictionaries (Li et al., 2012;. In these works, though, the validity of assessment by using tag dictionaries is left completely unaddressed.
Contributions. Our work poses the question: How adequate are tag dictionaries for evaluating POS taggers for low-resource languages? Across 25 languages, we compare the POS tagger rankings induced by evaluation against dictionaries to 1 http://universaldependencies.org/ those induced by evaluation on manually annotated gold standards. We select the best out of five competitive taggers for 14 out of 25 languages. We also consider to what extent we can predict true tagging scores. We find that as little as the 100 most frequent tokens with corresponding POS tags suffice to provide reliable estimates of true scores. Finally, we introduce a novel metric that presumes nothing but an English tag dictionary and a small bilingual dictionary for the target language. We also find this metric to be a relatively robust estimator for tagging accuracy. It finds the best tagger for 11 out of 20 languages.
Our code and data are freely available. 2

Metrics
In cross-lingual learning work, it is common to evaluate POS taggers for accuracy by using test data annotated by human experts. For a test set T of n word-tag pairs (w i , t i ) and its taggingT , we define the true accuracy A true as: Obviously this metric can only be computed when test data is available, which is not the case for the vast majority of the world's languages. Note that while we use the term true accuracy, the adequacy of the metric depends on how representative the annotated data is of the underlying distribution. Drawing from Li et al. (2012)-who compared Wiktionaries to gold dictionaries extracted from the tagger training sets-  propose an approximate metric in absence of test data T . They apply it to 10 low-resource languages by using Wiktionaries ranging from only 50 to more than 20k dictionary entries. We take their metric as our starting point.
Soft accuracy. Given a dictionary D whose entries are word forms with their ambiguous taggings (w, D w = {t w 1 , ..., t w k }), we express the approximate or soft accuracy A soft as: In absence of true tags t i , we ambiguously tag T using the tags from D, but only for the tokens w i that are covered by the dictionary: (w i , D w i ) ∈ D. We then count the tagger outputt i as correct iff it is warranted by the dictionary:t i ∈ D w i .
Problems. Crowd-sourced dictionaries can suffer from limited coverage and poor quality. We counter the first issue by covering the most frequent words. We distinguish between A soft with frequency information (+freq), using the m most frequent words, or without frequency information (−freq), using m random words.
Tag lists D i can also be deficient: They can be missing certain tags, or contain incorrect tags, or both. For example, the Croatian Wiktionary only notes the NOUN tagging of igra (en. game), but in reality the word form also has a VERB tagging (en. to play, third person singular).
We can gauge the quality of D in presence of a high-quality dictionary G = {(w i , G i )} G i=1 which we can induce from a training set: Namely, for each word w i covered by both D and G, we check how many tags D i and G i intersect, and then use the intersection to estimate dictionary precision and recall.
Translated dictionaries. With low-resource languages, we cannot presume the availability of tag dictionaries. However, we often have highquality bilingual dictionaries with translations of common words into a resource-rich language such as English. With these in place, we can "translate" the English dictionary into a lowresource language and exploit the resulting D trans in the evaluation for A soft . We implement a very simple form of dictionary lookup-based translation, whereby all words in the English word-tag dictionary are replaced by target-language words through bilingual dictionaries.
We expect this bilingual dictionary-based soft metric A trans to suffer from the same coverage and quality problems as A soft , and to introduce additional "translation noise" on top of that. We maintain that both metrics can still be reliable estimators of tagging accuracy for truly low-resource languages in absence of annotated test data.

Experiments
We perform two sets of experiments: i) numerical score prediction, where we evaluate the approximate metrics A soft and A trans as estimators of the true POS tagging accuracies A true , and ii) rank prediction, where we test how well do A soft and A trans perform in ranking several POS taggers relative to A true .
In numerical score prediction, we evaluate the taggers using all three metrics, and establish empirical relations between dictionary quality and size, and the observed scores.
In rank prediction, we rank five POS taggers using A true , and then attempt to replicate the ranking using A soft and A trans . We express the quality of predicted rankings using precision (P@1) and Kendall's τ b statistic (Knight, 1966).
Data. We train and test our taggers on data from UD version 1.2 (Nivre et al., 2015). We intersect this collection with the dictionaries we make available for this experiment: 9 of the Wiktionaries come from Li et al. (2012), and we collect 16 new on top of that. Thus, we experiment with a total of 25 languages from the UD. We refer to the 9 languages of Li et al. (2012) as development languages. To make the Wiktionaries and the UD data compatible, we map all POS tags to the tagset by Petrov et al. (2012).
We estimate the frequencies for the +freq variants of the soft metrics by using the multilingual Bible corpus by Christodouloupoulos and Steedman (2014) and the Watchtower corpus (Agić et al., 2016) combined.
We translate the English Wiktionary from Li et al. (2012) by using bilingual dictionaries from Wiktionary to obtain D trans for 20 languages. 3 Figure 1: Impact of dictionary size and frequency usage (−freq, +freq) on numerical score prediction for nine development languages using the TnT tagger. The shaded regions represent 95% confidence intervals for D − freq. The −freq dictionaries are randomly sampled 100 times for each size step, and the steps range 100-10,000 entries both for −freq and +freq.

Results
Score prediction. Here, we discuss how well our metric A soft performs in guessing the true tagger accuracies by using the Wiktionaries. Figure 1 reveals that even large Wiktionaries do not make for good accuracy estimators if they do not exploit the frequencies. We see the evidence for that in the very wide confidence intervals in our Wiktionary sampling. In contrast, even the smallest of frequency-aware Wiktionaries prove to be much more reliable. They can contain as little as 100 entries, especially if their tagging quality is high. For example, a bad sample of 6k Spanish (es) words and tags might underestimate A true by 10 points, while using the 100 most frequent Spanish words get us as close as -4 points even with erroneous tags.
We observe high negative correlations of Wiktionary F 1 scores (Pearson's ρ = −0.58) and test glish UD training set due to much higher coverage in spite of lower precision: F1 = 18.51 for the Wiktionary translations (Dtrans), compared to F1 = 13.22 for the UD training set translations (Gtrans) over 20 languages. set coverages (ρ = −0.60) with the quality of accuracy estimation, expressed as absolute difference of the two scores A true − A soft for the data in Figure 1. In simpler terms: The higher i) the intrinsic quality of the Wiktionary and ii) its coverage, the better the score estimation. There, the Wiktionaries are intrinsically evaluated with respect to the training set dictionaries. We also note that the noisy Wiktionaries (D) tend to underestimate A true , while the more reliable gold dictionaries (G) overestimate.
The translation-based metric A trans approximates the true scores better than A soft for 7/20 languages, and is more stable across languages as all D trans originate from English (en). See Table 1 for the results on all 25 languages.
Rank prediction. In system ranking, we try to select the best tagger for a given language through our metrics. We note the task is rather hard as all the taggers score very close to one another. Still, we manage to find the best tagger for 14/25 languages with A soft , and for 11/20 with A trans .
For some languages, even in spite of Wiktionary deficiency, we manage to i) select the best tagger and to ii) improve the true score prediction through translation from English. For example, the high quality of Bulgarian (bg) Wiktionary is outweighed by the high coverage of its D trans , and there A trans significantly improves the prediction. For Farsi (fa), we improve both the score predic-  Table 1: Wiktionary size and quality, and metrics evaluation. The dictionary sizes D are ×10 3 entries. Wiktionaries are evaluated for precision (P) and recall (R) against the respective UD training set dictionaries (G). In metrics evaluation, scores are obtained by using the full Wiktionaries, averaged (Ā) over five POS taggers. *: development languages, with Wiktionaries by Li et al. (2012). ±: 95% confidence intervals; bold: best score estimates, i.e., lowest differences to true scores A true − A soft .
tion and the tagger selection. Through Kendall's τ b statistic, we rate the quality of the entire rankings, not just of guessing the best out of five taggers. We find that the true and the estimated rankings are statistically dependent at p < 0.05 for all languages. We also find that the taggers are easier to rank when the true scores are lower and further apart. For example, the French (fr) and Spanish (es) taggers are hard to rank as they all score very close to one another, while we easily rank the taggers for Greek (el), Basque (eu), Polish (pl), or Romanian (ro). We argue that such ranking behavior favors evaluation for lowresource languages, where insufficient data is very likely to cause even greater disparity between different POS taggers.

Discussion
Sources of POS tags. Our work aims at supporting cross-lingual POS tagger evaluation. Why did we then evaluate the metrics on outputs of fully supervised taggers? In short, because higher tagging scores are harder to estimate.
We experimented with: i) fully supervised taggers, ii) actual cross-lingual taggers from Agić et al. (2016), for whichĀ true = 70.56, and iii) artificial corruption of gold POS tags.
In artificial data corruption for the development languages, we found that the score prediction error correlates with the true score (ρ = 0.54). For the corruption, we created 20 samples of A true ∈ [0, 1] for each language with a 0.05 increment. Further, we evaluated A soft on the cross-lingual taggers. There, we singled out the best taggers for 13/21 intersecting languages, or for 2 languages more than over fully supervised taggers (11/21). With translated dictionaries, i.e., through A trans , we scored 13/20 (also +2 languages).
For these reasons, we decided to show how our metrics perform in the most difficult case. Here, the additional experiments with different sources of POS tags show that the metrics easily scale down to evaluating cross-lingual taggers for lowresource languages.
Held-out data. Annotating a handful of test sentences could serve as an alternative to dictionarybased evaluation. We find that ∼55±27 sentences are needed on average to reach the system ranking accuracy of A trans for our 20 languages. However, the option of annotating test data might not be feasible for many low-resource languages, while Wiktionaries are currently readily available for more than 300 languages. We also note that the required sample size is negatively correlated with tagging accuracy (ρ = −0.63): the lower the tagger accuracy, the more sentences we need to reasonably estimate it. Li et al. (2012) gauge 9 Wiktionaries against gold dictionaries to strengthen the argument for their weakly-supervised tagger.  use 10 Wiktionaries to extend a cross-lingual tagger evaluation to languages without test sets, but they do so indiscriminately. Their Wiktionaries range from only 50 to more than 20k random entries. To the best of our knowledge, research on evaluating POS taggers in absence of manually annotated test data is novel to our work.

Related work
We collected 16 new Wiktionaries on top of the 9 provided by Li et al. (2012) for our experiment. Recently, larger Wiktionary datasets 4 have been made available, enabling further experiments with cross-lingual tagging. The dataset of Sylak-Glassman et al. (2015) covers more than 300 languages, and includes parts of speech and morphological features. Plank et al. (2015) discuss how various metrics for evaluating syntactic dependency parsing correlate with human judgments. We suggest that our translation-based metrics might naturally extend to dependency parsing by, e.g., treating an English dependency relation dictionary as a tag dictionary. The strong correlations between labeling (LA) and attachment scores (UAS) in dependency parsing favor our proposal. 5 Garrette and Baldridge (2013) build taggers for low-resource languages from just 2 hours of man-ual annotation. Similarly, we show how to reliably evaluate cross-lingual POS taggers by translating as little as 100 most frequent English Wiktionary entries to the target language.

Conclusions
We evaluated how well the quality of POS taggers can be estimated without annotated test data. Our work has obvious applications to developing unsupervised or weakly supervised POS taggers for low-resource languages.
We were able to reliably estimate tagging accuracies by using very small tag dictionaries. Dictionaries with as little as 100 entries were in the majority of cases sufficient to predict true accuracies within 5%. We only require that these 100 entries be frequently used. Out of 5 competitive POS taggers, we then single out the best ones using our metric for 14/25 languages.
Finally, we showed that even if the dictionaries are "translated" from the English Wiktionary through a small list of bilingual word pairs we can still predict what POS taggers are best for 11/20 languages. In other words, we found that it is sufficient to translate a small list of frequent words from English to start reliably evaluating crosslingual taggers for the true targets.