Simple Semi-Supervised POS Tagging

We tackle the question: how much supervision is needed to achieve state-of-the-art performance in part-of-speech (POS) tagging, if we leverage lexical representations given by the model of Brown et al. (1992)? It has become a standard practice to use automatically induced “Brown clusters” in place of POS tags. We claim that the underlying sequence model for these clusters is particularly well-suited for capturing POS tags. We empirically demonstrate this claim by drastically reducing super-vision in POS tagging with these representations. Using either the bit-string form given by the algorithm of Brown et al. (1992) or the (less well-known) embedding form given by the canonical correlation analysis algorithm of Stratos et al. (2014), we can obtain 93% tagging accuracy with just 400 labeled words and achieve state-of-the-art accuracy ( > 97%) with less than 1 percent of the original training data.


Introduction
While fully supervised POS tagging is largely considered a solved problem today, this is hardly the case for unsupervised POS tagging. Despite much previous work (Smith and Eisner, 2005;Johnson, 2007;Toutanova and Johnson, 2007;Haghighi and Klein, 2006;Berg-Kirkpatrick et al., 2010), results on this task are complicated by varying assumptions and unclear evaluation metrics (Christodoulopoulos et al., 2010). Perhaps most importantly, they are not good enough to be practical. Even with indirect supervision, for example the prototype-driven method of Haghighi and Klein (2006) which assumes a set of word examples for each tag type, the best perposition accuracy remains in the range of mid-70%.
Recent work has taken a middle ground between fully supervised and unsupervised setups by exploiting existing resources, for example by projecting POS tags from a supervised language or using tag dictionaries (Das and Petrov, 2011;Li et al., 2012;. In this work, we focus on minimizing the amount of labeled data required to obtain a good POS tagger. The key to our approach is the use of lexical representations induced by the clustering model of Brown et al. (1992). We argue that this model is particularly appropriate for representing POS tags given their nearly deterministic nature (Section 2). This sheds light on why the representations derived under this model reveal the underlying POS tag information of words.
We empirically demonstrate the validity of our observation by using these representations to drastically reduce the number of training examples required for good POS tagging performance on English, German, and Spanish newswire datasets. For instance, on the 12-tag English dataset, we obtain tagging accuracy of 93% with just 400 labeled words. We obtain tagging accuracy of 97.03% (about a half percent behind fully supervised models) with just 0.74% of the original training data.
Our aim is orthogonal to the discussion in Manning (2011) who investigates what is needed to go beyond the current state-of-the-art POS tagging performance. Our focus is on reaching that performance with as little supervision as possible. Certain words are genuinely ambiguous (e.g., set can be a verb or a noun): this was the motivation of the use of statistical models in the early days of computational linguistics (Church, 1988). However, it is also true that many words are deterministically mapped to correct POS tags (e.g., the is always a determiner). A simple experiment highlights this property. Let count(w, t) be the number of times word w is tagged as t in the training data (likewise, count(w) and count(t) are counts of word w and tag t). Define a deterministic mapping f : w → t from words to tags as In our datasets, this naïve procedure in fact yields reasonable tagging accuracies: 92.22% for coarse tags and 88.50% for fine-grained tags (averaged across three languages). This observation suggests that the following restricted class of HMMs might be sufficient for modeling the characteristics of POS tags: • π(t) is the prior probability of tag type t.
• t(t |t) is the probability of transitioning from tag type t to tag type t .
• o(w|t) is the probability of emitting word type w from tag type t.
• (Restriction) For each word type w, we have o(w|t) > 0 only for a unique tag type t and o(w|t ) = 0 for all other tag types t = t.
In other words, we assume that tag types partition word types while imposing a first-order sequence structure on tag types.

Brown et al. (1992) model
This class of restricted HMMs is precisely the model proposed by Brown et al. (1992)-henceforth the Brown model. A popular use of this model is agglomerative word clustering: the result is a hierarchy over word types, such as the one shown in Figure 1(a). In practice, each word type is represented as a bit-string indicating the path from the root. These bit-strings have been used as discrete (binary) features in various natural language tasks such as named-entity recognition (Miller et al., 2004) and dependency parsing (Koo et al., 2008).
Recently, Stratos et al. (2014) showed that a variant of canonical correlation analysis (CCA) (Hotelling, 1936) can be used to provably recover the clusters under the Brown model. Under this method, each word is represented as an mdimensional vector where m is the number of hidden states in the model: see Figure 1(b) for illustration. This can be used as m real-valued features in discrminative models. Note that real-valued representations can reflect ambiguity (e.g., set in the illustration) which can be seen as a benefit over discrete representations.
By the above observation, we conjecture that the hidden states of the Brown model capture POS tags. Then the bit-string and embedding representations essentially "give away" the POS tag associated with a word. In experiments, we show that this is indeed the case and not far removed from the idealized illustration in Figure 1.

Method
In this section, we describe our tagging framework MINITAGGER, which is pleasantly simple but surprisingly effective. It uses an off-the-shelf discriminative classifier to map a word's context to a POS tag. Concretely, given a sentence x and a position i in x, we extract a feature vector φ(x, i) ∈ R d and train a multi-class classifier to map φ(x, i) to a POS tag. To clarify, this is not the the HMM model described in the previous section: the HMM model underlies Brown bit-strings and CCA embeddings.
This framework has compelling benefits. First, it allows for learning from partially labeled sentences since each word is an independent sample. Second, training and tagging can be very fast since they do not involve dynamic programming required for structured models. Third, arbitrary features can be easily and effectively incorporated. Finally, there are many well-oiled public implementations of discriminative classifiers such as support-vector machines (SVMs), thus building an efficient and effective MINITAGGER takes only a minimal effort.

Feature templates
We use a baseline feature function base that maps a sentence-position pair (x, i) to a a 0-1 vector base(x, i) indicating • Prefixes and suffixes of x i up to length 4 • Whether x i is capitalized, numeric, or nonalphanumeric Let bit(x) be a binary vector indicating prefixes of the Brown bit-string corresponding to x. 1 Let  Table 1 shows the feature templates we use to obtain a vector representation of (x, i). ⊕ is the vector concatenation operation. BASE is a baseline template which uses only the spelling features of the current word and the identities of neighboring words. BIT is the same as BASE but augmented with Brown bit-strings. CCA is the same as BASE but augmented with CCA embeddings with appropriate normalization.

Active learning
We also deploy active learning to find the most informative words for labeling in attempt to reduce the amount of training data. While there is a rich literature on this topic (see Settles (2010) for a survey), we focus on a simple form of margin sampling in this work. Every time the model is allowed to have an additional label, it actively selects the word from a pool of unannotated words whose predicted tag is the least confident.
To be precise, let s(x, i, y) be the score of label y for sentence-position pair (x, i). For example, a linear SVM may define s(x, i, y) = w y φ(x, i) where w y is the model parameter and φ(x, i) is a feature template in Table 1. To obtain M labeled examples, we specify the initial seed size k ≤ M and the step size ξ (for simplicity, assume ξ divides M − k) and proceed as follows:  Large values of k and ξ can be used to speed up active learning (possibly at the cost of performance loss). Since our focus is on maximizing performance with minimal supervision, we use k = ξ = 1. We leave the speed-performance tradeoff of active learning for future work.

Random and frequent-word sampling
In addition to active learning, we consider the following methods for obtaining M labeled words.
• Random sampling: Select M words uniformly at random (without replacement).
• Frequent-word sampling: Select random occurrences of the M most frequent word types.
Note that frequent-word sampling is optimal if there really is a deterministic mapping from word types to tag types. But since the assumption does not hold perfectly, it has severe limitations in practice.
We have found that frequent-word sampling outperforms random sampling for small values of M but starts to lag behind as M increases.

Setting
We experimented on 3 languages: English, German, and Spanish. For all these languages, we used the train/dev/test datasets of the universal treebank (Mc-Donald et al., 2013)-both the reduced tagset version, which we denote by EN12, GE12, and SP12, and the original tagset version, which we denote by EN45, GE16, and SP24. The number in the dataset name refers to the number of tag types in that dataset: e.g., EN45 is an English dataset with 45 possible tags. For each language, we derived Brown representations (by which we mean both the bit-string and embedding forms) from a corpus of unlabeled text. For English, we used a corpus of about 772 million words from various sources of (mostly newswire) text. For German and Spanish, we used the n-gram statistics in Google Ngram (Michel et al., 2011). The number of words was about 64 billion for German and 83 billion for Spanish.
We used the implementation of Liang (2005) to derive bit-string word representations for English: for German and Spanish, we used the agglomerative clustering technique of Stratos et al. (2014) since Liang's implementation did not support operating directly on n-grams. We used the CCA algorithm of Stratos et al. (2014) to derive 50-dimensional word embeddings.  the closest in l 2 distance) of some example words. We see that these embeddings are remarkably good at relating the POS tag information to Euclidean distance, confirming our hypothesis in Section 2. We built a MINITAGGER using the liblinear package of Fan et al. (2008). 2 We primarily compared our model with conditional random fields (CRFs) (Lafferty et al., 2001). We used the implementation of Okazaki (2007).
While we do not rigorously compare runtime performances in this work, we note that the computational advantages of MINITAGGER are very useful in practice. Training/tagging takes only a few seconds with baseline features; with more complex features such as word embeddings, it still takes much less time than what is required by a CRF (which takes hours with baseline features).

With limited supervision
We first look at the effect of using Brown representations when only a limited amount of training data is available: a scenario in which the role of such lexical representations can be prominent. We select a subset of training data (by words) with various sampling schemes described in Section 3.2.
Fixed amount of data. Table 3 shows the perfor-2 Our code is available at: https://github.com/ karlstratos/minitagger. mance on the development portion when MINITAG-GER is trained on 200, 400, and 1000 labeled words. Active learning together with Brown representations gives dramatic improvement in accuracy when the amount of training data is limited. With 200 randomly sampled labels, the baseline model obtains an average accuracy of 74.97% across EN12, GE12, and SP12. This improves to 86.09% when labels are actively selected with CCA features. A striking result is that we can obtain an accuracy of 93% with only 400 labeled words on the 12-tag English data. The performance of various sampling methods and features on EN12 is plotted in Figure 2:   clear that active sampling with Brown bit-string or CCA embedding features outperforms others consistently and very significantly.
Fixed target accuracy. Table 4(a) shows the smallest numbers of labeled words required to achieve target performance, where the target performance is defined to be the accuracy of the fully supervised baseline rounded down to a whole number (Table 5). We repeatedly increase the training size by 100 and report the first size that allows MINITAGGER to reach the target accuracy. These numbers are presented as percentages of the size of the original training data in Table 4(b).
We see that active learning with lexical representations provides dramatic reduction in training data while maintaining good performance. In all cases, using CCA embeddings as features for active learning outperforms using Brown bit-strings, although sampling takes much longer with CCA embeddings since there are many more non-zero features. MINITAGGER does almost as well as when fully supervised with less than 1% of the training data on English: > 97% accuracy with 0.74% of the data on the 12-tag version, and > 96% accuracy with 0.81% of the data on the 45-tag version.

With full supervision
We also examine the effect of Brown representations in a fully supervised setting. Table 5 shows the performance of different tagging methods on the development portion when all training data is used. We see that Brown representations are helpful even under full supervision: MINITAGGER, a simple greedy model, outperforms CRF when equipped with Brown representations.  Table 4 which are sufficient to reach state-of-theart performance on the dev portion. As percentages of the original training data, these samples constitute 0.74% for EN12, 1.13% for GE12, 0.72% for SP12, 0.81% for EN45, 4.98% for GE16, and 1.60% for SP24-1.66% on average. The accuracy of MINITAGGER equipped with Brown representations is again generally higher than that of CRF. Furthermore, MINITAGGER achieves competitive performance using only a fraction of the original training set.

Related work
We make a few remarks on related works not already discussed earlier. Our work extends a rich body of previous work on reducing annotation efforts with seed examples, unlabeled data, and training example selection (Yarowsky, 1995;Blum and Mitchell, 1998;Collins and Singer, 1999;Miller et al., 2004;Koo et al., 2008;Kim and Snyder, 2013). In particular, Miller et al. (2004) investigate semi-supervised named-entity recognition based on Brown clusters and active learning. Koo et al. (2008) investigate semi-supervised dependency parsing based on Brown clusters.
The direction that Ringger et al. (2007) pursue is perhaps the most similar to ours. They attempt to reduce supervision required for high POS tagging performance based on active learning. But a critical difference is that they do not use word representations: in contrast, word representations are central to our approach.
Another closely relevant work is the work of Garrette and Baldridge (2013) who aim to learn a good POS tagger from limited resources. Notably, they faithfully simulate tagging resource-poor languages with human annotators. Our contribution is different in several important ways. Most importantly, our results are much more striking in the aspect of minimizing supervision. We obtain > 90% accuracy with a few hundred labeled words, whereas Garrette and Baldridge (2013) obtain 71-78% with 1,537-2,650 labeled words and tag dictionaries (i.e., the result of two hours of annotation efforts). They also do not make use of word representations which are the highlight of this work.

Conclusion
We have argued that that the sequence model of Brown et al. (1992), often used for deriving lexical representations, is particularly appropriate for capturing POS tags. We have demonstrated this claim by drastically reducing the amount of labeled data required for state-of-the-art POS tagging accuracy with word representations derived under the Brown model. Our simple framework MINITAGGER allows one to learn a functioning POS tagger with merely a few hundred labeled tokens, or an accurate POS tagger with less than 1% of the normally considered amount of training data.
We focused on utilizing lexical representations in a greedy framework, which is well-suited for the per-position accuracy metric (which is the standard metric for POS tagging). However, the result may be quite different if other metric is chosen, for instance the per-sentence accuracy metric. Thus improving tagging performance under different metrics using lexical representations may be a fruitful direction.
While they are not considered in this work, lexical representations not derived under the Brown model such as the skip-gram model in the WORD2VEC package Mikolov et al. (2013) can certainly be used for the same task. It may be illuminating to compare such different representations.