Model Selection for Type-Supervised Learning with Application to POS Tagging

Model selection (picking, for example, the feature set and the regularization strength) is crucial for building high-accuracy NLP models. In supervised learning, we can estimate the accuracy of a model on a subset of the labeled data and choose the model with the highest accuracy. In contrast, here we focus on type-supervised learning, which uses constraints over the possible labels for word types for supervision, and labeled data is either not available or very small. For the setting where no labeled data is available, we perform a comparative study of previously proposed and one novel model selection criterion on type-supervised POS-tagging in nine languages. For the setting where a small labeled set is available, we show that the set should be used for semi-supervised learning rather than for model selection only – using it for model selection reduces the error by less than 5%, whereas using it for semi-supervised learning reduces the error by 44%.


Introduction
Fully supervised training of NLP models (e.g., part-of-speech taggers, named entity recognizers, relation extractors) works well when plenty of labeled examples are available. However, manually labeled corpora are expensive to construct in many languages and domains, whereas an alternative, if weaker, supervision is often readily available. For example, corpora labeled with POS tags at the token level are only available for around 35 languages, while tag dictionaries of the form displayed in Fig. 1 are available for many more languages, either in commercial dictionaries or community created resources such as Wiktionary. Tag dictionaries provide type-level supervision for word types in the lexicon. Similarly, while sentences labeled with named entities are scarce, gazetteers and databases are more readily available (Bollacker et al., 2008).
There has been substantial research on how best to build models using such type-level supervision, for POS tagging, super sense tagging, NER, and relation extraction (Craven et al., 1999;Smith and Eisner, 2005;Carlson et al., 2009;Mintz et al., 2009;Johannsen et al., 2014), inter alia, focussing on parametric forms and loss functions for model training. However, there has been little research on the practically important aspect of model selection for type-supervised learning. While some previous work used criteria based on the type-level supervision only (Smith and Eisner, 2005;Goldwater and Griffiths, 2007), much prior work used a labeled set for model selection (Vaswani et al., 2010;Soderland and Weld, 2014). We are not aware of any prior work aiming to compare or improve existing type-supervised model selection criteria.
For POS tagging, there is also work on using both type-level supervision from lexicons, and projection from another language (Täckström et al., 2013). Methods for training with a small labeled set have also been developed (Søgaard, 2011;Garrette and Baldridge, 2013;Duong et al., 2014), but there have not been studies on the utility of a small labeled set for model selection versus model training. Our contributions are: 1) a simple and generally applicable model selection criterion for type-supervised learning, 2) the first multi-lingual systematic evaluation of model selection criteria for type-supervised models, 3) empirical evidence that, if a small labeled set is available, the set should be used for semi-supervised learning and not only for model selection.
2 Model selection and training for type-supervised learning Notation. In type-supervised learning, we have unlabeled text T = {x} of token sequences, and a lexicon lex which lists possible labels for word types. Model training finds the model parameters θ which minimize a training loss function L(θ; lex, T , h). We use h to represent the configurations and modeling decisions (also known as hyperparameters). Examples include the dependency structure between variables, feature templates, and regularization strengths. Given a set of fully-specified hyperparameter configurations {h 1 , . . . , h M }, model selection aims to find the configuration hm that maximizes the expected performance of the corresponding model θm according to a suitable accuracy measure.
x is a token sequence, y is a label sequence, and lex[x] is the set of label sequences compatible with token sequence x according to lex. Task. For the application in this paper, the task is type-supervised POS tagging, and the parametric model family we consider is that of featurized first order HMMs (Berg-Kirkpatrick et al., 2010). The hyperparameters specify the feature set used and the strength of an L 2 regularizer on the parameters.
Evaluation function. The evaluation function used in model selection is the main focus of this work. We use a function eval(m, T dev ) to estimate the performance of the model trained with hyperparameters h m on a development set T dev . In the following subsections, we discuss definitions of eval when the development set T dev is labeled and when it is unlabeled, respectively.

T dev is labeled
When the development set T dev is labeled, a natural choice of eval is token-level prediction accu-racy: Here, we use i to index all tokens in T dev ; y gold [i] denotes the correct POS tag, and y m [i] denotes the predicted POS tag of the i-th token obtained with hyperparameters h m .

T dev is unlabeled
When token-supervision is not available, we cannot compute eval sup . Instead, previous work on POS tagging with type supervision (Smith and Eisner, 2005) used: eval cond estimates the conditional log-likelihood of "lex-compatible" labels given token sequences, while eval joint estimates the joint log-likelihood of lex-compatible labels and token sequences.
The held-out lexicon criterion. We propose a new model selection criterion which estimates prediction accuracy more directly and is applicable to any model type, without requiring that the model define conditional or joint probabilities of label sequences. The idea behind this proposed criterion is simple: we hold out a portion of the lexicon entries and use it to estimate model performance as follows: where lex dev is the held-out portion of the lexicon entries, and x i is the i-th token in T .
The remainder of this section details the theory behind this criterion. The expected token-level accuracy of a model trained with hyperparameters h m is defined as E (x,y gold ,ym)∼D [1(y m = y gold )]; where D is a joint distribution over tokens x (in context), their gold labels y gold , and the predicted labels y m (for the configuration h m ). Since when no labeled data is available we do not have access to samples from D, we derive an approximation to this distribution using lex and T .
We first split the full lexicon into a training lexicon lex train and a held-out (or development) lexicon lex dev (see Fig. 1), by sampling words according to their token frequency c(x) in T , and placing them in the development or training portions of the lexicon such that lex dev covers 25% of the tokens in T . The goal of the sampling process is to make the distribution of word tags for words in the development lexicon representative of the tag distribution for all words.
Given h m , we train a tagging model using lex train and use it to predict labels y m for all tokens in T . We then use the word tokens covered by the development lexicon and their predicted tags y m to approximate D by letting P (y gold | x) be a uniform distribution over gold labels consistent with the lexicon for x, resulting in the following approximation P D (x, y gold , y m ) ∝ We then compute the expected accuracy as eval devlex = E D [1(y gold = y m )], and select the hyperparameter configurationm which maximizes this criterion, then re-train the model with the full lexicon lex. 1 3 How to best use a small labeled set T L ?
Several prior works used a labeled set for supervised hyper-parameter selection even when only type-level supervision is assumed to be available for training (Vaswani et al., 2010;Soderland and Weld, 2014). In this section, we want to answer the question: if a small labeled set is available, what are the potential gains from using it for model selection only, versus using it for both model training and model selection? A simple way to use a small labeled set for parameter training together with a larger unlabeled set in our type-supervised learning setting, is to do semi-supervised model training as follows (Nigam et al., 2000): Starting with our training loss function defined using a lexicon lex and unlabeled set T U L(θ; lex, T U , h m ), we define a combined loss function using both the unlabeled set T U and the labeled set T L : L(θ; lex, T U , h m ) + λL(θ; lex, T L , h m ). We then select parameters θ m to minimize the new loss function, where λ is now an additional hyperparameter that usually gives more weight to the labeled set. An advantage of this method is that it can be applied to any type-supervised model using less than 100 lines of code. 2 We implement this method for semisupervised training, and we use the labeled set both for semi-supervised model training and for hyper-parameter selection using a standard fivefold cross-validation approach. 3

Experiments
We evaluate the introduced methods for model selection and training with type supervision in two type-supervised settings: when no labeled examples are available, and when a small number of labeled examples are available.
We use a feature-rich first-order HMM model (Berg-Kirkpatrick et al., 2010) with an L 2 prior on feature weights. 4 Instead of using a multinomial distribution for the local emissions and transitions, this model uses a log-linear distribution (i.e., p(x i | y i ) ∝ exp λ f (x i , y i )) with a feature vector f and a weight vector λ. We use the feature set described in (Li et al., 2012): transition features, word-tag features ( y i , x i ) (lowercased words with frequency greater than a threshold), whether the word contains a hyphen and/or starts with a capital letter, character suffixes, and whether the word contains a digit. We initialize the transition and emission distributions of the HMM using unambiguous words as proposed by (Zhang and DeNero, 2014). Data. We use the Danish, Dutch, German, Greek, English, Italian, Portuguese, Spanish and Swedish datasets from CoNLL-X and CoNLL-2007 shared tasks (Buchholz and Marsi, 2006;Nivre et al., 2007). We map the POS labels in the CoNLL datasets to the universal POS tagset (Petrov et al., 2012). We use the tag dictionaries provided by Li et al. (2012)  language, we consider M = 15 configurations of the hyperparameters which vary in the L 2 regularization strength (5 values), and the minimum word frequency for word-tag features (3 values). When a small labeled set is available we additionally choose one of 3 values for the weight of the labeled set (see Section 3). We report final performance of models selected using different criteria using token-level accuracy on an unseen test set.
No labeled examples. When no labeled examples are available, we do model training and selection using only unlabeled text T and a tagging lexicon lex. We compare three type-supervised model selection criteria: conditional likelihood, joint likelihood, and the held-out lexicon criterion eval devlex . Additionally, we include the performance of a method which selects the hyperparameters using labeled data in English and uses these (same) hyperparameters for English and all other languages (we call this method "English Labeled"). Fig. 2 shows the accuracy of the models chosen by each of the four criteria on nine languages, as well as the average accuracy across languages. The lower (upper) bounds on average performance obtained by always choosing the worst (best) hyperparameters is 82.77 (85.83). eval joint outperformed eval cond on eight out of the nine languages and achieved a significantly higher average accuracy (85.05 vs 83.70). eval devlex outperformed eval joint on six out of nine languages, but did significantly worse on one language (German), which resulted in a slightly lower average accuracy. Choosing the hyper-parameters using English labeled data and using the same hyperparameters for all languages performed comparably to eval joint , with slightly higher average accuracy even when limited to the non-English languages (85.0 vs 84.9). Overall the results showed that the conditional log-likelihood criterion was dominated by the other three, which were comparable in average accuracy. Looking at the eight languages excluding English (since one criterion uses labeled data for English), the newly proposed held-out lexicon criterion was the winning method on five out of eight languages, eval cond was best on one language, eval joint was best (or tied for best) on two, and English-labeled was tied for best on one language.
Few labeled examples. We consider two ways of leveraging the labeled examples: (i) typesupervised model training + supervised model selection: only use unlabeled examples to optimize model parameters, then use the labeled examples for supervised model selection with eval sup , and (ii) semi-supervised model training + supervised model selection (see Section 3 for details). Fig. 3 shows how much we can improve on the method with highest average accuracy from Figure 2 (eval joint ), when a small number of examples is available. Using the 300 labeled sentences for semi-supervised training and model selection reduced the error by 44.6% (comparing to the model with best average accuracy using only type-level supervision with average performance of 85.05, the semi-supervised average is 91.8). In contrast, using the 300 sentences to select hyperparameters only reduced the error by less than 5% (the average accuracy was 85.75). Even when only 50 labeled sentences are used for semi-supervised training and supervised model selection, we still see a boost to average accuracy of 89% (results not shown in the Figure).

Conclusion
We presented the first comparative evaluation of model selection criteria for type-supervised POStagging on many languages. We introduced a novel, generally applicable model selection cri-terion which outperformed previously proposed ones for a majority of languages. We evaluated the utility of a small labeled set for model selection versus model training, and showed that when such labeled set is available, it should not be used solely for supervised model selection, because using it additionally for model parameter training provides strikingly larger accuracy gains.