Bayesian Modeling of Lexical Resources for Low-Resource Settings

Lexical resources such as dictionaries and gazetteers are often used as auxiliary data for tasks such as part-of-speech induction and named-entity recognition. However, discriminative training with lexical features requires annotated data to reliably estimate the lexical feature weights and may result in overfitting the lexical features at the expense of features which generalize better. In this paper, we investigate a more robust approach: we stipulate that the lexicon is the result of an assumed generative process. Practically, this means that we may treat the lexical resources as observations under the proposed generative model. The lexical resources provide training data for the generative model without requiring separate data to estimate lexical feature weights. We evaluate the proposed approach in two settings: part-of-speech induction and low-resource named-entity recognition.


Introduction
Dictionaries and gazetteers are useful in many natural language processing tasks. These lexical resources may be derived from freely available sources (such as Wikidata and Wiktionary) or constructed for a particular domain. Lexical resources are typically used to complement existing annotations for a given task (Ando and Zhang, 2005;Collobert et al., 2011). In this paper, we focus instead on low-resource settings where task annotations are unavailable or scarce. Specifically, we use lexical resources to guide part-of-speech induction ( §4) and to bootstrap named-entity recognizers in low-resource languages ( §5).
Given their success, it is perhaps surprising that incorporating gazetteers or dictionaries into dis-criminative models (e.g. conditional random fields) may sometimes hurt performance. This phenomena is called weight under-training, in which lexical features-which detect whether a name is listed in the dictionary or gazetteer-are given excessive weight at the expense of other useful features such as spelling features that would generalize to unlisted names (Smith et al., 2005;Sutton et al., 2006;Smith and Osborne, 2006). Furthermore, discriminative training with lexical features requires sufficient annotated training data, which poses challenges for the unsupervised and low-resource settings we consider here.
Our observation is that Bayesian modeling provides a principled solution. The lexicon is itself a dataset that was generated by some process. Practically, this means that lexicon entries (words or phrases) may be treated as additional observations. As a result, these entries provide information about how names are spelled. The presence of the lexicon therefore now improves training of the spelling features, rather than competing with the spelling features to help explain the labeled corpus.
A downside is that generative models are typically less feature-rich than their globally normalized discriminative counterparts (e.g. conditional random fields). In designing our approach-the hierarchical sequence memoizer (HSM)-we aim to be reasonably expressive while retaining practically useful inference algorithms. We propose a Bayesian nonparametric model to serve as a generative distribution responsible for both lexicon and corpus data. The proposed model memoizes previously used lexical entries (words or phrases) but backs off to a character-level distribution when generating novel types (Teh, 2006;Mochihashi et al., 2009). We propose an efficient inference algorithm for the proposed model using particle Gibbs sampling ( §3). Our code is available at https://github.com/noa/bayesner.

Model
Our goal is to fit a model that can automatically annotate text. We observe a supervised or unsupervised training corpus. For each label y in the annotation scheme, we also observe a lexicon of strings of type y. For example, in our tagging task ( §4), a dictionary provides us with a list of words for each part-of-speech tag y. (These lists need not be disjoint.) For named-entity recognition (NER, §5), we use a list of words or phrases for each named-entity type y (PER, LOC, ORG, etc.). 1

Modeling the lexicon
We may treat the lexicon for type y, of size m y , as having been produced by a set of m y IID draws from an unknown distribution P y over the words or named entities of type y. It therefore provides some evidence about P y . We will later assume that P y is also used when generating mentions of these words or entities in text. Thanks to this sharing of P y , if x = Washington is listed in the gazetteer of locations (y = LOC), we can draw the same conclusions as if we had seen a LOC-labeled instance of Washington in a supervised corpus. Generalizing this a bit, we may suppose that one observation of string x in the lexicon is equivalent to c labeled tokens of x in a corpus, where the constant c > 0 is known as a pseudocount. In other words, observing a lexicon of m y distinct types {x 1 , . . . , x my } is equivalent to observing a labeled pseudocorpus of cm y tokens. Notice that given such an observation, the prior probability of any candidate distribution P y is reweighted by the likelihood (cmy)! (c!) my · (P y (x 1 )P y (x 2 ) · · · P y (x my )) c . Therefore, this choice of P y can have relatively high posterior probability only to the extent that it assigns high probability to all of the lexicon types.

Discussion
We employ the above model because it has reasonable qualitative behavior and because computationally, it allows us to condition on observed lexicons as easily as we condition on observed corpora. However, we caution that as a generative model of the lexicon, it is deficient, in the sense that it 1 Dictionaries and knowledge bases provide more information than we use in this paper. For instance, Wikidata also provides a wealth of attributes and other metadata for each entity s. In principle, this additional information could also be helpful in estimating Py(s); we leave this intriguing possibility for future work.
allocates probability mass to events that cannot actually correspond to any lexicon. After all, drawing cm y IID tokens from P y is highly unlikely to result in exactly c tokens of each of m y different types, and yet a run of our system will always assume that precisely this happened to produce each observed lexicon! To avoid the deficiency, one could assume that the lexicon was generated by rejection sampling: that is, the gazetteer author repeatedly drew samples of size cm y from P y until one was obtained that had this property, and then returned the set of distinct types in that sample as the lexicon for y. But this is hardly a realistic description of how gazetteers are actually constructed. Rather, one imagines that the gazetteer author simply harvested a lexicon of frequent types from P y or from a corpus of tokens generated from P y . For example, a much better generative story is that the lexicon was constructed as the first m y distinct types to appear ≥ c times in an unbounded sequence of IID draws from P y . When c = 1, this is equivalent to modeling the lexicon as m y draws without replacement from P y . 2 Unfortunately, draws without replacement are no longer IID or exchangeable: order matters. It would therefore become difficult to condition inference and learning on an observed lexicon, because we would need to explicitly sum or sample over the possibilities for the latent sequence of tokens (or stick segments). We therefore adopt the simpler deficient model.
A version of our lexicon model (with c = 1) was previously used by Dreyer and Eisner (2011, Appendix C), who observed a list of verb paradigm types rather than word or entity-name types.

Prior distribution over P y
We assume a priori that P y was drawn from a Pitman-Yor process (PYP) (Pitman and Yor, 1997). Both the lexicon and the ordinary corpus are observations that provide information about P y . The PYP is defined by three parameters: a concentration parameter α, a discount parameter d, and a base distribution H y . In our case, H y is a distribution over X = Σ * , the set of possible strings over a finite character alphabet Σ.
For example, H LOC is used to choose new place names, so it describes what place names tend to look like in the language. The draw P LOC ∼ PYP(d, α, H LOC ) is an "adapted" version of H LOC . It is P LOC that determines how often each name is mentioned in text (and whether it is mentioned in the lexicon). Some names such as Washington that are merely plausible under H LOC are far more frequent under P LOC , presumably because they were chosen as the names of actual, significant places. These place names were randomly drawn from H LOC as part of the procedure for drawing P y .
The expected value of P y is H (i.e., H is the mean of the PYP distribution), but if α and d are small, then a typical draw of P y will be rather different from H, with much of the probability mass falling on a subset of the strings.
At training or test time, when deciding whether to label a corpus token of x = Washington as a place or person, we will be interested in the relative values of P LOC (x) and P PER (x). In practice, we do not have to represent the unknown infinite object P y , but can integrate over its possible values. When P y ∼ PYP(d, α, H y ), then a sequence of draws X 1 , X 2 , . . . ∼ P y is distributed according to a Chinese restaurant process, via where customers(x) ≤ i is the number of times that x appeared among X 1 , . . . , X i , and tables(x) ≤ customers(x) is the number of those times that x was drawn from H y (where each P y (X i | · · · ) defined by (1) is interpreted as a mixture distribution that sometimes uses H y ).

Form of the base distribution H y
By fitting H y on corpus and lexicon data, we learn what place names or noun strings tend to look like in the language. By simultaneously fitting P y , we learn which ones are commonly mentioned. Recall that under our model, tokens are drawn from P y but the underlying types are drawn from H y , e.g., H y is responsible for (at least) the first token of each type. A simple choice for H y is a Markov process that emits characters in Σ ∪ {$}, where $ is a distinguished stop symbol that indicates the end of the string. Thus, the probability of producing $ controls the typical string length under H y .
We use a more sophisticated model of strings-a sequence memoizer (SM), which is a (hierarchical) Bayesian treatment of variable-order Markov modeling (Wood et al., 2009). The SM allows dependence on an unbounded history, and the probability of a given sequence (string) can be found efficiently much as in equation (1).
Given a string x = a 1 · · · a J ∈ Σ * , the SM assigns a probability to it via where H y,u (a) denotes the conditional probability of character a given the left context u ∈ Σ * . Each H y,u is a distribution over Σ, defined recursively as where is the empty sequence, U Σ is the uniform distribution over Σ ∪ {$}, and σ(u) drops the first symbol from u. The discount and concentration parameters (d |u| , α |u| ) are associated with the lengths of the contexts |u|, and should generally be larger for longer (more specific) contexts, implying stronger backoff from those contexts. 3 Our inference procedure is largely indifferent to the form of H y , so the SM is not the only option. It would be possible to inject more assumptions into H y , for instance via structured priors for morphology or a grammar of name structure. Another possibility is to use a parametric model such as a neural language model (e.g., Jozefowicz et al. (2016)), although this would require an inner-loop of gradient optimization.

Modeling the sequence of tags y
We now turn to modeling the corpus. We assume that each sentence is generated via a sequence of latent labels y = y 1:T ∈ Y * . 4 The observations x 1:T are then generated conditioned on the label sequence via the corresponding P y distribution (defined in §2.3). All observations with the same label y are drawn from the same P y , and thus this subsequence of observations is distributed according to the Chinese restaurant process (1).
We model y using another sequence memoizer model. This is similar to other hierarchical Bayesian models of latent sequences Blunsom and Cohn, 2010), but again, it does not limit the Markov order (the number of preceding labels that are conditioned on). Thus, the probability of a sequence of latent types is computed in the same way as the base distribution in §2.4, that is, The probability of transitioning to label y t depends on the assignments of all previous labels y 1 . . . y t−1 . For part-of-speech induction, each label y t is the part-of-speech associated with the corresponding word x t . For named-entity recognition, we say that each word token is labeled with a named entity type (LOC, PER, . . . ), 5 or with itself if it is not a named entity but rather a "context word." For example, the word token x t = Washington could have been emitted from the label y t = LOC, or from y t = PER, or from y t = Washington itself (in which case p(x t | y t ) = 1). This uses a much larger set of labels Y than in the traditional setup where all context words are emitted from the same latent label type O. Of course, most labels are impossible at most positions (e.g., y t cannot be Washington unless x t = Washington). This scheme makes our generative model sensitive to specific contexts (which is accomplished in discriminative NER systems by contextual features). For example, the SM for y can learn that spoke to P E R yesterday is a common 4-gram in the label sequence y, and thus we are more likely to label Washington as a person if x = . . . spoke to Washington yesterday . . ..
We need one change to make this work, since now Y must include not only the standard NER labels Y = {PER, LOC, ORG, GPE} but also words like Washington. Indeed, now Y = Y ∪ Σ * . But no uniform distribution exists over the infinite set Σ * , so how should we replace the base distribution U Y over labels in equation (5)? Answer: To draw from the new base distribution, sample y ∼ U Y ∪ {CONTEXT} . If y = CONTEXT, however, then "expand" it by resampling y ∼ H CONTEXT . Here H CONTEXT is the base distribution over spellings of context words, and is learned just like the other H y distributions in §2.4.

Inference via particle Markov chain
Monte Carlo

Sequential sampler
Taking Y to be a random variable, we are interested in the posterior distribution p(Y = y | x) over label sequences y given the emitted word sequence x. Our model does not admit an efficient dynamic programming algorithm, owing to the dependencies introduced among the Y t when we marginalize over the unknown G and P distributions that govern transitions and emissions, respectively. In contrast to tagging with a hidden Markov model tagging, the distribution of each label Y t depends on all previous labels y 1:t−1 , for two reasons: x The transition distribution p(Y t = y | y 1:t−1 ) has unbounded dependence because of the PYP prior (4). y The emission distribution p(x t | Y t = y) depends on the emissions observed from any earlier tokens of y, because of the Chinese restaurant process (1). When y is the only complication, block Metropolis-Hastings samplers have proven effective (Johnson et al., 2007). However, this approach uses dynamic programming to sample from a proposal distribution efficiently, which x precludes in our case. Instead, we use sequential Monte Carlo (SMC)-sometimes called particle filtering-as a proposal distribution. Particle filtering is typically used in online settings, including word segmentation (Borschinger and Johnson, 2011), to make decisions before all of x has been observed. However, we are interested in the inference (or smoothing) problem that conditions on all of x (Dubbin and Blunsom, 2012;Tripuraneni et al., 2015). SMC employs a proposal distribution q(y | x) whose definition decomposes as follows: for T = |x|. To sample a sequence of latent labels, first sample an initial label y 1 from q 1 , then proceed incrementally by sampling y t from q t (· | y 1:t−1 , x 1:t ) for t = 2, . . . , T . The final sampled sequence y is called a particle, and is given an unnormalized importance weight of w =w T · p($ | y 1:T ) wherew T was built up viã w t :=w t−1 · p(y 1:t , x 1:t ) p(y 1:t−1 , x 1:t−1 ) q(y t | y 1:t−1 , x 1:t ) The SMC procedure consists of generating a system of M weighted particles whose unnormalized importance weightsw ( Particle Gibbs. We employ SMC as a kernel in an MCMC sampler (Andrieu et al., 2010). In particular, we use a block Gibbs sampler in which we iteratively resample the hidden labeling y of a sentence x conditioned on the current labelings for all other sentences in the corpus. In this context, the algorithm is called conditional SMC since one particle is always fixed to the previous sampler state for the sentence being resampled, which ensures that the MCMC procedure is ergodic. At a high level, this procedure is analogous to other Gibbs samplers (e.g. for topic models), except that the conditional SMC (CSMC) kernel uses auxiliary variables (particles) in order to generate the new block variable assignments. The procedure is outlined in Algorithm 1. Given a previous latent state assignment y 1:T and observations x 1:T , the CSMC kernel produces a new latent state assignment via M auxiliary particles where one particle is fixed to the previous assignment. For ergodicity, M ≥ 2, where larger values of M may improve mixing rate at the expense of increased computation per step.
Proposal distribution. The choice of proposal distribution q is crucial to the performance of SMC methods. In the case of continuous latent variables, it is common to propose y t from the transition probability p(Y t | y 1:t−1 ) because this distribution usually has a simple form that permits efficient sampling. However, it is possible to do better in the case of discrete latent variables. The optimal proposal distribution is the one which minimizes the variance of the importance weights, and is given by q(y t | y 1:t−1 , x 1:t ) := p(y t | y 1:t−1 , x 1:t ) (8) Substituting this expression in equation (7) and simplifying yields the incremental weight update: Resampling. In filtering applications, it is common to use resampling operations to prevent weight degeneracy. We do not find resampling necessary here for three reasons. First, note that we resample hidden label sequences that are only as long as the number of words in a given sentence. Second, we use a proposal which minimizes the variance of the weights. Finally, we use SMC as a kernel embedded in an MCMC sampler; asymptotically, this procedure yields samples from the desired posterior regardless of degeneracy (which only affects the mixing rate). Practically speaking, one can diagnose the need for resampling via the effective sample size (ESS) of the particle system: In our experiments, we find that ESS remains high (a significant fraction of M ) even for long sentences, suggesting that resampling is not necessary to enable mixing of the the Gibbs sampler.
Decoding. In order to obtain a single latent variable assignment for evaluation purposes, we simply take the state of the Markov chain after a fixed number of iterations of particle Gibbs. In principle, one could collect many samples during particle Gibbs and use them to perform minimum Bayes risk decoding under a given loss function. However, this approach is somewhat slower and did not appear to improve performance in preliminary experiments

Segmental sampler
We now present an sampler for settings such as NER where each latent label emits a segment consisting of 1 or more words. We make use of the same transition distribution p(y t | y 1:t−1 ), which determines the probability of a label in a given context, and an emission distribution p(x t | y t ) (namely P yt ); these are assumed to be drawn from hierarchical Pitman-Yor processes described in §2.5 and §2.1, respectively. To allow the x t to be a multi-word string, we simply augment the character set with a distinguished space symbol ∈ Σ that separates words within a string. For instance, New York would be generated as the 9-symbol sequence New York$.
Although the model emits New York all at once, we still formulate our inference procedure as a particle filter that proposes one tag for each word. Thus, for a given segment label type y, we allow two tag types for its words: • I-y corresponds to a non-final word in a segment of type y (in effect, a word with a following attached). • E-y corresponds to the final word in a segment of type y.
For instance, x 1:2 = New York would be annotated as a location segment by defining y 1:2 = I-LOC E-LOC. This says that y 1:2 has jointly emitted x 1:2 , an event with prior probability P LOC (New York). Each word that is not part of a named entity is considered to be a singleword segment. For example, if the next word were x 3 = hosted then it should be tagged with y 3 = hosted as in §2.5, in which case x 3 was emitted with probability 1.
To adapt the sampler described in §3.1 for the segmental case, we need only to define the transition and emission probabilities used in equation (8) and its denominator (9).
For the transition probabilities, we want to model the sequence of segment labels. If y t−1 is an I-tag, we take p(y t | y 1:t−1 ) = 1 , since then y t merely continues an existing segment. Otherwise y t starts a new segment, and we take p(y t | y 1:t−1 ) = 1 to be defined by the PYP's probability G y 1:t−1 (y t ) as usual, but where we interpret the subscript y 1:t−1 to refer to the possibly shorter sequence of segment labels implied by those t − 1 tags.
For the emission probabilities, if y t has the form I-y or E-y, then its associated emission probability no longer has the form p(x t | y t ), since the choice of x t also depends on any words emitted earlier in the segment. Let s ≤ t be the starting position of the segment that contains t. If y t = E-y, then the emission probability is proportional to P y (x s x s+1 . . . x t ). If y t = I-y then the emission probability is proportional to the prefix probability x P y (x) where x ranges over all strings in Σ * that have x s x s+1 . . . x t as a proper prefix. Prefix probabilities in H y are easy to compute because H y has the form of a language model, and prefix probabilities in P y are therefore also easy to compute (using a prefix tree for efficiency).
This concludes the description of the segmental sampler. Note that the particle Gibbs procedure is unchanged.
of-speech. In our setting, however, the dictionaries are not constraints but evidence. If monthly is listed in (only) the adjective lexicon, this tells us that P ADJ sometimes generates monthly and therefore that H ADJ may also tend to generate other words that end with -ly. However, for us, P ADV (monthly) > 0 as well, allowing us to still correctly treat monthly as a possible adverb if we later encounter it in a training or test corpus.

Experiments
We follow the experimental procedure described in Li et al. (2012), and use their released code and data to compare to their best model: a second-order maximum entropy Markov model parametrized with log-linear features (SHMM-ME). This model uses hand-crafted features designed to distinguish between different parts-of-speech, and it has special handling for rare words. This approach is surprisingly effective and outperforms alternate approaches such as cross-lingual transfer (Das and Petrov, 2011). However, it also has limitations, since words that do not appear in the dictionary will be unconstrained, and spurious or incorrect lexical entries may lead to propagation of errors.
The lexicons are taken from the Wiktionary project; their size and coverage are documented by (Li et al., 2012). We evaluate our model on multi-lingual data released as part of the CoNLL 2007 and CoNLL-X shared tasks. In particular, we use the same set of languages as Li et al. (2012). 7 For our method, we impute the parts-of-speech by running particle Gibbs for 100 epochs, where one epoch consists of resampling the states for a each sentence in the corpus. The final sampler state is then taken as a 1-best tagging of the unlabeled data.
Results. The results are reported in Table 1. We find that our hierarchical sequence memoizer (HSM) matches or exceeds the performance of the baseline (SHMM-ME) for nearly all the tested languages, particularly for morphologically rich languages such as German where the spelling distributions H y may capture regularities. It is interesting to note that our model performs worse relative to the baseline for English; one possible explanation is that the baseline uses hand-engineered features whereas ours does not, and these features may have been tuned using English data for validation.
Our generative model is supposed to exploit lexicons well. To see what is lost from using a generative model, we also compared with Li et al. (2012) on standard supervised tagging without any lexicons. Even here our generative model is very competive, losing only on English and Swedish.

Boostrapping NER with type-level supervision
Name lists and dictionaries are useful for NER particularly when in-domain annotations are scarce. However, with little annotated data, discriminative training may be unable to reliably estimate lexical feature weights and may overfit. In this section, we are interested in evaluating our proposed Bayesian model in the context of low-resource NER.

Data
Most languages do not have corpora annotated for parts-of-speech, named-entities, syntactic parses, or other linguistic annotations. Therefore, rapidly deploying natural language technologies in a new language may be challenging. In the context of facilitating relief responses in emergencies such as natural disasters, the DARPA LORELEI (Low Resource Languages for Emergent Incidents) program has sponsored the development and release of representative "language packs" for Turkish and Uzbek with more languages planned (Strassel and Tracey, 2016). We use the named-entity annotations as part of these language packs which include persons, locations, organizations, and geo-political entities, in order to explore bootstrapping named-entity recognition from small amounts of data. We consider two types of data: x in-context annotations, where sentences are fully annotated for named-entities, and y lexical resources. The LORELEI language packs lack adequate indomain lexical resources for our purposes. Therefore, we simulate in-domain lexical resources by holding out portions of the annotated development data and deriving dictionaries and name lists from them. For each label y ∈ {PER, LOC, ORG, GPE, CONTEXT}, our lexicon for y lists all distinct y-labeled strings that appear in the held-out data. This setup ensures that the labels associated with lexicon entries correspond to the annotation guidelines used in the data we use for evaluation. It avoids possible problems that might arise when leveraging noisy out-of-domain knowledge bases, which we may explore in future.  Table 1: Part-of-speech induction results in multiple languages.

Evaluation
In this section we report supervised NER experiments on two low-resource languages: Turkish and Uzbek. We vary both the amount of supervision as well as the size of the lexical resources. A challenge when evaluating the performance of a model with small amounts of training data is that there may be high-variance in the results. In order to have more confidence in our results, we perform bootstrap resampling experiments in which the training set, evaluation set, and lexical resources are randomized across several replications of the same experiment (for each of the data conditions). We use 10 replications for each of the data conditions reported in Figures 1-2, and report both the mean performance and 95% confidence intervals.
Baseline. We use the Stanford NER system with a standard set of language-independent features (Finkel et al., 2005). 8 . This model is a conditional random field (CRF) with feature templates which include character n-grams as well as word shape features. Crucially, we also incorporate lexical features. The CRF parameters are regularized using an L1 penalty and optimized via Orthant-wise limited-memory quasi-Newton optimization (Andrew and Gao, 2007). For both our proposed method and the discriminative baseline, we use a fixed set of hyperparameters (i.e. we do not use a separate validation set for tuning each data condition). In order to make a fair comparison to the CRF, we use our sampler for forward inference only, without resampling on the test data.
Results. We show learning curves as a function of supervised training corpus size. Figure 1 shows that our generative model strongly beats the baseline in this low-data regime. In particular, when there is little annotated training data, our proposed generative model can compensate by exploiting the lexicon, while the discriminative baseline scores terribly. The performance gap decreases with larger supervised corpora, which is consistent with prior results comparing generative and discriminative training (Ng and Jordan, 2002).
In Figure 2, we show the effect of the lexicon's size: as expected, larger lexicons are better. The generative approach significantly outperforms the discriminative baseline at any lexicon size, although its advantage drops for smaller lexicons or larger training corpora.
In Figure 1 we found that increasing the pseudocount c consistently decreases performance, so we used c = 1 in our other experiments. 9

Conclusion
This paper has described a generative model for low-resource sequence labeling and segmentation tasks using lexical resources. Experiments in semisupervised and low-resource settings have demonstrated its applicability to part-of-speech induction and low-resource named-entity recognition. There are many potential avenues for future work. Our model may be useful in the context of active learning where efficient re-estimation and performance in low-data conditions are important. It would also be interesting to explore more expressive parameterizations, such recurrent neural networks for H y . In the space of neural methods, differentiable memory (Santoro et al., 2016) may be more flexible than the PYP prior, while retaining the ability of the model to cache strings observed in the gazetteer. Figure 1: Absolute NER performance for Turkish (y-axis) as a function of corpus size (x-axis). The y-axis gives the F1 score on a held-out evaluation set (averaged over 10 bootstrap replicates, with error bars showing 95% confidence intervals). Our generative approach is compared to a baseline discriminative model with lexicon features (lowest curve). 500 held-out sentences were used to create the lexicon for both methods. Note that increasing the pseudocount c for lexicon entries (upper curves) tends to decrease performance for the generative model; we therefore take c = 1 in all other experiments. This graph shows Turkish; the corresponding Uzbek figure is available as supplementary material.