Unsupervised Disambiguation of Syncretism in Inflected Lexicons

Lexical ambiguity makes it difficult to compute useful statistics of a corpus. A given word form might represent any of several morphological feature bundles. One can, however, use unsupervised learning (as in EM) to fit a model that probabilistically disambiguates word forms. We present such an approach, which employs a neural network to smoothly model a prior distribution over feature bundles (even rare ones). Although this basic model does not consider a token’s context, that very property allows it to operate on a simple list of unigram type counts, partitioning each count among different analyses of that unigram. We discuss evaluation metrics for this novel task and report results on 5 languages.


Introduction
Inflected lexicons-lists of morphologically inflected forms-are commonplace in NLP. Such lexicons currently exist for over 100 languages in a standardized annotation scheme (Kirov et al., 2018), making them one of the most multi-lingual annotated resources in existence. These lexicons are typically annotated at the type level, i.e., each word type is listed with its possible morphological analyses, divorced from sentential context.
One might imagine that most word types are unambiguous. However, many inflectional systems are replete with a form of ambiguity termed syncretism-a systematic merger of morphological slots. In English, some verbs have five distinct inflected forms, but regular verbs (the vast majority) merge two of these and so distinguish only four. The verb s ing has the past tense form sang but the participial form sung; the verb talk, on the other hand, employs talked for both functions. The form talked is, thus, said to be syncretic. Our task is to partition the count of talked in a corpus between the past-tense and participial readings, respectively.  Table 1: Full paradigms for the German nouns W o&r%t ("word") and Her&r# ("gentleman") with abbreviated and tabularized UniMorph annotation. The syncretic forms are bolded and colored by ambiguity class. Note that, while in the plural the nominative and accusative are always syncretic across all paradigms, the same is not true in the singular.
In this paper, we model a generative probability distribution over annotated word forms, and fit the model parameters using the token counts of unannotated word forms. The resulting distribution predicts how to partition each form's token count among its possible annotations. While our method actually deals with all ambiguous forms in the lexicon, it is particularly useful for syncretic forms because syncretism is often systematic and pervasive.
In English, our unsupervised procedure learns from the counts of irregular pairs like sang-sung that a verb's past tense tends to be more frequent than its past participle. These learned parameters are then used to disambiguate talked. The method can also learn from regular paradigms. For example, it learns from the counts of pairs like runs-run that singular third-person forms are common. It then uses these learned parameters to guess that tokens of run are often singular or third-person (though never both at once, because the lexicon does not list that as a possible analysis of run).

Formalizing Inflectional Morphology
We adopt the framework of word-based morphology (Aronoff, 1976;Spencer, 1991). In the present paper, we consider only inflectional morphology. An inflected lexicon is a set of word types. Each word type is a 4-tuple of a part-of-speech tag, a lexeme, an inflectional slot, and a surface form.
A lexeme is a discrete object (represented by an arbitrary integer or string, which we typeset in cur#s iv) that indexes the word's core meaning and part of speech. A part-of-speech (POS) tag is a coarse syntactic category such as VERB. Each POS tag allows some set of lexemes, and also allows some set of inflectional slots such as "1st-person present singular." Each allowed tag, lexeme, slot triple is realized-in only one way-as an inflected surface form, a string over a fixed phonological or orthographic alphabet Σ. In this work, we take Σ to be an orthographic alphabet.
A paradigm π(t, ) is the mapping from tag t's slots to the surface forms that "fill" those slots for lexeme . For example, in the English paradigm π(VERB, talk), the past-tense slot is said to be filled by talked, meaning that the lexicon contains the tuple VERB, talk, PAST, talked . 1 We will specifically work with the UniMorph annotation scheme (Sylak-Glassman, 2016). Here each slot specifies a morpho-syntactic bundle of inflectional features (also called a morphological tag in the literature), such as tense, mood, person, number, and gender. For example, the German surface form Wörtern is listed in the lexicon with tag NOUN, lemma W o&r%t, and a slot specifying the feature bundle NUM=PL, CASE=DAT . An example of UniMorph annotation is found in Table 1.

What is Syncretism?
We say that a surface form f is syncretic if two slots s 1 = s 2 exist such that some paradigm π(t, ) maps both s 1 and s 2 to f . In other words, a single form fills multiple slots in a paradigm: syncretism may be thought of as intra-paradigmatic ambiguity. This definition does depend on the exact annotation scheme in use, as some schemes collapse syncretic slots. For example, in German nouns, no lexeme distinguishes the nominative, accusative and genitive plurals. Thus, a human-created lexicon might employ a single slot NUM=PL, CASE=NOM/ACC/GEN and say that Wörter fills just this slot rather than three separate slots. For a discussion, see Baerman et al. (2005).

Inter-Paradigmatic Ambiguity
A different kind of ambiguity occurs when a surface form belongs to more than one paradigm. A form f is inter-paradigmatically ambiguous if t 1 , 1 , s 1 , f and t 2 , 2 , s 2 , f are both in the lexicon for lexemes t 1 , 1 = t 2 , 2 . For example, talks belongs to the English paradigms π(VERB, talk) and π(NOUN, talk). The model we present in §3 will resolve both syncretism and inter-paradigmatic ambiguity. However, our exposition focuses on the former, as it is cross-linguistically more common.

Disambiguating Surface Form Counts
The previous sections §2.1 and §2.2 discussed two types of ambiguity found in inflected lexicons. The goal of this paper is the disambiguation of raw surface form counts, taken from an unannotated text corpus. In other words, given such counts, we seek to impute the fractional counts for individual lexical entries (4-tuples), which are unannotated in raw text. Let us assume that the word talked is observed c (talked) times in a raw English text corpus. We do not know which instances of talked are participles and which are past tense forms. However, given a probability distribution p θ (t, , s | f ), we may disambiguate these counts in expectation, i.e., we attribute a count of to the past participle of the VERB talk. Our aim is the construction and unsupervised estimation of the distribution p θ (t, , s | f ).
While the task at hand is novel, what applications does it have? We are especially interested in sampling tuples t, , s, f from an inflected lexicon. Sampling is a necessity for creating train-test splits for evaluating morphological inflectors, which has recently become a standard task in the literature (Durrett and DeNero, 2013;Hulden et al., 2014;Nicolai et al., 2015;Faruqui et al., 2016), and has seen two shared tasks (Cotterell et al., 2016(Cotterell et al., , 2017. Creating train-test splits for training inflectors involves sampling without replacement so that all test types are unseen. Ideally, we would like more frequent word types in the training portion and less 549 frequent ones in the test portion. This is a realistic evaluation: a training lexicon for a new language would tend to contain frequent types, so the system should be tested on its ability to extrapolate to rarer types that could not be looked up in that lexicon, as discussed by Cotterell et al. (2015). To make the split, we sample N word types without replacement, which is equivalent to collecting the first N distinct forms from an annotated corpus generated from the same unigram distribution.
The fractional counts that our method estimates may also be useful for corpus linguistics-for example, tracking the frequency of specific lexemes over time, or comparing the rate of participles in the work of two different authors.
Finally, the fractional counts can aid the training of NLP methods that operate on a raw corpus, such as distributional embedding of surface form types into a vector space. Such methods sometimes consider the morphological properties (tags, lexemes, and slots) of nearby context words. When the morphological properties of a context word f are ambiguous, instead of tagging (which may not be feasible) one could fractionally count the occcurrences of the possible analyses according to p θ (t, , s | f ), or else characterize f 's morphology with a single soft indicator vector whose elements are the probabilities of the properties according to p θ (t, , s | f ).

A Neural Latent Variable Model
In general, we will only observe unannotated word forms f . We model these as draws from a distribution over form types p θ (f ), which marginalizes out the unobserved structure of the lexicon-which tag, lexeme and slot generated each form. Training the parameters of this latent-variable model will recover the posterior distribution over analyses of a form, p θ (t, , s | f ), which allows us to disambiguate counts at the type level.
The latent-variable model is a Bayesian network, (1) where T , L, S range over the possible tags, lexemes, and slots of the language, and δ(f | t, , s) returns 1 or 0 according to whether the lexicon lists f as the (unique) realization of t, , s . We fix p θ (s | t) = 0 if the lexicon lists no tuples of the form t, ·, s, · , and otherwise model where v t,s is a multi-hot vector whose "1" components indicate the morphological features possessed by t, s : namely attribute-value pairs such as POS=VERB and NUM=PL. Here u ∈ R d and W is a conformable matrix of weights. This formula specifies a neural network with d hidden units, which can learn to favor or disfavor specific soft conjunctions of morphological features. Finally, we define p θ (t) ∝ exp ω t for t ∈ T , and p θ ( | t) ∝ exp ω t, or 0 if the lexicon lists no tuples of the form t, , ·, · . The model's parameter vector θ specifies u, W, and the ω values.

Inference and Learning
We maximize the regularized log-likelihood where F is the set of surface form types and p θ (f ) is defined by (1). It is straightforward to use a gradient-based optimizer, and we do. However, (3) could also be maximized by an intuitive EM algorithm: at each iteration, the E-step uses the current model parameters to partition each count c(f ) among possible analyses, as in (2.3), and then the M step improves the parameters by following the gradient of supervised regularized log-likelihood as if it had observed those fractional counts. On each iteration, either algorithm loops through all listed (t, s) pairs, all listed (t, ) pairs, and all observed forms f , taking time at most proportional to the size of the lexicon. In practice, training completes within a few minutes on a modern laptop.

Baseline Models
To the best of our knowledge, this disambiguation task is novel. Thus, we resort to comparing three variants of our model in lieu of a previously published baseline. We evaluate three simplifications of the slot model, to investigate whether the complexity of equation (2)

Computing Evaluation Metrics
We first evaluate perplexity. Since our model is a tractable generative model, we may easily evaluate its perplexity on held-out tokens. For each language, we randomly partition the observed surface tokens into 80% training, 10% development, and 10% test. We then estimate the parameters of our model by maximizing (3) on the counts from the training portion, selecting hyperparameters such that the estimated parameters 2 minimize perplexity on the development portion. We then report perplexity on the test portion.
Using the same hyperparameters, we now train our latent-variable model p θ without supervision on 100% of the observed surface forms f . We now measure how poorly, for the average surface form type f , we recovered the maximum-likelihood distributionp(t, , s | f ) that would be estimated with supervision in terms of KL-divergence: We can see that this formula reduces to a simple average over disambiguated tokens i.

Training Details and Hyperparameters
We optimized on training data using batch gradient descent with a fixed learning rate. We used perplexity on development data to jointly choose 2 Our vocabulary and parameter set are determined from the lexicon. Thus we create a regularized parameter ω , yielding a smoothed estimate p( ), even if the training count c( ) = 0. the learning rate, the initial random θ (from among several random restarts), the regularization coefficient λ ∈ {10 −1 , 10 −2 , 10 −3 , 10 −4 } and the neural network architecture. The NEURAL architecture shown in eq. (2) has 1 hidden layer, but we actually generalized this to consider networks with k ∈ {1, 2, 3, 4} hidden layers of d = 100 units each. In some cases, the model selected on development data had k as high as 3. Note that the LINEAR model corresponds to k = 0.

Datasets
Each language constitutes a separate experiment. In each case we obtain our lexicon from the Uni-Morph project and our surface form counts from Wikipedia. To approximate supervised counts to estimatep in the KL evaluation, we analyzed the surface form tokens in Wikipedia (in context) using the tool in Straka et al. (2016), as trained on the disambiguated Universal Dependencies (UD) corpora (Nivre et al., 2016). We wrote a script 3 to convert the resulting analyses from UD format into t, , s, f tuples in UniMorph format for five languages-Czech (cs), German (de), Finnish (fi), Hebrew (he), Swedish (sv)-each of which displays both kinds of ambiguity in its UniMorph lexicon. Lexicons with these approximate supervised counts are provided as supplementary material.

Results
Our results are graphed in Fig. 1, exact numbers are found in Table 2. We find that the NEURAL model slightly outperforms the other baselines on languages except for German. The LINEAR model is quite competitive as well.  outperform UNIF. NEURAL matches the supervised distributions reasonably closely, achieving an average KL of < 1 bit on all languages but German.

Related Work
By far the closest work to ours is the seminal paper of Baayen and Sproat (1996), who asked the following question: "Given a form that is previously unseen in a sufficiently large training corpus, and that is morphologically n-ways ambiguous [...] what is the best estimator for the lexical prior probabilities for the various functions of the form?" While we address the same task, i.e., estimation of a lexical prior, Baayen and Sproat (1996) assume supervision in the form of an disambiguated corpus. We are the first to treat the specific task in an unsupervised fashion. We discuss other work below.
Supervised Morphological Tagging. Morphological tagging is a common task in NLP; the state of the art is currently held by neural models (Heigold et al., 2017). This task is distinct from the problem at hand. Even if a tagger obtains the possible analyses from a lexicon, it is still trained in a supervised manner to choose among analyses.
Unsupervised POS Tagging. Another vein of work that is similar to ours is that of unsupervised part-of-speech (POS) tagging. Here, the goal is map sequences of forms into coarse-grained syntactic categories. Christodoulopoulos et al. (2010) provide a useful overview of previous work. This task differs from ours on two counts. First, we are interested in finer-grained morphological distinctions: the universal POS tagset (Petrov et al., 2012) makes 12 distinctions, whereas UniMorph has languages expressing hundreds of distinctions. Second, POS tagging deals with the induction of syntactic categories from sentential context. We note that purely unsupervised morphological tagging, has yet to be attempted to the best of our knowledge.

Conclusion
We have presented a novel generative latentvariable model for resolving ambiguity in unigram counts, notably due to syncretism. Given a lexicon, an unsupervised model partitions the corpus count for each ambiguous form among its analyses listed in a lexicon. We empirically evaluated our method on 5 languages under two evaluation metrics. The code is availabile at https://sjmielke. com/papers/syncretism, along with typedisambiguated unigram counts for all lexicons provided by the UniMorph project (100+ languages).