A Rich Morphological Tagger for English: Exploring the Cross-Linguistic Tradeoff Between Morphology and Syntax

A traditional claim in linguistics is that all human languages are equally expressive—able to convey the same wide range of meanings. Morphologically rich languages, such as Czech, rely on overt inflectional and derivational morphology to convey many semantic distinctions. Languages with comparatively limited morphology, such as English, should be able to accomplish the same using a combination of syntactic and contextual cues. We capitalize on this idea by training a tagger for English that uses syntactic features obtained by automatic parsing to recover complex morphological tags projected from Czech. The high accuracy of the resulting model provides quantitative confirmation of the underlying linguistic hypothesis of equal expressivity, and bodes well for future improvements in downstream HLT tasks including machine translation.


Introduction
Different languages use different grammatical tools to convey the same meanings. For example, to indicate that a noun functions as a direct object, English-a morphologically poor language-places the noun after the verb, while Czech-a morphologically rich language-uses an accusative case suffix. Consider the following two glossed Czech sentences: ryba jedla ("the fish ate") and oni jedli rybu ("they ate the fish").
The key insight is that the morphology of Czech (i.e., the case ending -u), carries the same semantic content as the syntactic structure of English (i.e., the word order) (Harley, 2015). Theoretically, this common underlying semantics should allow syntactic structure to be transformed into morphological structure and vice versa. We explore the veracity of this claim computationally by asking the following: Can we develop a tagger for English that uses the signal available in English-only syntactic structure to recover the rich semantic distinctions conveyed by morphology in Czech? Can we, for example, accurately detect which English contexts would have a Czech translation that employs the accusative case marker?
Traditionally, morphological analysis and tagging is a task that has been limited to morphologically rich languages (MRLs) (Hajič, 2000;Drábek and Yarowsky, 2005;Müller et al., 2015;Buys and Botha, 2016). In order to build a rich morphological tagger for a morphologically poor language (MPL) like English, we need some way to build a gold standard set of richly tagged English data for training and testing. Our approach is to project the complex morphological tags of Czech words directly onto the English words they align to in a large parallel corpus. After evaluating the validity of these projections, we develop a neural network tagging architecture that takes as input a number of English features derived from off-theshelf dependency parsing and attempts to recover the projected Czech tags.
A tagger of this sort is interesting in many ways. Whereas the best NLP tools are typically available for English, morphological tagging at this granularity has until now been applied almost exclusively to MRLs. The task is also scientifically interesting, in that it takes semantic properties that are latent in the syntactic structure of English and transforms them into explicit word-level annotations. Finally, such a tool has potential utility in a  Subtag   Values  GENDER   FEM, MASC, NEUT   NUMBER   SG, DU, PL   CASE   NOM, GEN, DAT, ACC, VOC, ESS, INS   PERSON  1, 2, 3  TENSE   FUT, PRS, PST   GRADE   CMPR, SPRL   NEGATION POS, NEG  VOICE ACT, PASS range of downstream tasks, such as machine translation into MRLs (Sennrich and Haddow, 2016).

Projecting Morphological Tags
Training a system to tag English text with multidimensional morphological tags requires a corpus of English text annotated with those tags. Since no such corpora exist, we must construct one. Past work (focused on translating out of English into MRLs) assigned a handful of morphological annotations using manually-developed heuristics (Drábek and Yarowsky, 2005;Avramidis and Koehn, 2008), but this is hard to scale. We therefore instead look to obtain rich morphological tags by projecting them (Yarowsky et al., 2001) from a language (such as Czech) where such rich tags have already been annotated.
We use the Prague Czech-English Dependency Treebank (PCEDT) (Hajič et al., 2012), a complete translation of the Wall Street Journal portion of the Penn Treebank (PTB) (Marcus et al., 1993). Each word on the Czech side of the PCEDT was originally hand-annotated with complex 15-dimensional morphological tags containing positional subtag values for morphological categories specific to Czech. 1 We manually mapped these tags to the UniMorph Schema tagset (Sylak-Glassman et al., 2015), which provides a universal, typologically-informed annotation framework for representing morphological features of inflected words in the world's languages. Uni-Morph tags are in principle up to 23-dimensional, but tags are not positionally dependent, and not every dimension needs to be specified. Table 1 shows the subset of UniMorph subtags used here. PTB tags have no formal internal subtag structure.  See Figure 1 for a comparison of the PCEDT, Uni-Morph, and PTB tag systems for a Czech word and its aligned English translation. The PCEDT also contains automatically generated word alignments produced by using GIZA++ (Och and Ney, 2003) to align the Czech and English sides of the treebank. We use these alignments to project morphological tags from the Czech words to their English counterparts through the following process. For every English word, if the word is aligned to a single Czech word, take its tag. If the word is mapped to multiple Czech words, take the annotation from the alignment point belonging to the intersection of the two underlying GIZA++ models used to produce the many-many alignment. 2 If no such alignment point is found, take the leftmost aligned word. Unaligned English words get no annotation.

Validating Projections
If we believe that we can project semantic distinctions over bitext, we must ensure that the elements linked by projection in both source and target languages carry roughly the same meaning. This is difficult to automate, and no gold-standard dataset or metric has been developed. Thus, we offer the following approximate evaluation.

Czech PCEDT tag
UniMorph tag = English PTB tag je VB-S---3P-AA---V;ACT;POS;PRS;3;SG is VBZ Figure 1: The PCEDT tag of the Czech word je was mapped to an equivalent UniMorph tag. The English translation of je, which is the copula is, has the PTB tag VBZ. While the PCEDT and UniMorph tags are composed of subtags, the PTB tag has no formal internal composition.
English is not bereft of morphological marking, and its use of it, though limited, does sometimes coincide with that of Czech. For example, both languages use overt morphology to mark nouns as singular or plural, adjectives and adverbs as superlative or comparative, and verbs as either present or past. 3 In these cases it is possible to directly map word-level PTB tags in English to word-level UniMorph tags in Czech, and to compare how often projected tags conform to this expected mapping. For example, the PTB tag VBZ is mapped to the UniMorph tag V;PRS;3;SG. Table 2 shows a set of expected projections along with how often the expectations are met across the PCEDT. In particular, we calculate the percentage of cases when an English word with a particular PTB tag has the expected Czech tag projected onto it. This calculation is only performed in those cases where where the aligned words agree in their core part of speech, since we would not expect, for example, verbs to have superlative/comparative morphology.
A qualitative examination of these results suggests that projections are usually valid in at least those cases where our limited linguistic intuitions predict they should be. For example, the dual number feature (DU) was projected in only 12 instances, but was almost always projected to the English words "two," "eyes," "feet," and "hands." These concepts naturally come in pairs, and this distinction is explicitly marked in Czech, but not English. We interpret this evaluation as suggesting that we can trust projection even in cases where we do not have pre-existing expectations of how English and Czech grammars should align.

Features
With our projections validated, we turn to the prediction model itself. Based on the idea that languages with rich morphology use that morphology to convey similar distinctions in meaning to that conveyed by syntax in a morphologically poor language, we extract lexical and syntactic features from English text itself as well as both dependency and CFG parses. We use the following basic features derived directly from the text: the word itself, the single-word lexical context, and the word's POS tag neighbors. We also use features derived from dependency trees.
• Head features. The word's head word, and separately, the head word's POS.
• Head chain POS. The chain of POS tags beginning with the word and moving upward along the dependency graph.
• Head chain labels. The chain of dependency labels moving upward.
• Child words. The identity of any child word having an arc label of det or case, under the Universal Dependency features. 4 Finally, we use features from CFG parsing: • POS features. A word's part-of-speech (POS) tag, its parent's, and its grandparent's.
• Chain features. We compute chains of the tree nodes, starting with its POS tag and moving upward (NN NP S).
• The distance to the root.
Non-lexical features are treated as real-valued when appropriate (such as in the case of the distance to the root), while all others are treated as binary. For lexical features, we use pretrained GLoVe embeddings, specifically 200-dimensional 400K-vocab uncased embeddings from Pennington et al. (2014). This is an approach similar to Tran et al. (2015), but we additionally augment the pretrained embeddings with randomly initialized embeddings for vocabulary items outside of the 400K lexicon.

Neural Model
In order to take advantage of correlated information between subtags, we present a neural model Other companies are introducing related products PL, NOM PL, NOM ACT, 3, PRS, PL ACT, 3, PRS, PL PL, ACC PL, ACC Table 3: An English sentence from the test set, WSJ §22, tagged with rich morphological tags by our neural tagger. Note, for example, that case is tagged correctly, with Other companies tagged as nominative and related products tagged as accusative. Legend here: CASE (NOM = nominative, ACC = accusative), TENSE (PRS = present), NUMBER (PL = plural), VOICE (ACT = active), and PERSON (3).
which learns a common representation of input tokens, and passes it on to a series of subtag classifiers that are trained jointly. Informally, this means that we learn a shared representation in the hidden layers and then use separate weight functions to predict each component of the morphological analysis from this shared representation of the input. We use a feed-forward neural net with two hidden layers and rectified linear unit (ReLU) activation functions (Glorot et al., 2011). A Uni-Morph tag m can be decomposed into its N subtags as m = [m (1) , m (2) , . . . , m (N ) ], where each m (i) may be represented as a one-hot vector. The weight matrices (W (1) , W (2) ) and bias vectors (b (1) , b (2) ) connecting the hidden layers are parameters for the whole model, but each of the N subtag classes has its own weight matrix and bias vector W i . All are randomly initialized from truncated normal distributions. Given an input vector x, we first compute a new input x = [x non-lex : Ex lex 0 : Ex lex 1 : . . . : Ex lexn ], where [a : b] represents vector concatenation. All lexical features x lex i are replaced by their embeddings from the embedding matrix E.
Then the definition of p(m) follows: The set of parameters is θ N }. The loss is defined as the cross-entropy, and the model is trained using gradient descent with minibatches. The models were trained using TensorFlow (Abadi et al., 2015). We complete a coarse-grained grid search over the learning rate, hidden layer size, and batch size. Based on performance on the development set, we choose a hidden layer size of 1000. We tune model parameters on whole-tag accuracy on WSJ §00. We find that a learning rate of 0.01 and batch size of 50 work best.

Experiment Setup
Our goal is to predict rich morphological tags for monolingual English text. The tagger was trained on §02-21 of the WSJ portion of the PTB. §00 was used for tuning. Training tags were projected from the equivalent Czech portion of the PCEDT, across the standard alignments provided by the PCEDT, as described in §2. Projected tags were treated as a gold standard to be recovered by the tagger. The full training set consisted of 39,832 sentences (726,262 words). Evaluation of the tagger was done on §22 of the WSJ portion of the PCEDT. Table 4 shows the accuracy of the neural tagger for each subtag category from Table 1, indicating how often the tagger recovered the English projections of the Czech subtags. Baseline 1 is computed by selecting the most common Czech (sub)tag value in every case.

Results and Analysis
Baseline 2 is computed similarly to the evaluation of projection validity presented in §3. For each English word, the UniMorph subtag values which can be obtained by translating the PTB tag are compared to the projected subtag value in the same category (e.g. TENSE). This baseline penalizes cases in which a value for a category exists in the gold projection, but the value from the PTB tag translation either does not match or is not present at all. The poor performance of this baseline highlights how little information can be gleaned from traditional English PTB tags themselves, which is caused by the poverty of English inflectional morphology. In baselines 2 and 3, values for negation and voice were never present from the PTB tags since both negation and passive voice are indicated by separate words in English. 5  Table 4: Performance of the neural tagger on §22 of the WSJ portion of the PTB. We report both subtag and whole tag accuracies. Baseline 1 simply outputs the most frequent subtag value. Baseline 2 outputs the subtag value that can be obtained from a human-annotated PTB tag with the gold subtag, and penalizes both values from the PTB tag that are either incorrect or missing. Baseline 3 does the same comparison, but penalizes only incorrect values, not those which are missing. Accuracy which exceeds or equals all baselines is bolded while that which exceeds only baselines 1 and 2 is italicized.
In baseline 3, we remove the effect of morphological poverty from consideration by comparing the values obtained from PTB tag translation to gold projected values only when both sources provide a value for a given category. The strong performance of this baseline, particularly in person and number, may be partly due to the fact that the tags are human-annotated as well as the fact that fewer comparisons are made in an attempt to isolate the effects of morphological poverty. In addition, baseline 3 need only predict instances of 3rd person, since person is only marked by PTB tags for one tag, VBZ. Similarly, PTB tags only explicitly mark number for the tags VBZ, NN, NNS, NNP, and NNPS.
The neural tagger outperforms baselines 1 and 2 everywhere, showing that the syntactic structure of English does contain enough signal to recover the complex semantic distinctions that are overt in Czech morphology. For case, especially, accuracy is nearly double that of baseline 1. Table 3 shows an example English sentence, where case and number have been tagged correctly. We examined the contribution of different grammatical aspects of English by training standard MaxEnt classifiers for each subtag using different subsets of features. The individual classifiers were trained with Liblinear's (Fan et al., 2008) MaxEnt model. We varied the regularization constant from 0.001 to 100 in multiples of 10, choosing the value in each situation that maximized performance on the dev set, PCEDT §00. Table 5 contains the results. First, word identity contributes more than POS on its own. This suggests that the distribution of morphological features is at least partially conditioned by lexical factors, in addition to grammattions such as 'have given' in which the VP as a whole is not passive.  ical properties such as POS. The addition of POS context, which includes the POS of the preceding and the following word, yields modest gains, except for case, in which it leads to a 5.2% increase in accuracy. POS context can be viewed as an approximation of true syntactic features, which yield greater improvements. Dependency parse features are particularly effective in helping to predict case since case is typically assigned by a verb governing a noun in a head-dependency relationship. The direct encoding of this relationship yields an especially salient feature for the case classifier. Even with these improvements, the case feature remains the most difficult to predict, suggesting that even more salient features have yet to be discovered.

Conclusion
To our knowledge, this is the first work to construct a rich morphological tagger for English that does not rely on manually-developed syntactic heuristics. This significantly extends the applicability and usability of the proposed general tagging framework, which offers the ability to use automatic parsing features in one language and (potentially automatically generated) morphological feature annotation in the other. Validating the claim that languages apply different aspects of grammar to represent equivalent meanings, we find that English-only lexical, contextual, and syntactic features derived from off-the-shelf parsing tools encode the complex semantic distinctions present in Czech morphology. In addition to allowing this scientific claim to be computationally validated, we expect this approach to generalize to tagging any morphologically poor language with the morphological distinctions made in another morphologically rich language.