What do character-level models learn about morphology? The case of dependency parsing

When parsing morphologically-rich languages with neural models, it is beneficial to model input at the character level, and it has been claimed that this is because character-level models learn morphology. We test these claims by comparing character-level models to an oracle with access to explicit morphological analysis on twelve languages with varying morphological typologies. Our results highlight many strengths of character-level models, but also show that they are poor at disambiguating some words, particularly in the face of case syncretism. We then demonstrate that explicitly modeling morphological case improves our best model, showing that character-level models can benefit from targeted forms of explicit morphological modeling.


Introduction
Modeling language input at the character level (Ling et al., 2015;Kim et al., 2016) is effective for many NLP tasks, and often produces better results than modeling at the word level. For parsing, Ballesteros et al. (2015) have shown that character-level input modeling is highly effective on morphologically-rich languages, and the three best systems on the 45 languages of the CoNLL 2017 shared task on universal dependency parsing all use character-level models (Dozat et al., 2017;Shi et al., 2017;Björkelund et al., 2017;Zeman et al., 2017), showing that they are effective across many typologies. The effectiveness of character-level models in morphologically-rich languages has raised a question and indeed debate about explicit modeling of morphology in NLP. Ling et al. (2015) propose that "prior information regarding morphology ... among others, should be incorporated" into character-level models, while Chung et al. * Work done while at the University of Edinburgh.
(2016) counter that it is "unnecessary to consider these prior information" when modeling characters. Whether we need to explicitly model morphology is a question whose answer has a real cost: as Ballesteros et al. (2015) note, morphological annotation is expensive, and this expense could be reinvested elsewhere if the predictive aspects of morphology are learnable from strings.
Do character-level models learn morphology? We view this as an empirical claim requiring empirical evidence. The claim has been tested implicitly by comparing character-level models to word lookup models (Qian et al., 2016;Belinkov et al., 2017). In this paper, we test it explicitly, asking how character-level models compare with an oracle model with access to morphological annotations. This extends experiments showing that character-aware language models in Czech and Russian benefit substantially from oracle morphology (Vania and Lopez, 2017), but here we focus on dependency parsing ( §2)-a task that benefits substantially from morphological knowledge-and we experiment with twelve languages using a variety of techniques to probe our models.
Our summary finding is that character-level models lag the oracle in nearly all languages ( §3). The difference is small, but suggests that there is value in modeling morphology. When we tease apart the results by part of speech and dependency type, we trace the difference back to the characterlevel model's inability to disambiguate words even when encoded with arbitrary context ( §4). Specifically, it struggles with case syncretism, in which noun case-and thus syntactic function-is ambiguous. We show that the oracle relies on morphological case, and that a character-level model provided only with morphological case rivals the oracle, even when case is provided by another predictive model ( §5). Finally, we show that the crucial morphological features vary by language ( §6).

Dependency parsing model
We use a neural graph-based dependency parser combining elements of two recent models (Kiperwasser and Goldberg, 2016;Zhang et al., 2017). Let w = w 1 , . . . , w |w| be an input sentence of length |w| and let w 0 denote an artificial ROOT token. We represent the ith input token w i by concatenating its word representation ( §2.3), e(w i ) and part-of-speech (POS) representation, p i . 1 Using a semicolon (; ) to denote vector concatenation, we have: We call x i the embedding of w i since it depends on context-independent word and POS representations. We obtain a context-sensitive encoding h i with a bidirectional LSTM (bi-LSTM), which concatenates the hidden states of a forward and backward LSTM at position i. Using h f i and h b i respectively to denote these hidden states, we have: We use h i as the final input representation of w i .

Head prediction
For each word w i , we compute a distribution over all other word positions j ∈ {0, ..., |w|}/i denoting the probability that w j is the headword of w i .
Here, a is a neural network that computes an association between w i and w j using model parameters U a , W a , and v a .

Label prediction
Given a head prediction for word w i , we predict its syntactic label � k ∈ L using a similar network.
where L is the set of output labels and f is a function that computes label score using model parameters U � , W � , and V � : The model is trained to minimize the summed cross-entropy losses of both head and label prediction. At test time, we use the Chu-Liu-Edmonds (Chu and Liu, 1965;Edmonds, 1967) algorithm to ensure well-formed, possibly non-projective trees.

Computing word representations
We consider several ways to compute the word representation e(w i ) in Eq. 1: word. Every word type has its own learned vector representation.
char-lstm. Characters are composed using a bi-LSTM (Ling et al., 2015), and the final states of the forward and backward LSTMs are concatenated to yield the word representation.
char-cnn. Characters are composed using a convolutional neural network (Kim et al., 2016).
trigram-lstm. Character trigrams are composed using a bi-LSTM, an approach that we previously found to be effective across typologies (Vania and Lopez, 2017).
oracle. We treat the morphemes of a morphological annotation as a sequence and compose them using a bi-LSTM. We only use universal inflectional features defined in the UD annotation guidelines. For example, the morphological annotation of "chases" is �chase, person=3rd, num-SG, tense=Pres�.
For the remainder of the paper, we use the name of model as shorthand for the dependency parser that uses that model as input (Eq. 1).

Experiments
Data We experiment on twelve languages with varying morphological typologies (Table 1) in the Universal Dependencies (UD) treebanks version 2.0 . 2 Note that while Arabic and Hebrew follow a root & pattern typology, their datasets are unvocalized, which might reduce the observed effects of this typology. Following common practice, we remove language-specific dependency relations and multiword token annotations. We use gold sentence segmentation, tokenization, universal POS (UPOS), and morphological (XFEATS) annotations provided in UD.
Implementation and training Our Chainer (Tokui et al., 2015) implementation encodes words (Eq. 2) in two-layer bi-LSTMs with 200 hidden units, and uses 100 hidden units for head and label predictions (output of Eqs. 4 and 6). We set batch size to 16 for char-cnn and 32 for other models following a grid search. We apply dropout to the embeddings (Eq. 1) and the input of the head prediction. We use Adam optimizer with initial learning rate 0.001 and clip gradients to 5, and train all models for 50 epochs with early stopping. For the word model, we limit our vocabulary to the 20K most frequent words, replacing less frequent words with an unknown word token. The charlstm, trigram-lstm, and oracle models use a onelayer bi-LSTM with 200 hidden units to compose subwords. For char-cnn, we use the small model setup of Kim et al. (2016). Table 2 presents test results for every model on every language, establishing three results. First, they support previous findings that character-level models outperform wordbased models-indeed, the char-lstm model outperforms the word model on LAS for all languages except Hindi and Urdu for which the results are identical. 3 Second, they establish strong baselines for the character-level models: the charlstm generally obtains the best parsing accuracy, closely followed by char-cnn. Third, they demonstrate that character-level models rarely match the accuracy of an oracle model with access to explicit morphology. This reinforces a finding of 3 Note that Hindi and Urdu are mutually intelligible. Vania and Lopez (2017): character-level models are effective tools, but they do not learn everything about morphology, and they seem to be closer to oracle accuracy in agglutinative rather than in fusional languages.

Why do characters beat words?
In character-level models, orthographically similar words share many parameters, so we would expect these models to produce good representations of OOV words that are morphological variants of training words. Does this effect explain why they are better than word-level models?
Sharing parameters helps with both seen and unseen words Table 3 shows how the character model improves over the word model for both non-OOV and OOV words. On the agglutinative languages Finnish and Turkish, where the OOV rates are 23% and 24% respectively, we see the highest LAS improvements, and we see especially large improvements in accuracy of OOV words. However, the effects are more mixed in other languages, even with relatively high OOV rates. In particular, languages with rich morphology like Czech, Russian, and (unvocalised) Arabic see more improvement than languages with moderately rich morphology and high OOV rates like Portuguese or Spanish. This pattern suggests that parameter sharing between pairs of observed training words can also improve parsing performance. For example, if "dog" and "dogs" are observed in the training data, they will share activations in their context and on their common prefix.

Why do morphemes beat characters?
Let's turn to our main question: what do characterlevel models learn about morphology? To answer it, we compare the oracle model to char-lstm, our best character-level model.

Morphological analysis disambiguates words
In the oracle, morphological annotations disambiguate some words that the char-lstm must disambiguate from context. Consider these Russian sentences from Baerman et al. (2005): (1) Mašačitaet pis mo Masha reads letter 'Masha reads a letter.'   Pis mo ("letter") acts as the subject in (1), and as object in (2). This knowledge is available to the oracle via morphological case: in (1), the case of pis mo is nominative and in (2) it is accusative. Could this explain why the oracle outperforms the character model?
To test this, we look at accuracy for word types that are empirically ambiguous-those that have more than one morphological analysis in the training data. Note that by this definition, some ambiguous words will be seen as unambiguous, since they were seen with only one analysis. To make the comparison as fair as possible, we consider only words that were observed in the training data. Figure 1 compares the improvement of the oracle on ambiguous and seen unambiguous words, and as expected we find that handling of ambiguous words improves with the oracle in almost all languages. The only exception is Turkish, which has the least training data.
Morphology helps for nouns Now we turn to a more fine-grained analysis conditioned on the annotated part-of-speech (POS) of the depen-  dent. We focus on four languages where the oracle strongly outperforms the best character-level model on the development set: Finnish, Czech, German, and Russian. 4 We consider five POS categories that are frequent in all languages and consistently annotated for morphology in our data: adjective (ADJ), noun (NOUN), pronoun (PRON), proper noun (PROPN), and verb (VERB). Table 4 shows that the three noun categories-ADJ, PRON, and PROPN-benefit substantially from oracle morphology, especially for the three fusional languages: Czech, German, and Russian.

Morphology helps for subjects and objects
We analyze results by the dependency type of the dependent, focusing on types that interact with morphology: root, nominal subjects (nsubj), objects (obj), indirect objects (iobj), nominal modifiers (nmod), adjectival modifier (amod), obliques (obl), and (syntactic) case markings (case). Figure 2 shows the differences in the confusion matrices of the char-lstm and oracle for those words on which both models correctly predict the head. The differences on Finnish are small, which we expect from the similar overall LAS of both models. But for the fusional languages, a pattern emerges: the char-lstm consistently underperforms the oracle on nominal subject, object, and indirect object dependencies-labels closely associated with noun categories. From inspection, it appears to frequently mislabel objects as nominal subjects when the dependent noun is morphologically ambiguous. For example, in the sentence of Figure 3, Gelände ("terrain") is an object, but the char-lstm incorrectly predicts that it is a nominal subject. In the training data, Gelände is ambiguous: it can be accusative, nominative, or dative.
In German, the char-lstm frequently confuses objects and indirect objects. By inspection, we found 21 mislabeled cases, where 20 of them would likely be correct if the model had access to morphological case (usually dative). In Czech and Russian, the results are more varied: indirect objects are frequently mislabeled as objects, obliques, nominal modifiers, and nominal subjects. We note that indirect objects are relatively rare in these data, which may partly explain their frequent mislabeling.

Characters and case syncretism
So far, we've seen that for our three fusional languages-German, Czech, and Russian-the or- acle strongly outperforms a character model on nouns with ambiguous morphological analyses, particularly on core dependencies: nominal subjects, objects and indirect objects. Since the nominative, accusative, and dative morphological cases are strongly (though not perfectly) correlated with these dependencies, it is easy to see why the morphologically-aware oracle is able to predict them so well. We hypothesized that these cases are more challenging for the character model because these languages feature a high degree of syncretism-functionally distinct words that have the same form-and in particular case syncretism. For example, referring back to examples (1) and (2), the character model must disambiguate pis mo from its context, whereas the oracle can directly disambiguate it from a feature of the word itself. 5 To understand this, we first designed an experiment to see whether the char-lstm could success-fully disambiguate noun case, using a method similar to (Belinkov et al., 2017). We train a neural classifier that takes as input a word representation from the trained parser and predicts a morphological feature of that word-for example that its case is nominative (Case=Nom). The classifier is a feedforward neural network with one hidden layer, followed by a ReLU non-linearity. We consider two representations of each word: its embedding (x i ; Eq. 1) and its encoding (h i ; Eq. 2). To understand the importance of case, we consider it alongside number and gender features as well as whole feature bundles.
The oracle relies on case Table 5 shows the results of morphological feature classification on Czech; we found very similar results in German and Russian (Appendix A.2). The oracle embeddings have almost perfect accuracy-and this is just what we expect, since the representation only needs to preserve information from its input. The char-lstm embeddings perform well on number and gender, but less well on case. This results suggest that the character-level models still struggle to learn case when given only the input text. Comparing the char-lstm with a baseline model which predicts the most frequent feature for each type in the training data, we observe that both of them show similar trends even though character models slightly outperforms the baseline model.
The classification results from the encoding are particularly interesting: the oracle still performs very well on morphological case, but less well on other features, even though they appear in  Table 5: Morphological tagging accuracy from representations using the char-lstm and oracle embedding and encoder representations in Czech. Baseline simply chooses the most frequent tag. All means we concatenate all annotated features in UD as one tag.
the input. In the character model, the accuracy in morphological prediction also degrades in the encoding-except for case, where accuracy on case improves by 12%. These results make intuitive sense: representations learn to preserve information from their input that is useful for subsequent predictions. In our parsing model, morphological case is very useful for predicting dependency labels, and since it is present in the oracle's input, it is passed almost completely intact through each representation layer. The character model, which must disambiguate case from context, draws as much additional information as it can from surrounding words through the LSTM encoder. But other features, and particularly whole feature bundles, are presumably less useful for parsing, so neither model preserves them with the same fidelity. 6 Explicitly modeling case improves parsing accuracy Our analysis indicates that case is important for parsing, so it is natural to ask: Can we improve the neural model by explicitly modeling case? To answer this question, we ran a set of experiments, considering two ways to augment the char-lstm with case information: multitask learning (MTL; Caruana, 1997) and a pipeline model in which we augment the char-lstm model with either predicted or gold case. For example, we use �p, i, z, z, a, Nom� to represent pizza with nominative case. For MTL, we follow the setup of Søgaard and Goldberg (2016) Table 6: LAS results when case information is added. We use bold to highlight the best results for models without explicit access to gold annotations. Coavoux and Crabbé (2017). We increase the biL-STMs layers from two to four and use the first two layers to predict morphological case, leaving out the other two layers specific only for parser. For the pipeline model, we train a morphological tagger to predict morphological case (Appendix A.1). This tagger does not share parameters with the parser. Table 6 summarizes the results on Czech, German, and Russian. We find augmenting the charlstm model with either oracle or predicted case improve its accuracy, although the effect is different across languages. The improvements from predicted case results are interesting, since in nonneural parsers, predicted case usually harms accuracy (Tsarfaty et al., 2010). However, we note that our taggers use gold POS, which might help. The MTL models achieve similar or slightly better performance than the character-only models, suggesting that supplying case in this way is beneficial. Curiously, the MTL parser is worse than the the pipeline parser, but the MTL case tagger is better than the pipeline case tagger (Table 7). This indicates that the MTL model must learn to encode case in the model's representation, but must not learn to effectively use it for parsing. Finally, we  observe that augmenting the char-lstm with either gold or predicted case improves the parsing performance for all languages, and indeed closes the performance gap with the full oracle, which has access to all morphological features. This is especially interesting, because it shows using carefully targeted linguistic analyses can improve accuracy as much as wholesale linguistic analysis.

Understanding head selection
The previous experiments condition their analysis on the dependent, but dependency is a relationship between dependents and heads. We also want to understand the importance of morphological features to the head. Which morphological features of the head are important to the oracle?
Composing features in the oracle To see which morphological features the oracle depends on when making predictions, we augmented our model with a gated attention mechanism following Kuncoro et al. (2017). Our new model attends to the morphological features of candidate head w j when computing its association with dependent w i (Eq. 3), and morpheme representations are then scaled by their attention weights to produce a final representation. Let f i1 , · · · , f ik be the k morphological features of w i , and denote by f i1 , · · · , f ik their corresponding feature embeddings. As in §2, h i and h j are the encodings of w i and w j , respectively. The morphological representation m j of w j is: where k is a vector of attention weights: The intuition is that dependent w i can choose which morphological features of w j are most important when deciding whether w j is its head. Note that this model is asymmetric: a word only attends to the morphological features of its (single) parent, and not its (many) children, which may have different functions. 7 We combine the morphological representation with the word's encoding via a sigmoid gating mechanism.
The gating mechanism allows the model to choose between the computed word representation and the weighted morphological representations, since for some dependencies, morphological features of the head might not be important. In the final model, we replace Eq. 3 and Eq. 4 with the following: The modified label prediction is: where f is again a function to compute label score: Attention to headword morphological features We trained our augmented model (oracle-attn) on Finnish, German, Czech, and Russian. Its accuracy is very similar to the oracle model (Table 8), so we obtain a more interpretable model with no change to our main results. Next, we look at the learned attention vectors to understand which morphological features are important, focusing on the core arguments: nominal subjects, objects, and indirect objects. Since our model knows the case of each dependent, this enables us to understand what features it seeks in potential heads for each case. For simplicity, we only report results for words where both head and label predictions are correct.   Figure 4 shows how attention is distributed across multiple features of the head word. In Czech and Russian, we observe that the model attends to Gender and Number when the noun is in nominative case. This makes intuitive sense since these features often signal subject-verb agreement. As we saw in earlier experiments, these are features for which a character model can learn reliably good representations. For most other dependencies (and all dependencies in German), Lemma is the most important feature, suggesting a strong reliance on lexical semantics of nouns and verbs. However, we also notice that the model sometimes attends to features like Aspect, Polarity, and Verb-Form-since these features are present only on verbs, we suspect that the model may simply use them as convenient signals that a word is verb, and thus a likely head for a given noun.

Conclusion
Character-level models are effective because they can represent OOV words and orthographic regularities of words that are consistent with morphology. But they depend on context to disambiguate words, and for some words this context is insufficient. Case syncretism is a specific example that our analysis identified, but the main results in Table 2 hint at the possibility that different phenomena are at play in different languages.
While our results show that prior knowledge of morphology is important, they also show that it can be used in a targeted way: our character-level models improved markedly when we augmented them only with case. This suggests a pragmatic reality in the middle of the wide spectrum between pure machine learning from raw text input and linguistically-intensive modeling: our new models don't need all prior linguistic knowledge, but they clearly benefit from some knowledge in addition to raw input. While we used a data-driven anal- ysis to identify case syncretism as a problem for neural parsers, this result is consistent with previous linguistically-informed analyses (Seeker and Kuhn, 2013;Tsarfaty et al., 2010). We conclude that neural models can still benefit from linguistic analyses that target specific phenomena where annotation is likely to be useful.