A Framework for Understanding the Role of Morphology in Universal Dependency Parsing

This paper presents a simple framework for characterizing morphological complexity and how it encodes syntactic information. In particular, we propose a new measure of morpho-syntactic complexity in terms of governor-dependent preferential attachment that explains parsing performance. Through experiments on dependency parsing with data from Universal Dependencies (UD), we show that representations derived from morphological attributes deliver important parsing performance improvements over standard word form embeddings when trained on the same datasets. We also show that the new morpho-syntactic complexity measure is predictive of the gains provided by using morphological attributes over plain forms on parsing scores, making it a tool to distinguish languages using morphology as a syntactic marker from others.


Introduction
While word embedding has proven a good solution to reduce data sparsity in parsing (Koo et al., 2008), treating word forms as atomic units is at odds with the fact that words have a potentially complex internal structure. Furthermore, it makes parameters estimation difficult for morphologically rich languages (MRL) in which the number of possible forms a word can take can be very large 1 .
Recently, researchers have started to work on morphologically informed word embeddings (Cao and Rei, 2016;Botha and Blunsom, 2014), aiming at better capturing both lexical, syntactic and morphological information. But encoding lexicon and morphology in the same space makes it difficult to distinguish the role of each in syntactic tasks such 1 A typical English noun has 2 forms while a Finnish one may have more than 30. This shows in data as English lemmas have 1.39 forms on average while Finnish ones have 2.19, as measured on UD data (Nivre et al., 2016). as dependency parsing. Furthermore, morphologically rich languages for which we hope to see a real impact from those morphologically aware representations, might not all rely to the same extent on morphology for syntax encoding. Some might benefit mostly from reducing data sparsity while others, for which paradigm richness correlate with freer word order (Comrie, 1981), will also benefit from morphological information encoding.
This paper aims at characterizing the role of morphology as a syntax encoding device for various languages. Using simple word representations, we measure the impact of morphological information on dependency parsing and relate it to two measures of language morphological complexity: the basic form per lemma ratio and a new measure (HPE) defined in terms of head attachment preference encoded by its morphological attributes. We show that this new measure is predictive of parsing result differences observed when using different word representations and that it allows one to distinguish amongst morphologically rich languages, those that use morphology for syntactic purpose from those using morphology as a more semantic marker. To the best of our knowledge, this work is the first attempt at systematically measuring the syntactic content of morphology in a multi-lingual environment.
Section 2 presents the representation learning method and the dependency parsing model. It also defines two measures of morphological complexity. Section 3 describes the experimental setting and analyses parsing results in terms of the previously defined morphological complexity measures. Section 4 gives some conclusions and future work perspectives.

Framework
This section details: (i) our method for learning lexical and morphological representations, (ii) how these can be used for graph-based dependency parsing, and (iii) how to measure morphological complexity. Our representation learning and parsing techniques are purposely very simple in order to let us separate lexical and morphological information and weight the role of morphology in dependency parsing of MRL.

Word Representation
We construct separate vectorial representations for lemmas, forms and morphological attributes, either learned via dimension reduction of their own cooccurrence count matrices or represented as raw one-hot vectors.
Let V be a vocabulary (it can be lemmas or forms or morphological attributes (incl. values for POS, number, case, tense, mood...)) for a given language. Correspondingly, let C be the set of contexts defined over elements of V. That is, lemmas appear in the context of other lemmas, forms in the context of forms, and attributes in the context of attributes. Then, given a corpus annotated with lemmas and morphological information, we can gather the cooccurrence counts in the matrix M ∈ N |V|×|C| , such that M ij is the frequency of lemma (form or morphological attributes) V i appearing in context C j in the corpus. Here, we consider plain sequential contexts (i.e. surrounding bag of "words") of length 1, although we could extend them to more structured contexts (Bansal et al., 2014). Those cooccurrence matrices are then reweighted by unshifted Positive Point-wise Mutual Information (PPMI) and reduced via Singular Value Decomposition (SVD). For more information on word embedding via matrix factorization, please refer to (Levy et al., 2015).
Despite its apparent simplicity, this model is as expressive as more popular state of the art embedding techniques. Indeed, Goldberg and Levy (2014) have shown that the SkipGram objective with negative sampling of Mikolov's Word2vec (2013) can be framed as the factorization of a shifted PMI weighted cooccurrence matrix.
This matrix reduction procedure gives us vectors for lemmas, forms and morphological attributes, noted R. Note that while a word has only one lemma and one form, it will often realize several morphological attributes. We tackle this issue by simply summing over all the attributes of a word (noted M orph(w)). If we note r w the vectorial representation of word w we have: Simple additive models have been shown to be very efficient for compositionally derived embeddings (Arora et al., 2017).

Dependency Parsing
We work with graph-based dependency parsing, which offers very competitive parsing models as recently re-emphasized by Dozat et al. (2017) in the CONLL 2017 shared-task on dependency parsing (Zeman et al., 2017). Let x = (w 1 , w 2 , ..., w n ) be a sentence, T x be the set of all possible trees over it,ŷ the tree that we predict for x, and Score(•, •) a scoring function over sentence-tree pairs : We use edge factorization to make the inference problem tractable. A tree score is thus the sum of its edges scores. We use a simple linear model: where φ(x, e) is a feature vector representing edge e in sentence x, and θ ∈ R m is a parameter vector to be learned.
The vector representation of an edge e ij whose governor is the i-th word w i and dependent is the j-th word w j , is defined by the outer product of their respective representations in context. Let ⊕ note vector concatenation, ⊗ the outer product and w k±1 be the word just before/after w k , then: Recall that w i of length d V is a vector from R. We use the averaged Passive-Aggressive online algorithm for structured prediction (Crammer et al., 2006) for learning the model θ. Given a score for each edge, we use Eisner algorithm (Eisner, 1996) to retrieve the best projective spanning tree. Even though some languages display a fair amount of non-projective edges, on average Eisner algorithm scores higher than Chu-Liu-Edmonds algorithm (Chu and Liu, 1965) in our setting.

Measuring Morpho-Syntactic Complexity
Some languages use morphological cues to encode syntactic information while other encode more semantic information with them. For example, the Case feature (especially core cases) is of prime syntactic importance, for it encodes the type of relation words have with each other. On the contrary, the Possessor feature (in Hungarian for example) is more semantic in nature and need not impact sentence structure. This remark would support different treatment for each language. However, those languages tend to be treated equally in works dealing with MRL.
Form to Lemma Ratio A basic measure of morphological complexity is the form per lemma ratio, we note it F/L. It captures the tendency of words to inflect in a given language. Because some word classes tend not to inflect and not all forms are equally productive, we note F/iL the ratio of form per inflected lemma. Given a language l with a lemma vocabulary V l and a form counting function c : V l → N that returns the number of forms a lemma can take, we have: F/L and F/iL do not measure the informative content of morphology, but simply its productivity. Bentz et al. (2016) compared five different measures of morphological complexity amongst which word entropy and the micro-averaged version of F/L (they call it TTR) and showed that they all have high positive correlation given enough data.

Head POS Entropy
In order to compare the morpho-syntactic complexity of different languages, we introduce a new measure called Head Part-of-speech Entropy or HPE. The HPE of a token t represents the amount of information t has about the part-of-speech of its governor. More formally, let P OS(Gov(t)) be the set of partsof-speech that t can depend on, and let π t (p) be the probability of t actually depending on part-ofspeech p, then the HPE is defined as: This is a measure of a token preferencial attachment to its head. A token with a low HPE tends to attach often to the same part-of-speech, while a token with a high HPE will attach to many different parts-of-speech. Thus a language with a low HPE will tend to encode a lot of syntactic information in the morphology, rather than in word order say. For example, a noun can attach to another noun like a genitive, or to a verb as a subject or object, or even to an adjective in the case of transitive adjective. French nouns do not inflect for case, thus attachment to another noun or verb can only be infered from words relative positions. On the contrary, Gothic nouns do inflect for case, thus making verb or noun attachment clear directly from the morphological analysis.
We compute the HPE of a language as the averaged HPE of its attributes sets over a given corpus. Likewise, we use the empirical counts as a surrogate for c in F/L and F/iL.

Experiments
In order to test the hypothesis that morphological representations contain syntactic information crucial for dependency parsing of morphologically rich languages, but that this information is not equally distributed across MRL, we run experiments on data from the Universal Dependencies (Nivre et al., 2016) project.

Data Description
For conciseness, we focused on eleven languages that display varying degrees of morphological complexity and belong to four different language families. Basque (eu) is an isolate and it is an ergative language. English (en), Gothic (got), Danish (da) and Swedish (sv) are Germanic languages, and French (fr) and Romanian (ro) are Romance languages (Indo-European). Finnish (fi), Estonian (et) and Hungarian (hu) are Finno-Ugric languages. Hebrew (he) is a Semitic language. Basic statistics are provided in Table 1.

Experimental Settings
For the experiments we use the train/dev/test data provided by UD 2.0. Basic statistics about the data are reported in the appendix. Lemmas and forms are embedded in 150 dimensions, while Morphological attributes are embedded in 50 dimensions, because they are much less numerous (less than 100). All embeddings are induced on their language respective train set only using a context window of size 1 (i.e. the  directly preceding and following words).
Parsers are trained for 10 iterations using either lemma, form or morphological representations, and we pick the best iteration on the basis of UAS on the development set.
While we used gold lemmas as provided in the corpora, we ran two experiments for morphological attributes, one with gold attributes and one with predicted attributes. Morphological attributes are predicted with a simple multinomial logistic regression per attribute (POS, Tense, Case, Gender...), where we add a special undef value (except for POS) to represent the lack of an attribute (e.g., nouns have no Tense in English). The models predict attribute values for the center word of trigrams represented by feature vectors encoding word prefixes and suffixes of length 1, 2 and 3, word length and capitalization. We used the logistic regression implemented in the Scikit-Learn (Pedregosa et al., 2011) library with the default settings. It can output an argmaxed decision or a softmaxed decision, thus we tried both as input to the parser. The argmaxed decision gives a vector of zeros and ones, while the softmaxed decision gives a continuous vector with each each attributes summing to one (the probability assigned to each possible value for Gender like Masculine, Feminine, Neuter and Undef must sum to one). Then those vectors are used unchanged for the one-hot representation or passed through an embedding matrix for the embedding representation.

Results
For clarity, we focus on comparing results using form embeddings and gold morphological representations. They are given in Table 2. Because the analysis carries to the labeled case, we stick to unlabeled scores (UAS) for the analysis. A more complete table is provided in the appendix as well as a complete labeled accuracy score (LAS) table. Morphological complexity measures are also reported.
One-hot gold morphological attributes consistently outerperform form embeddings. This is expected since forms embedding were trained on much fewer data than usually considered necessary. However, improvements are not consistent across languages, ranging from 1.14 point for English to 15.20 points for Finnish. While those differences are not explained by morphological productivity alone (Figure 1a), a measure of preferential attachment gives a good account of them (Figure 1b). Those inconsistencies become even more striking, considering results using predicted attributes. We notice that despite a general drop of performance of 5-12 points, predicted attributes  Figure 1: Accuracy differences (y-axis) between parsers using form embeddings and parsers using onehot attributes, with respect to morphological complexity (x-axis). Red dots represent the gold attributes scores and blue squares the predicted attributes scores.
still perform significantly better than form embeddings for those morphologically rich languages that have an HPE lower than 0.65 as depicted on Figure 1b.
Figures 1a and 1b plot the differences in parsing scores. For each language, the red dot corresponds to the score difference between using form embeddings and gold attributes one-hot representations, and the blue square corresponds to the score difference between using the same form embeddings and predicted attributes softmax representations (the complete scores are given in the appendix). Both Figures show trends. Score differences seem to increase with F/iL and decrease with HPE. But while the F/iL plot suffers outliers (Hungarian, Estonian and Romanian), the HPE plot shows a clear boundary between languages benefiting fully from morphological information (even predicted) and those benefiting primarily from reducing data sparsity. While Hebrew seems to be an outlier, it might be due to its annotation style, where attached prepositions, articles and possessive markers are treated as independent words rather than morphological inflection as other languages do, thus artificially increasing the parsing accuracy with a lot of trivial dependencies.
This shows that indeed, HPE is a good measure of the syntactic informativeness of a language morphology, and that it can help deciding between encoding morphological information or just reducing data sparsity. Furthermore, it seems to be link to the distinction that Kibort and Corbett (2010) do between morphosyntax and morphosemantic.

Conclusion
We have contributed a new measure of morphosyntactic complexity (HPE) that helps distinguishing languages that use morphology for syntactic purpose from languages that use morphology to encode more semantic information. We showed that this measure correlates much more with differences in parsing results using morphological representations than the simple form per lemma ratio. It could thus be used to help designing language specific word representations.
It is worth mentioning that we focused here on dependent marked head selection. It would be interesting to have a similar measure for headmarking situations with dependencies marked on the governor. We leave it for future work.

Acknowledgement
This work was supported by ANR Grant GRASP No. ANR-16-CE33-0011-01 and Grant from CPER Nord-Pas de Calais/FEDER DATA Advanced data science and technologies 2015-2020. We also thank the reviewers for their valuable feedback.    Table 5: LAS scores for parsers using predicted morpho-syntactic attributes. First row is LAS using form representation. Rows 2 to 5 are LAS using morphological representation, either one-hot or embedding and either hard decisions or soft decisions. Tables   Table 3 reports results for the predicted attributes experiment. The POS and averaged attributes prediction accuracies are given. Are also reported, scores for the four representation regimes of predicted attributes. Predictions can be either probability distributions (Soft) or argmax (Hard) and either used as such (OH) or passed through an embedding (Emb). Table 4 reports all the labeled accuracy scores for parsers using either gold lemmas, forms or gold attributes, either as one-hot vectors or as dense embeddings. Table 5 reports results for the predicted attributes experiment. Are also reported, scores for the four representation regimes of predicted attributes as in table 4. Predictions can be either probability distributions (Soft) or argmax (Hard) and either used as such (OH) or passed through an embedding (Emb).