Universal Dependencies and Morphology for Hungarian - and on the Price of Universality

In this paper, we present how the principles of universal dependencies and morphology have been adapted to Hungarian. We report the most challenging grammatical phenomena and our solutions to those. On the basis of the adapted guidelines, we have converted and manually corrected 1,800 sentences from the Szeged Treebank to universal dependency format. We also introduce experiments on this manually annotated corpus for evaluating automatic conversion and the added value of language-specific, i.e. non-universal, annotations. Our results reveal that converting to universal dependencies is not necessarily trivial, moreover, using language-specific morphological features may have an impact on overall performance.


Introduction
Morphological tagging and syntactic parsing are key components in most natural language processing (NLP) applications. Linguistic resources and parsers for morphological and syntactic analysis have been developed for several languages, see e.g. the shared tasks on morphologically rich languages (Seddah et al., 2013;Seddah et al., 2014). However, the comparison of results achieved for different languages is not straightforward as most languages and databases apply a unique tagset, moreover, they were annotated following different guidelines. In order to overcome these issues, the project Universal Dependencies and Morphology (UD) has recently been initiated within the NLP community (Nivre, 2015). The main goal of the UD project is to develop a "universal", i.e. a language-independent morphological and syntactic representation which can contribute to the im-plementation of multilingual morphological and syntactic parsers from a computational linguistic point of view. Furthermore, it can enhance studies on linguistic typology and contrastive linguistics.
From the viewpoint of syntactic parsing, the languages of the world are usually categorized according to their level of morphological richness (which is negatively correlated with configurationality). At one end, there is English, a strongly configurational language while there is Hungarian at the other end of the spectrum with rich morphology and free word order (Fraser et al., 2013). In this paper, we present how UD principles were adapted to Hungarian, with special emphasis on Hungarian-specific phenomena.
Hungarian is one of the prototypical morphologically rich languages thus our UD principles can provide important best practices for the universalization of other morphologically rich languages. The UD guidelines for Hungarian were motivated by both linguistic considerations and data-driven observations. We developed a converter from the existing Szeged Dependency Treebank (Vincze et al., 2010) to UD and manually corrected 1,800 sentences from the newspaper domain. The experiences gained during the converter development and during the manual correction could reinforce the linguistic guidelines. Moreover, the manually corrected gold standard corpus provides the opportunity for empirical evaluations like assessing the converter and comparing dependency parsers employing the original and the universal morphological representations. Thus, we evaluated the quality of the automatic conversion, which reveals that converting to universal dependencies is not necessarily trivial, at least for Hungarian. We also show that using different morphological tagsets may have an impact on overall parsing performance and utilizing language-specific, i.e. non-universal, information has a considerable added value at both the morphological and syntactic layers.
The chief contributions of the paper are the introduction of • the universal morphology and dependency principles for Hungarian, leading to insights for other morphologically rich languages, • empirical experiments on the upper bound of the accuracy of automatic conversion and pre-parsing, • comparative evaluations for assessing the added value of language-specific information at the morphological and syntactic layers along with the interaction of these two.

Related Work
Standardized tagsets for both morphological and syntactic annotations have been constantly developed in the international NLP community. For instance, the MSD morphological coding system was developed for a set of Eastern European languages (Erjavec, 2012), within the MULTEXT-EAST project. Interset functions as an interlingua for several morphological coding systems, which can convert different tagsets to the same morphological representation (Zeman, 2008). There have also been some attempts to define a common set of parts-of-speech: Rambow et al. (2006) defined a multilingual tagset for part-of-speech (POS) tagging and parsing, while McDonald and Nivre (2007) identified eight POS tags based on data from the CoNLL-2007 Shared Task . Petrov et al. (2012) offered a tagset of 12 POS tags and applied this tagset to 22 languages. Now, Universal Dependencies (UD) is the latest standardized tagset that we are aware of. UD is an international project that aims at developing a unified annotation scheme for dependency syntax and morphology in a language-independent framework (Nivre, 2015). Hungarian was among the first 10 languages of the project, participating also in the first official release in January 2015. In the latest release (Version 1.3, May 2016), there are annotated datasets available for 40 languages, including English, German, French, Hungarian and Irish, among others 1 . In these datasets, the very same tagsets are applied at the morphological and 1 http://universaldependencies.org/ syntactic levels and texts are annotated on the basis of the same linguistic principles, to the widest extent possible.
The UD tagset encodes morphological information in the form of POS tags and feature-value pairs. As for syntactic information, each word is assigned to its parent word in the dependency tree and the grammatical function of the specific word is encoded in dependency labels. Dependency labels, POS tags and features are universal (i.e. there is a fixed set of them without the possibility of introducing new members), but values and dependency labels can have language-specific additions if needed. Features are divided into the categories lexical features and inflectional features. Lexical features are features that are characteristics of the lemmas rather than the word forms, whereas inflectional features are those that are characteristics of the word forms. Both lexical and inflectional features can have layered features: some features are marked more than once on the same word, e.g. a Hungarian noun may denote its possessor's number as well as its own number. In this case, the Number feature has an added layer, Num[psor].
Our UD principles introduced in this paper follow the central UD guidelines (Nivre, 2015) and we did our best to align with the existing guidelines for other morphologically rich languages as well. On the other hand, there are several Hungarian-specific phenomena that required changes and extensions of the original UD principles.
The only available manually annotated tree-bank for Hungarian is the Szeged Corpus (Csendes et al., 2004) and Szeged Dependency Treebank (Vincze et al., 2010). It contains approximately 82,000 sentences and 1.5 million tokens, all manually annotated for POS-tagging and constituency and dependency syntax. We developed an automatic tool that converts the morphological descriptions of the Szeged Corpus to universal morphology tags and the dependency trees of the Szeged Treebank to universal dependencies.

Universal Morphology for Hungarian
In this section, we present the morphological tagset applied to Hungarian.
When adapting the principles of Universal Morphology to Hungarian, we were able to automatically convert most of the morphological features used in the Szeged Treebank 2.5 (Vincze et al., 2014), which was based on MSD principles (Erjavec, 2012). However, we faced some problematic issues, which we will discuss in detail in this section. The details of universal morphological codeset of Hungarian are available on our website 2 .

Possessive constructions
The possessor in Hungarian possessive constructions can have two different surface forms, without any difference in meaning: the possessor can be morphologically marked or not, just like the English constructions the girl's doll and the doll of the girl. Thus, both of the following possessive constructions are widely used: (1) a the szomszéd neighbor kertje garden-3SGPOSS the neighbor's garden (2) a the szomszédnak neighbor-DAT a the kertje garden-3SGPOSS the neighbor's garden In Example 1, the possessor is not marked, i.e. it shares its form with the nominative form of the noun, however, in Example 2, the possessor is morphologically marked, sharing its from with the dative form of the noun. Nevertheless, the possessed is morphologically marked in both cases, which was a novelty in the UD project as the languages already included in the data do not mark the possessor on the possessed noun but use determiners for this purpose (cf. my car but az autóm (the car-1SGPOSS)). Moreover, the number of the possessed can be marked on the noun in elliptical constructions such as: ( 3)

Object-verb agreement
Another Hungarian-specific feature was the definiteness of the object. As a special type of agreement, the definiteness of their objects determines which paradigm of the verb is to be chosen. In other words, the form of the verb changes when the definiteness of the object also changes (Törkenczy, 2005 .
. I can see you.
In this way, the feature Definiteness needs to be applied to verbs in Hungarian, moreover, it has a language-specific feature due to the special form triggered by the second person objects. Thus, Definiteness has three possible values in Hungarian: Definite, Indefinite, 2.

Determiners and pronouns
Determiners, pronouns and ordinal numbers also constituted a peculiarity. According to Hungarian grammatical traditions, ordinal numbers have been treated as numerals but in the universal morphology, they have to be annotated as adjectives. Thus, their POS tags were automatically converted to adjectives.
Demonstrative pronouns were also treated differently in the original annotation used in the Szeged Treebank and in universal morphology. While demonstrative pronouns ez and az are tagged as pronouns independently of their positions, in universal morphology such words occurring before an article should be tagged as a determiner (see Example 9) but when they are used as an NP, they should be tagged as a pronoun (see Example 10). . I have read that.
These cases were also automatically converted, following the universal morphology guidelines.

Verbal prefixes
In our original treebank, verbal particles that were spelt as a separate token had their own part-ofspeech, i.e. verbal particle. According to the UD description however, not all function words that are traditionally called particles automatically qualify for the PART tag. They may be adpositions or adverbs by origin, therefore should be tagged ADP or ADV, respectively. Thus, we manually compiled a list that contained the original part-ofspeech of words that were tagged as verbal prefixes, for instance, el "away" was treated as an adverb and agyon brain-SUP as a noun -the latter is usually used in phrases like agyonüt "kill someone by hitting on his head". Based on this list, we were able to automatically assign UD POS tags to verbal prefixes.

Universal Dependency in Hungarian
When adapting the universal dependency labels to Hungarian, we could find a one-to-one correspondence between the original labels of the Szeged Treebank and the UD labels only in most of the cases, and these labels could be automatically converted to the UD format, making use of the dependency and morphological annotations found in the original treebank. However, we encountered some problematic cases during conversion, which we will discuss below in detail. The details of universal dependency rules of Hungarian are available on our website 3 .

Non-overt copulas
Traditionally, it is the verb that functions as the head of the clause in dependency grammars but in certain languages, there are verbless clauses where the predicate consists of a single nominal element (typically a noun or an adjective) at the surface level. The dependency analysis of such sentences may be problematic due to the lack of an overt verb. Some studies such as Polguère and Mel'čuk (2009) argue for a zero copula in such cases, especially when the copula is empty only in certain slots of the verbal paradigm. For instance, in Hungarian, the copula has its zero form only in the present tense, indicative mood, third person forms as shown in Examples 11-14: (11) Present tense, indicative mood, Sg1: The original dependency analysis in the Szeged Treebank inserts a zero copula (VAN), i.e. a virtual node in the dependency tree, which functions as the head of the clause and the nominal predicate is attached to it. Figure 1 shows such an analysis of the sentence E gondolat sem új (this thought not new) "This thought is not novel at all".
Beside the function head analysis (i.e. where function words, e.g. the copula is the head), there is another approach to dependencies, namely, the content head analysis, where the head is a content word instead of a function word. In the latter case, the main grammatical relations can be found among content words and all the other function words are attached to the main structure. UD applies the content head analysis, which means that in copular constructions, the nominal element is the head and the copula (if present) is attached to it with a cop relation. In a similar way, the head of adpositional constructions is the noun and the adposition is attached to it.
Sentences with nominal predicates were automatically converted from the original treebank into the UD format: Figure 2 shows the UD analysis of the sentence found in Figure 1. Likewise, postpositional constructions were converted: the noun was treated as the head and the postposition was attached to it with a case label.

Subordinate clauses
Subordinate clauses proved also to be a problematic issue as UD principles make a sharp distinction among several types of subordinate clauses -e.g. clausal subject, clausal object, adverbial clause -in contract with the Szeged Dependency Treebank, which applies one single label for all types of subordinate clauses. Some types of subordinate clauses had a special label in the constituency version of the treebank hence their conversion was straightforward. In other cases, we could rely on manually constructed conversion rules but the resulting trees had to be corrected manually.

Multiword named entities
The UD treatment of multiword named entities required a Hungarian-specific solution. According to the UD principles, the first token of the multiword expressions should be marked as the head. However, in Hungarian, it is always the last element of the multiword expression that is inflected. Examples 15-16 demonstrate that the first element cannot be inflected, only the last one: Due to the above morphosyntactic facts, we marked the last token of multiword named entities as the head in the Hungarian UD treebank while all the other UD treebanks mark the first token as the head.

Dative forms
In Hungarian, nouns that bear the suffix -nAk can fulfill several grammatical roles in the sentence such as:  Leslie had to apologize to his friend.
While these forms do not show any difference at the morphological level, they have very different roles at the syntactic and semantic levels. Thus we decided not to make any distinction in the morphological annotation but they should have different syntactic labels. Indirect objects are marked with the label iobj, possessors with the label nmod:poss and other occurrences with nmod:obl. Obviously, these annotations had to be carried out manually as most of these cases could not be easily and unequivocally converted to the UD format only on the basis of morphology and syntax.

Light verb constructions
Light verb constructions are verb + noun combinations where most of the semantic content of the whole expression is carried by the noun while the syntactic head is the verb (e.g. to have a shower, to make a decision). They are not uniformly treated in Version 1.3 of the UD treebanks. Light verb constructions are either not marked at all or if they are marked, they may have a special structure or special labels (Nivre and Vincze, 2015). The Hungarian treebank belongs to the latter group, that is, members of light verb constructions bear a special label. For instance, Figure 3 shows that the label dobj:lvc can be found between the nominal and verbal component of the light verb construction döntést hoz (decision-ACC bring) "to make a decision". In this way, the dobj part of the label marks that syntactically it is a verb-object relation but semantically, it is a light verb construction, marked by the lvc extension of the label.

Experiments
We developed a converter from the existing Szeged Dependency Treebank (Vincze et al., 2010) to UD and manually corrected 1,800 sentences from the newspaper domain. The manually corrected UD sentences are available in the UD repository v3.0. The experiences gained during the manual correction could reinforce the linguistic conversion rules and the manually corrected gold standard corpus provides the opportunity for empirical evaluations which we introduce in this section.

On the Accuracy of Automatic Converters
Most of the UD treebanks are the result of automatic conversion from a dependency treebank of originally different principles. The accuracy of these automatic converters is unknown, i.e. we do not know how much information was lost or how much noise was introduced by the converters. To empirically investigate this in the case of Hungarian UD, we compared the converted and the manually corrected, i.e. gold standard, trees of the 1800 sentences.
The converter itself is based on linguistic rules (it is available on our website 5 ) which were itera-tively improved by manually investigating the results of conversion on sentences of the Szeged Dependency Treebank. The final version of the converted achieves an UAS of 87.81 and a LAS of 75.99 on the 1800 sentences compared against the manually corrected UD trees. We believe that this level of accuracy is not sufficient for releasing the rest of the 80,000 sentences of the automatically converted Szeged Dependency Treebank. On the other hand, some of the shortcomings of the automatic conversion could be corrected by exploiting annotation found in other versions of the Szeged Treebank. For instance, the type of certain subordinate clauses is marked in the constituency version of the treebank, which can be transformed into UD labels. Moreover, coreference annotations from the subcorpora annotated for coreference relations could enhance the proper attachment of relative clauses. We intend to add these pieces of information to our converter in the future, hence higher accuracy scores can be provisioned for the automatic conversion process: just with the above mentioned corrections, an additional 6 percentage points could be achieved in terms of LAS as about 20% of the errors are due to subordinate or relative clauses.

On the Price of Universality
We carried out experiments for investigating whether is there any difference between using the original MSD (Vincze et al., 2014) and the new universal morphological (UM) descriptions. We were particularly interested in the utility of the two representations for dependency parsing. We trained two models of the MarMot morphological tagger (Mueller et al., 2013) using the two morphological representation in 10-fold cross-tagging on our manually corrected 1800 sentences. Then we trained and evaluated the Bohnet dependency parser (Bohnet, 2010) on the train/test split of the UD repository v3.0 utilizing the two different predicted morphological descriptions. We used the default parameters for both the MarMot and the Bohnet parser.  Table 1: Dependency parsing results on the Hungarian Universal Dependency dataset. In the case of LAS(main label) we do not check the language specific part of the dependency labels in the evaluations while we compare the universal and language-specific dependency labels at LAS(f ull label).
was used for training the Bohnet parser. main label refers here to the universal dependency labels while full label refers to using the concatenation of universal and language-specific labels. The difference between the last two columns of the table is that we checked the full or only the main dependency labels at evaluations. Table 1 shows the MSD outperforms UM consistently at each of the experiments. Although these differences are not high, this suggests that some information encoded in the MSD morphology is not represented in UM, i.e. we have to pay a price to be universal. We can observe the greatest difference when training and evaluating on full dependency labels, i.e. language-specific morphological features contribute to the prediction of language-specific dependency labels.
We made a manual error analysis of the results with regard to attachment (UAS) errors, i.e. we compared the outputs of the dependency parsers trained by using predicted universal codes and predicted MSD morphological codes, respectively. Results are presented in Table 2. We found that the benefits of the original language-specific annotation (MSD) mostly manifests in the treatment of subordinate clauses, adverbial modifiers and infinitival complements. These results might be explained by the fact that in certain cases, MSD contains more detailed grammatical information than the UM formalism. For instance, MSD encodes whether a conjunction connects clauses or words/phrases, which information is missing from UM. Also, higher results were achieved for cases when two nouns or adjectives were following each other and one of them modified the other (as in magas rangú képviselői "representatives of high standings"). However, sentences containing an overt or covert form of the copula could be parsed more effectively by using universal morphology codes.

The Added Value of Language-specific UD Labels
We also investigated the impact of the languagespecific parts of the dependency labels. As the numbers in Table 1 show, slightly better results can be achieved both in terms of UAS and LAS when training the model with full labels than with main labels. This highlights the importance of adding language specific distinctions to the universal ones because they may contain information that can be exploited during the tree decoding. They contribute even to unlabeled attachment decisions. To take an example, UD does not make any distinction among different types of nominal modifiers, treating them as nmod. However, for Hungarian, we applied extra labels such as nmod:poss for possessors (see Section 3.1) and nmod:obl for nominal arguments of the verb. As for the first, it should always be attached to the possessed noun, whereas the second one is attached to a verb (see also Examples 18 and 19 with the dative morphological case). Thus, the parser can learn these finegrained distinctions, which might be beneficial for the unlabeled attachment scores as well.
Also, we would like to point out that the utilization of language-specific labels does not contradict the UD principles. In UD, each language should select the appropriate labels according to their needs but there is no need to apply all of the labels/features. General labels like nsubj or dobj will be used in most (maybe all) of the UD languages but there are other labels or featurevalue pairs that are applicable for only a handful of languages. These ones are now called as "language-specific" features but in principle, their status is not different from those that are more widely applied. So we believe that introducing "language-specific" additions does not harm the UD principles. Moreover, the chief objective of our experiments was to highlight the added value of language-specific features and we were able to show that they can even improve parsing accuracy when evaluated exclusively on the general labels. The main goal of UD is to provide a way where the parsing results over languages are comparable, hence using language specific features during decoding but evaluating only on general labels is in line with this comparison principle. Moreover, it indicates for UD treebank developers that -besides general labels -language-specific ones have to be taken seriously.

Conclusions
In this paper, the principles of universal dependencies and morphology for Hungarian were introduced by reporting the most challenging grammatical phenomena and our solutions to those. We converted then manually corrected 1,800 sentences from the Szeged Treebank to universal dependency format and introduced experiments on this manually annotated corpus for evaluating automatic conversion and the added value of language-specific, i.e. non-universal, annotations. We would like to draw the attention to the importance of understanding i) the information loss of the automatic UD converters; ii) what is the price of being constrained by universal morphology principles and; iii) the utility of exploiting language-specific dependency labels in UD.