Projecting Multiword Expression Resources on a Polish Treebank

Multiword expressions (MWEs) are linguistic objects containing two or more words and showing idiosyncratic behavior at different levels. Treebanks with annotated MWEs enable studies of such properties, as well as training and evaluation of MWE-aware parsers. However, few treebanks contain full-fledged MWE annotations. We show how this gap can be bridged in Polish by projecting 3 MWE resources on a constituency treebank.


Introduction
Multiword expressions (MWEs) are linguistic objects containing two or more words and showing idiosyncratic behavior at different linguistic levels . For instance, at the morphological level they can have restricted paradigms, e.g., in Polish (PL) zjadłbym konia z kopytami (lit. I would eat a horse with its hooves) 'I am very hungry' can only occur in the conditional mood. At the syntactic level they can: (i) exhibit defective agreement, e.g., in French (FR) in grandsmères 'grandmothers' the adjective does not agree with the noun in gender unlike all regular adjectival modifiers, (ii) impose agreement constraints which do not apply to compositional structures, e.g., to have one's heart in one's mouth imposes agreement in person between both possessive pronouns and the subject, (iii) block some transformations typical for their structures, e.g., *the bucket was kicked by him, (iv) prohibit or require modifiers, e.g., (FR) germer dans le cerveau de quelqu'un (lit. to germinate in someone's brain) imposes a pronominal or nominal modifier of brain, etc. At the semantic level, MWEs show a varying degree of non-compositionality, e.g., to pull strings is semantically opaque but can be un-derstood compositionally if the components themselves are interpreted in an idiomatic way (to pull as 'to use', and strings as 'one's influence').
Treebanks in which MWE have been explicitly annotated are highly precious resources enabling us to study such more or less unpredictable properties. They also constitute basic prerequisites for training and evaluating parsers, which should best perform syntactic analysis jointly with MWE identification (Finkel and Manning, 2009;Green et al., 2013;Candito and Constant, 2014;Le Roux et al., 2014;Wehrli, 2014;Nasr et al., 2015;Constant and Nivre, 2016;Waszczuk et al., 2016).
Lexical MWE resources develop more rapidly than MWE-annotated treebanks (Losnegaard et al., 2016). They already exist for a large number of languages and are often distributed under open licenses. It is, thus, interesting to examine how far MWE lexicons can help in completing the existing treebanks with annotation layers dedicated to MWEs. Our case study deals with four Polish resources: (i) the named-entity annotation layer of a Polish reference corpus, (ii) an e-lexicon of nominal, adjectival and adverbial MWEs, (iii) a valence dictionary with a phraseological component, and (iv) a treebank with no initial MWE annotations. 20 We show how the 3 former resources can be automatically projected on the latter, by identifying syntactic nodes satisfying (totally or partly) the appropriate lexical and syntactic constraints.

Resources
The National Corpus of Polish (NCP) (Przepiórkowski et al., 2012) contains a manually double-annotated and adjudicated subcorpus of over 1 million words. Its named entity layer (NCP-NE), which builds on the morphosyntactic layer (relying in its turn on the segmentation layer), contains over 80,000 annotated NEs, 20% of which are MWNEs. Only the latter were used in the experiments described below. The annotation schema assumes notably the markup of nested, overlapping and discontinuous NEs, i.e., the annotation structures form trees (Savary et al., 2010).
SEJF (Czerepowicka and Savary, 2015) is a grammatical lexicon of Polish continuous MWEs containing over 4,700 compound nouns, adjectives and adverbs, where inflectional and word-order variation is described via fine-grained graph-based rules. It is provided in two forms -intensional (multiword lemmas and inflection rules) and extensional (list of morphologically annotated variants). The latter, generated automatically from the former, was used in our projecting experiments. Tab. 1 shows a sample extensional entry containing a MWE inflected form, its lemma and morphological tag: noun (subst) in singular (sg) genitive (gen) and feminine gender (f).
Walenty is a Polish large-scale valence dictionary of about 50000, 3700, 3000, and 1000 subcategorization frames (in its 2015 version) for Polish verbs, nouns, adjectives, and adverbs respectively. Its encoding formalism is rather expressive and theory-neutral, and includes an elaborate phraseological component (Przepiórkowski et al., 2014). 1 Thus, above 8,000 verbal frames contain lexicalized arguments of head verbs, i.e., they describe VMWEs. For instance the idiom highlighted in example (1) is described in Walenty as shown in Tab. 2. Each component separated by a '+' represents one required verbal argument with its lexical, morphological, syntactic, and (sometimes) semantic constraints. Here, the subject is compulsory and has a structural case (subj{np(str)}), which notably means that it normally occurs in the nominative, but turns to the genitive when realized as a numeral phrase (of a certain type). The subject being a required argument in a verbal frame does not contradict the fact that it can regularly be omitted in Polish, as in (1) The second required argument is a direct object realized as a nominal phrase in structural case, i.e., normally in the accusative but turning to the genitive when the sentence is negated, as in (1). The lexicalized object's head has the lemma język 'tongue', should be in singular (sg) and does not admit modifiers (natr). The second complement is a prepositional nominal phrase (prepnp) headed by the preposition za 'behind' governing the instrumental case (inst) and a lexicalized non-modifiable (natr) noun with the lemma ząb 'tooth' in plural (pl). Walenty's syntax is compact and meant to be easily handled by lexicographers but proved sufficiently formalized to be directly applicable to NLP tasks, such as automatic generation of grammar rules (Patejuk, 2015). Składnica is a Polish constituency treebank comprising about 9,000 sentences with manually disambiguated syntactic trees (Świdziński and Woliński, 2010). It was created by automatically generating all possible parses with a largecoverage DCG grammar, and then manually selecting the correct parse. It does not contain MWE  (1) in Składnica. The categories denote: ff 'finite phrase', fl 'adjunct', fno 'nominal phrase', formaczas 'verbal phrase', formaprzym 'adjectival phrase', formarzecz 'nominal phrase', fpm 'prepositional phrase', fpt 'adjectival phrase', fw 'required phrase', fwe 'verbal phrase', partykuła 'particle', przyimek 'preposition', wypowiedzenie 'utterance', zdanie 'sentence', znakkońca 'ending punctuation'. The feature structure of the fno node dominating the terminal język 'tongue' is highlighted. The feature codes include: przypadek 'case', rodzaj 'gender', liczba 'number', osoba 'person', rekcja 'case government', and neg 'negation'. The values denote: dop 'genitive', mnz 'human inanimate', poj 'singular', and nie 'negated'.
annotations. Its morphosyntactic tagset is mostly equivalent to the one used in Walenty, although it uses Polish terms: mian=mianownik 'nominative', dk=dokonany 'perfective aspect', etc. Fig. 1 shows the correct syntax tree from Składnica for example (1). Each non-terminal node includes a feature structure (FS). Here, the FS of the node fno (nominal phrase), above the terminal język 'tongue', is highlighted. It includes the feature neg=nie meaning that this node occurs within the scope of a negated verb. This makes it easy to validate constraints from Walenty entries, such as the structural genitive of direct objects.
A notable feature of Składnica is that dependents of the verbs are explicitly marked as either arguments (fw) or adjuncts (fl), i.e., valency is accounted for. Note, however, that the valency of head verbs in VMWEs can differ from the one of the same verbs occurring as simple predicates.

Projection
Since Składnica contains no explicit MWE annotations, we produced them automatically by projecting NCP-NE, SEJF and Walenty on the syntax trees. The projection for NCP-NE was straightforward and did not require manual validation, since Składnica is a subcorpus of the NCP, whose NE annotation and adjudication were performed manually. The projection for SEJF and Walenty, followed by a manual validation, consisted in searching for syntactic nodes satisfying all lexical constraints and part of syntactic constraints of a MWE entry. The required lexical nodes were to be contiguous for SEJF but not for Walenty.
Here, we give more details on the Walenty-to-Składnica projection, which was the most challenging one. It required defining correspondences at different levels. Explicit morphological values and phrase types could be translated rather straightforwardly due to largely compatible tagsets (np→fno 'nominal phrase', mian→nom 22 'nominative', etc.). Context-dependent values like str (structural case) were encoded in conditional statements taking combination of features into account. For instance, the argument specification obj(np(str)) translated into a feature structure containing one of the following: [category = f no, przypadek = bier, neg = tak], [category = f no, przypadek = dop, neg = nie] (nominal object, either in the accusative in an affirmative sentence or in the genitive in a negative one).
Once these correspondences were defined, identifying a Walenty entry in Składnica consisted in checking if the current sentence contained a subtree in which: (i) the lexically constrained arguments and adjuncts (and their own, recursively embedded, lexically constrained dependents) were present, (ii) selected syntactic constraints (those concerning np and prepnp phrases) were fulfilled. For instance in Fig. 1, a head verb, a direct object with a lexicalized head and a lexicalized prepositional complement were searched for, but an ellipsis of the subject was allowed. Query language The MWE projection task is handled by: (i) a query language, providing an interface between the MWE resources and the treebank, (ii) procedures for compiling lexicon entries into the queries, and (iii) an interpreter which runs a query over treebank subtrees to check whether the corresponding MWE entry occurs in them.
Formally, we defined our core query language using the following abstract syntax: b (Booleans) ::= true | false n (node queries) : Thus, the properties of a given syntactic node or tree can be verified via an appropriate node query (NQ) or tree query (TQ), respectively. Both kinds of queries are recursive and TQs can additionally build on NQs. For instance, from the query interpretation point of view, the TQ root n is satisfied for a given tree iff its root satisfies the NQ n. Also, the TQ child t is satisfied iff at least one of its root's children trees satisfies the TQ t. Finally, particular feature values (category, przypadek, etc.) can be verified using the NQ satisfy (node → b), which takes an arbitrary node-level predicate (node → b) and tells whether it is satisfied over the current syntactic node.
The particularity of this query language is the mark construction, which marks a syntactic node as a part of a MWE. When a TQ t containing mark has been executed over a tree T , t's result contains all nodes matched with mark, provided that T satisfies all the constraints encoded in t.
Mark does not check any constraints by itself, but it can be easily combined with other NQs via query conjunction (i.e., n ∧ mark).
Note that, based on our core language, more complex queries can be expressed, for instance: The query interpreter is defined over the core language only and handles MWE-related marking. For instance, given a query of type t 1 ∨ t 2 , while evaluating t 1 , some subtree nodes may be marked as potential MWE components. But if t 1 finally evaluates to false, all these markings are wiped out. This behavior is guaranteed by the implementation of the core disjunction (∨) operator. Compiling MWE entries Let us focus on the Walenty-to-query compilation and on the entry from Tab. 2 in particular. Its querified version checks that (i) the base form of the lexical head, reached via the head-annotated edges (marked in grayed in Fig. 1), corresponds to the main verb of the entry (i.e., trzymać), and (ii) each of the lexically-constrained elements of the frame (i.e., noun phrase język and prepositional phrase za zębami) is realized by one of the child-ren trees of the queried tree. Part (i) of the query is implemented by the version of the member query (see Eq. 2) restricted to head-annotated edges. Implementation of (ii) depends on the particular frame element. Tree queries corresponding to (i) and (ii) are then combined using the ∧ operator.
The obj{lex(np(str),sg,'język',natr)} frame element is also translated to a ∧-combined set of tree queries, which individually check that all the given restrictions are satisfied: the lexical head is język, the number is singular, etc. The node query which verifies that język is the lexical head is combined with mark, so that it is designated as a part of the resulting MWE annotation, provided that all the other entry-related constraints are also satisfied. Modifiers, if specified, are recursively compiled into tree queries which are then applied over child-ren trees. Here, natr specifies that no modifiers are allowed, constraint compiled into a query which checks that the corresponding tree 23  is non-branching (i.e., has no other children apart from its head, constraint satisfied in Fig. 1 by the subtree rooted with fno placed over the leaf język). 3 The other element of the frame, which describes the prepositional argument za zębami, is compiled into a query in a similar way.  (3)-(4). Thus, the precision of the SEJF/Walenty projection was equal to 0.85. The idiomaticity rate (El Maarouf and Oakes, 2015), i.e., the ratio of occurrences with idiomatic reading to all correctly recognized occurrences, is about 0.95. We expect that if NEs were taken into account, this ratio would be even higher, since NEs seem to exhibit compositional readings relatively rarely. Note also that false positives are much more frequent for entries stemming from Walenty than for those from SEJF, which shows the higher complexity of verbal MWEs as compared to other, continuous, MWEs.

Results
( Notable errors in the projection procedure stem from allowing for the ellipsis of compulsory but

Summary and Perspectives
The automatic projection of MWEs resources on a treebank results in a manually validated resource containing over 2,000 VMWEs in about 9,000 constituency trees, and available under the GPL v3 license. 4 The results are represented in a simplified custom XML format, meant for an easy use, e.g., in automatic grammar extraction. This format refers to identifiers of sentences and tokens in the Składnica trees, which enables users to automatically project annotations on the original treebank.
We believe to have shown examples of finegrained and high-quality MWE resources which might be promoted as standards for the international community. Adapting their formalisms to many languages should be possible with affordable efforts (already undertaken by us for French). In return, relatively reliable mapping procedures based on such resources may help bridge the gap towards large and comprehensive MWEannotation in treebanks, which is currently a bottleneck in the MWE-oriented research.
Another interesting finding, worth confirming in other languages, is the high idiomaticity rate of MWEs. It is a hint that automated MWE identification based on purely syntactic methods and rich resources may achieve high accuracy, even in the absence of semantic non-compositionality models.
Future work includes repeating the experiments with the new version of Walenty released in 2016, as well as estimating the projection recall. We also wish to enhance the lexicon projection process, so as to account for more fine-grained constraints, and tune the degree of flexibility in constraint validation. Finally, an appropriate MWE annotation schema is needed in which each MWE occurrence would be linked to its corresponding entry in a MWE lexicon, and its required arguments, whether lexicalized or not, would be marked.