Modeling MWEs in BTB-WN

The paper presents the characteristics of the predominant types of MultiWord expressions (MWEs) in the BulTreeBank WordNet – BTB-WN. Their distribution in BTB-WN is discussed with respect to the overall hierarchical organization of the lexical resource. Also, a catena-based modeling is proposed for handling the issues of lexical semantics of MWEs.


Introduction
In this paper we present the distribution and treatment of MultiWord Expressions (MWEs) within BTB-WN -a data-driven Bulgarian WordNet. 1 Currently BTB-WN contains about 22 000 synsets covering CoreWordNet synsets, all the content words within BulTreeBank (about 8 000 lemmas) and the top part of a frequency list over 70 million running words. For the purpose of this work we use two subsets: (1) the current version of BTB-WN; and (2) a subset mapped to the Bulgarian Wikipedia in order to establish a connection between the lexical information in BTB-WN and the encyclopedic knowledge - (Simov et al., 2019). The second set is used to evaluate the impact of the encyclopedic domain on the distribution of the MWEs. From the first subset 981 examples of MWEs have been extracted, while from the second one -506 examples.
In the past few years extensive literature has been dedicated to MWEs. In spite of that there is no single guiding principle or widely accepted classification, since MWEs are not homogeneous and can be classified at different levels that interact in various ways: morphology, lexicology, syntax, and semantics. Also, the typology becomes more * Laska Laskova and Petya Osenova are also affiliated at Sofia University "St. Kl. Ohridski", Faculty of Slavic Studies. 1 For more information on the creation and development of BTB-WN see (Osenova and Simov, 2018a). complex at a cross-language level due to the differing approaches to MWEs and differing language systems. For that reason we rely on the classification 2 developed within WG 4 of PARSEME COST Action. 3 This classification takes into account the part-of-speech of the MWE head which in our view is suitable for the treatment of MWEs in wordnets. Thus, it categorises the MWEs into the following types: Nominal (Named Entities, NN compounds, other), Verbal (phrasal verbs, light verb constructions, VP idioms, other), Adjectival, Prepositional and Other.
We focus on modeling compositionality of MWEs as reflected in their morphosyntactic and semantic properties. With respect to semantics we follow (Bentivogli and Pianta, 2004) in requiring the representation of both types of meanings -a) related to the whole MWE and b) related to its constituent words. Such an approach is especially important for cases when the MWE allows also a fully compositional usage. For example, the classical MWE "kick the bucket" comprises an idiomatic meaning, but in an appropriate context it might have also a compositional (literal) usage. We differ from the above mentioned authors, since we do not introduce a new relation for handling compositionality composed-of, but directly annotate the corresponding words within the MWE with their literal meaning. 4 As a modelling device for these MWEs we extend the catena framework of (Osenova and Simov, 2018b), since this approach can handle the morphosyntactic behaviour as well as the compositionality issues sourcing from semantics. The novelty here is that the focus is put on the incorporation of the lexical meaning coming from the WordNet into the catena model.
The structure of the paper is as follows: the next section outlines the related work; Section 3 presents a classification of the MWEs in BTB-WN. Section 4 proposes an extension of the catena model that incorporates lexical semantics. Section 5 concludes the paper.
2 Related Work (Constant et al., 2017) elaborate on the diversity of the MWEs and the schemes for their categorization. The article lists the most commonly seen categories of MWEs: idioms; light-verb constructions; verb-particle constructions; noun and verb compounds; complex function words; multiword named entities and multiword terms. The authors note that these categories are non-exhaustive and can overlap. Recently, the work on identification of MWEs continued with a focus on Verbal MWEs. The 2018 edition of the shared task PARSEME (Ramisch et al., 2018) relied on enhanced and revised guidelines defining the following verbal MWEs typology: light-verb constructions; verbal idioms; inherently reflexive verbs; verb-particle constructions; multi-verb constructions; inherently clitic verbs; inherently adpositional verbs. Here we do not go into such a detailed typology, thus relying on the more general verbal classification from the PARSEME WG 4 presented briefly above.
As already mentioned, our approach is similar to the one proposed in (Bentivogli and Pianta, 2004). They consider the addition of syntagmatic information to WordNet by providing cooccurrences of meanings within a MWE. In order to do this they related each noun, verb, adjective or adverb in a given MWE with the appropriate synset via the new relation composed-of. In addition to MWEs the authors proposed to include Recurrent Free Phrases in WordNet that are completely compositional, but have some additional features that distinguish them from the arbitrary compositional phrases. These features might source from additional knowledge carried by the phrase, or from statistically idiosyncratic patterns. The grouping of the phrases by their meaning has been called a phraset. The phrasets are useful not only by providing co-occurrences of meanings for their constituent words, but also in multilingual settings where they might fill lexical gaps. Also they are useful in NLP tasks such as Word Sense Disambiguation, Machine Translation, etc. A similar approach to MWEs has been undertaken also in the creation of the Basque WordNet - (Agirre et al., 2006). As already mentioned above, we do not introduce a new relation, but directly annotate the words in MWEs with the appropriate literal meanings. Furthermore, we do not restrict the annotation only to the compositional parts of MWEs. Whole MWEs are annotated as well. For the moment no phrasets have been added to BTB-WN, but we consider such a step as a good development in future.
In series of papers (Simov and Osenova, 2014), , and (Osenova and Simov, 2018b) we presented the modeling of MWEs in terms of catena. These papers demonstrate how the (partial) variability and compositionality can be represented. The last paper reflects the multilingual application of the model. In our work here we extend this model to represent also the literal meanings of the distinct components in MWEs.

MWE types in BTB-WN
The classification we present preserves the general grouping of synsets into the four syntactic types which can be found in Princeton WordNet (PWN) and other wordnets alike: nominal, verbal, adjectival and adverbial ones. All prepositional MWEs are classified either as adjectival or as adverbial MWE. It is worth noting that both phrases -PP and, less often, AdvP -can be modifiers or adjuncts depending on the context. For example, от първа ръка ("first-hand") can modify the verb знам ("know") (i.e. I know something from firsthand) and the noun информация ("information") (i.e. I have information from first-hand) which denotes one of the components involved in the situation described by the verb.
We examine each of the four subsets for recurring syntactic patterns and evaluate them in terms of semantic compositionality, grammatical deviation (archaic, morphologically frozen forms included), and flexibility -the last one understood as a complex feature that takes into account morphological variation, word order permutation, and the possibility to modify the sub-units of a MWE.
It is assumed that MWEs exhibiting the degree of compositionality and flexibility typical for phrases generated ad hoc in discourse, should still be included in the lexicon if they are as-sociated with a particular type of genre, speech act or otherwise conventionalized (Calzolari et al., 2002). One such example is the terminological unit промяна на климата (change of climate, 'climate change'), which corresponds to the two MWE forms in the PWN synset {climate change, global climate change}. A small number of the two or three word sequences extracted from BTB-WN appeared to be marginal for the MWE spectrum. They were born in the process of the bidirectional mapping of BTB-WN and PWN synsets  as instances of periphrastic translation; whenever there is no word or MWE in the target language to express the concept, dictionaries offer descriptive phrases whose length and syntactic level of complexity may vary. Consider these two examples: the VP гледам гневно of the structural type V + Adv, 'look disapprovingly' which is mapped to the PWN synset {glower, glare} or the four-word sequence казвам|изговарям буква по буква, lit. 'say|pronounce letter per letter' which is used to translate the English verb "spell". While the meaning of both Bulgarian expressions can be derived from the meaning of their sub-units, the latter is a collocation, and the former is not. The distribution of the various structural types within BTB-WN resource shows a slightly bigger share of the patterns which do not have an N for a head and the third most numerous group is that of the verbal MWEs. (see Table 1). Only 2 of the 25 compound noun phrases have not been matched to a Wikipedia article. In contrast to English, the NN pattern in Bulgarian is not only rare, but it is reserved for terms in which at least one of the subunits has reduced its semantic transparency, as in елен лопатар, "fallow deer", or is foreign уеб страница, "web page".  In comparison, Table 2 presents the percentage of the different structural types of MWEs within the synsets mapped to Wikipedia articles in an initial attempt to enrich BTB-WN with encyclopedic knowledge. Not surprisingly, the first three most numerous groups are all nominal, which stems from the fact that Wikipedia articles mainly cover general concepts and named entities (Mc-Crae, 2018).
In the following subsections a more detailed description for each MWE type is provided.

Multiword Adverbials
With a few exceptions, all of the examined prepositional phrases are adverbial adjuncts corresponding to a Prep(ositional) head followed by a postmodifier N(oun) or Adv(erb); in some cases the second element is modified by another PP or Adj(ective) -see Table 3.
The opposite, however, is not true --some of the adverbial adjuncts follow different syntactic patterns which often but not always have phonetic, rhythmic and/or lexical repetition as their common denominator. This feature is related to the iconicity that reflects the meaning of the MWE (e.g. examples 9 and 10, and especially 8 where the two sub-units are nonsensical if not concatenated, which is to say that they do not have a lemma status on their own).
Example 9 -Сегиз-тогиз (lit. "now-then") -and example 10 -напред-назад (lit. "forthback") -represent the result of a type of syntactic contraction where the conjunction is omitted. Example 13 represents another typical syntactic transformation that accompanies the process   Table 3 illustrate two productive derivational models and consequently --a predictable multiword time and manner adverbial constructions, where a fixed preposition (за, "for" or по, "on") is followed by an adjective which in turn has to be semantically and grammatically compatible with the elliptical head noun (време, "time" and начин, "manner", respectively). Neither the MWEs with a prepositional head, nor any of the adverbials show any degree of morphosyntactic variation. All of them have a fixed word order.

Multiword Adjectives
There are only three MWEs of the PP modifier type in BTB-WN, на високо равнище, на високо ниво, "top-level", and от първа ръка, "firsthand" which belong to two different synsets. The rest of the modifiers have as their head a syntactic Adjective (see Table 4), and it is the only subunit subject to morphological modification. Interesting cases are the following ones: example 14 рохко сварен, "soft-boiled" and example 15 добре дошъл, "welcome". The former represents an interesting example of a MWE that has a limited selective power, since it typically collocates with the neutral noun яйце, "egg". This respectively narrows down the possible morphological realizations to two forms, рохко сварен-о 'soft-boiled-SG.N' or рохко сварен-и 'soft-boiled-PL'. The latter is usually predicatively used and referring to some person. Thus its form depends on the gender of the referred person and on the singularity/plurality of these objects. Again, the order of the adjectival MWE elements is fixed.

Multiword Verbs
The majority of verbal MWEs contain at least one reflexive seor si-verb (отморявам си, отдъхна си, relax), or dative/accusative clitics. 93.83 % of all verbal MWEs in BTB-WN are of this kind. Although they are often mapped to English phrasal verbs in translation (Kordoni and Simova, 2014), we do not consider reflexive verbs and verbs that include accusative or dative pronominal particles, such as унася ме, "doze off" or хрумва ми, "come to mind" as MWEs (for a different approach see (Ramisch et al., 2018)). Thus, there are only 124 multiword verbs per se. Among these 124 verbal phrases we distinguished several syntactic patterns as illustrated in Table 5. Examples 18 and 19 illustrate the light verb construction, with правя, "make" and водя, "lead" as phrasal heads respectively. Another frequent light verb in BTB-WN is давам, "give". Typically, light verb MWEs are found in synsets with verbs that are derived from the nominal sub-unit, e.g. (правя) гаргар-а → гаргаря се (make a gargle, to gargle) or vice versa, e.g. кореспондирам → ( водя) кореспонденция (to correspond, correspondence). In these cases the two synsets are {правя гаргара, гаргаря се} and {водя кореспонденция, корес-пондирам}. Examples 20 and 24 belong to different structural types but they have one thing in common, a sub-unit that refers to a body part,   accompanied by the reflexive possessive marker si. In example 20 the body part is 'mouth' while in example 24 it is 'nose'. Even when they are used in a sentence with a plural subject, the number of the noun element typically remains singular, e.g. Затваря-йте си уст-а.та!, 'Shut-IMP.2PL PTCL.REFL.POSS mouth-SG.DET'. The verbal MWEs might allow for adjectival modification of their noun elements.

Multiword Nouns
This type reflects predominantly named entities and specialized terminological units or everyday idiomatic phrases. Thus, they can be highly recursive in structure. In Table 6  Examples 29-34 demonstrate patterns for geographical names. Here the patterns are more diverse structurally. The pattern 'adjective(s) plus noun' seems to be regular (examples 29 and 30). Also, the pattern 'noun plus (adjective) noun' (ex-6 Also in some other Slavonic languages. amples 31 and 32) and the pattern 'noun plus prepositional phrase' can be distinguished (example 33). Not surprisingly, there are some names that are opaque to the Bulgarian morphosyntax (example 34). From the point of view of the annotation with literal meanings the non-opaque cases require special attention because of the usage of common words in them. Components like "strip" (example 29), "dead" (example 30), and "new" (example 32) need to be annotated with the appropriate meanings. If we consider "New South Wales" and "New York", the adjective "new" needs to be annotated with two different meanings in the two cases -recently discovered and recently created.
Examples 35-37 illustrate the organization names. The observed patterns are: 'noun plus prepositional phrase' (example 35) and 'adjective(s) plus noun plus (prepositional phrase)' (examples 36 and 37). These names are included in BTB-WN because of the mapping to the PWN. Since the organization names could be quite complex, a special (chunk) grammar will be required to deal with them. The grammar would include rules for annotating the literal meanings of the MWE components.
In Table 7     In examples 43 and 45 the head nouns determine the whole meaning of the phrases -"mean" and "victory" -but the meaning of the whole MWEs is not compositional because of the missing appropriate meanings for the adjectives. We do not want to include such meanings as separate synsets because of their limited distribution and thus the risk of introducing unnecessary ambiguity.
The presented examples in this section demonstrate a great diversity with respect to their morphosyntactic, syntactic and semantic characteristics. In the majority of the cases it seems that the literal meanings of the constituent words of the MWEs are transparent. This allows for an easy interpretation of the literal meanings within the appropriate context. Even when the MWEs are highly idiomatic, there might exist a context in which the speaker would refer to the literal meaning of the constituent words.
No1 Let us recall that the notion of catena (chain) was initially introduced in (O'Grady, 1998) as a mechanism for representing the syntactic structure of idioms. He showed that for this task a definition of syntactic patterns was needed that does not coincide with constituents. He defined the catena in the following way: The words A, B, and C (order irrelevant) form a chain if and only if A immediately dominates B and C, or if and only if A immediately dominates B and B immediately dominates C. In our work here we convert MWEs into a representation previously defined in (Simov and Osenova, 2014) and  in which the catena is depicted as a dependency tree fragment with appropriate grammatical and semantic information. The variations of the MWEs are represented through underspecifying the corresponding features, including valency frames, non-canonical basic form.
The lexical entry uses the following format: a lexicon-catena (LC), semantics (SM) and valency (Frame). The lexicon-catena for the MWEs is stored in its basic form. The realization of the catena in a sentence has to obey the rules of the grammar. In this way the possible word order is managed. The semantics of a lexical entry specifies the list of elementary predicates contributed by the lexical item. When the MWE allows for some modification (including adjunction) of its elements, i.e. modifiers of a noun, the lexical entry in the lexicon needs to specify the role of these modifiers. For example, the MWE represented in Fig. 1. 7 The valency frame contains two alternative elements for indirect object introduced by two different prepositions. The situation that the two descriptions are alternatives follows from the fact that the verb has no more than one indirect object. If there is also a direct object, then the valency set will contain elements for it as well. The semantic contribution of the indirect object is specified for each valency element. This semantic contribution is added to the semantic contribution of the lexical entry when the valency element is realized. In the dependency tree fragments also grammatical features and lemmas are represented. The catenae for the frame and for the whole lexical entry are unified on the basis of nodes with the same names.
In order to record the meaning of the whole MWE and the literal meanings of its constituent words, we extend the above lexical entry in the following way: The meaning of the whole MWE is recorded within the field SM as an additional item. In the case when the predicate semantics (as in the example) is available, it includes more than one predicate -one for the meaning of the MWE and one or more for the "assumed" arguments.
For the literal meanings of the constituent words we include a new field called constituent word literal meanings (LM). In Fig. 2 one example is provided of the new lexical entry for the MWE затварям си устата (close one's mouth-the) "shut up". For the mapping to the synsets we use the corresponding meaning from the Princeton WordNet 3.1: shut_up%2:30:00:: -"cause to be quiet or not talk"; shut%2:35:00:: -"move so that an opening or passage is obstructed; make shut"; and mouth%1:08:01:: -"the opening through which food is taken in and vocalizations emerge". Thus, through the WordNet mappings both -figurative and compositional -meanings are provided.

Conclusions
In this paper we presented the typology and the characteristics of the MWEs in BTB-WN. Near 400 Bulgarian MWEs were encoded as lexical entries based on the catena model. This fact shows that the model is feasible not only for modeling Bulgarian MWEs but also for describing MWEs in other languages. The approach that was taken in this work reflects the intuition of the human annotators to assign literal meanings to the constituent elements in MWEs even when they are highly idiomatic. In our work up to here no examples were found where one or more of the elements lack a literal meaning.
A more balanced and incremental view on the compositionality has been introduced since language is highly generative and might provide also contexts in which some of the literal meanings were triggered. An open question is the handling of ambiguity when the respective element has more than one literal meaning.