Using OntoLex-Lemon for Representing and Interlinking German Multiword Expressions in OdeNet and MMORPH

We describe work consisting in porting two large German lexical resources into the OntoLex-Lemon model in order to establish complementary interlinkings between them. One resource is OdeNet (Open German WordNet) and the other is a further development of the German version of the MMORPH morphological analyzer. We show how the Multiword Expressions (MWEs) contained in OdeNet can be morphologically specified by the use of the lexical representation and linking features of OntoLex-Lemon, which also support the formulation of restrictions in the usage of such expressions.


Introduction
WordNets are well-established lexical resources with a wide range of applications. For more than twenty years they have been elaborately set up and maintained by hand, especially the original Princeton WordNet of English (PWN) (Fellbaum, 1998). In recent years, there have been increasing activities in which open WordNets for different languages have been automatically extracted from other resources and enriched with lexical semantics information, building the so-called Open Multilingual WordNet (Bond and Paik, 2012). These WordNets were linked to PWN via shared synset. In this context a German lexical semantics resource with the name Open German WordNet (OdeNet) 1 is being developed with the aim to be included as the first open German WordNet into the Open Multilingual WordNet. This paper deals with the morphological enrichment of OdeNet, with a focus on complex OdeNet entries. The first morphological resource we are considering for this task is an updated German version of the MMORPH morphological analyzer (Petitpierre and Russell, 1995). 2 Besides this resource we have consulted the on-line editions of Duden and CanooNet, 3 as well as entries in the German Wiktionary 4 for manually checking a few lexical features of both OdeNet and MMORPH.
As a representation mean we have adopted OntoLex-Lemon (Cimiano et al., 2016), 5 as this model was shown to be able to represent both classical lexicographic description (McCrae et al., 2017) and lexical semantics networks, like Word-Net (McCrae et al., 2014). OntoLex-Lemon is a further development of the "LExicon Model for Ontologies" (lemon). 6 Guidelines for mapping Global WordNet formats onto lemon-based RDF 7 have been published 8 and already some WordNets have been mapped onto lemon, as described for example in (McCrae et al., 2014).
We follow in this work the suggestion made in (Hüning and Schlücker, 2015) to consider MWEs as being "a general term that includes phenomena with different degrees of syntactic fixedness and semantic compositionality", allowing us to treat German compounds in a similar way as OdeNet "phrasal entries", so that OdeNet entries "Rotkraut" (red kraut or red cabbage) and "rote Bete" (beetroot) can be equally considered as MWEs, but both terms will be associated to different morphological patterns, as the German adjective "rote" in the second case is also displaying an inflectional behavior in order to be in agreement with the morphology of the noun (for example in singular genitive or in all forms of plural, and also in dependency of the preceding presence of a definite or indefinite determiner).
In the next sections, we give first background information on OdeNet and on OntoLex-Lemon. We then describe the mapping of OdeNet to OntoLex-Lemon. We continue with an introduction of MMORPH, followed by a section that describes how the use of MMORPH and OntoLex-Lemon is supporting the linking of MWEs in OdeNet to full morphological descriptions.

OdeNet
A candidate for representing German lexical semantics data in OntoLex could be for sure Ger-maNet, which is a manually well-designed Word-Net resource for German (Hamp et al., 1997). Ger-maNet was developed over 20 years now and is very stable and precise. The problem with Ger-maNet is that it is not available under an opensource license. The restricted license makes Ger-maNet unable to be included in the aforementioned Open Multilingual WordNet. Therefore we selected OdeNet as the German lexical semantics resource we want to work with, also with the aim of publishing the resulting data set as part of the Linguistic Linked Open Data cloud. 9 OdeNet combines two existing resources: The OpenThesaurus German synonym lexicon 10 and the Open Multilingual WordNet (OMW) 11 English resource: the Princeton WordNet of English (PWN) (Fellbaum, 1998). Considering the integration of OpenThesaurus in OdeNet means making use of a large resource for German that is generated and updated by the crowd. A consequence of this approach is that OdeNet needs to be curated. While generally automatically generated entries have a confidence score of "0.7", manually curated entries get a score of "1.0".

OntoLex-Lemon
The OntoLex-Lemon model was originally developed with the aim to provide a rich linguistic grounding for ontologies, meaning that the natural language expressions used in the description of ontology elements are equipped with an extensive linguistic description. 15 This rich linguistic grounding includes the representation of morphological and syntactic properties of lexical entries as well as the syntax-semantics interface, i.e. the meaning of these lexical entries with respect to an ontology or to specialized vocabularies. The main organizing unit for those linguistic descriptions is the lexical entry, which enables the representation of morphological patterns for each entry (a MWE, a word or an affix). The connection of a lexical entry to an ontological entity is marked mainly by the denotes property or is mediated by the LexicalSense or the LexicalConcept properties, as this is represented in Figure 1, which displays the core module of the model.
As stated in Section 1, OntoLex-Lemon builds on and extends the lemon model. A major difference is that OntoLex-Lemon includes an explicit way to encode conceptual hierarchies, using the SKOS standard. 16 As can be seen 15 See (McCrae et al., 2012), (Cimiano et al., 2016) and also https://www.w3.org/community/ontolex/ wiki/Final_Model_Specification. 16 SKOS stands for "Simple Knowledge Organization Sys- in Figure 1, lexical entries can be linked, via the ontolex:evokes property, to such SKOS concepts, which can represent WordNet synsets. This structure is paralleling the relation between lexical entries and ontological resources, which is implemented either directly by the ontolex:reference property or mediated by the instances of the ontolex:LexicalSense class. 17 The ontolex:LexicalConcept class seems to be best appropriated to model the "sets of cognitive synonyms (synsets)" 18 that Princeton WordNet (PWN) describes, while the ontolex:LexicalSense class is meant to represent the bridge between lexical entries and ontological entities (which do not necessarily have semantic relations between them).

Mapping OdeNet to OntoLex-Lemon
A main issue with the original partly crowdsourced data for OdeNet was that additional textual information or special characters were added tem". SKOS provides "a model for expressing the basic structure and content of concept schemes such as thesauri, classification schemes, subject heading lists, taxonomies, folksonomies, and other similar types of controlled vocabulary" (https://www.w3.org/TR/skos-primer/) 17 Quoting from Section 3.6 "Lexical Concept" https: //www.w3.org/2016/05/ontolex/: "We [...] capture the fact that a certain lexical entry can be used to denote a certain ontological predicate. We capture this by saying that the lexical entry denotes the class or ontology element in question. However, sometimes we would like to express the fact that a certain lexical entry evokes a certain mental concept rather than that it refers to a class with a formal interpretation in some model. Thus, in lemon we introduce the class Lexical Concept that represents a mental abstraction, concept or unit of thought that can be lexicalized by a given collection of senses. A lexical concept is thus a subclass of skos:Concept." by the crowd to the headwords. In order to clean the data, we wrote a Python script, which not only is filtering out noisy data, but also mapping certain GWN codes (like part of speech (PoS)) into the vocabularies used in OntoLex-Lemon, like for example the LexInfo vocabulary for PoS and semantic relations. 19 As for now, we have in the OntoLex-Lemon encoding of OdeNet 120,012 lexical entries, the same number of lexical senses and 36,192 synsets, which are encoded as instances of the class ontolex:LexicalConcept and included in a SKOS-based conceptual hierarchy, supporting also the description of lexical semantic relations between synsets, like synonymy, hyponomy etc. It is interesting to notice that 44,506 entries contain a blank and can therefore be considered as Multi Word Expressions. And if we add to this figure all the 14,080 compound entries 20 we note that approximately half of the lexical entries in the OntoLex-Lemon representation can be considered as MWEs.
The following listings give some details on the OntoLex-Lemon encoding of the first entry in OdeNet, which is "Kernspaltung" (nuclear fission). In Listing 1 we display the full OntoLex-Lemon entry. One aspect that can be immediately noted by the reader, is the possibility to represent the components of the compound word, which is encoded as being an instance of 19 See https://www.lexinfo.net/ontology/ 2.0/lexinfo and also (Cimiano et al., 2011). 20 This figure was computed merely by comparison with the list of split nominal compounds offered by the GermaNet project on its web page: http: //www.sfs.uni-tuebingen.de/GermaNet/ documents/compounds/split_compounds_ from_GermaNet13.0.txt, We expect to have a larger number of compounds by applying a decomposition algorithm, not only to nominal entries. the class ontolex:MultiWordExpression (which in OntoLex-Lemon marks any type of entries that can be segmented, thus including compounds). This possibility is demonstrating one of the added-value of linking synsets to the (complex) representation of lexical entries, as we can state (see below) semantic relations between synsets associated to the components of a compound word and its synsets.
Listing 2 below is displaying the form information associated to the w1 entry in Listing 1. In this code we can see how a sense can be linked to a synset, via the property ontolex:isLexicalizedSenseOf, while the entry itself can be linked to the synset via the property ontolex:evokes, as this is displayed in Listing 1. The sense itself is also linking (ontolex:reference) to an ontological entity, here in the form of a Wikidata entry.
Listing 4 displays the representation of the synset associated to both the w1 lexical entry and the w1 1-n sense. There we can also see that this lexical concept (synset) is also "evoked" by other entries/senses. For example by the entries for "Kernfission" or "Atomspaltung", which are synonyms of "Kernspaltung". The lexinfo:hypernym property is providing the information on the semantic relation this synset has to another synset. Finally, in Listing 5 we display the "entries" for the components of the compound word "Kernspaltung". Those components are pointing to the lexical entries they are related to (the entry :entry w23527 is for example the one corresponding to the noun "Spaltung" (split, fission, separation, cleavage, etc.), which has again its own senses and associated synsets. We can here disambiguate the meaning of "Spaltung" as used in the compound, as being the one of "fission". And the whole compound can then be considered as an hyponym of the synset for "fission". In Listing 1 above, we can see the information on the sequence those components have in this entry. For sure, those component "entries" can be re-used separately for other compound, like for example for "Atomspaltung". So that we can collect all the corresponding meanings of a word, also when they are used in compounds, also in dependency of their position in the compounds. Details on the decomposition module of OntoLex-Lemon are shown in Figure 2.
In this section we described the current state of the OntoLex-Lemon representation of the data we can find in the OdeNet resource. But we also touched the possible use of OntoLex-Lemon for bridging WordNet-like resources and full lexical descriptions, concentrating in the above section on the topic of German compound nouns. In the next section we present the morphological resource we mapped onto OntoLex-Lemon in order to be able to link OdeNet elements to a full morphological description.

MMORPH
As mentioned in Section 1 we work with an updated German version of MMORPH (Petitpierre and Russell, 1995), which covers also English, Spanish, French and Italian morphology. Our German version of MMORPH contains over 2,630,000 full-forms, and has specifically improved the coverage of compounds compared to the original German version of MMORPH. MMORPH presents its data in a well structured fashion, as the (simplified) example for the noun "Kernspaltung" (nuclear fission) below demonstrates: Listing 6: The MMORPH entry for Kernspaltung " k e r n s p a l t u n g " = " k e r n s p a l t u n g " Noun [ g e n d e r =fem number= s i n g u l a r c a s e =nom | gen | d a t | a c c ] " k e r n s p a l t u n g e n " = " k e r n s p a l t u n g " Noun [ g e n d e r =fem number= p l u r a l c a s e =nom | gen | d a t | a c c ] We wrote a script in order to transform the MMORPH data into OntoLex-Lemon, in its turtle syntax serialization. 21

Linking the OdeNet Resource to the MMORPH Resource
We see the use of OntoLex-Lemon for representing WordNets as a chance to not only port information from one format to another (with the possibility to publish WordNets in the Linguistic Linked Open Data cloud), but also as an opportunity to extend the coverage of WordNet descriptions to more complex lexical phenomena, beyond lemma and PoS considerations. One case we have been studying concerns the morphological specification of MWEs included in OdeNet.
As we could see, there are more than a significant number of MWEs in OdeNet, being compounds or "phrasal entries", like for example "Rotkohl" (purple cabbage or red cabbage), "Rotkraut" (red kraut or red cabbage), "rote Bete" (beetroot), or "geistiges Eigentum" (intellectual property). A note on those examples: While "Rotkraut" and "Rotkohl" are essentially pointing to the same vegetable, 23 , the word "Rotkraut" is typically used only in its singular form. 24 The same remark for the MWE "geistiges Eigentum". 25 There is no way in the original OdeNet (or in general in PWN or other WordNets) to explicitly formulate this restriction, that an entry can be used 23  Wiktionary indicates a plural for "Rotkohl" (https://de.wiktionary.org/wiki/Rotkohl). 25 It is interesting to note that neither Duden nor CanooNet have an entry for such MWEs like "rote Bete" or "geistiges Eigentum", but Wiktionary and OdeNet include such MWEs as entries. We guess that a lexicography view is dedicated to include only words, also resulting from word formation processes, as entries, while the other dictionary tradition is more closely related to the description of meanings. only in singular. It is possible in PWN though to get the information that a concept is only lexicalized by a plural form, by just querying for a plural form, like for example "peoples". If this plural form is not reduced exclusively to its lemma, then a synset for it will be returned, together with the synsets for the singular form, as can be seen in the following listing, where the plural form is highlighted 26 : Listing 7: The Synsets for "people" vs. "peoples" But it is to be noticed, that when the user is querying for "people", the synset for the plural form "peoples" will not be displayed. The example of the OntoLex-Lemon representation of the German adjective "rot" (red) displayed in Listing 8 is introduced in order to give an idea of the complexity of the inflectional variants for a German adjective, whereas we do not include the form variants that are conditioned by the preceding use of a definite or an indefinite determiner. But even if we can limit the number of forms for the noun "Bete", we have to combine those with the possible forms of "rot", and also consider the possible use of a preceding indefinite or definite determiner. This gives us 32 forms to be considered, compared to the maximum of 8 different forms if we deal with a nominal compound like "Rotkohl". And in fact, the OntoLex-Lemon linking mechanisms allow us to precise only the "positive" adjectival forms, as "rot" can not appear in this MWE as a comparative or superlative.
We have a similar situation with the entry "geistiges Eigentum" (intellectual property) 28 in OdeNet. But there is another restriction, following which this concept of intellectual property can be used only in the singular form. We select from the list of 24 possible forms of the adjective only the "positive" ones (see Listing 10). The Ontolex-Lemon representation of the MMORPH entry for the noun "Eigentum", with 27 The number of forms can be reduced as all forms in singular have the same ending, the same for the plural, so that we do not need to list all different grammatical cases. 28 We note that "intellectual property" is also a MWE entry in PWN WordNet. the links to the associated forms, which are not displayed here, is shown in Listing 11 Listing 11: The MMORPH noun Eigentum (property) with the corresponding form variants One possibility would be to link the OdeNet entry "geistiges Eigentum" to both the relevant forms displayed just above in the Listings 10 and 11, with an additional information on the word ordering and that only the singular forms can be selected. This can be ensured by the use of the ontolex:usage usage property. This solution has the advantage that we do not have to introduce the phrasal MWE in the MMORPH representation corresponding to the OdeNet "geistiges Eigentum". But at the price of introducing some rules, like the ordering of the words, an agreement rule or the specific restriction that an OdeNet entry has only singular forms for its lexicalization.
The other possibility is to introduce an entry for the OdeNet MWE and the corresponding forms, as this is shown in Listing 12 In this example we included the decomposition information, including the ordering of the components. We also included the line ontolex:usage lexinfo:singular, which can be considered as redundant as we already selected as ontolex:otherForm all the singular forms, discarding thus the (possible) plural forms.
As a purely morphological information source, MMORPH does not have any sense or synset associated with its entries. Linking them to the OdeNet resources is adding thus a conceptual view to the MMORPH data. Additionally one can add reference information by querying DBpedia 29 or Wikidata 30 This can be done very easily by just adding the line ontolex:denotes with the corresponding URL pointing to an ontological reference.

Conclusion
We described our current work consisting in porting a recently developed German WordNet compliant lexical resource, OdeNet, to OntoLex-Lemon, in order to support its publication in the Linguistic Linked Open Data cloud. While processing those data, we noticed that OntoLex-Lemon can be used for bridging the WordNet type of lexical resources to a full description of lexical entries, leading to an extension of the coverage of WordNets beyond the consideration of lemmas and PoS information. In order to test our intuition, we ported an updated version of the German MMORPH morphological analyzer to OntoLex-Lemon and we established links between the two new OntoLex-Lemon data sets. We documented our interlinking work with the example of the full morphological representation of components of German compounds and MWEs used in OdeNet, also being able to express usage restrictions.

Acknowledgments
Contributions by Thierry Declerck have been supported in part by the H2020 project "ELEXIS" with Grant Agreement number 731015 and by the H2020 project "Prêt-à-LLOD" with Grant Agreement number 825182. 29 See https://wiki.dbpedia.org/. "DBpedia ... is a project aiming to extract structured content from the information created in the Wikipedia project", quoted from https://en.wikipedia.org/wiki/DBpedia. 30 "Wikidata is a free and open knowledge base that can be read and edited by both humans and machines. Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.", quoted from https:// www.wikidata.org/wiki/Wikidata:Main_Page.