Augmenting a German Morphological Database by Data-Intense Methods

This paper deals with the automatic enhancement of a new German morphological database. While there are some databases for flat word segmentation, this is the first available resource which can be directly used for deep parsing of German words. We combine the entries of this morphological database with the morphological tools SMOR and Moremorph and a context-based evaluation method which builds on a large Wikipedia corpus. We describe the state of the art and the essential characteristics of the database and the context method. The approach is tested on an inflight magazine of Lufthansa. We derive over 5,000 new instances of complex words. The coverage for the lemma types reaches up to over 99 percent. The precision of new found complex splits and monomorphemes is between 0.93 and 0.99.


Introduction
German is a language with complex processes of word formation, of which the most common are compounding and derivation. Segmentation and analysis of the resulting word forms are challenging as spelling conventions do not permit spaces as indicators for boundaries of constituents as in (1).
(1) Verkehrsamt 'tourist office' For long orthographical word forms, many combinatorially possible analyses exist, though usually only one of them has a conventionalized meaning (see Figure 1). For instance, for Verkehrsamt 'traffic office, tourist office', word segmentation tools can yield the wrong split containing one with the smaller number of word tokens Verkehr 'traffic' and Samt 'velvet'. In this case, there is a linking element within the word form which could be wrongly interpre-  tated as part of a morph. Such elements function as morphophonological structure markers. 1 German compounds can consist of derivatives, or compounds can be subject to further derivation. In (1), Verkehr is the result of a conversion process from verkehren 'to run, to fly', which again consists of a prefix and a verb stem (see Figure 2). On each level of morphological segmentation, the number of possible analyses is 2 n . This number can be reduced by excluding implausible constructions such as suffixes at the beginning of a construct. On the other hand, it has to be multiplied by the number of homonyms for the segmented forms. Therefore, automatic segmentations with more than ten possible analyses for one word are no rare case.
However, finding the correct segmentations and morphological structures is essential for terminologies and translation (memory) tools, information retrieval, and as input for tex- tual analyses. Deep parsing of complex morphological structures produces disambiguation such as (Fremde|n|verkehr)|s|amt 'tourism office' instead of the tautological interpretation Fremde|n|(Verkehr|s|amt) 'foreigner tourist office'. 2 Such analyses can help improving the quality of translation and retrieval tasks.
Moreover, counts of morphs, and morphological structures are useful for inducing hypotheses about statistical tendencies and quantitative laws, e.g. Menzerath's law (Cramer, 2005) or the Principle of Early Immediate Constituents (Hoffmann, 1999), which has not yet been corroborated for the word level by statistical tests.
In this paper, we will apply a hybrid approach for finding the correct splits of words and augmenting a morphological database. In Section 2, we provide a concise overview of previous work in word segmentation and word parsing for German. In Section 3, we introduce two linguistic tools we will be using later. SMOR is a wellknown morphological tool. We describe how we modified its lexicon and exploited and changed its internal results by the add-on module Moremorph. Section 4 introduces our morphological database which was built on the basis of the linguistic databases CELEX and GermaNet. Section 5 describes the data-intense procedures for the morphological analyses and supervised database 2 The complete structure of Fremdenverkehrsamt 'tourism traffic office, tourist office' is represented in Figure 4. enhancements. In Section 6, we test our method on a corpus of an inflight journal. Finally, we discuss our results and give an outlook for future developments.

Related Work
The first developments in morphological segmentation tools for German date back to the Nineties. Most of them are based on finite state machines. Gertwol (Haapalainen and Majorin, 1995), MORPH (Hanrieder, 1996), Morphy (Lezius, 1996;Lezius et al., 1998) and later SMOR (Schmid et al., 2004) and TAGH (Geyken and Hanneforth, 2006) generate morphological analyses for complex German words, yielding results for derivatives and compounds. All these analyses are flat word splittings and often include dozens of segmentation versions.
There are different ways to tackle such kind of ambiguity, most of which are applied merely to compounds and yield flat segmentations of the immediate constituent level. Cap (2014) and Koehn and Knight (2003) use ranking scores, such as the geometric mean, for the different morphological analyses and then choose the segmentation with the highest ranking.
Another approach consists in exploitation the sequence of letters, e.g. by pattern matching with tokens (Henrich and Hinrichs, 2011, 422) or lemmas (Weller-Di Marco, 2017). Ziering and van der Plas (2016) use normalization methods which are combined with ranking by the geometric mean. Ma et al. (2016) apply Conditional Random Fields modeling for letter sequences. Daiber et al. (2015) extract candidates of compound splits by string comparisons with corpus data.
Recent approaches exploit semantic information for the ranking of compound splittings. Riedl and Biemann (2016) utilize look-ups of similar terms inside a distributional thesaurus. Their ranking score is a modification of the geometric mean.  use the cosine as a measure for semantic similarity between compounds and their hypothetical constituents and combine these similarity values by computing the geometric means and other scores for each produced split. The scores are then used as factors to be multiplied by the scores of former splits.
One of the few approaches tackling deep morphological analyses is . Their investigation considers left-branching compounds consisting of three lexemes. Their distributional semantic modelling often fails to find the correct binary split if the head is too ambiguous to correlate strongly with the first part. But in general, using the semantic context is a sensitive disambiguation method. Ziering and van der Plas (2016) develop a splitter which makes use of normalization methods and can be used recursively by reanalyzing the results of splits. Their evaluation is based on the binary compounds of GermaNet (Hamp and Feldweg, 1997;Henrich and Hinrichs, 2011). Würzner and Hanneforth (2013) use a probabilistic context free grammar for full morphological parsing, but restrict their approach to derivational adjectives.
Most these approaches build upon corpus data. Only Henrich and Hinrichs (2011) enrich the output of morphological segmentation with information from the annotated compounds of GermaNet to disambiguate such structures. This can in a further step yield hierarchical structures but presupposes that the entries for the components exist inside the database. Steiner and Ruppenhofer (2018) build on this idea to derive more complex morphological structures from lexical resources. In 5, we come back to this and will exploit their resource.

SMOR: A Morphological Tool for
German and its Add-On Moremorph

SMOR
SMOR is a widely used morphological segmentation tool (e.g. Cap (2014), Henrich and Hinrichs (2011), Steiner and Ruppenhofer (2015), ). It is based on two-level morphology (Koskenniemi, 1984) and implemented as a set of finite-state transducers. For German, a large set of lexicons is available. These lexicons contain information about inflection, parts of speech and classes of word formation, e.g. abbreviations and truncations. The tag set used is compatible with the STTS (Stuttgart Tübingen tag set, Schiller et al. (1995)). SMOR produces different levels of granularity and different representation formats with different transducers and options. Example (2) and (3) show two simplified outputs of fine-grained analyses for Verkehrsamt 'traffic office, tourist office' and Fremdenverkehrsamt 'foreign-traffic office, tourist office'. For the sake of simplicity, we removed case and number.
(2) Verkehr<NN>Samt<+NN> Verkehr<NN>Amt<+NN> ver<VPREF>kehren<V>Samt<+NN> In (2), the word form Verkehrsamt 'tourist office' is analyzed in three different ways, of which two show the erroneous interpretation of the string samt 'velvet' as a noun. (3) shows the same error in three of its five segmentations. The categories consist of parts of speech (<NN>, <V>) for free morphs and the position of bound morphemes (e.g. <VPREF> for 'verbal prefix').

Moremorph
While SMOR is a reliable foundation for the analysis of word forms which have not been found before, it comes with some small drawbacks. Moremorph aims at improving and adjusting the output of SMOR.
As can be seen from the second line of (2), the SMOR output does not indicate if there are filler letters (or interfixes) inside a word.
However, the information exists inherently in intermediate SMOR output which can be reanalyzed by Moremorph. Therefore, filler letters (FL) can be marked as in (4): (4) Verkehr s Amt NN FL NN <NN> This annotation shows the morphs on the lexical level, their classes with filler letters, and finally the part of speech of the word form in angle brackets.
(5) presents the Moremorph representation of (3). In the last three analyses, there is one tag more than the number of splits due to the noun conversion of fremd 'foreign' to Fremde 'foreigner'. We also standardized inconsistent analyses for orthographical variants with and without hyphenations and added some more special characters to the inventory of word structuring means. This leads to consistent analyses for orthographical variants such as in (6). Also word forms with some other special characters not covered by SMOR can be processed now, as in (7). (6) a. Flughafen Köln-Bonn 'Airport Cologne-Bonn' b. Flughafen Köln/Bonn 'Airport Cologne/Bonn'.
"Team Lufthansa"-Partner (8) shows the output for (6-b) with the structuring character tagged as HYPHEN. While most morphological analyzers build on the results of word splitters, we decided to take up a hybrid approach which combines the reliable entries of a morphological database with the augmented and further processed analyses of SMOR and Moremorph. Here, also another morphological tool could be chosen. The German morphological tree database extracts its entries from a. the refurbished CELEX database (Baayen et al., 1995;Steiner, 2016) for German morphology (Burnage, 1995;Gulikers et al., 1995) and b. the compound analyses from the GermaNet database (Hamp and Feldweg, 1997;Henrich and Hinrichs, 2011;Steiner, 2017). For both preprocessed datasets, the derivation of complex structures was performed recursively, by combining the GermaNet analyses with the analyses from CELEX.
The tree building tool provides different parameters for the analysis. We chose to enrich the data with information on diachronic derivation and permitted a depth of six levels for the morphological analyses. (9) shows the morphological structures for (9-a) Verkehrsamt 'tourist office', (9-b) Verkehrsanlage 'traffic facility', and (9-c) Verkehrsbehinderung 'traffic obstruction'. (9-b) comprises diachronic derivational information, showing the noun Anlage 'facility/lay out' as derived from the verb anlegen 'lay out'.

Combining Morphological Databases with a Segmenter
In the following, we combine the morphological database with a morphological segmenter and a contextual evaluation process. If the database look-up fails, the time-consuming word splitting and evaluation is started. Then the output of Moremorph is analyzed by a contextual method by exploiting a very large corpus. If this fails, frequencies counts of a very large corpus is the back-off strategy. The new analyses are added to a set of new splits. At the end of each word analysis, all subparts of the word are being searched within the database and the newsplit set. This leads to incrementally more fine-grained entries. Figure 3 presents an overview. It shows two databases of morphological trees: the German morphological tree database and a incremental database for all newly found morphological analyses. Furthermore, it comprises a set of monomorphemes.

Basic Look-Up
As shown in Figure 3, a look-up finds the respective tree or the simplex form for the word within the lexicons. Before this is added to the results, all of its subparts are being looked up within the databases and the new splits. These subanalyses are being integrated to its new analysis. Old entries within the lexical databases are being substituted for the new ones.

Finding Splits
If neither an entry inside the tree lexicons nor in the list of monomorphemes can be found, the Moremorph analyses are taken as the start for the further analysis. For each analysis, e.g. the five different ones of example (5), every possible combination of subtrees has to build. Some of them can be filtered out, because they are linguistically implausible, e.g. when a hypothetical subpart finishes with a prefix.
All plausible combinations of strings and tags undergo a contextual analysis, if occurrences for all subparts can be found within at least one text of the large corpus. Otherwise, a procedure of using the overall document frequencies together with a back-off strategy will be invoked.

Morphological Segmentation based on Contextual Information
For (unknown) compounds, we presuppose that each component can be found within the same close environments. Therefore, the frequencies of components in texts should be much lower for erroneous splits than the frequencies for correct segmentations.
We chose a large set of texts for the retrieval: the freely available and annotated German Wikipedia Korpus of 2015 (Margaretha and Lüngen, 2014). 3 We restricted ourselves to the 1.8 million texts subcorpus of the articles. The corpus was tokenized by a modified version of the tool from Dipper (2016) and lemmatized by the TreeTagger (Schmid, 1999). Text indices were built both for tokenized and lemmatized forms. For each text, all frequencies of lemmas and tokens were stored.
For each morphological split of a word form wf (sp wf,n ), the intersection of all texts comprising the word form wf and their hypothetical components c wf,sp,1..n is retrieved from the text indices. For every text t which includes all components for the word form wf (c wf,sp,1 ...c wf,sp,n ) of a morphological split, the document frequencies (df ) of the components are being retrieved and added to the sum of text frequencies score (Stf ). For every hypothetical analysis, the highest value is chosen and the morphological analysis with this score is stored (Equation 1). Results: Fremdenverkehrsamt (*Fremdenverkehr* (*Fremde* fremd|e)|n|(*Verkehr* (*verkehren* ver|kehren)))|s|Amt Finally, for every hypothetical analysis, the highest value is chosen and the morphological analysis with this score is processed and stored.

Morphological Segmentation based on Document Frequencies
In case that no text can be found which includes the word form wf and the components of any of the hypothetical analyses, the corpus itself is considered as a textual enviroment in the widest sense. For each split, the sum of frequencies are being calculated. The hypothetical analysis with the highest value is chosen and the morphological analysis with this score is processed for the storage. In all other cases, the analysis will yield the hypothetical analysis of a monomorpheme.

Substitution of Analyses
Whenever an analysis by the Best − Stf score or another look-up has been found, the analyses for its immediate constituents are being searched in the databases. By this, the lexicons can be incrementally enlarged and enriched. Figure 4 shows an example from our test corpus, which we used for the evaluation in Section 6. The results are added to a database of new splits and can be added to the previous database after an evaluation.

Data
For testing the performance, we use Korpus Magazin Lufthansa Bordbuch (MLD) which is part of the DeReKo-2016-I (Institut für Deutsche Sprache, 2016) corpus 4 . It is an in-flight magazine with articles on traveling, consumption and aviation. For the tokenization, we enlarged and costumized the tokenizer by Dipper (2016) for our purposes. Multi-word units were automatically identified based on the multi-word dataset which we had augmented before. The resulting data comprises 276 texts with 5,202 paragraphs, 16,046 sentences and 260,114 tokens. The number of word-form types is 38,337. We are analyzing the lemmatized version of this corpus which was produced by the TreeTagger (Schmid, 1999), it comprises 27,902 lemma types.

Coverage
15,622 lemma types can be found within the database. 12,280 lemma types are not covered by the databases, so they were re-analyzed by SMOR/Moremorph. We manually checked the results for the first 1,000 lemma types which could not be found in the database. Very often, these are derivatives, rare or nounce words, proper names or words containing proper names as in (10). (10) a. ordnend 'ordering, regulatory' b. Paris-Erfahrung 'Paris experience' c. Winterspaß 'winter fun' The details of the check against the German tree database are included in Table 1, with a coverage of 55.99% for the lemma types. This direct lookup saves a lot of computational effort. According to the quality of the database which is based on Ger-maNet and CELEX, the recall is extremely close to these numbers. The remaining 44.01% of all lemma types were evaluated in the following way: We checked every split of the first thousand analyzed words. For ambiguous analyses, we accepted those which included a monomorphemic and a correct derivational analysis, as in (11), with (11-a) showing the segmentation of verb stem and derivational suffix.
(11) a. ordnend ordn end V PPres ADJ-SUFF <ADJ> b. ordnend ordnend V <V> If one or more splits were erroneous, as in (12-a), the analysis was rejected.  types were re-analysed as hypothetical monomorphemes during the further analysis. Often, these were names of airplane types or similar expressions. Therefore, the number of analyzed lemma types (27,902) corresponds to a full coverage. SMOR/Moremorph on its own was able to process 13,461 lemmas, the rest was classified as unknown. This good coverage is a direct result of the adjustment of the lexicons, which we described in 3.2, especially concerning the names lexicons.

Precision
The complete analyses of the hybrid morphological parsing yield 5,307 entries in the newsplit database and 5,973 new entries inside the monomorphemes. We analyzed the first 1,000 entries of the newly found splits and the first 2,000 entries within the monomorpheme set. Of the first set, we found 65 wrongly or imperfectly analyzed word forms. Most of them are three-part compounds such as (13) whose correct components were not found within a text. The morphemes were identified, but the ambiguity could not be resolved.
(13) (Berg|Regen|Wald) 'mountain rain forest' Another error are wrong analyses of derivative nouns which starts with a verb particle such as (14-a) , which is a derivative form of anfahren 'to approach' (14-b) and not a compound of an 'at, to' and Fahrt 'ride'. There is a systematic mistake here which is caused by the high frequency of the first part which is usually a homograph of a preposition.
(14) a. An|(*Fahrt* fahren|t) 'approach' b. (*anfahren* an|fahren)|t 'approach' The set of monomorphs comprise many new complex numbers and proper names. All of them were correctly included. Only three assignments are questionable. However, as these are proper names such as Anneliese which consists of two proper names Anna and Liese, and/or the analysis in CELEX was monomorphemous too (as for Allerheiligen 'All Saints'), the quality is very high. Therefore, the precision can be considered as high for this test corpus: 0.935 for new splits and 0.998 for newly found monomorphs.

Discussion
The results for the first hybrid deep-level morphology analyzer are promising. However, the errors concerning verb particles are systematic. They can be explained by the high frequency of verb particles in texts, which are often homographs of a preposition. For future research, we plan an adjustment by a factor which takes into account the relationship between word length in characters and word frequency as observed by Zipf and others (Prün, 2005). Köhler (1986) derives this relationship by a synergetic model. He corroborates the functional connection between the frequency classes of words and their average length. A measure directly derived from this function would penalize word segmentations with small morphemes and assign more weight to longer (and rare) components.

Conclusion and Outlook
This paper demonstrates how updating and exploiting linguistic databases for morphological analyses can be performed. By simple look-up, we reached a coverage of 56% of lemma types. As both underlying databases, CELEX and Ger-maNet, were manually revised, we can speak of very reliable analyses. The remaining unanalyzed words can be mostly covered by a conventional word segmenter after adjusting its lexicons. These analyses have a flat structure and undergo a procedure of constructing all combinations of possible analyses and a context-based search for the hypothetical constituents in a large corpus. The results for the lemma types are very promising: Over 99% of all words were covered by the combined morphological analyses.
New morphological analyses from the treebuilding process can be added to the German tree database after a process of careful evaluation and selection.
The direction of the future research is therefore straightforward: it will lead towards creating complex analyses out of existing ones and augmenting the lexical databases.

Acknowledgement
Work for this publication was partially supported by the German Research Foundation (DFG) under grant RU 1873/2-1. I would like to thank the reviewers for their valuable feedback and my colleague Josef Ruppenhofer for making this work possible.