Top a Splitter: Using Distributional Semantics for Improving Compound Splitting

We present a ﬂexible method that re-arranges the ranked output of compound splitters (i


Introduction
Closed nominal compounds (i.e., one-word compounds such as the German Eidotter 'egg yolk') are one of the most productive word formation types in Germanic languages such as German, Dutch or Swedish, and constitute a major class of multi-word expressions (MWEs). Baroni (2002) presents a German corpus study showing that almost half of the corpus types are compounds, while the token frequency of individual compounds is low. This makes it hard to process closed compounds with general-purpose statistical methods and necessitates automatic compound analysis as a principal part of many natural language processing tasks such as statistical machine translation (SMT).
Therefore, previous work has tried to tackle the task of compound splitting (e.g., decomposing Eidotter to Ei 'egg' and Dotter 'yolk'). Most compound splitters follow a generate-and-rank procedure. Firstly, all possible candidate splits are generated, e.g., Ei|dotter, Eid|otter, . . . , Eidott|er (Koehn and Knight, 2003) or a knowledge-rich morphological analyzer provides a set of plausible candidate splits (Fritzinger and Fraser, 2010). In a second step, the list of candidate splits is ranked according to statistical features such as constituent frequency (Stymne, 2008;Macherey et al., 2011;Weller and Heid, 2012) or frequency of morphological operations (Ziering and Van der Plas, 2016). By considering each constituent in isolation, approaches limited to frequency neglect the semantic compatibility between a compound and its constituents. For example, while Eidotter is usually understood as the yolk of an egg (i.e., Ei|dotter), the low frequency of Dotter often makes frequency-based splitters rank a less plausible interpretation higher: Eid|otter 'oath otter'.
We try to tackle this pitfall by enriching the ranked output of various splitters with a semantic compatibility score. Our method is inspired by recent work on the prediction of compound compositionality using distributional semantics (Reddy et al., 2011;Schulte im Walde et al., 2013). The distributional measures that are used to predict the compositionality of compounds are in fact measuring the semantic similarity between the compound and its constituents. Our assumption is that they can therefore be used readily to rank the candidate constituents a splitter proposes and help to promote more plausible candidate splits (e.g., Eidotter is distributionally more similar to Dotter than to Otter). Previously, Weller et al., (2014) applied compositionality measures to compound splitting as a pre-processing step in SMT. Their intuition is that non-compositional compounds benefit less from splitting prior to SMT. However, they found no improvements in the extrinsic evaluation. Neither did they find improvements from applying distributional semantics directly to the unordered list of candidate splits. We will show in an intrinsic evaluation that distributional semantics, when combined with the initial ranked output of various splitters does lead to a statistically significant improvement in compound splitting.
Other works that used semantic information for compound splitting include Bretschneider and Zillner (2015), who developed a splitting approach relying on a semantic ontology of the medical do-main. They disambiguated candidate splits using semantic relations from the ontology (e.g., Beckenbodenmuskel 'pelvic floor muscle' is binary split to Beckenboden | muskel using the part of relation). As back-off strategy, if the ontology lookup fails, they used constituent frequency. We do not restrict to a certain domain and related ontology but use distributional semantics in combination with frequency-based split features for the disambiguation. Daiber et al., (2015) developed a compound splitter based on semantic analogy (e.g., bookshop is to shop as bookshelf is to shelf ). From word embeddings of compound and head word, they learned prototypical vectors representing the modification. During splitting, they determined the most suitable modifier by comparing the analogy to the prototypes. While Daiber et al., (2015) developed an autonomous splitter and focused on semantic analogy, we present a re-ranker that combines distributional similarity with additional splitting features.
Very recently, Riedl and Biemann (2016) developed a semantic compound splitter that uses a pre-compiled distributional thesaurus for searching semantically similar substrings of a compound subject to decomposition. While their stand-alone method focuses on knowledge-lean split point determination, our approach improves splitters including the task of constituent normalization.
Our contributions are as follows. We are the first to show that distributional semantics information as an additional feature helps in determining the best split among the candidate splits proposed by various compound splitters in an intrinsic evaluation. Moreover, we present an architecture that allows for the addition of distributional similarity scores to any compound splitter by re-ranking a system's output.
2 Re-ranking based on distributional semantics 2.1 Initial split ranking Our method is applicable to any compound splitter that produces a ranked output of split options 1 with their corresponding ranking score. For example, the target compound Fischerzeugnis 'fish product' is processed by a compound splitter yielding the output as given in Table 1.
The top-ranked candidate split is the result from a falsely triggered normalization rule (i.e., +er is not a valid linking element for Fisch).

Determination of distributional similarity
For each candidate split of a target compound (e.g., Fisch | erzeugnis given Fischerzeugnis), the cosine similarity between the target compound and each candidate constituent is determined as a standard measure that is used for computing the distributional similarity (DS). In a following step, these cosine values are used to predict the degree of semantic relatedness between the target compound and the candidate modifier (MOD) or head (HEAD), respectively. As proposed by Weller et al., (2014), a possible combination of the candidate constituents' cosine values is the geometric mean (GEO). For example, let cos( The GEO DS score for the lexemes derived from Fisch|erzeugnis is √ 0.455 · 0.10 ≈ 0.22.

Combination and re-ranking
In the next step, we multiply the DS scores with the initial split ranking scores and finally re-rank the splits according to the resulting product. Table 2 shows the result from re-ranking the output presented in Table 1 with GEO

Data
We use the German Wikipedia 2 corpus comprising 665M words. We tokenize, lemmatize and PoStag using TreeTagger (Schmid, 1995). While we are aware of the fact that there are German corpora larger than Wikipedia which can boost the perfomance of distributional semantics methods, we decided to use the same corpora as used in previous work for the inspected compound splitters (Ziering and Van der Plas, 2016). By controlling for corpus size, we can contrast the differences in splitting performance with respect to information type (i.e., distributional similarity vs. frequency information) irrespective of corpus size.

Distributional model
In analogy to the distributional model of Weller et al., (2014), we adopt a setting whose parameters are tuned on a development set and prove best for compositionality (Schulte im Walde et al., 2013). It employs corpus-based co-occurrence information extracted from a window of 20 words to the left and 20 to the right of a target word. We restrict to the 20K most frequent nominal co-occurrents.

Distributional similarity modes
Inspired by Weller et al., (2014), the distributional similarity mode (DS MODE) refers to the selected cosine values, determined with our distributional model. We compare the distributional similarity of both individual constituents (i.e., modifier (MOD) and head (HEAD)) with the geometric mean of them (GEO). Moreover, we used standard arithmetic operations (Widdows, 2008;Mitchell and Lapata, 2010) and combine the vectors of modifier and head by vector addition (ADD), and multiplication (MULT) as shown to be beneficial in Schulte im Walde et al., (2013).

Rankings in comparison
We compare the performance of the initial ranking (INITIAL) of a compound splitter, based on all individual features, with the splitting performance after re-ranking by multiplying the selected DS value with the initial ranking score (RR ALL ). Our baseline (RR DS ) is inspired by the aggressive splitting mode (DIST) of Weller et al., (2014): we re-rank the unordered list of candidate splits proposed by a splitter according to the DS scores only.

Inspected compound splitters
We inspect three different types of German compound splitters, ranging from knowledge-lean to knowledge-rich. Ziering and Van der Plas (2016) developed a corpus-based approach, where morphological operations are learned automatically from word inflection. Weller and Heid (2012) used a frequency-based approach with a list of PoS-tagged lemmas and an extensive handcrafted set of normalization rules. Fritzinger and Fraser (2010) combined the splitting output of the morphological analyzer SMOR (Schmid et al., 2004) with corpus frequencies.

Evaluation setup
While Weller at al., (2014) did not observe a difference in SMT performance between ranking candidate splits according to frequency and compositionality, we use an intrinsic evaluation measure actually revealing significant differences. We follow the evaluation approach of Ziering and Van der Plas (2016), who defined splitting accuracy 3 in terms of determining the correct split point (SPAcc) and correctly normalizing the resulting constituents (NormAcc), and use the Ger-maNet 4 gold standard developed by Henrich and Hinrichs (2011). We remove hyphenated compounds, which should be trivial splitting cases that do not need improvement by re-ranking.  Some of the compound splitters described in Section 3.5 can only process a subset of the gold standard. For example, the approach of Fritzinger and Fraser (2010) is limited to a hand-crafted lexicon (i.e., it misses compounds with unknown constituents such as Barbiepuppe 'Barbie doll'). Moreover, it uses the analyzer SMOR, which considers some gold standard compounds as cases of derivation which are not subject to decomposition (e.g., Unterbesetzung 'understaffing' is primarily derived from the verb unterbesetzen 'to understaff'). Besides, for some compounds, there are  Table 4: Results of split re-ranking; † indicates significantly better than INITIAL no binary splits in a system's ranking. These compounds are excluded from the respective splitter's test set. Table 3 shows the test set sizes and coverage of the inspected compound splitters.

Results and discussion
In the following section, we show results on splitting performance of various compound splitters before and after adding our re-ranking method. As shown in Table 3, the systems are evaluated on different test sets. It is not our goal to compare different splitting methods against each other, but to show the universal applicability of our re-ranker for different types of splitters. Table 4 shows the performance numbers for all inspected compound splitters and all DS modes. A first result is that the INITIAL accuracy (both SPAcc and NormAcc) is always outperformed by re-ranking with DS scores as additional feature (RR ALL ) for at least one DS MODE. The baseline of using pure DS scores (RR DS ) worsens the INITIAL performance. This is in line with previous work (Weller et al., 2014) and shows that isolated semantic information does not suffice but needs to be introduced as an additional feature. In an error analysis, we observed that the corpus frequency, which is missing for RR DS , is a crucial feature for compound splitting and helps to demote analyses based on typographical errors or unlikely modifier normalization. For example, while RR ALL analyzes the compound Haarwasser 'hair tonic' with the correct and highly frequent modifier Haar 'hair', RR DS selects the morphologically plausible but yet unlikely and infrequent verbal modifier haaren 'to molt', which happens to have the higher cosine similarity to Haarwasser.

General trends
Another type of compound analysis that benefits from corpus frequency is binary splitting of left-branched tripartite compounds (i.e., bracketing). For example, the compound Blinddarmoperation 'appendix operation' (lit.: 'blind intestine operation') is frequency-based correctly split into Blinddarm | operation '[appendix] operation', whereas RR DS prefers the right-branched splitting into Blind | darmoperation 'blind [intestine operation]'. Since the rightmost constituent Operation 'surgery/operation' is more ambiguous, it has a smaller cosine similarity to the entire compound than the right-branched compound Darmoperation 'intestinal operation'. In contrast, the high corpus frequency of the non-compositional Blinddarm 'appendix' and the head Operation, make a frequency-based splitter choose the correct structure. However, bracketing also benefits from cosine similarity. For example, using re-ranking by RR ALL , the wrong compound split Arbeits|platzmangel 'labor [lack of space]' is corrected to Arbeitsplatz|mangel 'job scarcity'. As conclusion, we argue that the combination of corpus frequency and semantic plausibility (in terms of cosine similarity) is working best for splitting.
Comparing the accuracy types, we see that the determination of the correct split point is the easier task and achieves a SPAcc of 98.5% (GEO@RR ALL for Fritzinger and Fraser's (2010) splitter). However, there is only a small benefit for SPAcc when adding semantic support. In contrast, constituent normalization (measured as Nor-mAcc) can be improved by +1.6% (GEO@RR ALL for Ziering and Van der Plas' (2016) splitter).
Comparing the DS modes, we see that for NormAcc, the more demanding task that leads to the largest differences in performance between the different modes, the MOD mode outperforms the HEAD mode (for RR ALL ). However, the modes that combine head and modifier scores mostly outperform those based on heads or modifiers in isolation. This is in line with tendencies found in previous work on compositionality of compounds (Schulte im Walde et al., 2013). In addition, we find that for NormAcc, the GEO mode outperforms the modes based on vector arithmetic, whereas for SPAcc, the performance of GEO and the vector addition (ADD) is comparable.

Individual splitter improvement
Ziering and Van der Plas (2016) automatically learned constituent transformations taking place during compounding (e.g., s-suffixation) from word inflection. Based on corpus frequency and transformation plausibility, they produced a ranked list of candidate splits. However, misleading inflections can rank false splits high. For example, +ge, as in the participle aufgewachsen 'grown up' (aufwachsen 'grow up'), leads to the falsely top-ranked candidate split Fu(ge)nk | elle 'radio ulna' instead of Fugen | kelle 'filling trowel'. Re-ranking with RR ALL promotes the correct candidate split. We achieve significant 5 improvements for almost all DS MODEs.
Weller and Heid (2012) extended a frequencybased approach (Koehn and Knight, 2003) with a hand-crafted set of morphological rules. Even restricted to only valid constituent transformations, some rules are falsely triggered and lead to wrong splits. For example, the er-suffix (as in Kinder | buch 'children's book') is used for the compound Text | erkennung 'text recognition' and results in the false split Text(er) | kennung 'text ID'. Our reranking method (RR ALL ) again helps to promote the correct candidate split. In all DS MODES, the performance is improved significantly.
For the system of Fritzinger and Fraser (2010), the GEO mode improves the INITIAL split-5 Approximate randomization test (Yeh, 2000), p < 0.05 ting accuracy (+0.1%), but we do not achieve statistically significant results. The main reason for this is due to the lexicon-based morphological analyzer SMOR. While having the smallest coverage on the gold standard, utilizing a hand-crafted lexicon results in only correctly triggered transformation rules. This leads to a smaller list of candidate splits. In fact, the average number of analyses provided by Fritzinger and Fraser (2010) is much smaller than for Ziering and Van der Plas (2016) as shown in Table 5.

System
Avg # candidate splits ZvdP 2016 4.31 WH 2012 2.25 FF 2010 1.11 Table 5: Average number of candidate splits As a consequence, re-ranking has only a limited impact on the splitting performance. We can conclude that a knowledge-rich morphological resource can mitigate the need for semantic support, however, at the expense of coverage.

Conclusion
We presented a flexible method for re-arranging the ranked output of a compound splitter, by adding a feature for the semantic compatibility between compound and potential constituents derived from a distributional semantics model. We showed that the addition of distributional similarity significantly improves different types of compound splitters.