Counting What Counts: Decompounding for Keyphrase Extraction

A core assumption of keyphrase extraction is that a concept is more important if it is mentioned more often in a document. Especially in languages like German that form large noun compounds, frequency counts might be misleading as concepts “hidden” in compounds are not counted. We hypothesize that using decompounding before counting term frequencies may lead to better keyphrase extraction. We identiﬁed two effects of decompounding: (i) enhanced frequency counts, and (ii) more keyphrase candidates. We created two German evaluation datasets to test our hypothesis and analyzed the effect of additional decompounding for keyphrase extraction.


Introduction
Most approaches for automatic extraction of keyphrases are based on the assumption that the more frequent a term or phrase is mentioned, the more important it is. Consequently, most extraction algorithms apply some kind of normalization, e.g. lemmatization or noun chunking (Hulth, 2003;Mihalcea and Tarau, 2004), in order to arrive with accurate counts. However, especially in Germanic languages the frequent use of noun compounds has an adverse effect on the reliability of frequency counts. Consider for example a German document that talks about Lehrer (Engl.: teacher) without ever mentioning the word "Lehrer" at all, because it is always part of compounds like Deutschlehrer (Engl.: German teacher) or Gymnasiallehrer (Engl.: grammar school teacher). Thus, we argue that the problem can be solved by splitting noun compounds in meaningful parts, i.e. by performing decompounding. in German. The compound Deutschlehrer consists of the parts Deutsch (Engl.: German) and Lehrer (Engl.: teacher).
In this paper, we propose a comprehensive decompounding architecture and analyze the performance of four state-of-the-art algorithms. We then perform experiments on three German datasets, of which two have been created particularly for these experiments, in order to analyze the impact of decompounding on standard keyphrase extraction approaches. Decompounding has previously been successfully used in other applications, e.g. in machine translation (Koehn and Knight, 2003), information retrieval (Hollink et al., 2004;Alfonseca et al., 2008b;Alfonseca et al., 2008a), speech recognition (Ordelman, 2003), and word prediction (Baroni et al., 2002). Hasan and Ng (2014) have shown that infrequency errors are a major cause for lower keyphrase extraction results . To the best of our knowledge, we are the first to examine the influence of decompounding on keyphrase extraction.

Decompounding
Decompounding is usually performed in two steps: (i) a splitting algorithm creates candidates, and (ii) a ranking function decides which candidates are best suited for splitting the compound. For example, Aktionsplan has two splitting candidates: Aktion(s)+plan (Engl.: action plan) and Akt+ion(s)+plan (Engl.: nude ion plan). 1 After generating the candidates, the ranking function assigns a score to each splitting candidate, including the original compound. We will now take a closer look on possible splitting algorithms and ranking functions.

Splitting algorithms
Left-to-Right grows a window over the input from left to right. When a word from a dictionary is found a split is generated. The algorithm is then applied recursively to the rest of the input.
JWord Splitter 2 performs a dictionary look-up from left to right, but continues this process if the remainder of the word is not right), it creates a split and stops. Banana Splitter 3 searches for the word from the right to the left, and if there is more than one possibility, the one with the longest split on the right side is taken as candidate. Data Driven counts the number of words in a dictionary, which contain a split at this position as prefix or suffix for every position in the input. A split is made at the position with the largest difference between prefix and suffix counts (Larson et al., 2000). ASV Toolbox 4 uses a trained Compact Patricia Tree to recursively split parts from the beginning and end of the word (Biemann et al., 2008). Unlike the other algorithms, it generates only a single split candidate at each recursive step. For that reason, it does not need a ranker. It is also the only supervised (using lists of existing compounds) approach tested.

Ranking functions
As stated earlier, the ranking functions are as important as the splitting algorithms, since a ranking function is responsible for assigning scores to each possible decompounding candidate. For the ranking functions, Alfonseca et al. (2008b) use a geometric mean of unigram frequencies (Equation 1), and a mutual information function (Equation 2).  Table 1: Evaluation results of state-of-the-art decompounding systems.
In these equations, N is the number of fragments the candidate has, w is the fragment itself, f (w) is the relative unigram frequency for that fragment w, bigr(w i , w j ) is the relative bigram frequency for the fragment w i and w j , c is the compound itself without being split.

Decompounding experiments
For evaluation, we use the corpus created by Marek (2006) as a gold standard to evaluate the performance of the decompounding methods. This corpus contains a list of 158,653 compounds, stating how each compound should be decompounded. The compounds were obtained from the issues 01/2000 to 13/2004 of the German computer magazine c't 5 in a semi-automatic approach. Human annotators reviewed the list to identify and correct possible errors. For calculating the required frequencies, we use the Web1T corpus 6 (Brants and Franz, 2006). Koehn and Knight (2003) use a modified version of precision and recall for evaluating decompounding performance. Following Santos (2014), we decided to apply these metrics for measuring the splitting algorithms, and ranking the functions' performance. The following counts were used for evaluating the experiments on the compound level: correct split (cs), a split fragment which was correctly identified and wrong split (ws), a split fragment which was wrongly identified. P comp and R comp evaluate decompounding on the level of compounds, and we propose to use P split = cs cs + ws to evaluate on the level of splits.
As we focus in this work on the influence of decompounding on improving the accuracy of fre-  quency counts, P split is the best metric in our case. We can see in Table 1 that the ASV Toolbox splitting algorithm is the best performing system in respect to P split . Thus, we select it as the decompounding algorithm in our keyphrase extraction experiments described in the next section.

Datasets
For our evaluation, we could not rely on English datasets, as there is only very little compounding and thus the expected effect of decompounding is small. German is a good choice, as it is infamous for its heavy compounding, e.g. the well-known Donaudampfschifffahrtskapitän (Engl.: captain of a steam ship on the river Danube). For German keyphrase extraction, we can use the peDOCS datasets described in Erbs et al. (2013) and we created two additional datasets consisting of summaries of lesson transcripts (Pythagoras) and posts from a medical forum (MedForum). Table 2 summarizes their characteristics. peDOCS consists of peer-reviewed articles, dissertations, and books from the educational domain published by researchers. The gold standard for this dataset was compiled by professional indexers and should thus be of high quality. We present two novel keyphrase datasets consisting of German texts. MedForum is composed of posts from a medical forum. 7 To our knowledge, it is the first dataset with keyphrase annotations from user-generated data in German. Two German annotators with university degrees identified a set of keyphrases for every document and following Nguyen and Kan (2007), the union of both sets are the final gold keyphrases. The Pythagoras dataset contains summaries of lesson transcripts compiled in the Pythagoras project. 8 Two annotators iden-tified keyphrases after a training phase with discussion of three documents. As in the MedForum dataset, the gold standard consists of the union of lemmatized keyphrases by both annotators. All datasets contain a unranked list of keyphrases.
The peDOCS dataset is by far the largest of the sets, since it has been created over the course of several years. MedForum and Pythagoras contain fewer documents but each document is annotated by a fixed pair of human annotators. The average number of keyphrases is highest for peDOCS and lowest for MedForum. The length of the document also influences the number of keyphrases as short documents have fewer keyphrase candidates. Keyphrases in all three datasets are on average very short. The example in Figure 1 gives an example of a rather specific keyphrase which, however, consists of only one token. We believe that keyphrase extraction approaches benefit from decompounding more in cases of short documents. Longer documents provide more statistical data which reduces the need for additional statistical data obtained with decompounding.

Experimental Setup
For preprocessing, we rely on components from the DKPro Core framework (Eckart de Castilho and Gurevych, 2014) and on DKPro Lab (de Castilho and Gurevych, 2011) for building experimental pipelines. We use the Stanford Segmenter 9 for tokenization, TreeTagger (Schmid, 1994;Schmid, 1995) for lemmatization and partof-speech tagging. Finally, we perform stopword removal and decompounding as described in Section 2. It should be noted that in most preprocessing pipelines, decompounding should be the last step, as it heavily influences POS-tagging. We extract all lemmas in the document as keyphrase candidates and rank them according to basic ranking approaches based on frequency counts and the position in the document. We do not use more sophisticated extraction approaches, as we want to examine the influence of decompounding as directly as possible. However, it has been shown that frequency-based heuristics are a very strong baseline (Zesch and Gurevych, 2009), and even supervised keyphrase extraction methods such as KEA (Witten et al., 1999) use term frequency and position as the most important features and will be heavily influenced by decompounding.
We evaluate the following ranking methods: tfidf constant ranks candidates according to their term frequency f (t, d) in the document. tf-idf decreases the impact of words that occur in most documents. The term frequency count is normalized with the inverse document frequency in the test collection (Salton and Buckley, 1988).
In this formula |D| is the number of documents and |d ∈ D : t ∈ d| is the number of documents mentioning term t. As some document collections may be too small to allow computing reliable frequency estimates, we also evaluated tf-idf web . Again, the document frequency is approximated by the frequency counts from the Web1T corpus. We take the position of a candidate as a baseline.
The closer the keyword is to the beginning of the text, the higher it is ranked. This is not dependent on frequency counts, but decompounding can also have an influence if a compound that appears early in the document is split into parts that are now also possible keyphrase candidates. We test each of the ranking methods with (w) and without (w/o) decompounding.

Evaluation metrics
For the keyphrase experiments, we compare results in terms of precision and recall of the top-5 keyphrases (P@5), Mean Average Precision (MAP), and R-precision (R-p). 10 MAP is the average precision of extracted keyphrases from 1 to the number of extracted keyphrases, which can be much higher than ten. R-precision 11 is the ratio of true positives in the set of extracted keyphrases when as many keyphrases as there are gold keyphrases are extracted. 12

Results and discussion
In order to assess the influence of decompounding on keyphrase extraction, we evaluate the selected extraction approaches with (w/) and without (w/o) decompounding. The final evaluation results will be influenced by two factors:  Table 3: Difference of results with decompounding on the MedForum dataset.
Enhanced frequency counts: As we have discussed before, the frequency counts will be more accurate, which should lead to higher quality keyphrases being extracted. This affects frequency-based rankings.
More keyphrase candidates: The number of keyphrase candidates might increase, as it is possible that some of the parts created by the decompounding were not mentioned in the document before. This is the special case of a enhanced frequency count going up from 0 to 1.
We perform experiments to investigate the influence of both effects, first, the enhanced frequency counts, and second, the newly introduced keyphrase candidates.

Enhanced frequency counts
In order to isolate the effect, we limit the list of keyphrase candidates to those that are already present in the document without decompounding. We selected the MedForum dataset for this analysis, because it contains many compounds and has the shortest documents which we believe is best suited for an additional decompounding step. Table 3 shows improvements of evaluation results for keyphrase extraction approaches on the MedForum datasets. The improvement is measured as the difference of evaluation metrics of using extraction approaches with decompounding compared to not using any decompounding. This table does not show absolute numbers, instead it shows the increase of performance. Absolute values are not comparable to other experimental settings, because all gold keyphrases that do not appear in the text as lemmas are disregarded. We can thus analyze the effect of enhanced frequency counts in isolation. Results show that for tfidf constant , tf-idf, and tf-idf web our decompounding extension increases results on the MedForum dataset considering only candidates that are extracted without decompounding. Decompounding does not affect results for the position baseline as it is not based on frequency counting. For the frequency-based approaches, the effect is rather  small in general, however consistent across all metrics and methods. The decompounding extension, however, has the effect of adding further keyphrase candidates.

More keyphrase candidates
The second effect of decompounding is that new terms are introduced that cannot be found in the original document. Table 4 shows the maximum recall for lemmas with and without decompounding on all German datasets. The maximum recall is obtained by assuming that given a list of candidates the best possible set of keyphrases are extracted. Keyphrase extraction with decompounding increases the maximum recall on all datasets by up to 3.8% points. It must be noted that the increase is due to more keyphrase candidates extracted, which increases the importance of the final ranking. The increase is higher for MedForum while it is lower for Pythagoras. Pythagoras comprises summaries of lesson transcripts for students in the ninth grade, thus teachers are less likely to use complex words which need to be decompounded. The smaller increase for peDOCS compared to MedForum is due to longer peDOCS documents. The longer a document is, the more likely a part in a compound also appears as an isolated token which limits the increase of maximum recall. peDOCS shows to have a higher maximum recall compared to collections with shorter documents because documents with more tokens also have more candidates. MedForum comprises forum data, which contains both medical terms and informal description of such terms. Furthermore, gold keyphrases were assigned to assist others in searching. This leads to having documents containing terms like Augenschmerzen (Engl.: eye pain) for which the gold keyphrase Auge (Engl.: eye) was assigned.

Combined results
Previously, we analyzed the effects of decompounding in isolation, now we analyze the combination of enhanced frequency counts and more keyphrase candidates on the overall results. Table 5 shows the complete results for the German datasets, described keyphrase extraction methods, and with and without decompounding. For the peDOCS dataset, we see a negative effect of decompounding. Only the position baseline and tf-idf constant benefit from decompounding in terms of mean average precision (MAP), while they yield lower results in terms of the other evaluation metrics. The improvement of the position baseline in terms of MAP might be to several correctly extracted keyphrases beyond the top-5 extracted keyphrases. We have previously discussed that peDOCS has on average the longest documents and most likely contains all gold keyphrases multiple times in the document text. For this reason, frequency-based approaches do not benefit from additional frequency information obtained from compounds. Many compounds are composed of common words, which already appear in the document. On the contrary, more common keyphrases are weighted higher, which hurts results in the case of peDOCS with highlyspecialized and longer keyphrases. Depending on the task, this might be an undesired behavior. 13 The only dataset for which the decompounding yields higher results is the MedForum dataset. Results improve with decompounding for tfidf constant and tf-idf. As can be seen in Table 4, enhanced frequency counts improve results, and yield a higher maximum recall. Contrary to the other tf-idf configurations, results for tf-idf web decrease with decompounding. This leads to the observation that, besides the effect of enhanced ranking and more keyphrase candidates, a third effect influences results of keyphrase extraction methods: The ranking of additional keyphrase candidates obtained from decompounding. These candidates might appear infrequently in isolation and are ranked high if external document frequencies (df values) are used. Compound parts which do not appear in isolation 14 -hence, no good keyphrases-are ranked high in case of tfidf web because their document frequency from the web is very low. In case of classic tf-idf they are ranked low because they are normalized with doc-  ument frequencies from a corpus where decompounding has been applied. In case of tf-idf web , no decompounding has been applied. The effect of the poor ranking of newly introduced keyphrase candidates needs to be investigated further by conducting a manual analysis of the decompounding performance and the creation of non-words. For the Pythagoras dataset, keyphrase extraction approaches yield similar results as for peDOCS. Decompounding decreases results, only results for tf-idf stay stable. As seen earlier (see Table 4), decompounding does not raise the maximum recall much (only by .002). As before in the case of the MedForum dataset, tf-idf web is influenced negatively by the decompounding extension. Results for tf-idf web decrease by .103 in terms of R-precision, which is a reduction of more than 50%. The ranking of keyphrases is hurt by many keyphrases, which appear as parts of compounds. They are ranked high because they infrequently appear as separate words. Considering the characteristics of keyphrases in Pythagoras, we see that keyphrases are rather long with 12.22 characters per keyphrase. This leads to the observation that the style of the keyphrases has an effect on the applicability of decompounding. Datasets with more specific keyphrases are less likely to benefit from decompounding.

Conclusions and future work
We presented a decompounding extension for keyphrase extraction. We created two new datasets to analyze these effects and showed that decompounding has the potential to increase results for keyphrase extraction on shorter German documents. We identified two effects of decompounding relevant for keyphrase extraction: (i) enhanced frequency counts, and (ii) more keyphrase candidates. We find that the first effect slightly increases results when updating the term frequencies, while including the second effect in the evaluation, reduces results for two of three datasets. We thus conclude that the effect of decompounding for keyphrases extraction requires further analysis, but may be a useful feature for supervised systems (Berend and Farkas, 2010).
In the future, we propose to further analyze characteristics of good keyphrases and whether they often are compounds. We see the potential for better decompounding approaches as any improvements on this task may have positive effects on keyphrase extraction. We would also like to investigate other effects that make tasks like keyphrase extraction especially hard. Named entity disambiguation might improve results further as some concepts are mentioned frequently in a text but always with another surface form. We make our experimental framework available to the community to foster future research.