A Single Word is not Enough: Ranking Multiword Expressions Using Distributional Semantics

We present a new unsupervised mechanism, which ranks word n-grams according to their multiwordness. It heavily relies on a new uniqueness measure that computes, based on a distributional thesaurus, how often an n-gram could be replaced in context by a single-worded term. In addition with a downweighting mechanism for incomplete terms this forms a new measure called DRUID . Results show large improvements on two small test sets over competitive baselines. We demonstrate the scalability of the method to large corpora, and the independence of the measure of shallow syntactic ﬁltering.


Introduction
While it seems intuitive to treat certain sequences of tokens as single terms, there is still considerable controversy about the definition of what exactly such a multiword expression (MWE) constitutes. Sag et al. (2001) pinpoint the need of treating MWEs correctly and classify a range of syntactic formations that could form MWEs and define MWEs as being non-compositional with respect to the meaning of their parts. While the exact requirements on MWEs is bound to specific tasks (such as parsing, keyword extraction, etc.), we operationalize the notion of non-compositionality by using distributional semantics and introduce a new measure that works well for a range of task-based MWE definitions.
Most previous MWE ranking approaches use the following mechanisms to determine multiwordness: part-of-speech (POS) tags, word/multiword frequency and significance of co-occurrence of the parts. In this paper we do not want to introduce "yet another ranking function" but rather an additional mechanism, which performs ranking based on distributional semantics.
Distributional semantics has already been used for MWE identification, but mainly to discriminate between compositional and noncompositional MWEs (Schone and Jurafsky, 2001;Salehi et al., 2014;Hermann and Blunsom, 2014). Here we introduce a new concept to describe the multiwordness of a term by its uniqueness. Using the uniqueness score we measure how likely a term in context can be replaced by a single word. This measure is motivated by the semiotic consideration that due to parsimony concepts are often expressed as single words. Furthermore, we implement a context-aware punishment, called incompleteness, which degrades the score of terms that seem incomplete regarding their contexts. Both concepts are combined into a single score we call DRUID, which is calculated based on a distributional thesaurus. In the following, we show the impact of that new method for French and English and also examine the effect of corpus size on MWE extraction. Additionally, we report on results without using any linguistic preprocessing except tokenization.

Related Work
The generation of MWE dictionaries has drawn much attention in the field of Natural Language Processing (NLP). Early computational approaches (e.g. Justeson and Katz (1995)) use POS sequences as MWE extractors. Other approaches, relying on word frequency, statistically verify the hypothesis whether the parts of the MWE occur more often together than would be expected by chance (Manning and Schütze, 1999;Evert, 2005;Ramisch, 2012). One of the first measures that consider context information (cooccurrences) are the C-value and the NC-value introduced by Frantzi et al. (1998). These methods first extract candidates using POS information and then compute scores based on the frequency of the MWE and the frequency of nested MWE candidates. The method described by Wermter and Hahn (2005) computes a score by multiplying the frequency of a candidate when placing wildcards for each word. A newer method is introduced in Lossio- Ventura et al. (2014), which reranks scores based on an extension of the C-value, which uses a POS-based probability and an inverse document frequency. Using different measures and learning a classifier that predicts the multiwordness was first proposed by Pecina (2010), who, however, restricts his experiments to twoword MWEs for the Czech language only. Korkontzelos (2010) comparatively evaluates several MWE ranking measures. The best MWE extractor reported in his work is the scorer by (Nakagawa and Mori, 2002;Nakagawa and Mori, 2003), who use the un-nested frequency (called marginal frequency) of each candidate and multiply these by the geometric mean of the distinct neighbor of each word within the candidate.
Distributional semantics is mostly used to detect compositionality of MWEs (Salehi et al., 2014;Katz and Giesbrecht, 2006). Most approaches therefore compare the context vector of a MWE with the combined vectors based on the constituent words of the MWE. The similarity between the vectors is then used as degree for compositionality. In machine translation, words are sometimes considered as multiwords if they can be translated as single term (cf. (Bouamor et al., 2012;Anastasiou, 2010)). Whereas this follows the same intuition as our uniqueness measure, we do not require any bilingual corpora.
Regarding the evaluation, mostly precision at k (P @k) and recall at k (R@k) are applied (e.g. (Evert, 2005;Frantzi et al., 1998;Lossio-Ventura et al., 2014)). Another general approach is using the average precision (AP), which is also used in Information Retrieval (IR) (Thater et al., 2009) and has also been applied by .

Baselines and Previous Approaches
We will evaluate our method by comparing our MWE ranking to multiword lists that have been annotated in corpora. Here, we introduce an upper bound and two baseline methods and give a brief description of the competitive methods used in this paper. Most of these methods require a list of candidate terms T , usually extracted with POS sequences (see Section 5).

Upper Bound
We use a perfect ranking as upper bound, where we rank all positive candidates before all negative ones.

Lower Baseline and Frequency Baseline
The ratio between true candidates and all candidates serves as lower baseline, which is also called baseline precision (Evert, 2008). The second baseline is the frequency baseline, which ranks candidate terms t ∈ T according to their frequency f req(t).

C-value/NC-value
The commonly used C-value (see Eq. 1) was developed by Frantzi et al. (1998). The first factor, logarithm of the term length in words, favors longer MWEs. The second factor is the frequency of the term reduced by the average frequency of all candidate terms T , which nest the term t, i.e. t is a substring of the terms we denote as T t .
An extension of the C-value was proposed by Frantzi et al. (1998) as well and is named NCvalue. It takes advantage of context words C t by assigning weights to them. As context words only nouns, adjectives and verbs are considered 1 . Context words are weighted with Equation 2, where k denotes the number of times the context word c ∈ C t occurs with any of the candidate terms. This number is normalized by the number of candidate terms.
The NC-value is a weighted sum of the C-value and the product of the term t occurring with each context c which form the term t c :

t-test
The t-test (see e.g. (Manning and Schütze, 1999, p.163)) is a statistical test for the significance of co-occurrence of two words. It relies on the probabilities of the term and its single words. The probability of a word p(w) is defined as the frequency of the term divided by the total number of terms of the same length. The t-test statistic is computed using Equation 4 with f req(.) being the total frequency of unigrams.
We then use this score to rank the candidate terms.

FGM Score
Another method inspired by the C/NC-value is proposed in (Nakagawa and Mori, 2002;Nakagawa and Mori, 2003). The method was developed on a Japanese dataset and outperformed a modified C-value 2 measure. The method is composed of two scoring mechanisms for the candidate term t as shown in Equation 5.
The first term in the equation is a geometric mean GM (.) of the number of distinct direct left l(.) and right r(.) neighboring words for each single word t i within t.
The neighboring words are extracted directly from the corpus; the method does neither rely on candidate lists nor POS tags. To the contrary, the marginal frequency M F (t) relies on the candidate list and the underlying corpus. This frequency counts how often the candidate term occurs within the corpus and is not a subset of a candidate. In Korkontzelos (2010) it was shown that while scoring according to Equation 5 leads to comparatively good results, it is consistently outperformed by MF only.

Semantic Uniqueness and Incompleteness
We present two new mechanisms relying on a Distributional Thesaurus (DT), which we use to rank terms regarding their multiwordness: A score for the uniqueness of a term and a punishing score that conveys the incompleteness.

Similarity Computation
The DT is computed based on Biemann and Riedl (2013). First we extract n-grams from text and consider the left and the right neighbor of each ngram as context feature. Then, we calculate the Lexicographer's Mutual Information (LMI) significance score (Bordag, 2008) between n-grams and features and remove all context features, which co-occur with more than 1000 terms, as these features tend to be to general. In the next step we keep for each n-gram only the 1000 context features, with the highest LMI score. The similarity score is then computed based on the overlap of features between two terms. Due to pruning this overlap-based significance measure is proportional to the Jaccard similarity measure, albeit we do not consider any normalization. After computing the feature overlap between two terms, we keep for each n-gram the 200 most similar ngrams. An example for the most similar n-grams to the terms red blood cell and red blood including their feature overlap are shown in Table 1.

Uniqness Computation
The first mechanism of our MWE ranking method is based on the following hypothesis: n-grams, which are MWE, could be substituted by single words, thus they have many single words amongst their most similar terms. This is motivated by semiotic considerations: Because of parsimony, concepts are usually expressed in single words. When a semantically non-compositional word combination is added to the vocabulary, it expresses a concept that is necessarily similar to other concepts. Hence, if a candidate multiword is similar to many single word terms, this indicates multiwordness.
To compute the uniqueness score (uq) of an ngram t, we first extract the n-grams it is similar to using the DT as described in Section 4.1. The function similarities(t) returns the 200 most similar n-grams to the given n-gram t. We then compute the ratio between unigrams and all similar n-grams considered using the formula: We illustrate the computation of our measure based on the MWE red blood cell and the non-MWE red blood. When considering only the ten most similar entries for both n-grams as illustrated in Figure 1, we observe an uniqueness score of 7/10 = 0.7 for both n-grams. If considering the  Table 1: We show the ten most similar entries for the term red blood cell (left) and red blood (right).
Here, seven out of ten terms are single words.
top 200 similar n-grams, which are also used in our experiments we will obtain 135 unigrams for the candidate red blood cell and 100 unigrams for the n-gram red blood. We will use these counts for showing the workings of the method in the remainder.

Incompleteness Computation
Similar to the C/NC-value method, we also assign a context weighting function that punishes incomplete terms, which we call incompleteness (ic). For this function we extract the 1000 most significant context features using the function context(t), which yields tuples of left and right contexts. These context features are the same that are used for the similarity computation in Section 4.1 and have been ranked according to the LMI measure. For the example term red blood, some of the contexts are extravasated, cells , uninfected, cells , nucleated, corpuscles . In the next step we split each tuple to its left and right word including its relative position (left/right) to the candidate term. Using the first context feature results in: extravasated, left , cells, right . Then, we sum up the occurrences of for each single context, as shown in Table 2 for the two terms. We subsequently select the maximal count and normalize it by the counts of features |context(t)| considered, which is 1000. This results into the incompleteness measure ic(t). For our example terms we achieve the values ic(red blood) = 557/1000 and ic(red blood cell) = 48/1000. Whereas the uniqueness scores for the most similar entries were equal, we now have a measure that indicates the incompleteness of an n-gram, with higher scores indicating more incomplete terms.  Table 2: Top three most frequent context words for the term red blood cell and red blood in the Medline corpus.

Combining Both Measures
As shown in the previous two sections, a high uniqueness score indicates the multiwordness and a high incompleteness score should decrease the overall score. In experiments, we found the best combination if we subtract 3 the incompleteness score from the uniqueness score. This mechanism is inspired by the C-value and motivated as terms that are often preceded/followed by the same word do not cover the full multiword expression and need to be downranked. This leads to Equation 8, which we call DistRibutional Uniqueness and Incompleteness Degree: Applying the DRUID score to our example terms (considering the 200 most similar terms) we will achieve the scores DRUID(red blood cell) = 135/200 − 48/1000 = 0.627 and DRUID(red blood) = 100/200 − 557/1000 = −0.057. As a higher DRUID score indicates the multiwordness of an n-gram, we can summarize that the n-gram red blood cell is a better MWE than the n-gram red blood.

Experimental Setting
We examine two experimental settings: First, we compute all measures on a small corpus that has been annotated for MWEs, which serves as the gold standard. In the second setting we compute the measures on a larger in-domain corpus. The evaluation is again performed for the same candidate terms as given by the gold standard. Results for the top k ranked entries are reported using the precision at k (P @k = 1 k k i=1 x i with x i equals 1 if the i-th ranked candidate is annotated as MWE and 0 otherwise). For an overall performance we use the average precision (AP) as defined in Thater et al. (2009): AP = 1 |Tmwe| |T | k=1 x k P @k, with T mwe beeing the set of positive MWE. When facing tied scores we mix false and true candidates randomly cf. Cabanac et al. (2010).

Corpora
For the experiments we consider two annotated (small) corpora and two unannotated (large) corpora.

GENIA corpus and SPMRL 2013:
French Treebank In the first experiments we use two small annotated corpora that serve the gold standard MWEs. We use the medical GENIA corpus (Kim et al., 2003) 4 which consists of 1999 abstracts from Medline 5 and encompasses 0.4 million words. This corpus has annotations regarding important and biomedical terms. Also single terms are annotated in this data set, which we ignore.
The second small corpus is based on the French Treebank (Abeillé and Barrier, 2004), which was extended for the SPMRL task (Seddah et al., 2013). This version of the corpus also contains compounds annotated as MWEs. In our experiments we use the training data, which covers 0.4 million words.
Whereas the GENIA MWEs target term matching and medical information retrieval, the SPMRL MWEs mainly focus on improving parsing through compound recognition.

Medline Corpus and Est Républican
Corpus (ERC) In a second experiment the scalability to larger corpora is tested. For this, we make use of the entire Medline 5 abstracts, which consist of about 1.1 billion words. The Est Républican Corpus (ERC) (Seddah et al., 2012) 6 is our large French corpus. It consists of local French news from the eastern part of France and comprises of 150 million words.

Candidate Selection
In the first two experiments, we use POS filters to select candidates. We concentrate on filters that extract noun MWEs and avoid further preprocessing like lemmatization. We use the filter introduced by Justeson and Katz (1995) 7 for the English medical datasets. Considering only terms that appear more than ten times leads to 1,340 candidates for the GENIA dataset and 29,790 candidates for the Medline dataset. According to Table 3 we observe that most candidates are bigrams. Whereas for both corpora still around 20% of trigrams are contained, the number of 4-grams is only marginally represented. For the French datasets we apply the filter proposed by Daille et al. (1994) 8 , which is suited to match nominal MWEs. Applying the same filtering as for the medical corpora leads to 330 candidate terms for the SPMRL and 7,365 candidate terms for the ERC. Here the ratio between bi-and trigrams is more balanced but again the number of 4-grams constitutes the smallest class.

Corpus
Candidates 2-gram 3-gram 4-gram GENIA  In comparison to the Medline dataset, the ratio of multiwords extracted by the POS filter on the French corpus is much lower. The reason for that property is that in the French data, many adverbial, prepositional MWEs are annotated, which are not covered by the POS filter.
The third experiment shows the performance of the method in absence of language-specific preprocessing. Thus, we only filter the candidates by frequency and do not make use of POS filtering. As most previous methods rely on POS-filtered data we cannot make use of them in this setting.
For the evaluation, we compute the scores of the competitive methods in two settings: First, we compute the scores based on the full candidate list without any frequency filter and prune lowfrequent candidates only for the evaluation (postprune). In the second setting we filter candidates 7 A regular expression for matching POS tag sequences is summarized by Korkontzelos (2010) . Each letter is a truncated POS tag of length one where J is an adjective N a noun and P a preposition. 8 Following the same convention as for English the regular expression can be expressed as N[J]?|NN|NPDN according to their frequency before the computation of scores (pre-prune). This leads to differences for context-aware measures, since in the prepruned case, a lower number of less noisier contexts is used.

Small Corpora Results
First we show the results based on the GENIA corpus (see Table 4). Almost all competitive methods Method P @100 P @500 AP upper baseline 1.000  beat the lower baseline. The C/NC-value perform best when the pruning is done after a frequency filter. In line with the findings of Korkontzelos (2010) and in contrast to Frantzi et al. (1998) the AP of the C-value is slightly higher than for the NC-value. All the FGM based methods except the GM measure alone outperform the C-value. The results in Table 4 indicate that the best competitive system is the post-pruned FGM system as it has much higher average precision scores and misses only 50 MWEs in the first 500 entries. A slightly different picture is presented in Figure 1 where the P @k scores against the number of candidates are plotted. Here DRUID performs well for the top-k list for small k, i.e. finds many good MWEs with high confidence thus combines well with MF, which extends to larger k, but places too much importance of frequency when used alone. Common errors are frequent chunks such as "in patience", see Table 9 in Section 7. Whereas for the post-pruned case FGM scores higher than MF, the inverse is observed when pre-pruning. Using our DRUID methods can surmount the FGM method only for the first 300 ranked terms (see Figure 1 and Table 4). Multiplying the logarithmic frequency to the DRUID, the results improve slightly and the best P @100 with 0.97 is achieved. All FGM results are outperformed when combining the post-pruned FGM scores with our measure. According to Figure 1 this combination achieves high precision for the first ranked candidates and still exploits the good performance of the postpruned FGM based method for the middle-ranked candidates. Different results are achieved for the SPMRL dataset as can be seen in Table 5. Whereas the pre-pruned C-value again receives better results than frequency, it scores below the lower baseline. Also the post-pruned FGM and MF method Scoring P @100 P @200  do not exceed the lower baseline. Data analysis revealed that for the French dataset only ten out of the 330 candidate terms are nested within any of the candidates. This is much lower than the 637 terms nested in the 1340 candidate terms for the GENIA dataset. As both the FGM-based methods and the C/NC-value heavily rely on nested candidates, they cannot profit from the candidates of this dataset and achieve similar scores as ordering candidates according to frequency. Comparing the baselines to our scoring method, this time we obtain the best result for DRUID without additional factors. However, multiplying DRUID with MF or log(frequency) still outperforms the other methods and the baselines.

Large Corpora Results
Most MWE evaluations have been performed on rather small corpora. Here we want to inspect the performance of the measures for large corpora, so as to realistically simulate a situation where the MWEs should be found automatically for an entire domain.
Using the Medline corpus, all methods except the GM score outperform the lower baseline and the frequency baseline (see Table 6). Regarding Scoring P @100 P @500  the AP the best results are obtained when combining our DRUID method with the MF, whereas for P @100 and P @500 the log-frequency weighted DRUID scores best. Using solely the DRUID method or the combined variation with the logfrequency lead to the best ranking for the first 1000 ranked candidates and is then outperformed by the MF based DRUID variations. In this experiment the C-value achieves the best performance from the competitive methods for the P @100 and P @500, followed by the t-test. But the highest AP is reached with the post-pruned MF method, which also outperforms the sole DRUID slightly. Contrary to the GENIA results, the MF scores are consistently higher than the FGM scores. When using the French ERC we figured out that no nested terms are found within the candidates. Thus, the post-and pre-pruned settings are equivalent and thus MF equals frequency. The best results are again obtained with our method with and without the logarithmic frequency weighting (see Table 7). Again the AP of the C-value and most Method P @100 P @500 AP upper baseline 1.000  of the FGM-based methods are inferior to the frequency scoring. Only the t-test and the MF are slightly higher than the frequency 9 . But in contrast to the results based on the smaller SPMRL dataset, the MF, FGM and C-value can outperform the lower baseline. In comparison to the smaller corpora, the performance for the larger corpora is much lower. Especially low-frequent terms in the small corpora that have high frequencies in the larger corpora have not been annotated as MWEs.

Results without POS Filtering
In the last experiment, we apply our method to candidates without any POS filtering and report results using a frequency threshold of ten. As the competitive methods from the previous section rely on POS tags, we use the t-test for comparison. Analysis revealed that the top-scored candi-  dates according to the t-test begin with stop words.
As an additional heuristic for the t-test, we shift MWEs, which start or end with one of the ten most frequent words, to the last ranks. For the smaller dataset the best results are achieved with the sole DRUID (see Table 8) and the frequency weighting does not seem to be beneficial, as highly frequent n-grams ending with stopwords are ranked higher in absence of POS filtering. This, however, is not observed for larger corpora. Here the best results for Medline are achieved with the frequency weighted DRUID. Whereas for French, the sole DRUID method performs best, the difference between the DRUID and the log-frequencyweighted DRUID is rather small. The low APs throughout can be explained by the large number of considered candidates. The second best scores are achieved with stop word based t-test (t-test + sw). C-value performs en par with frequency.

Components of DRUID
Here, we show different parameters for DRUID, relying on the English GENIA dataset without POS filtering of MWE candidates and by considering only terms with a frequency of 10 or more. Inspecting the two different components of the DRUID measure (see upper graph in Figure 2), we observe that the uniqueness measure contributes most to the DRUID score. The main effect of the incompleteness component is the downranking of a rather small number of terms with high uniqueness scores, which improves the overall ranking. We can also see that for the top ranked terms the negative incompleteness score does not improve over the frequency baseline but outperforms the frequency in the middle ranked candidates. Used in DRUID we observe a slight improvement for the complete ranking. We achieve a P@500 of 0.474 for the uniqueness scoring and 0.498 for the DRUID score. When filtering similar entries, used for the uq scoring, by their similarity score (see bottom graph in Figure 2), we observe that the amount of similar n-grams considered seems to be more important then the quality of the similar entries: With the increasing filtering also the quality of extracted candidate MWEs diminishes.

Discussion and Data Analysis
The experiments confirm that our DRUID measure, either weighted with the MF or alone, works best across two languages and across different cor- pus sizes. It also achieves the best results in absence of POS filtering for candidate term extraction. The optimal weighting of DRUID depends on the nestedness of the MWEs: Using DRUID with the MF should be used when there are more than 20% of nested candidates and using the logfrequency or no frequency weighting when there are almost no nested candidate terms. We show the best-ranked candidates obtained with our method and with the best competitive method in terms of P @100 for the two smaller corpora. Using the GENIA dataset, our logfrequency based DRUID (see left column in Table  9) ranks only true MWE within the 15 top-scored candidates.
The right-hand side shows results extracted with the pre-pruned MF method that yields three non-MWE terms. Whereas that could be a POS error,   (0). the MF and also the C-value are not capable to remove terms starting with stop words. The DRUID score alleviates this problem with the uniqueness factor. For the French dataset our method ranks only one false candidate whereas the MF (postpruned) ranks eight non-annotated candidates in the top 15 list (see Table 10).  Whereas the unweighted DRUID method scores better than its competitors on the large corpora, the best results are achieved when using DRUID with frequency-based weights on the smaller corpora. For a direct comparison we evaluated the small and large corpora using an equal candidate set. We observed that all methods computed on the large corpora achieve slightly inferior results than when computing them using the small cor-pora. Data analysis revealed that we would consider many of the high ranked "false" candidates as MWE.
Therefore we extracted the top ten ranked terms, which are not annotated as MWE from the methods with the best P @100 performance, resulting to the log(freq) DRUID and the pre-pruned C-value methods.
First, we observed that the first 'false' candidate for our method appears at rank 26 and at rank 1 for the C-value. Additionally, only ten out of the top 74 candidates are not annotated as MWEs for our method and 48 for the competitor. When searching the terms within the MeSH dictionary 10 , we find seven terms ranked from our method and two for the competitive method.

Conclusion
Uniqueness is a new mechanism in MWE modeling. Whereas frequency and co-occurrence have been captured in many previous approaches (see Manning and Schütze (1999),  and Korkontzelos (2010) for a survey), we boost multiword candidates t by their grade of distributional similarity with single word terms. We implement such contextual substitutability with a model where the term t can consist of multiword tokens and similarity is measured based on the right and neighboring word between all (single and multiword) terms. Since it is the default to express concepts with single words, a high uniqueness score is given to multiwords that belong to a category just as single words would. E.g. for an English open-domain corpus hot dog is most similar to the terms: food, burger, hamburger, sausage and roadside. Candidates with a low number of single word similarities also serve the same function, but more frequently we observe single n-grams with function words or modifying adjectives concatenated with content words, i.e. small dog is most similar to "various cat", "large amount of ", "large dog", "certain dog", "dog". To be able to kick in, the measure requires a certain minimum frequency for candidates in order to find enough contextual overlap with other terms. Additionally, we also demonstrate effective performance on larger corpora and show its applicability when used in a complete unsupervised evaluation setting.