Towards a Variability Measure for Multiword Expressions

One of the most outstanding properties of multiword expressions (MWEs), especially verbal ones (VMWEs), important both in theoretical models and applications, is their idiosyncratic variability. Some MWEs are always continuous, while some others admit certain types of insertions. Components of some MWEs are rarely or never modified, while some others admit either specific or unrestricted modification. This unpredictable variability profile of MWEs hinders modeling and processing them as “words-with-spaces” on the one hand, and as regular syntactic structures on the other hand. Since variability of MWEs is a matter of scale rather than a binary property, we propose a 2-dimensional language-independent measure of variability dedicated to verbal MWEs based on syntactic and discontinuity-related clues. We assess its relevance with respect to a linguistic benchmark and its utility for the tasks of VMWE classification and variant identification on a French corpus.


Introduction
Multiword expressions (MWEs), in particular verbal ones (VMWEs), are groups of words whose meaning does not derive from the meaning of their components and from their syntactic structure in a regular way (Gross, 1982), like pay a visit and take the cake 'be the most remarkable of its kind'. MWEs exhibit some degree of variability. On the one hand, they allow internal inflection (paid many visits), insertions (pay annual visits) and syntactic transformations (visits paid last month ). On the other hand, they can block variation that is usual/typical for ordinary expressions with the same syntactic structure, such as inflection (#take a turn 1 vs. take turns), diathesis alternation (#he cast the die vs. the die is cast 'the point of no retreat is passed'), or adjunction of modifiers (#take 1 We use # to signal a loss of idiomatic meaning. the sweet cake). This leads to variation schemes which are specific to subclasses of MWE, that is, MWE variability is idiosyncratic.
Variability, also known as flexibility, has been considered a key property of MWEs in linguistic studies (Gross, 1988;Tutin, 2016;Nunberg et al., 1994;Sheinfux et al., 2017). It was also highlighted as a major challenge in NLP models and applications (Constant et al., 2017). Variants are pervasive (Jacquemin, 2001) and hinder straightforward search of MWE citation forms in a corpus (Nissim and Zaninello, 2013). They introduce discontinuities which challenge sequence labeling approaches. Even when employing parsers to cope with discontinuities, MWE recognizers can still fail to capture some syntactic transformations such as complex determiners, which can break a direct link between a verb and a noun in a dependency tree (pay a series of visits). These facts have important implications for downstream tasks and applications, e.g. parsers can heavily suffer from incorrectly identified MWEs (Baldwin et al., 2004).
The restricted variability of MWEs as compared to their regular counterparts can also be seen as an advantage in their automatic discovery (Weller and Heid, 2010;Tsvetkov and Wintner, 2014;Buljan andŠnajder, 2017). Substitution-based MWE discovery techniques based on lexico-semantic variability have been largely explored (Pearce, 2001;Farahmand and Henderson, 2016). Morphological and syntactic variability, however, have rarely been studied for MWE discovery (Ramisch et al., 2008) and even less so for in-context identification (Fazly et al., 2009).
Given the importance of MWE variability (Constant et al., 2017) as well as its gradual nature, especially for VMWEs, we suggest that this phenomenon should be subject to measurement. This paper presents measures of VMWE variability based on variant-to-variant similarity, taking syn-tactic variability and linear discontinuity into account (Sec. 2-3). 2 Our proposal is evaluated on a French corpus (Sec. 4). We assess the relevance of our measure with respect to a linguistic benchmark (Sec.5), and we study its usability for VMWE classification (Sec. 6) and variant identification (Sec. 7). Then, we conclude and sketch perspectives to extend our proposal to other languages and to an unsupervised framework (Sec. 8).

Variant-to-variant similarity
To capture the variability of a VMWE, we rely on pairwise comparison of its occurrences. Fig. 1 shows the dependency trees of sentences containing two variants, henceforth V 1 and V 2 , of prendre une décision 'to take a decision'. V 1 and V 2 exhibit some common and some divergent syntactic and linear properties. For instance, the noun decision governs a determiner (det) and an adjectival modifier (amod) both in V 1 and in V 2 , and a relative clause (acl:relcl) in V 2 . The verb take governs a nominal subject (nsubj), an object (obj) and adverbial modifiers (adv) in both V 1 and V 2 , and an auxiliary (aux) in V 2 . External elements are inserted between the lexicalized ones in both variants. Their POS are adv (twice), det and adj in V 1 , and pron, propn, aux and adv in V 2 , i.e. one POS (adv) is shared. 3 In order to measure both these common characteristics and discrepancies, we define the similarity of two VMWE variants on the basis of the similarity of their components and of the external inserted elements. A lexicalized component, or simply a component, of a VMWE E is the one which is realized by the same lexeme in any variant of E. 4 All variants of E necessarily have the same number of lexicalized components, which are lemmatized and lexicographically sorted, yielding a canonical form of E = (C 1 , C 2 , . . . , C n ) which uniquely represents it. 5 By C j i we denote the form that component i takes in variant j. For instance, in Fig. 1 C 1 = décision, C 2 = prendre, E = (décision, prendre), C 1 1 = décision, C 2 1 = décisions, C 1 2 = prennent and C 2 2 = prises. Similarity of objects (components or 2 Morphological variability is disregarded in this paper, as it did not prove influential in the experiments described here.
4 Lexicalized components are highlighted in bold. 5 We neglect rare cases of VMWEs sharing a canonical form, e.g. fermer les yeux 'close the eyes'⇒'pretend not to see' vs. fermer l'oeil 'close the eye'⇒'have a nap'. VMWEs) is measured by the Sørensen-Dice coefficient, which is defined as where P (O 1 ) and P (O 2 ) denote the sets of (relevant) properties exhibited by objects O 1 and O 2 . We now define two variant-to-variant similarity measures: syntactic -focusing on the outgoing dependencies -and linear -based on insertions.

Syntactic similarity
Syntactic similarity S S is based on the dependencies between a VMWE and its external elements. It allows us to account for long-distance arguments and modifiers not necessarily included between the lexicalized components. The similarity of each pair of lexicalized components is calculated first, and then averaged for the whole VMWE. For each component, the set of outgoing dependencies is considered and relations of the same type are counted once. In the two sentences given in Fig. 1, the syntactic similarity of the noun C 1 and the verb C 2 is: S S (C 1 2 , C 2 2 ) = 2 × |{adv,nsubj,obj}| |{adv,nsubj,obj}| + |{adv,aux,nsubj,obj}| = 6 7 Variant-to-variant syntactic similarity is the weighted average of the per-component scores: where weights w 1 , . . . , w n sum up to 1. For instance, with uniform weights w 1 = w 2 = 1 2 :

Linear similarity
Linear similarity S L is defined for two VMWE variants in terms of the POS of the elements inserted between the lexicalized components. The number of insertions for the same POS is disregarded. In this way we focus on the quality of admitting an insertion of a certain POS, rather than on their count. For example, the two adv insertions in V 1 (vraiment 'really' and pas 'NEG ') are only counted once:  Figure 1: Two POS-tagged and dependency-parsed occurrences of prendre une décision 'take a decision'.

VMWE variability
Given the two similarity measures S S and S L between variants V 1 and V 2 of a VMWE E, the rigidity scores of E are the averages of all pairs of E's variants. For example, if take decision occurs 6 times, we average the scores S S and S L of 6 2 = 15 pairs: where Y ∈ {S, L}, m is the number of E's variants in the corpus, and V i (E) is the i'th variant. Note that the rigidity measures defined above range from 0 to 1. The variability of E can, thus, be defined as the complement of rigidity: . Experiments were performed in order to estimate the relevance and utility of these measures. Parameter values were chosen empirically and are presented in Appendix A. In the long run, these parameters should be estimated experimentally, possibly in an applicationspecific manner.

Corpus
We use the French part of the PARSEME corpus 6 manually annotated for VMWEs in 18 languages (Savary et al., 2017). Among its 4 VMWE categories two are particularly relevant: The VMWEs annotations in the corpus are accompanied by morphological and a syntactic layers, as shown in Fig. 1. In the morphological layer, lemmas, POS and morphological features are assigned to each token. The syntactic layer represents syntactic dependencies between tokens. Both result from manual annotation and use UD tagsets. The corpus is divided into a training corpus (TrC) and a test corpus (TeC). TrC contains 17,880 sentences, 450,221 tokens, and 4,462 VMWE occurrences, including 1,786 occurrences of 502 unique IDs and 1,362 occurrences of 672 unique LVCs. On average, each ID has 3.6 variants and each LVC has 2 variants. The frequency of individual VMWEs varies greatly (from 1 to 172) and so does the reliability of the variability estimation of each MWE. Hence, only the most frequent VMWEs are considered in Sec. 6.

Linguistic relevance
It order to estimate the relevance of our measures, we refer to an existing corpus study by Tutin (2016). There, 30 French VMWEs of the form Verb-(Det)-Noun are studied with respect to 5 morpho-syntactic variation types. This yields 6 variability levels depending on how many of the 5 variability types a VMWE exhibits. This is illustrated in Tab. 1 with three VMWEs which stand at distinct levels of the variability spectrum.
Tutin's variability types are defined in terms of complex linguistic phenomena, such as admitting passivization and relative constructions, which have to be validated manually. We, conversely, are in need of fully automatic procedures. Therefore we capture the VMWE variability in distinct ways. It is interesting to see how far both approaches agree on their conclusions.

Relative construction
la décision qu'il prend 'the decision which he takes' #la porte qu'il ferme 'the door which he closes' #lieu qu'il donne 'place which it gives' Adjunction of noun modifiers prendre une grande décision 'take a great decision' #fermer la grande porte 'close the great door' #donner un grand lieu 'give a great place'  To this aim, we extract from TrC all occurrences of the 30 VMWEs covered by Tutin and retain those with at least 2 occurrences (measuring similarity requires two variants at least). Tab. 2 shows the distribution of the resulting set S of 18 VMWEs into Tutin's levels. While their corpus frequency is relatively high at levels 0, 1 and 5, it is low at levels 2, 3 and 4. Therefore we aggregate neighbor levels into 3 subsets: S 0−1 , S 2−4 and S 5 . For each VMWE in S we calculate V L and V S with weight w i = 1 for the noun and 0 for the verb and the determiner (if any). As shown by the corresponding boxplots in Fig. 2 (a-b), V L tends to increase with Tutin's level. That is to say, the more variable VMWEs are (as judged by a linguist expert on the basis of a manual corpus study), the higher is their automatically calculated linear variability value. Tutin's extreme levels 0-1 and 5 are particularly well discriminated by V L . 7 No interesting tendency could be observed for the syntactic variability of the noun. We hypothesize that different outgoing dependencies have different roles in modeling syntactic variability. For instance in aller dans le bon sens 'go to the right direction'⇒'evolve positively', the dependency between the noun and the modifier bon 'good ' probably tells us more about the rigidity of this VWME than its case-marking preposition dans 'in' or its 7 Wilcoxon-Mann-Whitney (WMW) test confirms that S5 differs from S0−1 with significance at α = 0.05.  determiner le 'the'. In future work, we would like to address experimental estimation of weights for different dependency relations in S S .

VMWE classification
LVCs are known to have a relatively regular morphosyntactic behavior as compared to IDs, which tend to be more rigid. We expect our variability measures to help discriminate these categories. We selected those VMWEs whose frequency in TrC was higher than 9, i.e. 12 IDs and 17 LVCs. 8 We then calculated V S and V L for each selected VMWE. As shown in Fig. 3, a strong ID vs. LVC discriminative power can be attributed especially to V L , given that the variability of IDs never exceeds 0.3, while it reaches 0.94 for LVCs. 9

Identification of VMWE variants
As shown by Fazly et al. (2009), English MWEs exhibit lower variability than non-MWEs. Thus, variability measures can help identify MWEs in running text. We test this hypothesis for French using S L and S S , which model variant similarity differently from this seminal work. To this aim, we adapted the method proposed by Savary and  Cordeiro (2018) to consider all VMWEs of the form Verb-(Det)-Noun annotated in TrC and extract their candidate occurrences in TeC. For instance, if TrC contains the expression e perdre pied 'lose foot'⇒'lose self-confidence', then the extracted TeC candidates, noted Cand(e), contain true variants of e (e.g. ces obstacles me font perdre pied 'these obstacles make me lose my self-confidence'), literal readings of e (e.g. il a perdu le pied gauche 'he lost his left foot'), and coincidental occurrences of e's components (e.g. traces des pieds de l'enfant perdu 'traces of the lost child's feet'). Our hypothesis is that S S and S L should be able to distinguish true VMWEs from literal and accidental occurrences, thus being useful for supervised VMWE identification. More precisely, we hypothesise that the more a candidate resembles a known VMWE occurrence, the more chances it has to be a VMWE.
We extracted 195 candidates c ∈ Cand(e) from TeC. For each candidate c, we calculated the minimum similarities S L (e, c), S S (e, c) and the average of both S L−S (e, c) over all occurrences of e in TrC. 10 Interesting results were obtained mainly with S L . Fig. 4 shows pairwise comparison of the minimal value of S L (e, c) when IDs and LVCs are considered jointly (boxplots 1-2), or separately (boxplots 3-6). In each case S L clearly delimits false from true positives. 11

Conclusions and future work
We defined syntactic and linear measures of VMWE variability. They use pairwise similarity based on expert linguistic knowledge. We showed their statistically significant correlation with a linguistic benchmark. We also discovered that linear similarity proves useful in VMWE classification and identification, which is particularly interesting in comparison to the seminal work by Fazly et al. (2009), who do not consider this kind of similarity.
These definitions and estimations should be further improved to deal with other MWE categories, not only verb-noun combinations. Our similarity measures rely on language-independent assumptions: they can be applied to any MWE-annotated corpus containing POS tags and dependency trees. If these morphosyntactic annotations use the unified UD tagsets, cross-language MWE variability studies can be carried out. Therefore, our experiments will be extended to all languages accounted for in the PARSEME corpus. Task-specific parameter tuning should show which parameters are shared by all/many languages and/or tasks, and which have to be language-and task-specific. Morphological variability, including both inflection and derivation (as in refaire appel 're-make appeal '⇒'to call on again'), temporarily abandoned for French, could be examined in a multilingual context. Finally, the measures should be adapted to an unsupervised context, to scale them up to larger VMWE vocabularies and languages with no MWE-annotated corpora. For instance, MWE variant candidates could be extracted from automatically parsed text, using lists of known MWE lemmas (Savary and Cordeiro, 2018).
We believe that with these extensions our variability measures will offer a unified framework for describing variability profiles of MWEs, which should be useful both in theoretical and applied research. They could help: (i) disambiguate literal vs. idiomatic readings of VMWEs, (ii) conflate variants of the same MWE to reduce information variation in text, (iii) measure the sensitivity of NLP tools to variability, (iv) define variabilityspecific evaluation measures in MWE identification to boost the efficient recognition of variants.

B Similarity coefficients used in the variant-to-variant similarity
Similarity between two datasets X and Y is given by the following formulae: card ( The variant-to-variant similarity defined in Sec. 7 uses the arithmetic mean of these four coefficients.