Modeling the Non-Substitutability of Multiword Expressions with Distributional Semantics and a Log-Linear Model

Non-substitutability is a property of Mul-tiword Expressions (MWEs) that often causes lexical rigidity and is relevant for most types of MWEs. Efﬁcient identiﬁcation of this property can result in the efﬁ-cient identiﬁcation of MWEs. In this work we propose using distributional semantics, in the form of word embeddings, to identify candidate substitutions for a candidate MWE and model its substitutability. We use our models to rank MWEs based on their lexical rigidity and study their performance in comparison with association measures. We also study the interaction between our models and association measures. We show that one of our models can signiﬁcantly improve over the association measure baselines, identifying collocations.


Introduction
Multiword expressions (MWEs), commonly referred to as collocations, 1 are idiosyncratic sequences of words whose idiosyncrasy can be broadly classified into semantic, statistical, and syntactic classes. Semantic idiosyncrasy (also referred to as non-compositionality) means that the meaning of an MWE cannot be inferred from the meaning of its components, as in loan shark. Syntactic idiosyncrasy refers to the situation where the syntax of an MWE does not follow syntactic rules, as in in short. Statistical idiosyncrasy means that components of a statistically idiosyncratic MWE co-occur more than expected by chance, as in swimming pool. The range of types of idiosyncrasy included in MWEs has been characterized in several other ways (Baldwin and Kim, 2010;Sag et al., 2002). To avoid getting mired down in this uncertainty, which mainly emerges while dealing with borderline MWEs, between completely idiosyncratic and fully compositional, we subscribe to the viewpoint of McCarthy et al. (2007) and treat idiosyncrasy as a spectrum and focus only on the (very) idiosyncratic end of this spectrum. MWEs have application in different areas in NLP and linguistics, for instance statistical machine translation (Ren et al., 2009;Carpuat and Diab, 2010); shallow parsing (Korkontzelos and Manandhar, 2010); language generation (Hogan et al., 2007); opinion mining (Berend, 2011); corpus linguistics and language acquisition (Ellis, 2008). In general, as Green et al. (2011) point out, "MWE knowledge is useful, but MWEs are hard to identify." In this work, we propose a method of identifying MWEs based on their non-substitutability. Non-substitutability means that the components of an MWE cannot be replaced with their synonyms (Manning and Schütze, 1999;Pearce, 2001). It implies statistical idiosyncrasy, which is relevant for all kinds of MWEs, and identifying nonsubstitutability in text results in the identification of a wide range of MWEs. In MWE research, nonsubstitutability has been widely considered but never thoroughly studied, except for a few work that present low coverage and limited models of this concept.
We develop a model that takes into account the semantics of words for identifying statistical idiosyncrasy, but is highly generalizable and does not require supervision or labor-intensive resources. The proposed model uses distributional semantics, in the form of word embeddings, and uses them to identify semantically similar words for the components of a candidate MWE. Nonsubstitutability is then measured for the candidate MWE using log-linear model(s), also computed using word embeddings. Our proposed models result in an improvement over the state-of-the-art.

Syntactic Categories of MWEs
From a syntactic point of view, MWEs are very heterogeneous, including light verb constructions, phrasal verbs, noun compounds, verb-object combinations and others. In this work, however, we focus only on noun compounds for the following reasons: (i) They are the most productive and frequent category of MWEs.() (ii) There are more datasets of compounds available for evaluation. (iii) Focusing on one controlled category allows us to focus on modeling and detecting idiosyncrasy in isolation, avoiding complexities such as gappy MWEs. We also focus only on two-word noun compounds, because higher order ones are relatively rare.

Related Work
Identification of statistical idiosyncrasy of MWEs seems to have been first formally discussed in Choueka et al. (1983) by proposing a statistical index to identify collocates and further developed into more efficient measures of collocation extraction such as Pointwise Mutual Information (Church and Hanks, 1990), t-score (Church et al., 1991;Manning and Schütze, 1999), and Likelihood Ratio (Dunning, 1993). Smadja (1993) proposes a set of statistical scores that can be used to extract collocations. Evert (2005) and Pecina (2010) study a wide range of association measures that can be employed to rank and classify collocations, respectively.  assume that a word pair is a true MWE if the conditional probability of one word given the other is greater than the conditional probability of that word given synonyms of the other word, and Riedl and Biemann (2015), and Farahmand and Martins (2014) use contextual features to identify MWEs.

Modeling Non-Substitutability
As discussed earlier, we model statistical idiosyncrasy based on an assumption inspired by nonsubstitutability, which means that the components of an MWE cannot be replaced with their near synonyms. Let w 1 w 2 represent a word pair. We make the same assumption as  that w 1 w 2 is statistically idiosyncratic if: where sim(w i ) (defined below in Section 3.1) represents the words that are similar to w i . With respect to noun noun compounds, this inequality roughly means that for an idiosyncratic compound, the probability of the headword (w 2 ) cooccurring with the modifier (w 1 ) is greater than the probability of the headword co-occurring with "synonyms" of the modifier (e.g. climate change is more probable than weather change). This, however, is not the case for non or less idiosyncratic compounds (e.g. film director which is substitutable with movie director).
Farahmand and Nivre (2015) estimate a similar probability, in both directions, with the help of WordNet synsets. They show that the model that considers the probabilities in both directions outperforms the model that considers only one direction (head conditioned on modifier).
To study and model the effects of direction we also consider the following inequality: Intuitively, inequality 1 plays a more important role in lexical rigidity than inequality 2, but this is something we study in section 4. In related work, (Pearce, 2001) extracts the synonyms of the constituents of a compound, creates new phrases called anti-collocations, and based on the number of anti-collocations of the candidate MWE decides whether it is a true MWE.

Modeling Semantically Similar Words
In previous work, WordNet synsets were employed to model the sim() function. The obvious limitation of such an approach is coverage. Other limitations include costliness and labor intensiveness of updating and expanding such a knowledge base. In this work, we use cosine similarity between word embeddings to represent semantically similar words (that include but are not limited to synonyms). This may result in a drop in precision, but the coverage will be immensely improved. Moreover, similarity in the word embedding space is shown to provide a relatively good approximation of synonymy (Chen et al., 2013).

Ranking with Log-Linear Models
We estimate the probabilities presented in (1) and (2) using a log linear model. Let φ(w i ) represent the word embedding of w i where φ ∈ R 50 .
where v w i is a parameter vector and v is the model's parameter matrix. The analogous equation is used to define P (w 1 |w 2 ). Let S w i represent the set of top-n φ(w j ) that are most similar to φ(w i ), S w i = {w j |w j ∈ nGreatest(w i .w j )}. P (w 2 |sim(w 1 )) can then be estimated as: where P (w 2 |w j ) is defined in (3). And again, the analogous equation defines P (w 1 |sim(w 2 )).
Combining these gives us the following version of (1), and an analogous version of (2).
Given that MWEs lie on a continuum of idiosyncrasy, it is natural to treat identification of MWEs as a ranking problem. We therefore define an unsupervised ranking function as follows: And an analogous function δ 12 .

Evaluation
As our evaluation set we used the dataset of  who annotate 1042 English noun compounds for statistical and semantic idiosyncrasy. Each compound is annotated by four judges with two binary votes, one for their semantic and one for their statistical idiosyncrasy. As our baselines we use three measures that have been widely used as a means of identifying collocations: Pointwise Mutual Information (PMI ) (Church and Hanks, 1990;Evert, 2005;Bouma, 2009;Pecina, 2010), t-score (Manning and Schütze, 1999;Church et al., 1991;Evert, 2005;Pecina, 2010), and Log-likelihood Ratio (LL r ) (Dunning, 1993;Evert, 2005).
Since we are concerned with the idiosyncratic end of the spectrum of MWEs, we look at the identification of MWEs as a ranking problem. To evaluate this ranking, we use precision at k (p@k) as the evaluation metric, considering different values of k.

Individual Models
To train the log-linear model, we first extracted all noun-noun compounds from a POS-tagged Wikipedia dump (only articles) with a frequency of at least 5. This resulted in a list of ≈ 560, 000 compounds. We created word embeddings of size 50 for words of Wikipedia that had the frequency of at least 5 using word2vec 2 . These word embeddings were used both to determine the set of similar words for each word of a compound and to train the log-linear model by stochastic minimization of the cross entropy. We discarded 30 instances of the evaluation set because (having type frequency of below 5) word embeddings were not available for at least one of their components.
To measure precision, we assume those evaluation set instances that were annotated as statistically or semantically idiosyncratic by three or more judges (out of four) are MWE and other instances are not. This results in the total of 369 positive instances. Figure 1 shows the performance of the different models.
At the top of the ranked list, δ 21 outperforms one of the baselines (t-score) but performs similarly to the other two baselines, PMI and LL r . It, however, shows a more steady performance up until p@ 100. As it moves further from the idiosyncratic end of the spectrum its precision drops further. δ 12 , on the other hand, shows a weaker performance. It, however, outperforms t-score for the most part. The best baseline is PMI, the worst is t-score. Again, considering lexicalization, the main process that MWEs should undergo to become useful for other NLP applications, a high precision at a small (proportional) k is what we should be really concerned about: lexicons cannot grow too large so every multi-word entry should be sufficiently idiosyncratic and lexically rigid. On the other hand, we do not want to limit a model's ability to generalize by lexicalizing every word sequence that appears slightly idiosyncratic. Looking back at the models, we know that δ 21 , PMI, and LL r independently perform well at the top of their ranked list. On the other hand, we know that in theory δ 21 bases its ranking on relatively different criteria from PMI and LL r . The question we seek to answer in the next section is whether merging these criteria (semantic nonsubstitutability and statistical association) can improve on the best performance.

Combining Non-Substitutability and Association
Our first combined model of non-substitutability integrates both directions (head to modifier and modifier to head). To emphasize precision, we propose a combination function H 1 that requires both δ 21 and δ 12 to be high.
H 1 = min(δ 21 , δ 12 ) By ranking according to the minimum of the scores δ 21 and δ 12 , each highly-ranked data point must be highly ranked by both individual models. 3 To combine an association measure with our non-substitutability models we chose PMI because its performance at the top of the ranked list is better than other baselines. The values of PMI and the δs have different scales. We measured the linear correlation in terms of Pearson r between PMI and δs in order to see whether we can scale up the δs' by a linear factor. The correlation was very small and almost negligible, so instead of using min() we combined the two rankings as: where • denotes the element-wise product.
We perform the same experiments as in Section 4.1 with the combined models 4 and compare their performance with the best models from the previous experiments. The results can be seen in Figure  2. H 2 clearly outperforms other models at the top of the ranked list. It reaches a significantly higher precision than other models. This confirms our assumption that in practice association measures and substitutability based models that are semantically motivated 5 base their decisions on different pieces of information that are complementary. Also, the results for H 1 show that combining both δ 21 and δ 12 gives us an improvement for high precision and performs similarly to the best one (δ 21 ) at lower k.

Conclusions
We presented a method for identifying MWEs based on their semantic non-substitutability. We assumed that non-substitutability implies statistical idiosyncrasy and modeled this property with word embedding representations and a log-linear model. We looked at MWE identification as a ranking problem due the nature of idiosyncrasy, which is better defined as a continuum than as a binary phenomenon. We showed our best model can reach the same performance as the best baseline. We showed that joining our models lead to a better performance compared to that of the baselines and individual models. We also showed that joining our models -that are aware of semantic nonsubstitutability, and association measures (baselines) can result in a model with a performance that is significantly higher than the performance of the baselines.