Heuristically Informed Unsupervised Idiom Usage Recognition

Many idiomatic expressions can be interpreted figuratively or literally depending on their contexts. This paper proposes an unsupervised learning method for recognizing the intended usages of idioms. We treat the usages as a latent variable in probabilistic models and train them in a linguistically motivated feature space. Crucially, we show that distributional semantics is a helpful heuristic for distinguishing the literal usage of idioms, giving us a way to formulate a literal usage metric to estimate the likelihood that the idiom is intended literally. This information then serves as a form of distant supervision to guide the unsupervised training process for the probabilistic models. Experiments show that our overall model performs competitively against supervised methods.


Introduction
Many idiomatic expressions may be interpreted both figuratively or literally. Their intended usages depend on how they fit with their contexts. For example, the idiom "spill the beans" is used figuratively in the first instance below, and literally in the second: This type of ambiguity is commonplace -prior work suggests that about half out of a sample of 1 https://twitter.com/BTeboe/status/ 958792419302100993 2 https://twitter.com/DukeRaccoon/ status/477530732173471744 60 idioms have a clear literal meaning as well as a figurative one (Fazly et al., 2009). Being able to distinguish the intended usage of an idiom in context has been shown to benefit many natural language processing (NLP) applications, e.g., machine translation and sentiment analysis (Salton et al., 2014;Williams et al., 2015).
While supervised models for idiom usage recognition have had some successes, they require appropriately annotated training examples (Peng et al., 2014;Byrne et al., 2013;Liu and Hwa, 2017). A more challenging problem is to recognize idiom usages without a dictionary or some annotated examples (Korkontzelos et al., 2013). Some previous unsupervised models tried to exploit linguistic differences in usages. For example, Fazly et al.(2009) observed that an idiom appearing in its canonical form is usually used figuratively;  relied on the break in lexical coherence between the idioms and the context to signal a figurative usage. These heuristics, however, are not always applicable because the distinctions they depend upon may not be present or obvious. To improve generalization across different idioms and usage contexts, we need a more reliable heuristic, and appropriately incorporate it into an unsupervised learning framework.
We propose a heuristic that differentiates usages based on distributional semantics (Harris, 1954;Turney and Pantel, 2010). Our key insight is that when an idiom is used literally, its relationship with its context is more predictable than when it is used figuratively. This is because the literal meaning of an idiom is compositional (Katz and Giesbrecht, 2006), and the constituent words that make up the idiom are also meant literally. For example, in instance (2), spill is meant literally and can take on objects other than beans; moreover, one of the context words, mess, can often be seen to co-occur with spill in other text, even without beans. Our strategy is to represent an idiom's literal usage in terms of the word embeddings of the idiom's constituent words and other words they frequently co-occur with. Then, for any instance in which the idiom's usage is not known, we only need to determine the semantic similarity between that instance and the idiom's literal representation. We define a literal usage metric that estimates the likelihood that an instance would be labeled "literal".
While the literal usage metric captures the distributional semantic information of the context, we find that some other linguistic cues are also significant for usage detection (such as whether the subject of the sentence is a person); therefore, we allow our model to further refine through unsupervised methods. Specifically, we treat the usage (figurative or literal) as a hidden variable in probabilistic latent variable models, and we define a set of features that are linguistically relevant for idiom usage detection as observables. We integrate our literal usage metric with the latent variable models by treating the metric outputs as soft labels to guide the latent variable models toward grouping by usages.
We hypothesize that unsupervised learning in a more linguistically motivated feature space, informed by soft labels from a semantically driven metric, will produce more robust classifiers. We conduct experiments comparing our approach against other supervised and unsupervised baselines. Results suggest that our approach achieves performances that are competitive to supervised models.

Related Work
Despite the common perception that idioms are mainly used figuratively, many can also be meant literally. A number of models have been proposed in the literature to recognize an idiom's usages under different context. Many rely on specific linguistic property to draw a clear-cut decision boundary between literal and figurative usages. For example, Fazly et al. (2009) proposed a method that relies on the concept of canonical form. Based on the observation that while literal usages are less syntactically restricted, figurative usages tend to occur in a small number of canonical form(s). As shown in the examples above, however, this rule of thumb does not always hold.  proposed a method by building a cohesion graph to include all content words in the context; if removing the idiom improves cohesion, they assume the instance is figurative. Later, Li and Sporleder (2009) used their cohesion graph method to label a subset of the test data with high confidence. This subset is then passed on as training data to the supervised classifier, which then labels the remainder of the dataset.
When manually annotated examples are available, supervised classifiers are effective. Rajani et al. (2014) extracted all non-stop-words in the context and used them as "bag of words" features to train a L2 regularized Logistic Regression (L2LR) classifier (Fan et al., 2008). As local context of an idiom holds clues for discriminating between its literal and figurative usages, Liu and Hwa (2017) find that context representation also plays a significant role in idiom usage recognition. They took an adaptive approach, applying supervised ensemble learning over three classifiers based on different context representations (Peng et al., 2014;Birke and Sarkar, 2006;Rajani et al., 2014).

Our Approach
Given a target idiomatic expression and a collection of instances in which the idiom occurs, our proposed system (Figure 1) determines whether the idiom in each instance is meant figuratively or literally. We first build a Literal Usage Representation for each idiom by leveraging the distributional semantics of its constituents (Sec 3.1). Given an instance of idiom, we can determine its usage by the semantic similarity between the context of the instance and the Literal Usage Representation. We define a Literal Usage Metric to transform the semantic similarity score into soft label, i.e., an initial rough estimation of the instance's usage (Sec 3.2). Finally, we treat the soft labels as distant supervision for downstream probabilistic latent variable models, in which the usages are considered as the hidden variables and are represented over a set of features.

Literal Usage Representation
An idiom co-occurs with different sets of words depending on whether it is meant literally or figuratively. For example, when used literally, get wind is more likely to co-occur with words such as rain, storm or weather; in contrast, when used figuratively, it frequently co-occurs with rumor or Comparing the two sets of words associated with the idiom, we see that the literal set of words also tend to co-occur with just wind, a constituent word within the idiom. Therefore, even without annotated data or dictionary, we may still approximate a representation for the literal meaning of an idiom by the idiom's constituent words and their semantic relationship to other words. To do so, we begin by initializing a literal meaning set to just the idiom's main constituent words 3 ; we then grow the set by adding two types of semantically related words. First, we look for co-occuring words in a large textual corpus (e.g., (David et al., 2005)): for each constituent word w, we randomly sample s sentences that contain w from the corpus; we extract the top n most frequent words (excluding stop words) and add them to the literal meaning set. Second, we look for words that are semantically close in a word embedding space: we train a continuous bag-of-words (CBOW) embedding model (Mikolov et al., 2013) and add additional t words that are the most related to w using cosine similarity.
All together, the literal usage representation is a collection of vectors, i.e., the embeddings of the words in the final extended literal meaning set. The size of the set depends on parameters s, n, and t; if the chosen values are too small, we do not end up with a word collection that is representative enough; if the numbers are too large, we would only be wasting computing resources chasing Zipfian tails. Parameter setting choices are discussed further in the experiment section.

Literal Usage Metrics
Among all the instances to be classified, we expect the context words of the literal cases to be more semantically close to the literal usage representation we just formed. Let L denote the set of words in the literal usage representation for the target idiom. For each instance, let C be the set of non-stop context words in the instance. We calculate s, the semantic similarity score between the context of the instance and the literal usage representation as follows: where c denotes a word in C, l denotes a word in L and sim(c, l) refers to the cosine similarity between the word embeddings of c and l. Let S = {s 1 , s 2 , ...s n } be the set of semantic similarity scores for all the instances we wish to classify. Instances with higher scores are more likely to use the idiom literally. A naive literal usage metrics is to choose a predefined threshold for all idioms and label all the instances with score above the threshold as literal usages. This approach is unlikely to work well in practice. As noted by previous work, idioms have different levels of semantic analyzability (Gibbs et al., 1989;Cacciari and Levorato, 1998). When an idiom has a high degree of semantic analyzability, its contextual words will be more semantically close to the literal usage representation, thus a higher threshold is needed.
In this work, we select a different decision threshold for each idiom adaptively based on the similarity scores distribution. And most importantly, rather than generate a hard label, we transform these scores into a probabilistic metric, where 0 means the usage in the instance is almost certainly figurative while 1.0 means it is literal.
We propose a metric based on the principle of Minimum Variance (MinV). That is, we first sort the scores in S and choose the threshold (from these scores) that minimizes the sum of variances of the two resulting clusters. For each instance i, we then apply the following metric to estimate the probability that the idiom in instance i is meant literally based on its semantic similarity score s i : where k is a constant weighting factor and t indicates the learned threshold. The intuition is that the larger the difference between s i and the threshold is, the more likely the instance i is literal; the probability of literal usage is not linearly correlated to the difference, we use the sigmoid function to account for this non-linearity. We incorporate k to scale the value of the difference since it is generally very small (close to 0). Without k, all the P r values gravitate toward 0.5, rendering the soft label being equivalent to random guess. We set k to 5 for all the idioms based on a development set.

Heuristically Informed Usage Recognition
The soft label, generated by MinV (the literal usage metric), captures the distributional semantic information of the context. In practice, there are a variety of other linguistic features which are also informative of the intended usage of idiom. We explore probabilistic latent variable models over a collection of features that are linguistically relevant for idiom usage detection. The soft label is integrated into the unsupervised learning of hidden usages as a distant supervision. In this section, we will describe the proposed features in the latent variable models and how we integrate the soft label into the learning process.

Latent Variable Models
To predict an idiom's usage in instances, we consider two representative probabilistic latent variable models: Latent Dirichlet Allocation (LDA) (Blei et al., 2003) 4 and unsupervised Naive Bayes (NB). For both models, the latent variable is the idiom usage (figurative vs. literal); the observables are linguistic features that can be extracted from the instances, described below: Subordinate Clause We encode a binary feature indicating whether the target expression is followed by a subordinate clause (the Stanford Parser (Chen and Manning, 2014) is used). This feature is useful for some idioms such as in the dark. It usually suggests a figurative usage as in You've kept us totally in the dark about what happened that night.
Selectional Preference Violation of selectional preference is normally a signal of figurative usage (e.g., having an abstract entity as the subject of play with fire). We encode this feature if the head word of the idiom is a verb and focus on the subject of the verb. We apply Stanford Name Entity tagger (Finkel et al., 2005) with 3 classes ("Location", "Person", "Organization") on the sentence containing the idiom. If the subject is labeled as an Entity, its class will be encoded in the feature vector. Pronouns such as "I" and "he" also indicate the subject is a "Person". However, they are normally not tagged by Stanford Name Entity tagger. To overcome this issue, we add Part-of-Speech of the subject into the feature vector.
Abstractness Abstract words refer to things which are hard to perceive directly with our senses. Abstractness has been shown to be useful in the detection of metaphor, another type of figurative language (Turney et al., 2011). A figurative usage of an idiomatic phrase may have relatively more abstract contextual words. For example, in the sentence She has lived life in the fast lane, the word life is considered as an abstract word. This is a useful indicator that in the fast lane is used figuratively. We use the MRC Psycholinguistic Database Machine Usable Dictionary (Coltheart, 1981) which contains a list of 4295 words with their abstractness measure between 100 and 700. We calculate the average abstractness score for all the contextual words (with stop words being removed) in the sentence containing the idiom. The score is then transformed into categorical feature to overcome sparsity problem based on the following criteria: concrete (450 -700), medium (350 -450), abstract (100 -350).
Neighboring Words Words preceding and following the idiomatic expression can be very informative in terms of usage recognition. For example, words such as relax or shower before the idiom in hot water often signal a literal usage.
Part-of-Speech of the Neighboring Words Class of neighboring words might be useful as well. For example, a pronoun preceding dog's age generally indicates a literal usage, as in I think my dog's age is starting to catch up. She sometimes needs help to jump on to my bed, while a determiner usually marks a figurative usage, as in It's been a dog's age since I've used Twitter.

Incorporating Soft Label into Usage Recognition
Given a collection of instances and their features, either LDA or NB can separate the instances into two groups (hopefully, by usages), but it does not associate the right label (i.e., "figurative" or "literal") to the groups. We do not want to rely on any manual annotations for this step. Therefore, we integrate the automatically generated soft labels (based on MinV, our literal usage metric) into the unsupervised learning procedure as a weak form of supervision. Formally, we want to estimate each instance's posterior distribution over (literal/figurative) usages θ du and usage-feature distribution φ uf . For LDA, we derive a Gibbs sampling algorithm which incorporates the soft label into the learning procedure. We refer it as informed Gibbs sampling (infGibbs). For unsupervised naive Bayes model, we adapt the classical Expectation-Maximization algorithm to integrate the soft label. We refer it as informed Expectation-Maximization (infEM). Informed Gibbs Sampling The Gibbs sampling algorithm (Griffiths and Steyvers, 2004) used in traditional LDA initializes each word token a random hidden topic. The system needs to interpret the learned topics post-hoc, e.g., by human annotation. In our case, for each feature f in each instance, an initial random usage biased by the instance's soft label is assigned to f (i.e., a Bernoulli trial). Since the soft label explicitly encodes an instance's literal and figurative usage distribution, we do not need to interpret the learned usages at the end of the algorithm. Based on these assignments, we build a feature-usage counting matrix C F U and instance-usage counting matrix C DU with dimensions |F | × 2 and |D| × 2 respectively (|F | is the feature size and |D| is the number of instances): C F U i,j is the count of feature i assigned to usage j; C DU d,j is the count of features assigned to usage j in instance d. Then for each feature f in each instance, we resample a new usage for f and matrices C F U and C DU will be updated accordingly. This step will be repeated for T times. The resampling equation is: where i indexes features in the instance d, j is an index into literal and figurative usages, * indicates a summation over that dimension and − means excluding the corresponding instance. The first factor p j is the soft label encoding prior usage distribution. The second factor represents the probability of feature f under usage j (C f i −i,j is the count of the feature f assigned to usage j, excluding the current usage assignment u i ). The third factor represents the probability of usage j in the current instance (C d i −i,j is the count of linguistic features which are assigned to usage j in the current instance, excluding the current feature f ). The value of |U | is 2, representing the number of usages (i.e., figurative and literal). α and β are the hyper-parameters from the Dirichlet priors (we set both of them to 1). The core idea of Equation 3 is to integrate both distribution semantic information (soft label, the first factor) and linguistically motivated features (the second and third factors) into the inference procedure.
The matrices of C F U and C DU from the last 10% * T iterations are averaged and then normalized to approximate the true usage-feature distribution φ uf and instance-usage distribution θ du respectively. The final result is determined by θ du , i.e., assigning each instance with the usage of probability higher than 0.5. We do average to have a more stable result because an accidental bad sampling would affect our model negatively if we only use the C F U and C DU from the last iteration. This procedure is important for some idioms if their feature space is sparse. The iteration number T is set to 500 based on a development set.
Informed Expectation Maximization Combining a Naive Bayes classifier with the EM algorithm has been widely used in text classification and word sense disambiguation (Hristea, 2013;Nigam et al., 2000). In our case, we want to construct a model to recover the missing literal and figurative labels of the instances of the target idiom. This section describes two extensions to the basic EM algorithm for idiom usage recognition. The extensions help improve parameter estimation by taking the automatically learned soft labels into consideration.
Our informed EM method extends a basic version for NB (Hristea, 2013), where the initial parameter values θ du and φ uf are chosen randomly. At each iteration, the E-step of the algorithm estimates the expectations of the missing values (i.e. the literal and figurative usage) given the latest iteration of the model parameters; the M-step maximizes the likelihood of the model parameters using the previously-computed expectations of the missing values. As we've done with extending Gibbs sampling for LDA, we also perform two similar adaptations on conventional EM for NB to incorporate soft labels. First, we assign each instance an initial usage distribution θ du directly using the soft label, and then initialize the usagefeature distribution φ uf using these assignments. We refer it as informed initialization. Second, in the E-step, we multiply the expectation result of the basic EM with the soft label as the new expected usage for each instance (i.e., updating θ du ). The M-step is the same as basic EM to update the usage-feature distribution φ uf .

Evaluation
We conduct experiments to address three questions: 1. How effective is our overall approach? How does it compare against previous work? 2. How effective is our literal usage metric (i.e., MinV) compared to other heuristics?
3. How effective is our literal usage metric at informing downstream learning processes?

Experimental Setup
Models Our main experiments will evaluate the two variants of the proposed fully unsupervised model as described in section 3: MinV+infGibbs and MinV+infEM. We report the average performance of our models over 5 runs. Performing multiple runs is necessary because we have a sampling process. They are compared with three baseline unsupervised models:   Parameter setting Recall that in order to build the literal usage representation of an idiom, we need to sample s sentences that contain each constituent word w from an external corpus; extract from them the top n most frequently co-occurring words with w; then separately find t words that are semantically similar to w using word embeddings. To set parameters with values in reasonable ranges, we evaluated MinV on a small development set. We picked 10 idioms that are different from the evaluation set, scraped 50 instances from the web for each idiom, and labeled them ourselves. We find that s >= 100, n=10, and t=5 yield good results.
We use the gensim toolkit (Řehůřek and Sojka, 2010) and train our word embedding model using the continuous bag of word model on Text8 Corpus 6 . Negative sampling is applied as the training method; the min count is set to 2. For the other parameters, we use the default settings in gensim. Evaluative Data Our goal is to compare all the methods under two public available corpora: Se-mEval 2013 Task 5B corpus (Korkontzelos et al., 2013), which is used by prior supervised methods (Liu and Hwa, 2017;Rajani et al., 2014) and verb-noun combination (VNC) dataset (Cook et al., 2008), which is used by a prior unsupervised method (Fazly et al., 2009). However, there are some methods-datasets conflicts that have to be resolved. Because the idioms in the SemEval dataset are all in their canonical forms, and because the idioms are not restricted to the verb-noun combination, we cannot evaluate the method by Fazly et al. on this dataset (as their method is tailored to verbnoun combination). Some idioms from the VNC dataset are almost always used figuratively (or literally), which presents a problem for supervised methods. To facilitate full comparisons, we select the subset of idioms from the VNC corpus whose number of literal and figurative instances are both higher than 10. A summary of the two corpora is shown in Table 1. Note that each instance in Se-mEval corpus has about 3∼5 sentences; for consistency, we use 3 sentences as the context: the sentence with the target idiom and two neighboring sentences. Evaluation metric Following the convention in prior works, we report the F-score for the recognition of figurative usages and the overall accuracy.   Table 2 shows the result of our models and the other comparative methods. Our proposed models show consistent performance across the two corpora, outperforming the unsupervised baselines from , Li and Sporleder (2009) and the supervised model from Rajani et al. (2014). Moreover, there is no statistical significance in the F-score difference between the supervised ensemble model from Liu and Hwa (2017) and our models. On the VNC corpus, our models have comparable average scores as that of Fazly et al. (2009); our scores are more stable across different idioms. While the method of Fazly et al. is nearly perfect for some idioms (0.98 on "take heart"), it performs poorly for others (e.g., 0.33 on "pull leg"). Their algorithm has trouble with idioms whose canonical and non-canonical forms can appear frequently both in literal and figurative usages.

Effectiveness of MinV
The core of our approach is MinV, the literal usage metric we developed to generate soft labels to guide the unsupervised learning. This experiment examines its effectiveness by creating usage classifications directly from it (i.e., if MinV predicts a probability of >0.5, predict "literal"). We compare MinV against two alternative heuristics.
MinV is based on two core ideas. First, if an idiom is used figuratively, we expect to see a big difference (low similarity scores) between its context and the semantic representation of idiom's literal usage. The idea is similar to that of , but they relied on lexical chain instead of distributional semantics. Second, instead of choosing a predefined threshold to separate the raw semantic similarity scores, we select a different decision threshold for each idiom adaptively based on the distribution of the scores. So as an alternative, we compare MinV against a Fixed-Threshold heuristic that labels an instance as "literal" if its raw score is higher than some global threshold (set to 0.346 based on development data).
In Table 3, we observe that Minv outperforms both Sporleder and Li's model as well as Fixed-Threshold, but using MinV by itself is not sufficient. It has great fluctuations, e.g., the F-Score for individual idioms varies from 0.43 to 0.88. Recall that MinV +infGibbs has a smaller fluctuation across different idioms in Table 2. These results suggest that the subsequent learning process is effective.
Through error analysis, we find two major factors contributing to the performance fluctuation. First, the context itself could be misleading. An error case of play ball by MinV is: All 10-year-old Minnie Cruttwell wants to do is play with the boys , but the Football Association are not playing ball. She is a member of a mixed team called Balham Blazers , but the FA say she must join a girls' team when she is 12.
The context words in bold (which are related to the word "ball") mislead MinV to predict a "literal" usage when it is actually a "figurative" usage (since an organization such as the Football Association cannot literally play ball). Second, not all content words in the context are relevant for distinguishing the idiom's usage. A future direction is to prune contextual words more intelligently.

Integration of MinV into Learning
We have argued that an advantage of using a metric with a probabilistic interpretation instead of a binary class heuristic is that its scores can be incorporated into subsequent learning models as soft labels. In this set of experiments, we evaluate the impact of the metric on the learning methods. First, we consider unsupervised learning without input from the literal usage metric. We cluster the instances with the original Gibbs sampling and EM algorithms and then label the two clusters with the majority usage within the clusters. Second, we explore using the information from the literal usage metric as "noisy gold standard" to perform supervised training on a nearest neighbors (NN) classifier. Specifically, the literal and figurative instances labeled by MinV with high confidence (top 30%) are used as example set. Then for each test instance, we calculate its cosine similarity (in feature space) to the literal and figurative example sets and assign the label of the closest set. We refer this model as MinV +NN.     Table 4 shows the performances of the new models, which are all worse than our full models MinV +infGibbs and MinV +infEM. This highlights the advantage of integrating distributional semantic information and local features into one single learning procedure. Without the informed prior (encoded by the soft labels), the Gibbs sampling and EM algorithms only seek to maximize the probability of the observed data, and may fail to learn the underlying usage structure.
The model MinV +NN is not as competitive as our full models. It is too sensitive to the selected instances. Even though the training examples are instances that MinV is the most confident about, there are still mislabelled instances. These "noisy training examples" would lead the NN classifier to make unreliable predictions. In contrast, our unsupervised learning is less sensitive to the performance of MinV; it can achieve a decent performance for an idiom even when the quality of the soft labels is poor. For example, when using MinV as a stand-alone model for break a leg, its figura- Figure 2: The performance of MinV+infGibbs on the idiom "break a leg" tive F-score is only 0.43, but through further training, the full model MinV+infGibbs achieves 0.64. Fig. 2 shows the training curve. A possible reason for this phenomenon is that the soft label is integrated into the learning process by biasing the sampling procedure (see Equation 3). We only encourage our model to follow the distributional semantic evidence captured by soft label and do not force it. So if there are strong evidences encoded by the linguistically motivated features in the instances to overcome the soft label it still has the freedom to do so.

Conclusion
We have presented an unsupervised method for idiom usage recognition built upon the heuristic that instances that use the idiom literally are semantically closer to constituent words of the idiom. Experimental results on two different corpora suggest that our models are competitive against supervised methods and prior unsupervised methods.