Idiom-Aware Compositional Distributed Semantics

Idioms are peculiar linguistic constructions that impose great challenges for representing the semantics of language, especially in current prevailing end-to-end neural models, which assume that the semantics of a phrase or sentence can be literally composed from its constitutive words. In this paper, we propose an idiom-aware distributed semantic model to build representation of sentences on the basis of understanding their contained idioms. Our models are grounded in the literal-first psycholinguistic hypothesis, which can adaptively learn semantic compositionality of a phrase literally or idiomatically. To better evaluate our models, we also construct an idiom-enriched sentiment classification dataset with considerable scale and abundant peculiarities of idioms. The qualitative and quantitative experimental analyses demonstrate the efficacy of our models.


Introduction
Currently, neural network models have achieved great success for many natural language processing (NLP) tasks , such as text classification (Zhao et al., 2015;Liu et al., 2017), semantic matching (Liu et al., 2016a,b), and machine translation . The key factor of these neural models is how to compose a phrase or sentence representation from its constitutive words. Typically, a shared compositional function is used to compose word vectors recursively until obtaining the representation of the phrase or sentence. The form of compositional function involves many kinds of neural networks, such as recurrent neural networks (Hochreiter and Schmidhuber, 1997;Chung et al., 2014), convolutional neural networks (Collobert et al., 2011;Kalchbrenner et al., 2014), and recursive neural networks (Socher et al., 2013;Tai et al., 2015;Zhu et al., 2015).
However, these methods show an obvious defect in representing idiomatic phrases, whose semantics are not literal compositions of the individual words. For example, "pulling my leg" is idiomatic, and its meaning cannot be directly derived from a literal combination of its contained words. Due to its importance, some previous work focuses on automatic identification of idioms (Katz and Giesbrecht, 2006;Li and Sporleder, 2009;Fazly et al., 2009;Peng et al., 2014;Salton et al., 2016). However, challenge remains to take idioms into account to improve neural based semantic representations of phrases or sentences.
Motivated by the literal-first psycholinguistic hypothesis proposed by Bobrow and Bell (1973), in this paper, we propose an end-to-end neural model for idiom-aware distributed semantic representation, in which we adopt a neural architecture of recursive network (Socher et al., 2013;Tai et al., 2015;Zhu et al., 2015) to learn the compositional semantics over a constituent tree. More concretely, we introduce a neural idiom detector for each phrase in a sentence to adaptively determine their compositionality: literal or idiomatic manner. For the literal phrase, we compute its semantics from its constituents while for the idiomatic phrase, we design two different ways to learn representations of idioms grounded in two different linguistic views of idioms (Katz, 1963;Fraser, 1970;Nunberg et al., 1994).
To evaluate our models towards the ability to understand sentences with idioms, we conduct our experiments on sentiment classification task due to the following reasons: 1) Idioms typically imply an affective stance toward something and they are common in reviews and comments (Williams et al., 2015). 2) The error analysis of sentiment classification results reveals that a large number of errors are caused by idioms (Balahur et al., 2013).
The contributions of this work are summarized as follows: • We grow the capacity of recursive neural network, enabling it to model idiomatic phrases and handle ubiquitous phenomenon of idiomatic variations when learning a sentential representation. • We integrate idioms understanding into a real-world NLP task instead of evaluating idiom detection as a standalone task. • We construct a new real-world dataset covering abundant idioms with original and variational forms. The elaborate qualitative and quantitative experimental analyses show the effectiveness of our models.

Linguistic Interpretation of Idioms
Recently, idioms have raised eyebrows among linguists, psycholinguists, and lexicographers due to their pervasiveness in daily discourse and their fascinating properties in linguistics literature (Villavicencio et al., 2005;Salton et al., 2014). As peculiar linguistic constructions (Villavicencio et al., 2005;Salton et al., 2014), idioms have three following properties: Invisibility Idioms always disguise themselves as normal multi-words in sentences. It makes end-to-end training hard since we should detect idioms first, and then understand them. Idiomaticity Idioms are semantically opaque, whose meanings cannot be derived from their constituent words. Existing compositional distributed approaches fail due to the hypothesis that the meaning of any phrase can be composed of the meanings of its constituents. Flexibility While structurally fixed, idioms allow variations. The words of some idioms can be removed or substituted by other words. Table 1 shows the three properties of idioms and the resulting challenges for distributed compositional semantics. To address these challenges,  two different perspectives have been held for idiom comprehension.
The first perspective treats idioms as long words whose meanings are stipulated arbitrarily and can not be predict from its constituent (Katz, 1963;Fraser, 1970). However, a lot of idioms have shown certain degree of flexibility in term of morphology and lexeme, so this kind of method handles variation badly and fails to generalize.
The second perspective considers idioms as linguistic expressions (Nunberg et al., 1994), whose meanings are determined by the meanings of their constituent parts and some compositional rules can be used to combine them. This fully compositional view may handle lexical variations, but it suffers from the idiomaticity problem, for the meanings of idioms are opaque.

Proposed Models
We propose an end-to-end neural model for idiomaware distributed semantic representation. Specifically, in terms of invisibility, we introduce a neural idiom detector to adaptively distinguish literal and idiomatic meaning of each phrase when learning sentence representations. For the literal phrase, we compute its semantics from its constituents with Tree-structured LSTM (TreeLSTM) (Tai et al., 2015;Zhu et al., 2015). For the idiomatic phrase, we design two different ways to learn representations of idioms grounded in two different linguistic views of idioms, which considers the idiomaticity and flexibility properties of idioms. Figure 1 illustrates the framework of our proposed models, which consist of three modules: literal interpreter, idiom detector and idiomatic interpreter.

Literal Interpreter
The literal interpreter is basically a compositional semantic model, in which the semantics of a phrase is composed by literal meanings of its constituent words. Several existing models could serve as literal interpreter. In this paper, we adopt TreeLSTM (Tai et al., 2015) due to its superior performance.
Formally, given a binary constituency tree T induced by a sentence, each non-leaf node corresponds to a phrase. We refer to h j and c j as the hidden state and memory cell of each node j. The transition equations of node j are as follows:  where x j denotes the input vector and is non-zero if and only if it is a leaf node. The superscript l and r represent the left child and right child respectively. σ represents the logistic sigmoid function and denotes element-wise multiplication. T A,b is an affine transformation which depends on parameters of the network A and b. Figure 1- (a) gives an illustration of TreeLSTM unit.

Idiom Detector
Despite the success of TreeLSTM, there is still existing potential weakness of the hypothesis that the meaning of a phrase or a sentence can be composed from the meanings of its constituents. Previous neural sentence models are poor at learning the meanings of idiomatic phrases, not to mention modeling the idiomatic variations.
Therefore, we introduce a parameterized idiom detector, which is used for detecting the boundary between literal and idiomatic meanings. Specifically, if a compositional interpretation is nonsensical in the context of a sentence, then the detector is supposed to check whether an idiomatic sense should be taken and whether it makes sense. This literal-first model of idiom comprehension is motivated by the psycholinguistic hypothesis proposed by Bobrow and Bell (1973).
Due to ignoring the context information, TreeLSTM suffers from the problem of disambiguation. For example, the phrase "in the bag" is compositional in the sentence "The dictionary is in the bag" while it has idiomatic meaning in the sentence "The election is in the bag unless the voters find out about my past". To address this problem, we explicitly model the context representation and integrate it into the process of sentence composition.
Context Representation More concretely, for each non-leaf node i and its corresponding phrase p i , we define C as a word set which contains words surrounding the phrase p i . Then the context representation s i can be obtained as follow: where f is a function with learnable parameter θ. Here, the function is implemented in two approaches, NBOW and LSTM.
Detector The detector outputs a scalar α to determine whether the meaning of a phrase is literal or idiomatic. Formally, for the phrase i (non-leaf node i) with its context information s i and literal meaning h (l) i , we compute the semantic compositional score α i using a single layer multilayer perceptron.
where W s ∈ R d×2d and v s ∈ R d are learnable parameters.

Idiomatic Interpreter
Idiomatic phrases pose a clear challenge to current compositional models of language comprehension. However, until recently, there is little investigation of learning idiomatic phrases in a realworld task. Here, based on different views of idioms (Katz, 1963;Fraser, 1970;Nunberg et al., 1994), we propose two idiomatic interpreters to model them.
Direct Look-Up Model Inspired by the direct access theory for idiom comprehension (Glucksberg, 1993), in this model, once a phrase p is detected as an idiom, it will be regarded as a long word like a key, and then their meanings can be directly retrieved from an external memory M , which is a table and stores idiomatic information for each idiom as depicted in the top subfigure in Figure 1-(c). Formally, the idiomatic meaning for a phrase can be obtained as: where k denotes the index of the corresponding phrase p.

Morphology-Sensitive Model
Since most idioms enjoy certain flexibility in morphology, lexicon, syntax, the above model suffers from the problem of idiom variations. To remedy this, inspired by the compositional view of idioms (Nunberg et al., 1994) and recent success of characterbased models (Kim et al., 2016;Lample et al., 2016;Chung et al., 2016), we propose to use CharLSTM to directly encode the meaning of a phrase in an idiomatic space and generate an idiomatic representation, which is not contaminated by its literal meaning and sensitive to different variations.
Formally, for each non-leaf node i and its corresponding phrase p i in a constituency tree, we apply charLSTM to phrase p i as depicted in the bottom subfigure in Figure 1-(c) and utilize the emitted hidden states r j to represent the idiomatic meaning of the phrase.
where j = 1, 2, · · · , m and m represents the length of the input phrase. Then, we can obtain the idiomatic representation: After obtaining the literal or idiomatic meanings, we can compute the final representation for phrase p i :

Analysis of Two Proposed Idiomatic Interpreters
Given a phrase, both interpreters can generate a corresponding semantic representation, which is not contaminated by its literal meaning. The difference is that Look-Up Model takes a totally noncompositional view that the meanings of idioms can be directly accessed from an external dictionary. This straightforward retrieval mechanism is more efficient and can introduce external prior knowledge by utilizing pre-trained external dictionary. By contrast, Morphology-Sensitive Model holds the idea that idiomatic meanings can still be composed in an idiomatic space, which allows this model to understand idioms better in terms of flexibility. Besides, this kind of model does not require an extra dictionary.

iSent: A Benchmark for Idiom-Enriched Sentiment Classification Dataset
To evaluate our models, we need a task that heavily depends on the understanding of idioms. In this paper, we choose sentiment classification task due to following reasons: 1) Idioms typically imply an affective stance toward something and they are common in reviews and comments (Williams et al., 2015).
2) The error analysis of sentiment classification results reveals that a large number of errors are caused by idioms (Balahur et al., 2013).
In this section, we will first give a brief description of the most commonly used datasets for sentiment classification so as to motivate the need for a new benchmark dataset.

Mainstream Datasets for Sentiment Classification
We list four kinds of datasets which are most commonly used for sentiment classification in NLP community. Additionally, we also evaluate our models on these datasets to make a comparison with many recent proposed models. Each dataset is briefly described as follows.
• SST-1 The movie reviews with five classes (negative, somewhat negative, neutral, somewhat positive, positive) in the Stanford Sentiment Treebank 1 (Socher et al., 2013). • SST-2 The movie reviews with binary classes. It is also derived from the Stanford Sentiment Treebank. • MR The movie reviews with two classes 2 (Pang and Lee, 2005). • SUBJ Subjectivity data set where the goal is to classify each instance (snippet) as being subjective or objective. (Pang and Lee, 2004) The detailed statistics about these four datasets are listed in Table 2.

Reasons for a New Dataset
Differing from previous work, which evaluating idiom detection as a standalone task, we want to integrate idiom understanding into sentiment classification task. However, most of existing sentiment datasets do not cover enough idioms or related linguistic phenomenon. To better evaluate our models on idiom understanding task, we proposed an idiom-enriched sentiment classification dataset, in which each sentence contains at least one idiom.
Additionally, considering most idioms have certain flexibility in morphology, lexicon and syntax, we enrich our dataset by introducing different types of idiom variations so that we can further evaluate the ability that the model handle different idiomatic variations. As shown in Table 3, we sum up two types of phenomena towards idiom variations and, for each variation, we obtain several corresponding sentences from a large corpora.

Data collection
We crawl the website rottentomatoes.com to excerpt movie reviews with corresponding scores and collect the idioms from dictionary (Flavell and Flavell, 2006). The idiom dictionary contains lexical variations while has no morphology variations. To address this problem, we manually annotate the morphological variation of each idiom in term of verb(tense), noun(plural or singular).
Then we filter these movie reviews with idioms ensuring that each sentence covers at least one idiom. After that, we obtain nearly 15K movie reviews covering 1K idioms. To further improve the quality of these idiom-enriched sentences, we take some strategies to filter the dataset and finally construct 13K idiom-enriched sentences.
• If the occurrence of an idiom in all the reviews is less than 3, we threw this idiom and corresponding reviews. 3 • We find some "idioms" in sentences are movie names rather than expressing idiomatic meanings and we filtered this kind of noise.

Statistics
The iSent dataset finally contains 9752 training samples , 1020 development samples and 2003 test samples. Besides, the development and test sets cover different types of idiom variations allowing us to test the model's generalization. Table 4 shows the detailed statistics and Figure 2 shows the distribution of the number of reviews over different lengths.

Experiment
We first evaluate our proposed models on four popular sentiment datasets, so that we can make a comparison with varieties of competitors. And then, we use the newly-introduced dataset to make more detailed experiment analyses.

Experimental Settings
Loss Function Given a sentence and its label, the output of neural network is the probabilities of the different classes. The parameters of the network are trained to minimise the cross-entropy of the predicted and true label distributions. To minimize the objective, we use stochastic gradient descent with the diagonal variant of AdaGrad (Duchi et al., 2011).

Initialization and Hyperparameters
In all of our experiments, the word embeddings for all of the models are initialized with the GloVe vectors (Pennington et al., 2014). The other parameters are initialized by randomly sampling from uniform distribution in [−0.1, 0.1].
For each task, we take the hyperparameters which achieve the best performance on the development set via a small grid search over combina-tions of the initial learning rate [0.1, 0.01, 0.001], l 2 regularization [0.0, 5E−5, 1E−5] The final hyper-parameters are as follows. The initial learning rate is 0.1. The regularization weight of the parameters is 1E−5.
For all the sentences from the five datasets, we parse them with constituency parser (Klein and Manning, 2003) to obtain the trees for our and some competitor models.

Competitor Models
We give some descriptions about the setting of our models and several baseline models.

Evaluation over Mainstream Datasets
The experimental results are shown in Table 5.
We can see Cont-TLSTM outperforms TLSTM on all four tasks, showing the importance of contextsensitive composition. Besides, both iTLSTM-Lo and iTLSTM-Mo achieve better results than TL-STM and Cont-LSTM, which indicates the effectiveness of our introduced modules (detector and idiomatic interpreter). Additionally, compared with iTLSTM-Lo, iTLSTM-Mo behaves better, suggesting its char-based idiomatic interpreter is more powerful. Although four mainstream datasets are not rich in idioms, we could also observe substantial improvement gained from our models. We attribute this success to the power of introduced detector in identifying other non-compositional collocations besides idioms. We will discuss about this later.

Evaluation over iSent Dataset
Since iSent is a newly-introduced dataset, there is no existing baselines. Nevertheless, we provide several strong baselines implemented by ourselves as shown in Table 6, and we can observe that: • Differing from the improvement achieved on mainstream datasets, proposed models have shown their advantages on idiom-enriched sentences. They obtain more significant improvements.   Table 6: Accuracies of our models on iSent dataset against typical baselines. BiLSTM represents bidirectional LSTM.
• Additionally, iTLSTM-Lo performs worse than iTLSTM-Mo while still surpasses baseline models, which also indicates the variation-sensitive model (iTLSTM-Mo) of idioms could further improve the performance.

Analysis
In this section, we will provide more detailed quantitative and qualitative analysis in terms of three properties of idioms described in Table 1: flexibility, invisibility and idiomaticity.
Flexibility Besides the overall accuracies on the test set, we also list the performance achieved by different models over the different parts of test set: original, morphological and lexical, which represents different types of variations and have been described in Table 4. We can see from Figure 3, both idiom-aware models achieve better performance than Cont- TLSTM by a large margin on the original part of test set, which indicates the importance of understanding idiomatic phrases during sentence modelling. Additionally, iTLSTM+Mo outperforms the other two models on the test set, suggesting the effectiveness of morphology-based model for modeling idiom variations.
Invisibility and Idiomaticity Previous experimental results have shown the effectiveness of our models. Here, we want to know how the introduced idiom detector contributes to the improvement of performance. Toward this end, we analyze all the 157 samples which our model predicts correctly while baseline model (Cont-TLSTM) fails on iSent, and find more than 120 sentences are given wrong sentiment by Cont-TLSTM due to ignoring the figurative meanings of idioms. For example, as shown in Figure 4, we randomly sample a sentence and analyze the changes of the predicted sentiment score at different nodes of the tree. The sentence "The movie enable my friends to blow a gasket" has negative sentiment. Cont-TLSTM gives a wrong prediction due to ignoring the information expressed by the idiomatic phrase "blow a gasket". By contrast, our model correctly detects this idiom, whose meaning plays a major role in final sentiment prediction.
Non-compositional Phrases Detection Besides idioms, we find the introduced detector can also pick up other types of non-compositional phrases 4 . We roughly sum up these non- Figure 4: The change of the predicted sentiment score at different nodes of the tree. The red and blue color represent positive and negative sentiment respectively, where darker indicates higher confidence. Dashed triangle or square box denote not being selected by detector.   Table 7. From the table, we can see that most of these phrases either imply an affective stance toward something: "thumbs down", or are critical to the understanding of sentences such as the "Verb Phrases" and "Adverb Phrases". For example, the sentence "More often than not, this mixed bag hit its mark" has a positive sentiment. Cont-TLSTM pays much more attention to the word "not" without realizing that it belongs to the collocation "more often than not", which expresses neutral emotion. In comparison, our model regards this collocation as a whole with neutral sentiment, which is crucial for the final prediction.

Related Work
Previous work related to idioms focused on their identification, which falls in two kinds of paradigms: idiom type classification (Gedigian et al., 2006;Shutova et al., 2010) and idiom token classification (Katz and Giesbrecht, 2006;Li and Sporleder, 2009;Fazly et al., 2009;Peng et al., 2014;Salton et al., 2016). Different with these work, we integrate idioms understanding into a real-world task and consider different peculiarities of idioms in an end-to-end trainable framework.
Recently, there are some work exploring the compositionality of various types of phrases (Kartsaklis et al., 2012;Muraoka et al., 2014;Hermann, 2014;Hashimoto and Tsuruoka, 2016). Compared with these work, we focus on how to properly model idioms under the context of sentence representations.
More recently, Zhu et al. (2016) propose a DAG-structured LSTM to incorporate external semantics including non-compositional or holistically learned semantics. Its key characteristic is that a DAG needs be built in advance, which merges some detected n-grams as the noncompositional phrases based on external knowledge. Different from this work, we focus on how to integrate detection and understanding of idioms into a unified end-to-end model, in which an idiomatic detector is introduced to adaptively control the semantic compositionality. Particularly, in the whole process no extra information is given to tell which phrases should be regarded as noncompositional.

Conclusion and Future Work
In this paper, we lay idioms understanding in the context of sentence-level semantic representation based on two linguistic perspectives. To apply our model into the real-world task, we introduce a sizeable idiom-enriched sentiment classification dataset, which covers abundant peculiarities of idioms. We make an elaborate experiment design and case analysis to evaluate the effectiveness of our proposed models.
In future work, we would like to investigate more complicated idiom-enriched NLP tasks, such as machine translation.