Olive Oil is Made of Olives, Baby Oil is Made for Babies: Interpreting Noun Compounds Using Paraphrases in a Neural Model

Automatic interpretation of the relation between the constituents of a noun compound, e.g. olive oil (source) and baby oil (purpose) is an important task for many NLP applications. Recent approaches are typically based on either noun-compound representations or paraphrases. While the former has initially shown promising results, recent work suggests that the success stems from memorizing single prototypical words for each relation. We explore a neural paraphrasing approach that demonstrates superior performance when such memorization is not possible.


Introduction
Automatic classification of a noun-compound (NC) to the implicit semantic relation that holds between its constituent words is beneficial for applications that require text understanding. For instance, a personal assistant asked "do I have a morning meeting tomorrow?" should search the calendar for meetings occurring in the morning, while for group meeting it should look for meetings with specific participants. The NC classification task is a challenging one, as the meaning of an NC is often not easily derivable from the meaning of its constituent words (Spärck Jones, 1983). Previous work on the task falls into two main approaches. The first maps NCs to paraphrases that express the relation between the constituent words (e.g. Nakov and Hearst, 2006;Nulty and Costello, 2013), such as mapping coffee cup and garbage dump to the pattern [w 1 ] CONTAINS [w 2 ]. The second approach computes a representation for NCs from the distributional representation of their individual constituents. While this approach * Work done during an internship at Google. yielded promising results, recently, Dima (2016) showed that similar performance is achieved by representing the NC as a concatenation of its constituent embeddings, and attributed it to the lexical memorization phenomenon (Levy et al., 2015).
In this paper we apply lessons learned from the parallel task of semantic relation classification. We adapt HypeNET (Shwartz et al., 2016) to the NC classification task, using their path embeddings to represent paraphrases and combining with distributional information. We experiment with various evaluation settings, including settings that make lexical memorization impossible. In these settings, the integrated method performs better than the baselines. Even so, the performance is mediocre for all methods, suggesting that the task is difficult and warrants further investigation. 1

Background
Various tasks have been suggested to address noun-compound interpretation. NC paraphrasing extracts texts explicitly describing the implicit relation between the constituents, for example student protest is a protest LED BY, BE SPONSORED BY, or BE ORGANIZED BY students (e.g. Nakov and Hearst, 2006;Kim and Nakov, 2011;Hendrickx et al., 2013;Nulty and Costello, 2013). Compositionality prediction determines to what extent the meaning of the NC can be expressed in terms of the meaning of its constituents, e.g. spelling bee is non-compositional, as it is not related to bee (e.g. Reddy et al., 2011). In this paper we focus on the NC classification task, which is defined as follows: given a pre-defined set of relations, classify nc = w 1 w 2 to the relation that holds between w 1 and w 2 . We review the various features used in the literature for classification. 2

Compositional Representations
In this approach, classification is based on a vector representing the NC (w 1 w 2 ), which is obtained by applying a function to its constituents' distributional representations: v w 1 , v w 2 ∈ R n . Various functions have been proposed in the literature. Mitchell and Lapata (2010) proposed 3 simple combinations of v w 1 and v w 2 (additive, multiplicative, dilation). Others suggested to represent compositions by applying linear functions, encoded as matrices, over word vectors. Baroni and Zamparelli (2010) focused on adjective-noun compositions (AN) and represented adjectives as matrices, nouns as vectors, and ANs as their multiplication. Matrices were learned with the objective of minimizing the distance between the learned vector and the observed vector (computed from corpus occurrences) of each AN. The full-additive model (Zanzotto et al., 2010;Dinu et al., 2013) is a similar approach that works on any two-word composition, multiplying each word by a square matrix: Socher et al. (2012) suggested a non-linear composition model. A recursive neural network operates bottom-up on the output of a constituency parser to represent variable-length phrases. Each constituent is represented by a vector that captures its meaning and a matrix that captures how it modifies the meaning of constituents that it combines with. For a binary NC, nc = g(W · [ v w 1 ; v w 2 ]), where W ∈ R 2n×n and g is a non-linear function.
These representations were used as features in NC classification, often achieving promising results (e.g. Van de Cruys et al., 2013;Dima and Hinrichs, 2015). However, Dima (2016) recently showed that similar performance is achieved by representing the NC as a concatenation of its constituent embeddings, and argued that it stems from memorizing prototypical words for each relation. For example, classifying any NC with the head oil to the SOURCE relation, regardless of the modifier.

Paraphrasing
In this approach, the paraphrases of an NC, i.e. the patterns connecting the joint occurrences of the constituents in a corpus, are treated as features. For example, both paper cup and steel knife may share the feature MADE OF. Séaghdha and Copestake (2013) leveraged this "relational similarity" in a kernel-based classification approach. They combined the relational information with the complementary lexical features of each constituent separately. Two NCs labeled to the same relation may consist of similar constituents (paper-steel, cup-knife) and may also appear with similar paraphrases. Combining the two information sources has shown to be beneficial, but it was also noted that the relational information suffered from data sparsity: many NCs had very few paraphrases, and paraphrase similarity was based on ngram overlap.
Recently, Surtani and Paul (2015) suggested to represent NCs in a vector space model (VSM) using paraphrases as features. These vectors were used to classify new NCs based on the nearest neighbor in the VSM. However, the model was only tested on a small dataset and performed similarly to previous methods.

Model
We similarly investigate the use of paraphrasing for NC relation classification. To generate a signal for the joint occurrences of w 1 and w 2 , we follow the approach used by HypeNET (Shwartz et al., 2016). For an w 1 w 2 in the dataset, we collect all the dependency paths that connect w 1 and w 2 in the corpus, and learn path embeddings as detailed in Section 3.2. Section 3.1 describes the classification models with which we experimented.

Classification Models
Figure 1 provides an overview of the models: path-based, integrated, and integrated-NC, each which incrementally adds new features not present in the previous model. In the following sections, x denotes the input vector representing the NC. The network classifies NC to the highest scoring relation: r = argmax i softmax( o) i , where o is the output layer. All networks contain a single hidden layer whose dimension is |x| 2 . k is the number of relations in the dataset. See Appendix A for additional technical details.
Path-based. Classifies the NC based only on the paths connecting the joint occurrences of w 1 and w 2 in the corpus, denoted P (w 1 , w 2 ). We define the feature vector as the average of its path embeddings, where the path embedding p of a path p is weighted by its frequency f p,(w 1 ,w 2 ) : Integrated. We concatenate w 1 and w 2 's word embeddings to the path vector, to add distributional information: Potentially, this allows the network to utilize the contextual properties of each individual constituent, e.g. assigning high probability to SUBSTANCE-MATERIAL-INGREDIENT for edible w 1 s (e.g. vanilla pudding, apple cake). Integrated-NC. We add the NC's observed vector v nc as additional distributional input, providing the contexts in which w 1 w 2 occur as an NC: Dima (2016), we learn NC vectors using the GloVe algorithm (Pennington et al., 2014), by replacing each NC occurrence in the corpus with a single token.
This information can potentially help clustering NCs that appear in similar contexts despite having low pairwise similarity scores between their constituents. For example, gun violence and abortion rights belong to the TOPIC relation and may appear in similar news-related contexts, while (gun, abortion) and (violence, rights) are dissimilar.

Path Embeddings
Following HypeNET, for a path p composed of edges e 1 , ..., e k , we represent each edge by the concatenation of its lemma, part-of-speech tag, dependency label and direction vectors: v e = [ v l , v pos , v dep , v dir ]. The edge vectors v e 1 , ..., v e k are encoded using an LSTM (Hochreiter and Schmidhuber, 1997), and the last output vector p is used as the path embedding.
We use the NC labels as distant supervision. While HypeNET predicts a word pair's label from the frequency-weighted average of the path vectors, we differ from it slightly and compute the label from the frequency-weighted average of the predictions obtained from each path separately: We conjecture that label distribution averaging allows for more efficient training of path embeddings when a single NC contains multiple paths.

Dataset
We follow Dima (2016) and evaluate on the Tratz (2011) dataset, with 19,158 instances and two levels of labels: fine-grained (Tratz-fine, 37 relations) and coarse-grained (Tratz-coarse, 12 relations). We report results on both versions. See Tratz (2011) for the list of relations. Dima (2016) showed that a classifier based only on v w 1 and v w 2 performs on par with compound representations, and that the success comes from lexical memorization (Levy et al., 2015): memorizing the majority label of single words in particular slots of the compound (e.g. TOPIC for travel guide, fishing guide, etc.). This memorization paints a skewed picture of the stateof-the-art performance on this difficult task.

Dataset Splits
To better test this hypothesis, we evaluate on 4 different splits of the datasets to train, test, and validation sets: (1) random, in a 75:20:5 ratio, (2) lexical-full, in which the train, test, and validation  sets each consists of a distinct vocabulary. The split was suggested by Levy et al. (2015), and it randomly assigns words to distinct sets, such that for example, including travel guide in the train set promises that fishing guide would not be included in the test set, and the models do not benefit from memorizing that the head guide is always annotated as TOPIC. Given that the split discards many NCs, we experimented with two additional splits: (3) lexical-mod split, in which the w 1 words are unique in each set, and (4) lexical-head split, in which the w 2 words are unique in each set. Table 2 displays the sizes of each split.

Baselines
Frequency Baselines. mod freq classifies w 1 w 2 to the most common relation in the train set for NCs with the same modifier (w 1 w 2 ), while head freq considers NCs with the same head (w 1 w 2 ). 4 Distributional Baselines. Ablation of the pathbased component from our models: Dist uses only w 1 and w 2 's word embeddings: while Dist-NC includes also the NC embedding: The network architecture is defined similarly to our models (Section 3.1).
Compositional Baselines. We re-train Dima's (2016) models, various combinations of NC representations (Zanzotto et al., 2010;Socher et al., 3 In practice, in lexical-full this is a random baseline, in lexical-head it is the modifier frequency baseline, and in lexical-mod it is the head frequency baseline. 4 Unseen heads/modifiers are assigned a random relation. 2012) and single word embeddings in a fully connected network. 5 Table 1 shows the performance of various methods on the datasets. Dima's (2016) compositional models perform best among the baselines, and on the random split, better than all the methods. On the lexical splits, however, the baselines exhibit a dramatic drop in performance, and are outperformed by our methods. The gap is larger in the lexical-full split. Finally, there is usually no gain from the added NC vector in Dist-NC and Integrated-NC.

Analysis
Path Embeddings. To focus on the changes from previous work, we analyze the performance of the path-based model on the Tratz-fine random split. This dataset contains 37 relations and the model performance varies across them. Some relations, such as MEASURE and PER-SONAL TITLE yield reasonable performance (F 1 score of 0.87 and 0.68). Table 3 focuses on these relations and illustrates the indicative paths that the model has learned for each relation. We compute these by performing the analysis in Shwartz et al. (2016), where each path is fed into the pathbased model, and is assigned to its best-scoring relation. For each relation, we consider paths with a score ≥ 0.8. Other relations achieve very low F 1 scores, indicating that the model is unable to learn them at all. Interestingly, the four relations with the lowest performance in our model 6 are also those with the highest error rate in Dima (2016), very   likely since they express complex relations. For example, the LEXICALIZED relation contains noncompositional NCs (soap opera) or lexical items whose meanings departed from the combination of the constituent meanings. It is expected that there are no paths that indicate lexicalization. In PAR-TIAL ATTRIBUTE TRANSFER (bullet train), w 1 transfers an attribute to w 2 (e.g. bullet transfers speed to train). These relations are not expected to be expressed in text, unless the text aims to explain them (e.g. train as fast as a bullet).
Looking closer at the model confusions shows that it often defaulted to general relations like OB-JECTIVE (recovery plan) or RELATIONAL-NOUN-COMPLEMENT (eye shape). The latter is described as "indicating the complement of a relational noun (e.g., son of, price of)", and the indicative paths for this relation indeed contain many variants of "[w 2 ] of [w 1 ]", which potentially can occur with NCs in other relations. The model also confused between relations with subtle differences, such as the different topic relations. Given that these relations were conflated to a single relation in the inter-annotator agreement computation in Tratz and Hovy (2010), we can conjecture that even humans find it difficult to distinguish between them.
NC Embeddings. To understand why the NC embeddings did not contribute to the classification, we looked into the embeddings of the Tratz-fine test NCs; 3091/3831 (81%) of them had embeddings. For each NC, we looked for the 10 most similar NC vectors (in terms of cosine similarity), and compared their labels. We have found that only 27.61% of the NCs were mostly similar to NCs with the same label. The problem seems to be inconsistency of annotations rather than low embeddings quality.

Conclusion
We used an existing neural dependency path representation to represent noun-compound paraphrases, and along with distributional information applied it to the NC classification task. Following previous work, that suggested that distributional methods succeed due to lexical memorization, we show that when lexical memorization is not possible, the performance of all methods is much worse. Adding the path-based component helps mitigate this issue and increase performance.

A Technical Details
To extract paths, we use a concatenation of English Wikipedia and the Gigaword corpus. 7 We consider sentences with up to 32 words and dependency paths with up to 8 edges, including satellites, and keep only 1,000 paths for each nouncompound. We compute the path embeddings in advance for all the paths connecting NCs in the dataset ( §3.2), and then treat them as fixed embeddings during classification ( §3.1). We use TensorFlow (Abadi et al., 2016) to train the models, fixing the values of the hyperparameters after performing preliminary experiments on the validation set. We set the mini-batch size to 10, use Adam optimizer (Kingma and Ba, 2014) with the default learning rate, and apply word dropout with probability 0.1. We train up to 30 epochs with early stopping, stopping the training when the F 1 score on the validation set drops 8 points below the best performing score.
We initialize the distributional embeddings with the 300-dimensional pre-trained GloVe embeddings (Pennington et al., 2014) and the lemma embeddings (for the path-based component) with the 50-dimensional ones. Unlike HypeNET, we do not update the embeddings during training. The lemma, POS, and direction embeddings are initialized randomly and updated during training. NC embeddings are learned using a concatenation of Wikipedia and Gigaword. Similarly to the original GloVe implementation, we only keep the most frequent 400,000 vocabulary terms, which means that roughly 20% of the noun-compounds do not have vectors and are initialized randomly in the model.