Paraphrase to Explicate: Revealing Implicit Noun-Compound Relations

Revealing the implicit semantic relation between the constituents of a noun-compound is important for many NLP applications. It has been addressed in the literature either as a classification task to a set of pre-defined relations or by producing free text paraphrases explicating the relations. Most existing paraphrasing methods lack the ability to generalize, and have a hard time interpreting infrequent or new noun-compounds. We propose a neural model that generalizes better by representing paraphrases in a continuous space, generalizing for both unseen noun-compounds and rare paraphrases. Our model helps improving performance on both the noun-compound paraphrasing and classification tasks.


Introduction
Noun-compounds hold an implicit semantic relation between their constituents. For example, a 'birthday cake' is a cake eaten on a birthday, while 'apple cake' is a cake made of apples. Interpreting noun-compounds by explicating the relationship is beneficial for many natural language understanding tasks, especially given the prevalence of nouncompounds in English (Nakov, 2013).
The interpretation of noun-compounds has been addressed in the literature either by classifying them to a fixed inventory of ontological relationships (e.g. Nastase and Szpakowicz, 2003) or by generating various free text paraphrases that describe the relation in a more expressive manner (e.g. Hendrickx et al., 2013).
Methods dedicated to paraphrasing nouncompounds usually rely on corpus co-occurrences of the compound's constituents as a source of explicit relation paraphrases (e.g. Wubben, 2010;Versley, 2013). Such methods are unable to generalize for unseen noun-compounds. Yet, most noun-compounds are very infrequent in text (Kim and Baldwin, 2007), and humans easily interpret the meaning of a new noun-compound by generalizing existing knowledge. For example, consider interpreting parsley cake as a cake made of parsley vs. resignation cake as a cake eaten to celebrate quitting an unpleasant job.
We follow the paraphrasing approach and propose a semi-supervised model for paraphrasing noun-compounds. Differently from previous methods, we train the model to predict either a paraphrase expressing the semantic relation of a noun-compound (predicting '[w 2 ] made of [w 1 ]' given 'apple cake'), or a missing constituent given a combination of paraphrase and noun-compound (predicting 'apple' given 'cake made of [w 1 ]'). Constituents and paraphrase templates are represented as continuous vectors, and semantically-similar paraphrase templates are embedded in proximity, enabling better generalization. Interpreting 'parsley cake' effectively reduces to identifying paraphrase templates whose "selectional preferences" (Pantel et al., 2007) on each constituent fit 'parsley' and 'cake'.
A qualitative analysis of the model shows that the top ranked paraphrases retrieved for each noun-compound are plausible even when the constituents never co-occur (Section 4). We evaluate our model on both the paraphrasing and the classification tasks (Section 5). On both tasks, the model's ability to generalize leads to improved performance in challenging evaluation settings. 1 2 Background

Noun-compound Classification
Noun-compound classification is the task concerned with automatically determining the semantic relation that holds between the constituents of a noun-compound, taken from a set of pre-defined relations.
Early work on the task leveraged information derived from lexical resources and corpora (e.g. Girju, 2007;Ó Séaghdha and Copestake, 2009;Tratz and Hovy, 2010). More recent work broke the task into two steps: in the first step, a nouncompound representation is learned from the distributional representation of the constituent words (e.g. Mitchell and Lapata, 2010;Zanzotto et al., 2010;Socher et al., 2012). In the second step, the noun-compound representations are used as feature vectors for classification (e.g. Dima and Hinrichs, 2015;Dima, 2016).
The datasets for this task differ in size, number of relations and granularity level (e.g. Nastase and Szpakowicz, 2003;Kim and Baldwin, 2007;Tratz and Hovy, 2010). The decision on the relation inventory is somewhat arbitrary, and subsequently, the inter-annotator agreement is relatively low (Kim and Baldwin, 2007). Specifically, a noun-compound may fit into more than one relation: for instance, in Tratz (2011), business zone is labeled as CONTAINED (zone contains business), although it could also be labeled as PURPOSE (zone whose purpose is business).

Noun-compound Paraphrasing
As an alternative to the strict classification to predefined relation classes, Nakov and Hearst (2006) suggested that the semantics of a noun-compound could be expressed with multiple prepositional and verbal paraphrases. For example, apple cake is a cake from, made of, or which contains apples.
The suggestion was embraced and resulted in two SemEval tasks. SemEval 2010 task 9 (Butnariu et al., 2009) provided a list of plausible human-written paraphrases for each nouncompound, and systems had to rank them with the goal of high correlation with human judgments. In SemEval 2013 task 4 (Hendrickx et al., 2013), systems were expected to provide a ranked list of paraphrases extracted from free text.
Various approaches were proposed for this task. Most approaches start with a pre-processing step of extracting joint occurrences of the constituents from a corpus to generate a list of candidate paraphrases. Unsupervised methods apply information extraction techniques to find and rank the most meaningful paraphrases (Kim and Nakov, 2011;Xavier and Lima, 2014;Pasca, 2015;Pavlick and Pasca, 2017), while supervised approaches learn to rank paraphrases using various features such as co-occurrence counts (Wubben, 2010;Li et al., 2010;Surtani et al., 2013;Versley, 2013) or the distributional representations of the nouncompounds ( Van de Cruys et al., 2013).
One of the challenges of this approach is the ability to generalize. If one assumes that sufficient paraphrases for all noun-compounds appear in the corpus, the problem reduces to ranking the existing paraphrases. It is more likely, however, that some noun-compounds do not have any paraphrases in the corpus or have just a few. The approach of Van de Cruys et al. (2013) somewhat generalizes for unseen noun-compounds. They represented each noun-compound using a compositional distributional vector (Mitchell and Lapata, 2010) and used it to predict paraphrases from the corpus. Similar noun-compounds are expected to have similar distributional representations and therefore yield the same paraphrases. For example, if the corpus does not contain paraphrases for plastic spoon, the model may predict the paraphrases of a similar compound such as steel knife.
In terms of sharing information between semantically-similar paraphrases, Nulty and Costello (2010) and Surtani et al. (2013) learned "is-a" relations between paraphrases from the co-occurrences of various paraphrases with each other. For example, the specific '[w 2 ] extracted from [w 1 ]' template (e.g. in the context of olive oil) generalizes to '[w 2 ] made from [w 1 ]'. One of the drawbacks of these systems is that they favor more frequent paraphrases, which may co-occur with a wide variety of more specific paraphrases.

Noun-compounds in other Tasks
Noun-compound paraphrasing may be considered as a subtask of the general paraphrasing task, whose goal is to generate, given a text fragment, additional texts with the same meaning. However, general paraphrasing methods do not guarantee to explicate implicit information conveyed in the original text. Moreover, the most notable source for extracting paraphrases is multiple translations of the same text (Barzilay and McKeown,(23 2001; Ganitkevitch et al., 2013;Mallinson et al., 2017). If a certain concept can be described by an English noun-compound, it is unlikely that a translator chose to translate its foreign language equivalent to an explicit paraphrase instead. Another related task is Open Information Extraction (Etzioni et al., 2008), whose goal is to extract relational tuples from text. Most system focus on extracting verb-mediated relations, and the few exceptions that addressed noun-compounds provided partial solutions. Pal and Mausam (2016) focused on segmenting multi-word nouncompounds and assumed an is-a relation between the parts, as extracting (Francis Collins, is, NIH director) from "NIH director Francis Collins". Xavier and Lima (2014) enriched the corpus with compound definitions from online dictionaries, for example, interpreting oil industry as (industry, produces and delivers, oil) based on the Word-Net definition "industry that produces and delivers oil". This method is very limited as it can only interpret noun-compounds with dictionary entries, while the majority of English noun-compounds don't have them (Nakov, 2013).

Paraphrasing Model
As opposed to previous approaches, that focus on predicting a paraphrase template for a given nouncompound, we reformulate the task as a multitask learning problem (Section 3.1), and train the model to also predict a missing constituent given the paraphrase template and the other constituent. Our model is semi-supervised, and it expects as input a set of noun-compounds and a set of constrained part-of-speech tag-based templates that make valid prepositional and verbal paraphrases. Section 3.2 details the creation of training data, and Section 3.3 describes the model.

Multi-task Reformulation
Each training example consists of two constituents and a paraphrase (w 2 , p, w 1 ), and we train the model on 3 subtasks: (1) predict p given w 1 and w 2 , (2) predict w 1 given p and w 2 , and (3) predict w 2 given p and w 1 . Figure 1 demonstrates the predictions for subtasks (1) (right) and (2) (left) for the training example (cake, made of, apple). Effectively, the model is trained to answer questions such as "what can cake be made of?", "what can be made of apple?", and "what are the possible relationships between cake and apple?".
The multi-task reformulation helps learning better representations for paraphrase templates, by embedding semantically-similar paraphrases in proximity. Similarity between paraphrases stems either from lexical similarity and overlap between the paraphrases (e.g. 'is made of' and 'made of'), or from shared constituents, e.g.
. This allows the model to predict a correct paraphrase for a given noun-compound, even when the constituents do not occur with that paraphrase in the corpus.

Training Data
We collect a training set of (w 2 , p, w 1 , s) examples, where w 1 and w 2 are constituents of a nouncompound w 1 w 2 , p is a templated paraphrase, and s is the score assigned to the training instance. 2 We use the 19,491 noun-compounds found in the SemEval tasks datasets (Butnariu et al., 2009;Hendrickx et al., 2013) and in Tratz (2011). To extract patterns of part-of-speech tags that can form noun-compound paraphrases, such as '[w 2 ] VERB PREP [w 1 ]', we use the SemEval task training data, but we do not use the lexical information in the gold paraphrases.
Corpus. Similarly to previous noun-compound paraphrasing approaches, we use the Google Ngram corpus (Brants and Franz, 2006) as a source of paraphrases (Wubben, 2010;Li et al., 2010;Surtani et al., 2013;Versley, 2013). The corpus consists of sequences of n terms (for n ∈ {3, 4, 5}) that occur more than 40 times on the web. We search for n-grams following the extracted patterns and containing w 1 and w 2 's lemmas for some noun-compound in the set. We remove punctuation, adjectives, adverbs and some determiners to unite similar paraphrases. For example, from the 5-gram 'cake made of sweet apples' we extract the training example (cake, made of, apple). We keep only paraphrases that occurred at least 5 times, resulting in 136,609 instances.
Weighting. Each n-gram in the corpus is accompanied with its frequency, which we use to assign scores to the different paraphrases. For instance, 'cake of apples' may also appear in the corpus, although with lower frequency than 'cake from apples'. As also noted by Surtani et al. (2013), the shortcoming of such a weighting mechanism is that it prefers shorter paraphrases, which are much more common in the corpus (e.g. count('cake made of apples') count('cake of apples')). We overcome this by normalizing the frequencies for each paraphrase length, creating a distribution of paraphrases in a given length.
Negative Samples. We add 1% of negative samples by selecting random corpus words w 1 and w 2 that do not co-occur, and adding an example (w 2 , [w 2 ] is unrelated to [w 1 ], w 1 , s n ), for some predefined negative samples score s n . Similarly, for a word w i that did not occur in a paraphrase p we add (w i , p, UNK, s n ) or (UNK, p, w i , s n ), where UNK is the unknown word. This may help the model deal with non-compositional noun-compounds, where w 1 and w 2 are unrelated, rather than forcibly predicting some relation between them.

Model
For a training instance (w 2 , p, w 1 , s), we predict each item given the encoding of the other two.
Encoding. We use the 100-dimensional pretrained GloVe embeddings (Pennington et al., 2014), which are fixed during training. In addition, we learn embeddings for the special words [w 1 ], [w 2 ], and [p], which are used to represent a missing component, as in "cake made of [w 1 ]", "[w 2 ] made of apple", and "cake [p] apple".
For a missing component } surrounded by the sequences of words v 1:i−1 and v i+1:n , we encode the sequence using a bidirectional long-short term memory (bi-LSTM) network (Graves and Schmidhuber, 2005), and take the ith output vector as representing the missing component: In bi-LSTMs, each output vector is a concatenation of the outputs of the forward and backward LSTMs, so the output vector is expected to contain information on valid substitutions both with respect to the previous words v 1:i−1 and the subsequent words v i+1:n .
Prediction. We predict a distribution of the vocabulary of the missing component, i.e. to predict w 1 correctly we need to predict its index in the word vocabulary V w , while the prediction of p is from the vocabulary of paraphrases in the training set, V p . We predict the following distributions: where W w ∈ R |Vw|×2d , W p ∈ R |Vp|×2d , and d is the embeddings dimension.
During training, we compute cross-entropy loss for each subtask using the gold item and the prediction, sum up the losses, and weight them by the instance score. During inference, we predict the missing components by picking the best scoring index in each distribution: 3 The subtasks share the pre-trained word embeddings, the special embeddings, and the biLSTM parameters. Subtasks (2) and (3) also share W w , the MLP that predicts the index of a word.  Table 1: Examples of top ranked predicted components using the model: predicting the paraphrase given w 1 and w 2 (left), w 1 given w 2 and the paraphrase (middle), and w 2 given w 1 and the paraphrase (right). Implementation Details. The model is implemented in DyNet (Neubig et al., 2017). We dedicate a small number of noun-compounds from the corpus for validation. We train for up to 10 epochs, stopping early if the validation loss has not improved in 3 epochs. We use Momentum SGD (Nesterov, 1983), and set the batch size to 10 and the other hyper-parameters to their default values.

Qualitative Analysis
To estimate the quality of the proposed model, we first provide a qualitative analysis of the model outputs. Table 1 displays examples of the model outputs for each possible usage: predicting the paraphrase given the constituent words, and predicting each constituent word given the paraphrase and the other word.
The examples in the table are from among the top 10 ranked predictions for each componentpair. We note that most of the (w 2 , paraphrase, w 1 ) triplets in the table do not occur in the training data, but are rather generalized from similar examples. For example, there is no training instance for "company in the software industry" but there is a "firm in the software industry" and a company in many other industries.
While the frequent prepositional paraphrases are often ranked at the top of the list, the model also retrieves more specified verbal paraphrases. The list often contains multiple semanticallysimilar paraphrases, such as '[w 2 ] involved in [w 1 ]' and '[w 2 ] in [w 1 ] industry'. This is a result of the model training objective (Section 3) which positions the vectors of semantically-similar paraphrases close to each other in the embedding space, based on similar constituents.
To illustrate paraphrase similarity we compute a t-SNE projection (Van Der Maaten, 2014) of the embeddings of all the paraphrases, and draw a sample of 50 paraphrases in Figure 2. The projection positions semantically-similar but lexicallydivergent paraphrases in proximity, likely due to many shared constituents. For instance, 'with', 'from', and 'out of' can all describe the relation between food words and their ingredients.

Evaluation: Noun-Compound Interpretation Tasks
For quantitative evaluation we employ our model for two noun-compound interpretation tasks. The main evaluation is on retrieving and ranking paraphrases ( §5.1). For the sake of completeness, we also evaluate the model on classification to a fixed inventory of relations ( §5.2), although it wasn't designed for this task.

Paraphrasing
Task Definition. The general goal of this task is to interpret each noun-compound to multiple prepositional and verbal paraphrases. In SemEval 2013 Task 4, 4 the participating systems were asked to retrieve a ranked list of paraphrases for each noun-compound, which was automatically evaluated against a similarly ranked list of paraphrases proposed by human annotators.
Model. For a given noun-compound w 1 w 2 , we first predict the k = 250 most likely paraphrases: p 1 , ...,p k = argmax kp , wherep is the distribution of paraphrases defined in Equation 1. While the model also provides a score for each paraphrase (Equation 1), the scores have not been optimized to correlate with human judgments. We therefore developed a re-ranking model that receives a list of paraphrases and re-ranks the list to better fit the human judgments.
We follow Herbrich (2000) and learn a pairwise ranking model. The model determines which of two paraphrases of the same noun-compound should be ranked higher, and it is implemented as an SVM classifier using scikit-learn (Pedregosa et al., 2011). For training, we use the available training data with gold paraphrases and ranks provided by the SemEval task organizers. We extract the following features for a paraphrase p: is its confidence score. The last feature incorporates the original model score into the decision, as to not let other considerations such as preposition frequency in the training set take over.
During inference, the model sorts the list of paraphrases retrieved for each noun-compound according to the pairwise ranking. It then scores each paraphrase by multiplying its rank with its original model score, and prunes paraphrases with final score < 0.025. The values for k and the threshold were tuned on the training set.
Evaluation Settings. The SemEval 2013 task provided a scorer that compares words and ngrams from the gold paraphrases against those in the predicted paraphrases, where agreement on a prefix of a word (e.g. in derivations) yields a partial scoring. The overall score assigned to each system is calculated in two different ways. The 'isomorphic' setting rewards both precision and recall, and performing well on it requires accurately reproducing as many of the gold paraphrases as possible, and in much the same order. The 'non-isomorphic' setting rewards only precision, and performing well on it requires accurately reproducing the top-ranked gold paraphrases, with no importance to order.
Baselines. We compare our method with the published results from the SemEval task. The SemEval 2013 baseline generates for each nouncompound a list of prepositional paraphrases in an arbitrary fixed order. It achieves a moderately good score in the non-isomorphic setting by generating a fixed set of paraphrases which are both common and generic. The MELODI system performs similarly: it represents each nouncompound using a compositional distributional vector (Mitchell and Lapata, 2010) which is then used to predict paraphrases from the corpus. The performance of MELODI indicates that the system was rather conservative, yielding a few common paraphrases rather than many specific ones. SFS and IIITH, on the other hand, show a more balanced trade-off between recall and precision.
As a sanity check, we also report the results of a baseline that retrieves ranked paraphrases from the training data collected in Section 3.2. This baseline has no generalization abilities, therefore it is expected to score poorly on the recall-aware isomorphic setting. Method isomorphic non-isomorphic Baselines SFS (Versley, 2013) 23.1 17.9 IIITH (Surtani et al., 2013) 23.1 25.8 MELODI ( Van de Cruys et al., 2013) 13.0 54.8 SemEval 2013 Baseline (Hendrickx et al., 2013) 13   Results. Table 2 displays the performance of the proposed method and the baselines in the two evaluation settings. Our method outperforms all the methods in the isomorphic setting. In the nonisomorphic setting, it outperforms the other two systems that score reasonably on the isomorphic setting (SFS and IIITH) but cannot compete with the systems that focus on achieving high precision. The main advantage of our proposed model is in its ability to generalize, and that is also demonstrated in comparison to our baseline performance. The baseline retrieved paraphrases only for a third of the noun-compounds (61/181), expectedly yielding poor performance on the isomorphic setting. Our model, which was trained on the very same data, retrieved paraphrases for all nouncompounds. For example, welfare system was not present in the training data, yet the model predicted the correct paraphrases "system of welfare benefits", "system to provide welfare" and others.
Error Analysis. We analyze the causes of the false positive and false negative errors made by the model. For each error type we sample 10 nouncompounds. For each noun-compound, false positive errors are the top 10 predicted paraphrases which are not included in the gold paraphrases, while false negative errors are the top 10 gold paraphrases not found in the top k predictions made by the model. Table 3 displays the manu-ally annotated categories for each error type.
Many false positive errors are actually valid paraphrases that were not suggested by the human annotators (error 1, "discussion by group"). Some are borderline valid with minor grammatical changes (error 6, "force of coalition forces") or too specific (error 2, "life of women in community" instead of "life in community"). Common prepositional paraphrases were often retrieved although they are incorrect (error 3). We conjecture that this error often stem from an n-gram that does not respect the syntactic structure of the sentence, e.g. a sentence such as "rinse away the oil from baby 's head" produces the n-gram "oil from baby".
With respect to false negative examples, they consisted of many long paraphrases, while our model was restricted to 5 words due to the source of the training data (error 1, "holding done in the case of a share"). Many prepositional paraphrases consisted of determiners, which we conflated with the same paraphrases without determiners (error 2, "mutation of a gene"). Finally, in some paraphrases, the constituents in the gold paraphrase appear in inflectional forms (error 3, "holding of shares" instead of "holding of share").

Classification
Noun-compound classification is defined as a multiclass classification problem: given a pre-defined set of relations, classify w 1 w 2 to the relation that holds between w 1 and w 2 . Potentially, the corpus co-occurrences of w 1 and w 2 may contribute to the classification, e.g. '[w 2 ] held at [w 1 ]' indicates a TIME relation. Tratz and Hovy (2010) included such features in their classifier, but ablation tests showed that these features had a relatively small contribution, probably due to the sparseness of the paraphrases. Recently, Shwartz and Waterson (2018) showed that paraphrases may contribute to the classification when represented in a continuous space.
Model. We generate a paraphrase vector representation par(w 1 w 2 ) for a given noun-compound w 1 w 2 as follows. We predict the indices of the k most likely paraphrases:p 1 , ...,p k = argmax kp , wherep is the distribution on the paraphrase vocabulary V p , as defined in Equation 1. We then encode each paraphrase using the biLSTM, and average the paraphrase vectors, weighted by their confidence scores inp: We train a linear classifier, and represent w 1 w 2 in a feature vector f (w 1 w 2 ) in two variants: paraphrase: f (w 1 w 2 ) = par(w 1 w 2 ), or integrated: concatenated to the constituent word embeddings f (w 1 w 2 ) = [ par(w 1 w 2 ), w 1 , w 2 ]. The classifier type (logistic regression/SVM), k, and the penalty are tuned on the validation set. We also provide a baseline in which we ablate the paraphrase component from our model, representing a nouncompound by the concatenation of its constituent embeddings f (w 1 w 2 ) = [ w 1 , w 2 ] (distributional).
We report the performance on two different dataset splits to train, test, and validation: a random split in a 75:20:5 ratio, and, following concerns raised by Dima (2016) about lexical memorization (Levy et al., 2015), on a lexical split in which the sets consist of distinct vocabularies. The lexical split better demonstrates the scenario in which a noun-compound whose constituents have not been observed needs to be interpreted based on similar observed noun-compounds, e.g. inferring the relation in pear tart based on apple cake and other similar compounds. We follow the random and full-lexical splits from Shwartz and Waterson (2018).
Baselines. We report the results of 3 baselines representative of different approaches: 1) Feature-based (Tratz and Hovy, 2010): we reimplement a version of the classifier with features from WordNet and Roget's Thesaurus. 2) Compositional (Dima, 2016): a neural architecture that operates on the distributional representations of the noun-compound and its constituents. Noun-compound representations are learned with   (Socher et al., 2012) models. We report the results from Shwartz and Waterson (2018).
3) Paraphrase-based (Shwartz and Waterson, 2018): a neural classification model that learns an LSTM-based representation of the joint occurrences of w 1 and w 2 in a corpus (i.e. observed paraphrases), and integrates distributional information using the constituent embeddings.
Results. Table 4 displays the methods' performance on the two versions of the Tratz (2011) dataset and the two dataset splits. The paraphrase model on its own is inferior to the distributional model, however, the integrated version improves upon the distributional model in 3 out of 4 settings, demonstrating the complementary nature of the distributional and paraphrase-based methods. The contribution of the paraphrase component is especially noticeable in the lexical splits. As expected, the integrated method in Shwartz and Waterson (2018), in which the paraphrase representation was trained with the objective of classification, performs better than our integrated model. The superiority of both integrated models in the lexical splits confirms that paraphrases are beneficial for classification.  Analysis. To analyze the contribution of the paraphrase component to the classification, we focused on the differences between the distributional and integrated models on the Tratz-Coarse lexical split. Examination of the per-relation F 1 scores revealed that the relations for which performance improved the most in the integrated model were TOPICAL (+11.1 F 1 points), OBJECTIVE (+5.5), AT-TRIBUTE (+3.8) and LOCATION/PART WHOLE (+3.5). Table 5 provides examples of noun-compounds that were correctly classified by the integrated model while being incorrectly classified by the distributional model. For each noun-compound, we provide examples of top ranked paraphrases which are indicative of the gold label relation.

Compositionality Analysis
Our paraphrasing approach at its core assumes compositionality: only a noun-compound whose meaning is derived from the meanings of its constituent words can be rephrased using them. In §3.2 we added negative samples to the training data to simulate non-compositional nouncompounds, which are included in the classification dataset ( §5.2). We assumed that these compounds, more often than compositional ones would consist of unrelated constituents (spelling bee, sacred cow), and added instances of random unrelated nouns with '[w 2 ] is unrelated to [w 1 ]'. Here, we assess whether our model succeeds to recognize non-compositional noun-compounds.
We used the compositionality dataset of Reddy et al. (2011) which consists of 90 nouncompounds along with human judgments about their compositionality in a scale of 0-5, 0 being non-compositional and 5 being compositional. For each noun-compound in the dataset, we predicted the 15 best paraphrases and analyzed the errors. The most common error was predicting paraphrases for idiomatic compounds which may have a plausible concrete interpretation or which originated from one. For example, it predicted that silver spoon is simply a spoon made of silver and that monkey business is a business that buys or raises monkeys. In other cases, it seems that the strong prior on one constituent leads to ignoring the other, unrelated constituent, as in predicting "wedding made of diamond". Finally, the "unrelated" paraphrase was predicted for a few compounds, but those are not necessarily non-compositional (application form, head teacher). We conclude that the model does not address compositionality and suggest to apply it only to compositional compounds, which may be recognized using compositionality prediction methods as in Reddy et al. (2011).

Conclusion
We presented a new semi-supervised model for noun-compound paraphrasing. The model differs from previous models by being trained to predict both a paraphrase given a noun-compound, and a missing constituent given the paraphrase and the other constituent. This results in better generalization abilities, leading to improved performance in two noun-compound interpretation tasks. In the future, we plan to take generalization one step further, and explore the possibility to use the biL-STM for generating completely new paraphrase templates unseen during training.