Learning Paraphrasing for Multiword Expressions

In this paper, we investigate the impact of context for the paraphrase ranking task, comparing and quantifying results for multi-word expressions and single words. We focus on systematic integration of existing paraphrase resources to produce paraphrase candidates and later ask human annotators to judge paraphrasability in context. We ﬁrst conduct a paraphrase-scoring annotation task with and without context for targets that are i) single-and multi-word expressions ii) verbs and nouns. We quantify how differently annotators score para-phrases when context information is provided. Furthermore, we report on experiments with automatic paraphrase ranking. If we regard the problem as a binary clas-siﬁcation task, we obtain an F1–score of 81.56% and 79.87% for multi-word expressions and single words resp. using kNN classiﬁer. Approaching the problem as a learning-to-rank task, we attain MAP scores up to 87.14% and 91.58% for multi-word expressions and single words resp. using LambdaMART, thus yielding high-quality contextualized paraphrased selection. Further, we provide the ﬁrst dataset with paraphrase judgments for multi-word targets in context.


Introduction
In this work, we examine the influence of context for paraphrasing of multi-word expressions (MWEs). Paraphrases are alternative ways of writing texts while conveying the same information (Zhao et al., 2007;Burrows et al., 2013). There are several applications where an automatic text paraphrasing is desired such as text shortening (Burrows et al., 2013), text simplification, machine translation (Kauchak and Barzilay, 2006), or textual entailment.
Over the last decade, a large number of paraphrase resources have been released including PPDB (Pavlick et al., 2015), which is the largest in size. However, PPDB provides only paraphrases without context. This hampers the usage of such a resource in applications. In this paper, we tackle the research question on how we can automatically rank paraphrase candidates from abundantly available paraphrase resources. Most existing work on paraphrases focuses on lexical-, phrase-, sentenceand document level (Burrows et al., 2013). We primarily focus on contextualization of paraphrases based on existing paraphrase resources.
Furthermore, we target multi-worded paraphrases, since single-word replacements are covered well in lexical substitution datasets, such as (McCarthy and Navigli, 2007;Biemann, 2012). While these datasets contain multi-word substitution candidates, the substitution targets are strictly single words. Multi-word expressions are prevalent in text, constituting roughly as many entries as single words in a speaker's lexicon (Sag et al., 2002), and are important for a number of NLP applications. For example, the work by Finlayson and Kulkarni (2011) shows that detection of multiword expressions improves the F-score of a word sense disambiguation task by 5 percent. In this paper, we experiment with both MWE and single words and investigate the difficulty of the paraphrasing task for single words vs. MWEs, using the same contextual features.
Our work, centered in assessing the effect of context for paraphrase ranking of humans and its automatic prediction, includes the following steps: 1) systematic combination of existing paraphrase resources to produce paraphrase candidates for single-and multi-word expressions, 2) collection of dataset for paraphrase ranking/selection annotation task using crowdsourcing, and 3) investigating different machine learning approaches for an automatic paraphrase ranking.
2 Related Work

Paraphrase Resources and Machine
Learning Approaches Paraphrasing consists of mainly two tasks, paraphrase generation and paraphrase identification. Paraphrase generation is the task of obtaining candidate paraphrases for a given target. Paraphrase identification estimates whether a given paraphrase candidate can replace a paraphrase target without changing the meaning in context. PPDB (Pavlick et al., 2015) is one of the largest collections of paraphrase resources collected from bilingual parallel corpora. PPDB2 has recently been released with revised ranking scores. It is based on human judgments for 26,455 paraphrase pairs sampled from PPDB1. They apply ridge regression to rank paraphrases, using the features from PPDB1 and include word embeddings.
The work of (Kozareva and Montoyo, 2006) uses a dataset of paraphrases that were generated using monolingual machine translation. In the dataset, sentence pairs are annotated as being paraphrases or not. For the binary classification, they use three machine learning algorithms (SVM, kNN and MaxEnt). As features they use word overlap features, n-grams ratios between targets and candidates, skip-grams longest common subsequences, POS tags and proper names. Connor and Roth (2007) develop a global classifier that takes a word v and its context, along with a candidate word u, and determines whether u can replace v in the given context while maintaining the original meaning. Their work focuses on verb paraphrasing. Notions of context include: being either subject or object of the verb, named entities that appear as subject or object, all dependency links connected to the target, all noun phrases in sentences containing the target, or all of the above.
The work of Brockett and Dolan (2005) uses annotated datasets and Support Vector Machines (SVMs) to induce larger monolingual paraphrase corpora from a comparable corpus of news clusters found on the World Wide Web. Features in-clude morphological variants, WordNet synonyms and hypernyms, log-likelihood-based based word pairings dynamically obtained from baseline sentence alignments, and string features such as word-based edit distance Bouamor et al. (2011) introduce a targeted paraphrasing system, addressing the task of rewriting of subpart of a sentence to make the sentences easier for automatic translation. They report on experiments of rewriting sentences from Wikipedia edit history by contributors using existing paraphrase resources and web queries. An SVM classifier has been used for evaluation and an accuracy of 70% has been achieved.
Using a dependency-based context-sensitive vector-space approach, Thater et al. (2009) compute vector-space representations of predicate meaning in context for the task of paraphrase ranking. An evaluation on the subset of SemEval 2007 lexical substitution task produces a better result than the state-of-the-art systems at the time. Zhao et al. (2007) address the problem of context-specific lexical paraphrasing using different approaches. First, similar sentences are extracted from the web and candidates are generated based on syntactic similarities. Candidate paraphrases are further filter using POS tagging. Second, candidate paraphrases are validated using different similarity measures such as co-occurrence similarity and syntactic similarity.
Our work is similar to previous approaches on all-words lexical substitution (Szarvas et al., 2013;Kremer et al., 2014;Hintz and Biemann, 2016) in the sense that we construct delexicalized classifiers for ranking paraphrases: targets, paraphrase candidates and context are represented without lexical information, which allows us to learn a single classifier/ranker for all potential paraphrasing candidates. However, these approaches are limited to single-word targets (Szarvas et al., 2013) resp. single-word substitutions (Kremer et al., 2014) only. In this paper, we extend these notions to MWE targets and substitutions, highlight the differences to single-word approaches, and report both on classification and ranking experiments.

Multi-word Expression Resources
While there are some works on the extraction of multi-word expressions and on investigation of their impact on different NLP applications, as far as we know, there is no single work dedicated on paraphrasing multi-word expressions. Various approaches exist for the extraction of MWEs: Tsvetkov and Wintner (2010) present an approach to extract MWEs from parallel corpora. They align the parallel corpus and focus on misalignment, which typically indicates expressions in the source language that are translated to the target in a non-compositional way. Frantzi et al. (2000) present a method to extract multi-word terms from English corpora, which combines linguistic and statistical information. The Multi-word Expression Toolkit (MWEtoolkit) extracts MWE candidates based on flat n-grams or specific morphosyntactic patterns (of surface forms, lemmas, POS tags) (Ramisch et al., 2010) and apply different fillters ranging form simple count thresholds to a more complex cases such as Association Measures (AMs). The tool further supports indexing and searching of MWEs, validation, and annotation facilities. Schneider et al. (2014) developed a sequencetagging-based supervised approach to MWE identification. A rich set of features has been used in a linguistically-driven evaluation of the identification of heterogeneous MWEs. The work by Vincze et al. (2011) constructs a multi-word expression corpus annotated with different types of MWEs such as compound, idiom, verb-particle constructions, light verb constructions, and others. In our work, we have used a combination of many MWEs resources from different sources for both MWE target detection and candidate generation (see Subsection 3.2).

Methods
In this section we describe our approach, which covers: the collection of training data, detection of multi-word paraphrases including annotating substitutes and learning a classifier in order to rank substitute candidates for a target paraphrase.

Impact of Context on Paraphrasing
In order to validate our intuitively plausible hypothesis that context has an impact on paraphrasing, we conduct experiments using the PPDB2 paraphrase database. PPDB2 is released with better paraphrase ranking than PPDB1 (Pavlick et al., 2015) but does not incorporate context information. Hence, we carry out different paraphrase ranking and selection annotation tasks using the Amazon Mechanical Turk crowdsourcing platform. In the first annotation task, a total of 171 sentences are selected from the British Academic Written English (BAWE) corpus 1 (Alsop and Nesi, 2009), with five paraphrase targets. The targets are selected in such a way that a) include MWEs as targets when it is possible (see Subection 3.2 how we select targets), b) the candidates could bear more than one contextual meaning and, c) workers can select up to three paraphrases and have to supply their own paraphrase if none of the candidates match. To satisfy condition b), we have used the JoBimText DT database API (Ruppert et al., 2015) to obtain single word candidates with multiple senses according to automatic sense induction.
We conduct this annotation setup twice, both with and without showing the original context (3-8 sentences). For both setups, a task is assigned to 5 workers. We incorporate control questions with invalid candidate paraphrases in order to reject unreliable workers. In addition to the control questions, JavaScript functions are embedded to ensure that workers select or supply at least one paraphrase. The results are aggregated by summing the number of workers that agreed on candidates, for scores between 0 and 5. Table 1 shows the Spearman correlation results. We can see that both single and MWE targets are context-dependent, as correlations are consistently lower when taking context into account. Further, we note that correlations are positive, but low, indicating that the PPDB2 ranking should not be used as-is for paraphrasing.

Paraphrase Dataset Collection using Crowdsourcing
In this subsection, we present the processes carried out to collect datasets for the paraphrase ranking task. This includes selection of documents, identification of target paraphrases, and generation of candidate paraphrases from existing resources. We use 2.8k essay sentences from the ANC 2 and BAWE corpora for the annotation task. Target detection and candidate generation: In order to explore the impact of contexts for paraphrasing, the first step is to determine possible targets for paraphrasing, as shown in Figure 1. As a matter of fact, every word or MWE in a sentence can be a target for paraphrasing. When prototyping the annotation setup, we found that five paraphrase targets are a reasonable amount to be completed in a single Human Intelligence Task (HIT), a single and self-contained unit of task to be completed and submitted by an annotator to receive a reward in a return 3 . We select targets that have at least five candidates in our combined paraphrase resources. The paraphrase resources (S) for candidates generations are composed of collections from PPDB (Pavlick et al., 2015), WordNet and JoBimText distributional thesaurus (DT -only for single words).
For MWE paraphrase targets, we have used different MWE resources. A total of 79,349 MWE are collected from WordNet, STREUSLE (Schneider and Smith, 2015; Schneider et al., 2014) 4 , Wiki50 (Vincze et al., 2011) andthe MWE project (McCarthy et al., 2003;Baldwin and Villavicencio, 2002) 5 . We consider MWEs from this resources to be a paraphrase target when it is possible to generate paraphrase candidates from our paraphrase resources (S).
Candidates paraphrases for a target (both single and MWE) are generated as follows. For each paraphrase target, we retrieve candidates from the resources (S). When more than five candidates are collected: 1) for single words, we select the top candidates that bear different meanings in context using the automatic sense induction API by Ruppert et al. (2015), 2) for MWEs we select candidates that are collected from multiple resources in S. We present five candidates for the workers to select the suitable candidates in context. We also allow workers to provide their own alternative candidates when they found that none of the provided candidates are suitable in the current context. Figure 2 shows the Amazon Mechanical Turk user interface for the paraphrase candidate selection task. We discuss the different statistics and quality of annotations obtained in Section 5.2.

Machine Learning Approaches for Paraphrasing
In this work we investigate two types of machinelearning setups for paraphrase selection and ranking problems. In the first setup, we tackle the problem as a binary classification task, namely whether one candidate can be chosen to replace a target in context. All candidates annotated as possible paraphrases are considered a positive examples. We follow a 5-fold cross validation approach to train and evaluate our model. In the second setup, we use a learning-to-rank algorithm to re-rank paraphrase candidates. There are different machine learning methods for the learning-to-ranking approach, such as pointwise, pairwise and listwise rankings. In the pointwise ranking, a model is trained to map candidate phrases to relevance scores, for example using a simple regression technique. Ranking is then performed by simply sorting predicted scores . In the pairwise approach, the problem is regarded as a binary classification task where pairs are individually compared each other (Freund et al., 2003). Listwise ranking approaches learn a function by taking individual candidates as instances and optimizing a loss function defined on the predicted instances (Xia et al., 2008). We experiment with different learning-torank algorithms from the RankLib 6 Java package of the Lemur project 7 . In this paper, we present the results obtained using LambdaMART. Lamb-daMART (Burges, 2010) uses gradient boosting to directly optimize learning-to-rank specific cost functions such as Normalized Discounted Cumulative Gain (NDCG) and Mean Average Precision (MAP).

Features
We have modeled three types of features: a resource-based feature where feature values are taken from a lexical resource (F 0), four features based on global context where we use word embeddings to characterize targets and candidates irrespectively of context (F 1, 2, 3, 4) and four features based on local context that take the relation of target and candidate with the context into account (F 5, 6, 7, 8).
PPDB2 score: We use the the PPDB2 score (F 0) of each candidate as baseline feature. This score reflects a context-insensitive ranking as provided by the lexical resources.
First we describe features considering global context information: Target and Candidate phrases: Note that we do not use word identity as a feature, and use the word embedding instead for the sake of robustness. We use the word2vec python implementation of Gensim (Řehůřek and Sojka, 2010) 8 to generate embeddings from BNC 9 , Wikipedia, BAWE and ANC. We train embeddings with 200 dimensions using skip-gram training and a window size of 5. We approximate MWE embeddings by averaging the embeddings of their parts. We use the word embeddings of the target (F 1) and the candidate (F 2) phrases. Candidate-Target similarities: The dot product of the target and candidate embeddings (F 3), as described in (Melamud et al., 2015). Target-Sentence similarity: The dot product between a candidate and the sentence, i.e. the average embeddings of all words in the sentence (F 4).
The following features use local context information: Target-Close context similarity: The dot product between the candidate and the left and right 3-gram (F 5) and 5-gram embedding (F 6) resp.. Ngram features: A normalized frequency for a 2-5-gram context with the target and candidate phrases (F 7) based on Google Web 1T 5-Grams 10 . Language model score: A normalized language model score using a sentence as context with the target and candidate phrases (F 8). An n-gram language model (Pauls and Klein, 2011) is built using the BNC and Wikipedia corpora.
Also, we experimented with features that eventually did not improve results, such as the embeddings of the target's n = 5 most similar words, length and length ratios between target and candidate, most similar words and number of shared senses among target and candidate phrases based JoBimText DT (Ruppert et al., 2015), and N-gram POS sequences and dependency labels of the tar-10 https://catalog.ldc.upenn.edu/ LDC2009T25

Experimental Results
Now we discuss the different experimental results using the K-Nearest Neighbors (kNN) 11 from the scikit-learn 12 machine leaning framework (binary classification setup) and the LambdaMART learning to rank algorithm from the RankLib (learning to rank setup). We have used 5-fold cross validation on 17k data points (2k MWEs and 15k single) from the crowdsourcing annotation task for both approaches. The cross-validation is conducted in a way that there is no target overlap in in each split, so that our model is forced to learn a delexicalized function that can apply to all targets where substitution candidates are available, cf. (Szarvas et al., 2013). As evaluation metrics, precision, recall, and Fscore are used for the first setup. For the second setup we use P@1, Mean Average Precision (MAP), and Normalized Discounted Cumulative Gain (NDCG). P@1 measures the percentage of correct paraphrases at rank 1, thus gives the percentage of how often the best-ranked paraphrase is judged as correct. MAP provides a single-figure measure of quality across recall levels. NDCG is a ranking score that compares the optimal ranking to the system ranking, taking into account situations where many resp. very few candidates are relevant (Wang et al., 2013). In the following subsections, we will discuss the performance of the two machine learning setups.

Binary Classification
For paraphrase selection, we regard the problem as a binary classification task. If a given candidate is selected by at least one annotator, it is considered as possible substitute and taken as positive example. Otherwise it will be considered as a negative training example. For this experiment, kNN from the scikit-learn machine learning framework is used. Table 2 shows the evaluation result for the best subsets of feature combinations. The classification experiments obtain maximal F1s of 81.56% for MWEs and 79.77% for single words vs. a noncontextual baseline of 69.06% and 71.47% resp. 11 Parameters: Number of neighbors (n neighbors) = 20, weight function (weights) = distance 12 http://scikit-learn.org/

Learning to Rank
Now we learn to rank paraphrase candidates, using the number of annotators agreeing on each candidate to assign relevance scores in the interval of [0][1][2][3][4][5].. The average evaluation result on the 5-fold splits is shown in Table  2. The baseline ranking given by F 0 is consistently lower than our context-aware classifiers. The best scores are attained with all features enabled (P@1=89.72, NDCG@5=88.82 and MAP=91.58 for single words vs. P@1=84.69, NDCG@5=77.54 and MAP=86.21 for MWEs). A more detailed analysis between the ranking of single-worded targets and multi-worded paraphrases will be discussed in Section 5.3.

Analysis of the Result
In this section, we interpret the results obtained during the crowdsourcing annotation task and machine learning experimentation.

Correlation with PPDB2 Ranking
As it can be seen from Table 1, without contexts, a Spearman correlation of 0.36 and 0.25 is obtained by the workers against the PPDB2 default rankings for single and MWE annotations resp. However, when the contexts are provided to the workers, the ranking for the same items is lower with a Spearman correlation of 0.32 and 0.23 for single and MWE annotations resp. This indicates that the contexts provided has an impact on the ranking of paraphrases. Moreover, we observe that the correlation with PPDB2 ranking is considerably lower than the one reported by Pavlick et al. (2015) which is 0.71. Data analysis revealed a lot of inconsistent scores within the PPDB2. For example, the word pairs (come in, sound) and (look at, okay) have a high correlation score (3.2, 3.18 resp.). However, they do not seem to be related and are not considered as substitutable by our method. The perceived inconsistency is worse in the case of MWE scores hence the correlation is lower than for single words.

Annotation Agreement
According to Table 3, annotators agree more often on single words than on MWEs. This might be attributed to the fact that single word candidates are generated with different meanings using the automatic sense induction approach, provided by the JoBimText framework (Ruppert et al., 2015).  Table 3: Score distributions and observed annotation agreement (in %). The columns #1 to #5 shows the percentage of scores the annotator give to each classes (0-5). The last column provides the observed agreements among 5 annotators.
Hence, when context is provided, it is much easier to discern the correct candidate paraphrase. On the other hand, in MWEs, their parts disambiguate each other to some extent, so there are less candidates with context mismatches. We can witness that from the individual class percentages (MWE candidates are on average scored higher than single word candidates, especially in the range of [2][3][4]) and from the overall observed agreements.

Machine Learning
According to the results shown in Table 2, we achieve higher scores for the binary classification for MWE than for single words. We found that this is due to the fact that we have more positive examples (67.6%) than the single words. Intuitively, it is much easier to have one of the five candidates to be a correct paraphrase as most of the MWE are not ambiguous in meaning (see recall (R) column in Table 2).
Example 1: this is the reason too that the reader disregards the duke 's point of view , and supports and sympathises with the duchess , acknowledging her innocence. Example 2: this list of verbs describes day-to-day occupations of the young girl , suggesting that she does n't distinguish the graveyard from other locations of her day . Example 3: this is apparent in the case of the priest who tries to vanquish the devil , who is infact mistaken for mouse slayer , the cat ...
Error analysis of the classification result shows that some of the errors are due to annotation mistakes. In Example 1, the annotators do not select the candidate stand while the classifier predicts it correctly. We also found that the classifier wrongly picks antonyms from candidates. The classifier selected younger man and heaven for Example 2 and 3 resp. while the annotators do not  The results for learning the ranking show a different trend. Once again, we can see that it is difficult to rank better when the candidates provided (in the case of MWEs) are less ambiguous. This could also be a consequence of the lower agreement on MWE candidate judgments. Analysis of the learn-to-rank result also revealed that the lower result is due to the fact that more often, the annotators do not agree on a single candidate, as it can be seen from Table 4.
Looking at the overall results, it becomes clear that our learning framework can substantially improve contextual paraphrase ranking over the PPDB2-resource-based baseline. The resourcebased F 0-feature, however, is still important for attaining the highest scores. While the global context features based on word embeddings (cf. F 1 + 2 + 3 or F 1 + 3) already show a very good performance, they are consistently improved by adding one or all feature that models local context (F 5, F 6, F 7, F 8). From this we conclude that all feature types (resource, global context, local context) are important.

Conclusion and Future Directions
In this paper we have quantified the impact of context on the paraphrase ranking scoring task. The direct annotation experiments show that paraphrasing is in fact a context-specific task: while the paraphrase ranking scores provided by PPDB2 were confirmed by a weak correlation with outof-context judgments, the correlation between resource-provided rankings and judgments in context were consistently lower.
We conducted a classification experiment in a delexicalized setting, i.e. training and testing on disjoint sets of paraphrase targets. For a binary classification setting as well as for ranking, we im-proved substantially over the non-contextualized baseline as provided by PPDB2. An F-score of 81.56% and 79.87% is attained for MWEs and Single words using kNN classifier from scikitlearn. A MAP score of 87.14% and 91.58% is obtained for MWEs and single words using the LambdaMART learn-to-rank algorithm from RankLib.
We recommend to use a learning-to-rank framework for utilizing features that characterize the paraphrase candidate not only with respect to the target, but also with respect to the context. The most successful features in these experiments are constructed from word embeddings, and the best performance is attained in combination of resource-based, global context and local context features.
Both experiments confirm the generally accepted intuition that paraphrasing, just like lexical substitution of single words, depends on context: while MWEs are less ambiguous than single words, it still does not hold that they can be replaced without taking the context into account. Here, we have quantified the amount of context dependence on a new set of contextualized paraphrase judgments, which is -to our knowledgethe first dataset with multi-word targets 13 .
While our dataset seems of sufficient size to learn a high-quality context-aware paraphrase ranker, we would like to employ usage data from a semantic writing aid for further improving the quality, as well as for collecting domain-and userspecific paraphrase generation candidates.