Monolingual Phrase Alignment on Parse Forests

We propose an efficient method to conduct phrase alignment on parse forests for paraphrase detection. Unlike previous studies, our method identifies syntactic paraphrases under linguistically motivated grammar. In addition, it allows phrases to non-compositionally align to handle paraphrases with non-homographic phrase correspondences. A dataset that provides gold parse trees and their phrase alignments is created. The experimental results confirm that the proposed method conducts highly accurate phrase alignment compared to human performance.


Introduction
Paraphrase detection is crucial in various applications, which has been actively studied for years. Due to difficulties caused by the non-homographic nature of phrase correspondences, the units of correspondence in previous studies are defined as sequences of words like in (Yao et al., 2013) and not syntactic phrases. On the other hand, syntactic structures are important in modeling sentences, e.g., their sentiments and semantic similarities (Socher et al., 2013;Tai et al., 2015).
In this paper, we present an algorithm to align syntactic phrases in a paraphrased pair of sentences. We show that (1) the problem of identifying a legitimate set of syntactic paraphrases under linguistically motivated grammar is formalized, (2) dynamic programing a la CKY (Cocke, 1969;Kasami, 1965;Younger, 1967) makes phrase alignment computationally feasible, (3) alignment quality of phrases can be improved using n-best parse forests instead of 1-best trees, and (4) noncompositional alignment allows non-homographic correspondences of phrases. Motivated by recent Source: Whenever I go to the ground floor for a smoke, I always come face to face with them. Target: Whenever I go down to smoke a cigarette, I come face to face with one of them. findings that syntax is important for phrase embedding (Socher et al., 2013) in which phrasal paraphrases allow semantic similarity to be replicated (Wieting et al., 2016(Wieting et al., , 2015, we focus on the syntactic paraphrase alignment. Fig. 1 shows a real example of phrase alignments produced by our method. Alignment proceeds in a bottom-up manner using the compositional nature of phrase alignments. First, word alignments are given. Then, phrase alignments are recursively identified by supporting relations between phrase pairs. Non-compositional alignment is triggered when the compositionality is violated, which is common in paraphrasing. For systematic research on syntactic phrase alignment in paraphrases, we constructed a gold standard dataset of paraphrase sentences with phrase alignment (20, 678 phrases in 201 paraphrasal sentences). This dataset will be made public for future research on paraphrase alignment. The experiment results show that our method achieves 83.64% and 78.91% in recall and precision in terms of alignment pairs, which are 92% and 89% of human performance, respectively.

Related Work
Due to the large amount of sentence-level paraphrases collected (Dolan et al., 2004;Cohn et al., 2008;Heilman and Smith, 2010;Yin and Schütze, 2015;Biran et al., 2016), researchers can identify phrasal correspondences for natural language inferences (MacCartney et al., 2008;Thadani et al., 2012;Yao et al., 2013). Current methods extend word alignments to phrases in accordance with the methods in statistical machine translation. However, phrases are defined as a simple sequence of words, which do not conform to syntactic phrases. PPDB (Ganitkevitch et al., 2013) provides syntactic paraphrases similar to synchronous context free grammar (SCFG). As discussed below, SCFG captures only a fraction of paraphrasing phenomenon.
In terms of our approach, parallel parsing is a relevant area. Smith and Smith (2004) related monolingual parses in different languages using word alignments, while Burkett and Klein (2008) employed phrase alignments. Moreover, Das and Smith (2009) proposed a model that generates a paraphrase of a given sentence using quasi-synchronous dependency grammar (Smith and Eisner, 2006). Since they used phrase alignments simply as features, there is no guarantee that the output alignments are legitimate.
Synchronous rewriting in parallel parsing (Kaeshammer, 2013; Maillette de Buy Wenniger and Sima'an, 2013) derives parse trees that conform to discontinuous word alignments. In contrast, our method respects parse trees derived by linguistically motivated grammar while handling nonmonotonic phrase alignment.
The synchronous assumption in parallel parsing has been argued to be too rigid to handle parallel sentence pairs or even paraphrasal sentence pairs. Burkett et al. (2010) proposed weakly synchronized parallel parsing to tackle this problem. Although this model increases the flexibility, the obtainable alignments are restricted to conform to inversion transduction grammar (ITG) (Wu, 1997). Similarly, Choe and McClosky (2015) used dependency forests of paraphrasal sentence pairs and allowed disagreements to some extent. However, alignment quality was beyond their scope. Weese et al. (2014) extracted SCFG from paraphrase corpora. They showed that parsing was only successful in 9.1% of paraphrases, confirming that a significant amount of transformations in paraphrases do not conform to compositionality or ITG.

Explanation s, t
Source and target sentences τ Phrase in the parse tree τ R , τ ∅ τ R is a phrase of a root node; τ ∅ is a special phrase with the null span that exists in every parse tree φ Phrase aligned to τ ∅ ·, · Pair of entities; a pair itself can be regarded as an entity {·} Set of entities m(·) Derive the mother node of a phrase l(·), r(·) Derive the left and right child nodes, respectively ds(·) Derive descendants of a node including self; τ ∈ ds(τ ) lca(·, ·) Derive the lowest common ancestor (LCA) of two phrases

Formulation of Phrase Alignment
In this study, we formalize the problem of legitimate phrase alignment. For simplicity, we discuss tree alignment instead of forests using Fig. 2 as a running example. Table 1 describes the notation used in this paper. We call a paraphrased pair source sentence s and the other as target t. Superscripts of s and t represent the source and the target, respectively. Specifically, τ s , τ t is a pair of source and target phrases. We represent f 1 /f 2 / · · · /f i (·) to abbreviate f i (· · · f 2 (f 1 (·)) · · · ) as an intuitive illustration. It should be noted that the order of the function symbols is reversed, e.g., l/r(τ ) (= r(l(τ ))) derives the right-child of the left-child node of τ , and l/ds(τ ) derives the left descendants of τ .

Definition of a Legitimate Alignment
A possible parse tree alignment of s and t is represented as a set of aligned pairs of phrases { τ s i , τ t i }. τ s i and τ t i are the source and the target phrases that constitute the i-th alignment, respectively. Either τ s i or τ t i can be τ ∅ when a phrase does not correspond to another sentence, which is called a null-alignment. Each phrase alignment can have support relations as: l/ds(τ s i ), l/ds(τ t i ) , r/ds(τ s i ), r/ds(τ t i ) or l/ds(τ s i ), r/ds(τ t i ) , r/ds(τ s i ), l/ds(τ t i ) exists. Pre-terminal phrases are supported by the corresponding word alignments. Support relations are denoted using ⇒ or R = ⇒ that represent the order of support phrases. Specif- and τ s n = r/ds(τ s i ). The number of all possible alignments in s and t, which is denoted as H, is exponential to the length. However, only its fraction constitutes legitimate parse tree alignments. For example, a subset in which the same phrase in s is aligned with multiple phrases in t, called competing alignments, is not legitimate as a parse tree alignment. The relationships among phrases in parse trees impose constraints on a subset to provide legitimacy.
Given word alignments W that provide the basis for the phrase alignment, its legitimate set W L ⊂ W should be 1-to-1 alignments. Starting with W L , a legitimate set of phrase alignments H L (⊂ H) with an accompanying set of support relations, ∆ L (⊂ ∆) is constructed. A legitimate set of alignments H L , ∆ L can be enlarged only by adding h i to H L with either the support relation ⇒ or R = ⇒ added to ∆ L . These assume competing alignments among the child phrases, thus cannot co-exist in the same legitimate set.
h i can be supported by more than one pair of descendant alignments in ∆ L , i.e., { h m , · } ⇒ H L , ∆ L should satisfy the conditions in Definition 3.2 to be legitimate as a whole. We denote h i * − → h j when a chain exists in ∆ L , which connects h i to h j regardless of straight or inverted directions of intermediate supports, e.g., are subsets of phrases in the same complete parse tree of s (same for t).

Consistency:
In H L , a phrase ( = τ ∅ ) in the source tree is aligned with at most one phrase ( = τ ∅ ) in the target tree, and vice versa.

Monotonous: For
The Same-Tree condition is required to conduct an alignment on forests that consist of multiple trees in a packed representation. The Consistency condition excludes competing alignments. The Monotonous condition is a consequence of compositionality. The Maximum Set means if h m , h n ∈ H L are in positions of a parse tree that can support h i , h i and the support relation should be added to H L , ∆ L . Such a strict locality of compositionality is often violated in practice as discussed in Sec. 2. To tackle this issue, we add another operation to align phrases in a noncompositional way in Sec. 4.3.

Lowest Common Ancestor
The same aligned pair can have more than one support of descendant alignments because there are numerous descendant node combinations. However, the Monotonous and the Maximum Set conditions allow ∆ L to be further restricted so that each of aligned pairs in H L has only one support.
Let us assume that alignment h i is supported by more than one pair of descendant alignments in For each h m ∈ H m and h n ∈ H n , we remove all support relations from ∆ L except for the maximum pairs or the pre-terminal alignments. The resultant set ∆ L satisfies: Fig. 2, τ s i is the lowest common ancestor (LCA) of τ s m and τ s n , and τ t i is the LCA of τ t m and τ t n . Theorem 3.2 constitutes the basis for the dynamic programming (DP) in our phrase alignment algorithm (Sec. 4.2).

Modeling of Phrase Alignment
We formally model the phrase alignment process as illustrated in Fig. 3, where h i is aligned from descendant alignments, i.e., h m and h n .

Probabilistic Model
Similar to the probabilistic context free grammar (PCFG), the inside probability α i of h i is determined by the inside probabilities, α m and α n , of the support pairs, together with the probability of the rule, i.e., the way by which h m and h n are combined to support h i as shown in Fig. 3. It is characterized by four paths, π s m,i (the path from τ s m to τ s i ), π s n,i (τ s n to τ s i ), π t m,i (τ t m to τ t i ), and π t n,i (τ t n to τ t i ). Each path consists of a set of null-aligned phrases φ ∈ φ, τ ∅ and their mothers, e.g., the path π s m,i in Fig. 3 is a set of φ s 1 , m(φ s 1 ) , φ s 2 , m(φ s 2 ) , and φ s 3 , m(φ s 3 ) . We assume that each occurrence of a null-alignment is indepen-1 ⇒ and R = ⇒ are not distinguished here. Figure 4: Alignment pairs and packed supports dent. Thus, its probability β s m,i is computed as: β s n,i , β t m,i , and β t n,i are computed in the same manner. We abbreviate γ s m,n,i = β s m,i β s n,i , likewise γ t m,n,i = β t m,i β t n,i . Finally, α i can be represented as a simple relation: P r (·, ·) is the alignment probability parameterized in Sec. 5. Since we assume that the structures of parse trees of s and t are determined by a parser, the values of γ s m,n,i and γ t m,n,i are fixed. Therefore, by traversing the parse tree in a bottomup manner, we can identify an LCA (i.e., τ i ) for phrases τ m and τ n while simultaneously computing γ m,n,i .

Alignment Algorithm
Algorithm 4.1 depicts our algorithm. Given word alignments W = { w s i , w t i }, it constructs legitimate sets of aligned pairs in a bottom-up manner. Like the CKY algorithm, Algorithm 4.1 uses DP to efficiently compute all possible legitimate sets and their probabilities in parallel. In addition, null-alignments are allowed when aligning an LCA supported by aligned descendant nodes.
A[·] is indexed by phrases in the parse tree of s and maintains a list of all possible aligned pairs. Furthermore, to deal with non-monotonic alignment (Sec. 4.3), it keeps all competing hypotheses of support relations using packed representations. Specifically, h i is accompanied by its packed support list as illustrated in Fig. 4 Depending on the support alignments, h i has different inside probabilities, i.e., α 1 , α 2 , and α 3 . Since the succeeding process of alignment only deals with the LCA's of τ s 1 and τ t 1 that are independent of the support alignment, all 14: Compute α i using Eq. (1) 15: support relations are packed as a support list 2 by the PACK function.

Non-Compositional Alignment
A monotonic alignment requires τ t m ∈ h m and τ t n ∈ h n to have an LCA, which adheres to the compositionality in language. However, previous studies declared that the compositionality is violated in a monolingual phrase alignment (Burkett et al., 2010;Weese et al., 2014). Heilman and Smith (2010) discuss complex phrase reordering is prevalent in paraphrases and entailed text.
A non-monotonic alignment occurs when corresponding phrases have largely different orders, i.e., one of them (e.g., τ t m ) is an ancestor of another (e.g., τ t n ) or the same phrase. Such a case could be exceptionally compatible, when τ t m has nullalignments and all the aligned phrases of τ t n fit in these null-alignments. A new alignment τ s i , τ t i (= τ t m ) would be non-monotonically formed. Fig. 5 shows a real example of non-compositional alignment produced by our method. The target phrase τ t n ("through the spirit of teamwork") is null- alignment when aligning τ s m and τ t m , but then the alignment to τ s n ("Relying on team spirit") is allowed by non-compositional alignment of τ s i . Unlike monotonous alignment, we have to verify whether the internal structures of τ t m and τ t n are compatible. Since the internal structures of τ t m and τ t n depend on their supporting alignments, their packed representations in A have to be unpacked, and each pair of supporting alignments for h m and h n must be checked to confirm compatibility. Furthermore, since the aligned phrases inside τ t m and τ t n have their own null-alignments, we need to unpack deeper supporting alignments as well.
Algorithm 4.2 checks if target phrases τ m and τ n ∈ ds(τ m ) are compatible. We use the following notations: [τ m ] i and [τ n ] j represent the phrases of τ m and τ n with the i-th and j-th sets of supporting alignments, respectively. For τ t 2 in Fig. 4 Fig. 5, the alignment information is updated at line 5, where GAP function takes two phrases and returns a set of null-alignments on a path between them. If τ n is a descendant of a support of τ m , the compatibility is recursively checked (line 7). Otherwise, the compatibility of the supports of τ n and τ m are recursively checked in DOWN function in a similar manner (line 10).
When TRACE function returns a set of { Ψ k , Φ k }, all ψ ∈ Ψ k are aligned with phrases in the source and their inside probabilities are stored in A. Thus we can compute the inside probability for each Ψ k , Φ k , which is stored in A to-Source: Relying on team spirit, expedition members defeated difficulties

Forest Alignment
Although we have discussed using trees for clarity, the alignment is conducted on forests. The alignment process is basically the same. The only difference is that the same pair has multiple LCAs. Hence, we need to verify if the sub-trees can be on the same tree when identifying their LCAs since multiple nodes may cover the same span with different derivations. This is critical for noncompositional alignment because whether the internal structures are on the same tree must be confirmed while unpacking them. Our alignment process corresponds to reranking of forests and may derive a different tree from the 1-best, which may resolve ambiguity in parsing. We use a parser trained beforehand because joint parsing and alignment is computationally too expensive.

Parameterization
Next, we parameterize the alignment probability.

Feature-enhanced EM Algorithm
We apply the feature-enhanced EM (Berg-Kirkpatrick et al., 2010) due to its ability to use dependent features without an irrational independence assumption. This is preferable because the attributes of phrases largely depend on each other.
Our method is computationally heavy since it handles forests and involves unpacking in the noncompositional alignment process. Thus, we use Viterbi training (Brown et al., 1993) together with a beam search of size µ b ∈ N on the featureenhanced EM. Also, mini-batch training (Liang and Klein, 2009) is applied. Such an approximation for efficiency is common in parallel parsing (Burkett and Klein, 2008;Burkett et al., 2010).
In addition, an alignment supported by distant descendants tends to fail to reach a root-pair alignment. Thus, we restrict the generation gap between a support alignment and its LCA to be less than or equal to µ g ∈ N.

Features
In feature-enhanced EM, the alignment probability in Eq. (1) is parameterized using features: , where a . = (a 0 , · · · , a n ) consists of n attributes of τ . F(·, ·) and w are vectors of feature functions and their weights, respectively.
In a parse tree, the head of a phrase determines its property. Hence, a lemmatized lexical head a lex ∈ a combined with its syntactic category a cat ∈ a is encoded as a feature 3 as shown below. We use semantic (instead of syntactic) heads to encode semantic relationships in paraphrases.
1: 1(a s lex = ·, a s cat = ·, a t lex = ·, a t cat = ·) 2: 1(SurfaceSim(a s lex = ·, a t lex = ·)) 3: 1(WordnetSim(a s lex = ·, a t lex = ·)) 4: 1(EmbeddingSim(a s lex = ·, a t lex = ·)) 5: 1(IsPrepositionPair(a s lex = ·, a t lex = ·)) 6: 1(a s cat = ·, a t cat = ·) 7: 1(IsSameCategory(a s cat = ·, a t cat = ·)) The first feature is an indicator invoked only at specific values. On the other hand, the rest of the features are invoked across multiple values, allowing general patterns to be learned. The second feature is invoked if two heads are identical or a head is a substring of another. The third feature is invoked if two heads are synonyms or derivations that are extracted from the WordNet 4 . The fourth feature is invoked if the cosine similarity between word embeddings of two heads is larger than a threshold. The fifth feature is invoked when the heads are both prepositions to capture their different natures from the content words. The last two features are for categories; the sixth one is invoked at each category pair, while the seventh feature is invoked if the input categories are the same.
To avoid generating a huge number of features, we reduce the number of syntactic categories; for contents (N, V, ADJ, and ADV), prepositions, coordinations, null (i.e., for τ ∅ ), and others.

Penalty Function
Since our method allows null-alignments, it has a degenerate maximum likelihood solution (Liang and Klein, 2009) that makes every phrase nullalignment. Similarly, a degenerate solution overly conducts non-compositional alignment.
To avoid these issues, a penalty is incorporated: where | · | φ computes the span of internal nullalignments, and µ n ≥ 1.0 and µ c ∈ R + control the strength of the penalties of the nullalignment and the non-compositional alignment, respectively. The penalty function is multiplied by Eq.
(1) as a soft-constraint for re-ranking alignment pairs in Algorithm 4.1.

Combination with Parse Probability
Following the spirit of parallel parsing that simultaneously parses and aligns sentences, we linearly interpolate the alignment probability with the parsing probability once the parameters are tuned by EM. When aligning a node pair τ s i , τ t i , the overall probability is computed as: (1 − µ p )α i + µ p (τ s i ) (τ t i ), where (·) gives the marginal probability in parsing and µ p ∈ [0, 1] balances these probabilities.

Evaluation
As discussed in Sec. 2, previous studies have not conducted syntactic phrase alignment on parse trees. A direct metric does not exist to compare paraphrases that cover different spans, i.e., our syntactic paraphrases and paraphrases of n-grams. Thus, we compared the alignment quality to that of humans as a realistic way to evaluate the performance of our method.
We also evaluated the parsing quality. Similar to the alignment quality, differences in phrase structures disturb the comparisons (Sagae et al., 2008). Our method applies an HPSG parser Enju  to derive parse forests due to its state-of-the-art performance and ability to provide rich properties of phrases. Hence, we compared our parsing quality to the 1-best parses of Enju.

Language Resources
We used reference translations to evaluate machine translations 5 as sentential paraphrases (Weese et al., 2014). The reference translations of 10 to 30 words were extracted and paired, giving 41K pairs as a training corpus.
We use different kinds of dictionaries to obtain word alignments W as well as to compute feature functions. First, we extract synonyms and words with derivational relationship using Word-Net. Then we handcraft derivation rules (e.g., create, creation, creator) and extract potentially derivational words from the training corpus. Finally, we use prepositions defined in (Srikumar and Roth, 2013) as a preposition dictionary to compute the feature function.
In addition, we extend W using word embeddings; we use the MVLSA word embeddings (Rastogi et al., 2015) given the superior performance in word similarity tasks. Specifically, we compute the cosine similarity of embeddings; words with a higher similarity value than a threshold are determined as similar words. The threshold is empirically set as the 100th highest similarity value between words in the training corpus.

Gold-Standard Data
Since no annotated corpus provides phrase alignments on parse trees, we created one through twophase manual annotation. First, a linguistic expert with rich experience on annotating HPSG trees annotated gold-trees to paraphrasal sentence pairs sampled from the training corpus. To diversify the data, only one reference pair per sentence of a source language was annotated. Consequently, 201 paraphrased pairs with gold-trees (containing 20, 678 phrases) were obtained.
Next, three professional English translators identified paraphrased pairs including nullalignments given sets of phrases extracted from the gold-trees. These annotators independently annotated the same set, yielding 14, 356 phrase alignments where at least one annotator regarded as a paraphrase. All the annotators agreed that 77% of the phrases were paraphrases.
We used 50 sentence pairs for development and another 151 for testing. These pairs were excluded from the training corpus.

Evaluation Metric
Alignment Quality Alignment quality was evaluated by measuring the extent that the automatic alignment results agree with those of humans. Specifically, we evaluated how goldalignments can be replicated by automatic alignment (called recall) and how automatic alignments overlap with alignments that at least an annotator aligned (called precision) as: where Ha is a set of alignments, while G and G are the ones that two of annotators produce, respectively. The function of | · | counts the elements in a set. There are three combinations for G and G because we had three annotators. The final precision and recall values are their averages.
Parsing Quality The parsing quality was evaluated using the CONLL-X (Buchholz and Marsi, 2006) standard. Dependencies were extracted from the output HPSG trees, and evaluated using the official script 6 . Due to this conversion, the accuracy on the relation labels is less important. Thus, we reported only the unlabeled attachment score (UAS) 7 . The development and test sets provide 2, 371 and 6, 957 dependencies, respectively.
Roles of hyper-parameters µ n Control penalty for null-alignment µ c Control penalty for non-compositional alignment µ p Balance alignment and parsing prob. µ b Beam size at alignment µ g Generation gap to reach an LCA  Since all metrics were computed in a set, the approximate randomization (Noreen, 1989;Riezler and Maxwell, 2005) (B = 10K) was used for significance testing. It has been shown to be more conservative than using bootstrap resampling (Riezler and Maxwell, 2005).

Results and Discussion
Overall Results Table 2 summarizes the hyperparameters, which were tuned to maximize UAS in the development set using the Bayesian optimization. For efficiency, we used 2K samples from the training corpus and set the mini-batch size in feature-enhanced EM to 200 similar to "rapid training" in (Burkett and Klein, 2008). We also set µ b = 50 during EM training to manage the training time. Table 3 shows the performance on the test set for variations of our method and that of the human annotators. The last column shows the percentage of pairs where a root pair is reached to be aligned, called reachability. Our method is denoted as Proposed, while its variations include a method with only monotonic alignment (monotonic), without EM (w/o EM), and a method aligning only 1-best trees (1-best tree).
The performance of the human annotators was assessed by considering one annotator as the test and the other two as the gold-standard, and then taking the averages, which is the same setting as our method. We regard this as the pseudo inter-annotator agreement, since the conventional interannotator agreement is not directly applicable due to variations in aligned phrases.
Our method significantly outperforms the others as it achieved the highest recall and precision for alignment quality. Our recall and precision reach 92% and 89% of those of humans, respectively. Non-compositional alignment is shown to contribute to alignment quality, while the featureenhanced EM is effective for both the alignment and parsing quality. Comparing our method and the one aligning only 1-best trees demonstrates that the alignment of parse forests largely contributes to the alignment quality. Although we confirmed that aligning larger forests slightly improved recall and precision, the improvements were statistically insignificant. The parsing quality was not much affected by phrase alignment, which is further investigated in the following.
Finally, our method achieved 98% reachability, where 2% of unreachable cases were due to the beam search. While understanding that the reachability depends on experimental data, ours is notably higher than that of SCFG, reported as 9.1% in (Weese et al., 2014). These results show the ability of our method to accurately align paraphrases with divergent phrase correspondences.
Effect of Mini-Batch Size We investigated the effect of the mini-batch size in EM training using the entire training corpus (41K pairs). When increasing the mini-batch size from 200 to 2K, recall, precision, and UAS values are fairly stable. In addition, they are insensitive against the amount of training corpus, showing the comparable values against the model trained on 2K samples. These results demonstrate that our method can be trained with a moderate amount of data.
Observations Previous studies show that parallel parsing improves parsing quality, while such an effect is insignificant here. We examine causes through manual observations. The evaluation script indicated that our method corrected 34 errors while introducing 41 new errors 8 . We further analyzed these 75 cases; 12 cases are ambiguous as both the gold-standard and the output are correct. In addition, 8 cases are due to erroneous original sentences that should be disregarded, e.g., " For two weeks ago,..." and "Accord-ing to the source, will also meet...". Consequently, our method corrected 32 errors while introducing 23 errors in reality for 446 errors in 1-best trees, which achieves a 2.5% error reduction.
These are promising results for our method to improve parsing quality, especially on the PPattachment (159 errors in 1-best), which contained 14 of the 32 corrected errors. Fig. 1 shows a real example; the phrase of "for a smoke" in the source was mistakenly attached to "ground floor" in the 1-best tree. This error was corrected as depicted. Duan et al. (2016) showed that paraphrases artificially generated using n-best parses improved the parsing quality. One reason for limited improvement in our experiments may be because structural changes in our natural paraphrases are more dynamic than the level useful to resolve ambiguities. We will further investigate this in future.

Conclusion
We propose an efficient method for phrase alignment on parse forests of paraphrased sentences. To increase the amount of collected paraphrases, we plan to extend our method to align comparable paraphrases that are partially paraphrasal sentences. In addition, we will apply our method to parallel parsing and other grammar, e.g., projective dependency trees. Furthermore, we will apply such syntactic paraphrases to phrase embedding.