Parsing Paraphrases with Joint Inference

Treebanks are key resources for developing accurate statistical parsers. However, building treebanks is expensive and time-consuming for humans. For domains requiring deep subject matter expertise such as law and medicine, treebanking is even more difﬁcult. To reduce annotation costs for these domains, we develop methods to improve cross-domain parsing inference using paraphrases. Paraphrases are easier to obtain than full syntactic analyses as they do not require deep linguistic knowledge, only linguistic ﬂuency. A sentence and its paraphrase may have similar syntactic structures, allowing their parses to mutually inform each other. We present several methods to incorporate paraphrase information by jointly parsing a sentence with its paraphrase. These methods are applied to state-of-the-art constituency and dependency parsers and provide significant improvements across multiple domains.


Introduction
Parsing is the task of reconstructing the syntactic structure from surface text. Many natural language processing tasks use parse trees as a basis for deeper analysis.
The most effective sources of supervision for training statistical parsers are treebanks. Unfortunately, treebanks are expensive, time-consuming to create, and not available for most domains. Compounding the problem, the accuracy of statistical parsers degrades as the domain shifts away from the supervised training corpora (Gildea, 2001;Bacchiani et al., 2006;McClosky et al., 2006b;Surdeanu et al., 2008). Furthermore, for * Work performed during an IBM internship. domains requiring subject matter experts, e.g., law and medicine, it may not be feasible to produce large scale treebanks since subject matter experts generally don't have the necessary linguistic background. It is natural to look for resources that are more easily obtained. In this work, we explore using paraphrases. Unlike parse trees, paraphrases can be produced quickly by humans and don't require extensive linguistic training. While paraphrases are not parse trees, a sentence and its paraphrase may have similar syntactic structures for portions where they can be aligned.
We can improve parsers by jointly parsing a sentence with its paraphrase and encouraging certain types of overlaps in their syntactic structures. As a simple example, consider replacing an unknown word in a sentence with a synonym found in the training data. This may help disambiguate the sentence without changing its parse tree. More disruptive forms of paraphrasing (e.g., topicalization) can also be handled by not requiring strict agreement between the parses.
In this paper, we use paraphrases to improve parsing inference within and across domains. We develop methods using dual-decomposition (where the parses of both sentences from a dependency parser are encouraged to agree, Section 3.2) and pair-finding (which can be applied to any nbest parser, Section 3.3). Some paraphrases significantly disrupt syntactic structure. To counter this, we examine relaxing agreement constraints and building classifiers to predict when joint parsing won't be beneficial (Section 3.4). We show that paraphrases can be exploited to improve crossdomain parser inference for two state-of-the-art parsers, especially on domains where they perform poorly.

Paraphrases
While paraphrases are difficult to define rigorously (Bhagat and Hovy, 2013), we only require a loose definition in this work: a pair of phrases that mean approximately the same thing. Paraphrases can be constructed in various ways: replacing words with synonyms, reordering clauses, adding relative clauses, using negation and antonyms, etc. Table 1 lists some example paraphrases.
There are a variety of paraphrase resources produced by humans (Dolan and Brockett, 2005) and automatic methods (Ganitkevitch et al., 2013). Recent works have shown that reliable paraphrases can be crowdsourced at low cost (Negri et al., 2012;Burrows et al., 2013;Tschirsich and Hintz, 2013). Paraphrases have been shown to help summarization (Cohn and Lapata, 2013), question answering (Duboue and Chu-Carroll, 2006;Fader et al., 2013), machine translation (Callison-Burch et al., 2006), and semantic parsing (Berant and Liang, 2014). Paraphrases have been applied to syntactic tasks, such as prepositional phrase attachment and noun compounding, where the corpus frequencies of different syntactic constructions (approximated by web searches) are used to help disambiguate (Nakov and Hearst, 2005). One method for transforming constructions is to use paraphrase templates.

Bilingual Parsing
The closest task to ours is bilingual parsing where sentences and their translations are parsed simultaneously (Burkett et al., 2010). While our methods differ from those used in bilingual parsing, the general ideas are the same. 1 Translating and paraphrasing are related transformations since both approximately preserve meaning. While syntax is only partially preserved across these transformations, the overlapping portions can be leveraged with joint inference to mutually disambiguate. Existing bilingual parsing methods typically require parallel treebanks for training and parallel text at runtime while our methods only require parallel text at runtime. Since we do not have a parallel paraphrase treebank for training, we cannot directly compare to these methods.

Jointly Parsing Paraphrases
With a small number of exceptions, parsers typically assume that the parse of each sentence is independent. There are good reasons for this independence assumption: it simplifies parsing inference and oftentimes it is not obvious how to relate multiple sentences (though see  for one approach). In this section, we present two methods to jointly parse paraphrases without complicating inference steps. Before going into details, we give a high level picture of how jointly parsing paraphrases can help in Figure 1. With the baseline parser, the parse tree of the target sentence is incorrect but its paraphrase (parsed by the same parser) is parsed correctly. We use rough alignments to map words across sentence pairs. Note the similar syntactic relations when they are projected across the aligned words.
Our goal is to encourage an appropriate level of agreement between the two parses across alignments. We start by designing "hard" methods which require complete agreement between the parses. However, since parsers are imperfect and alignments approximate, we also develop "soft" methods which allow for disagreements. Additionally, we make procedures to decide whether to use the original (non-joint) parse or the new joint parse for each sentence since joint parses may be worse in cases where the sentences are too different and alignment fails.

Objective
In a typical parsing setting, given a sentence (x) and its paraphrase (y), parsers find a * (x) and b * (y) that satisfy the following equation: 2 (1) where f is a parse-scoring function and T returns all possible trees for a sentence. f can take many forms, e.g., summing the scores of arcs (Eisner, 1996;McDonald et al., 2005) or multiplying probabilities together (Charniak and Johnson, 2005). The argmax over a and b of equation (1) is separable; parsers make two sentence-level decisions.
For joint parsing, we modify the objective so that parsers make one global decision: where c (defined below) measures the syntactic similarity between the two trees. The smaller c(a, b) is, the more similar a and b are. Intuitively, joint parsers must retrieve the most similar pair of trees with the highest sum of scores.

Constraints
The constraint function, c, ties two trees together using alignments as a proxy for semantic information. An alignment is a pair of words from sentences x and y that approximately mean the same thing. For example, in Figure 1, (help x , help y ) is one alignment and (pestilence x , disease y ) is Algorithm 1: Dual decomposition for jointly parsing paraphrases pseudocode. E is the set of all possible edges between any pair of aligned words. Given aligned word pairs, is the dual value of an edge from the ith aligned word to the jth aligned word. δ k is the step size at kth iteration.
another. To simplify joint parsing, we assume the aligned words play the same syntactic roles (which is obviously not always true and should be revisited in future work). c measures the syntactic similarity by computing how many pairs of alignments have different syntactic head relations. For the two trees in Figure 1, we see two different relations: (help The rest have the same relation so c(a, b) = 2. As we'll show in Section 5, the constraints defined above are too restrictive because of this strong assumption. To alleviate the problem, we present ways of appropriately changing constraints later. We now turn to the first method of incorporating constraints into joint parsing.

Constraints via Dual Decomposition
Dual decomposition  is well-suited for finding the MAP assignment to equation (2). When the parse-scoring function f includes an arc-factored component as in McDonald et al. (2005), it is straightforward to incorpo-(target sentence) x: help some natives dying of pestilence (paraphrase) y: help some natives who were dying of disease wrong right Figure 1: An illustration of joint parsing a sentence with its paraphrase. Unaligned words are gray. Joint parsing encourages structural similarity and allows the parser to correct the incorrect arc.
rate constraints as shown in Algorithm 1. Essentially, dual decomposition penalizes relations that are different in two trees by adding/subtracting dual values to/from arc scores. When dual decomposition is applied in Figure 1, the arc score of (help x − → dying) decreases and the score for (natives x − → dying) increases in the second iteration, which eventually leads the algorithm to favor the latter.
We relax the constraints by employing soft dual decomposition (Anzaroot et al., 2014) and replacing UPDATE in Algorithm 1 with S-UPDATE from Algorithm 2. The problem with the original constraints is they force every pair of alignments to have the same relation even when some aligned words certainly play different syntactic roles. The introduced slack variable lets some alignments have different relations when parsers prefer them. Penalties bounded by the slack tend to help fix incorrect ones and not change correct parses. In this work, we use a single slack variable but it's possible to have a different slack variable for each type of dependency relation. 3

Constraints via Pair-finding
One shortcoming of the dual decomposition approach is that it only applies to parse-scoring functions with an arc-factored component. We introduce another method for estimating equation (2) that applies to all n-best parsers.
Given the n-best parses of x and the m-best parses of y, Algorithm 3 scans through n×m pairs of trees and chooses the pair that satisfies equation (2). If it finds one pair with c(a, b) = 0, then it has found the answer to the equation. Otherwise, it Algorithm 2: The new UPDATE function of soft dual decomposition for joint parsing. It projects all dual values between 0 and s ≥ 0. s is a slack variable that allows the algorithm to avoid satisfying some constraints.
chooses the pair with the smallest c(a, b), breaking ties using the scores of the parses (f (a) + f (b)). This algorithm is well suited for finding solutions to the equation but the solutions are not necessarily good trees due to overly hard constraints.
The algorithm often finds bad trees far down the n-best list because it is mainly interested in retrieving pairs of trees that satisfy all constraints. Parsers find such pairs with low scores if they are allowed to search through unrestricted space. To mitigate the problem, we shrink the search space by limiting n. Reducing the search space relies on the fact that higher ranking trees are more likely to be correct than the lower ranking ones. Note that we decrease n because we are interested in recovering the tree of the target sentence, x. m should also be decreased to improve the parse of its paraphrase, y.

Logistic Regression
One caveat of the previous two proposed methods is that they do not know whether the original or joint parse of x is more accurate. Sometimes they increase agreement between the parses at the cost function PAIR-FINDING(a 1:n , b 1:m ) Set a, b = null, min = ∞, max = −∞ for i = 1 to n do for j = 1 to m do Algorithm 3: The pair-finding scheme with a constraint function, c. a 1:n are the n-best trees of x and b 1:m are the m-best of y.
of accuracy. To remedy this problem, we use a classifier (specifically logistic regression) to determine whether a modified tree should be used. The classifier can learn the error patterns produced by each method.

Features
Classifier features use many sources of information: the target sentence x and its paraphrase y, the original and new parses of x (a 0 and a), and the alignments between x and y.
Crossing Edges How many arcs cross when alignments are drawn between paraphrases on a plane divided by the length of x. It roughly measures how many reorderings are needed to change x to y.
Non-projective Edges Whether there are more non-projective arcs in new parse (a) than the original (a 0 ).

Sentence Lengths
Whether the length of x is smaller than that of y. This feature exists because baseline parsers tend to perform better on shorter sentences.

Word Overlaps
The number of words in common between x and y normalized by the length of x.  Table 2: Feature templates: REL is the dependency relation between the word and its parent. CP is the coarse part-of-speech tag (first two letters) of a word. p and gp select the parent and grandparent of the word respectively.
Parse Structure Templates The feature generator goes through every word in {a 0 , a} and sets the appropriate boolean features from Table 2. Features are prefixed by whether they come from a 0 or a.

Data and Programs
This section describes our paraphrase dataset, parsers, and other tools used in experiments.

Paraphrase Dataset
To evaluate the efficacy of the proposed methods of jointly parsing paraphrases, we built a corpus of paraphrases where one sentence in a pair of paraphrases has a gold tree. 4 We randomly sampled 4,000 sentences 5 from four gold treebanks: Brown, British National Corpus (BNC), Question-Bank 6 (QB) and Wall Street Journal (section 24) (Francis and Kučera, 1989;Foster and van Genabith, 2008;Judge et al., 2006;Marcus et al., 1993). A linguist provided a paraphrase for each sampled sentence according to these instructions: The paraphrases should more or less convey the same information as the original sentence. That is, the two sentences should logically entail each other. The paraphrases should generally use most of the same words (but not necessarily in the same order). Active/passive transforms, changing words with synonyms, and rephrasings of the same idea are all examples of transformations that paraphrases can use (others can be used too).
They can be as simple as just changing a single word in some cases (though, ideally, a variety of paraphrasing techniques would be used).
We also provided 10 pairs of sentences as examples. We evaluate our methods only on the sampled sentences from the gold corpora because the new paraphrases do not include syntactic trees. The data was divided into development and testing sets such that development and testing share the same distribution over the four corpora. Paraphrases were tokenized by the BLLIP tokenizer. See Table 3 for statistics of the dataset. 7

Meteor Word Aligner
We use Meteor, a monolingual word aligner developed by Denkowski and Lavie (2014), to find alignments between paraphrases. It uses the exact matches, stems, synonyms, and paraphrases 8 to form these alignments. Because it uses paraphrases, it sometimes aligns multiple words from sentence x to one or more words from sentence y or vice versa. We ignore these multiword alignments because our methods currently only handle single word alignments. In pilot experiments, we also tried using a simple aligner which required exact word matches. Joint parsing with simpler alignments improved parsing accuracy but not as much as Meteor. 9 Thus, all results in Section 5 use Meteor for word alignment. On average across the four corpora, 73% of the tokens are aligned.

Parsers
We use a dependency and constituency parser for our experiments: RBG and BLLIP. RBG parser (Lei et al., 2014) is a state-of-the-art dependency parser. 10 It is a third-order discriminative dependency parser with low-rank tensors as part of its features. BLLIP (Charniak and Johnson, 2005) is a state-of-the-art constituency parser, which is 7 The distribution over four corpora is skewed because each corpus has a different number of sentences within length constraints. Samples are collected uniformly over all sentences that satisfy the length criterion. 8 Here paraphrase means a single/multiword phrase that is semantically similar to another single/multiword. 9 The pilot was conducted on fewer than 700 sentence pairs before all paraphrases were created. We give Meteor tokenized paraphrases with capitalization. Maximizing accuracy rather than coverage worked better in pilot experiments. 10 http://github.com/taolei87/RBGParser, 'master' version from June 24th, 2014. composed of a generative parser and a discriminative reranker. 11 To train RBG and BLLIP, we used the standard WSJ training set (sections 2-21, about 40,000 sentences). 12 We also used the self-trained BLLIP parsing model which is trained on an additional two million Gigaword parses generated by the BLLIP parser (McClosky et al., 2006a).

Logistic Regression
We use the logistic regression implementation from Scikit-learn 13 with hand-crafted features from Section 3.4.1. The classifier decides to whether to keep the parse trees from the joint method. When it decides to disregard them, it returns the parse from the baseline parser. We train a separate classifier for each joint method.

Experiments
We ran all tuning and model design experiments on the development set. For the final evaluation, we tuned parameters on the development set and evaluate them on the test set. Constituency trees were converted to basic non-collapsed dependency trees using Stanford Dependencies (De Marneffe et al., 2006). 14 We report unlabeled attachment scores (UAS) for all experiments and labeled attachment scores (LAS) as well in final evaluation, ignoring punctuation. Averages are microaverages across all sentences.

Dual Decomposition
Since BLLIP is not arc-factored, these experiments only use RBG. Several parameters need to be fixed beforehand: the slack constant (s), the learning rate (δ), and the maximum number of iterations (K). We set δ 0 = 0.1 and δ k = δ 0 2 t where t is the number of times the dual score has increased (Rush et al., 2010). We choose K = 20. These numbers were chosen from pilot studies. The slack variable (s = 0.5) was tuned with a grid search on values between 0.1 and 1.5 with interval 0.1. We chose a value that generalizes well across four corpora as opposed to a value that does 11 http://github.com/BLLIP/bllip-parser 12 RBG parser requires predicted POS tags. We used the Stanford tagger (Toutanova et al., 2003) to tag WSJ and paraphrase datasets. Training data was tagged using 20-fold cross-validation and the paraphrases were tagged by a tagger trained on all of WSJ training. 13 http://scikit-learn. 13.1 10.5 6.9 13.0 9.7 12.6 10.7 6.7 13.0 9.7 Table 3: Statistics for the four corpora of the paraphrase dataset. Most statistics are counted from sentences with gold trees, including punctuation. indicates the statistic is from the paraphrased sentences. "Avg. aligned" is the average number of aligned tokens from the original sentences using Meteor. OOV is the percentage of tokens not seen in the WSJ training.  very well on a single corpus. As shown in Table 4, joint parsing with hard dual decomposition performs worse than independent parsing (RBG). This is expected because hard dual decomposition forces every pair of alignments to form the same relation even when they should not. With relaxed constraints (S-Dual), joint parsing performs significantly better than independent parsing. Soft dual decomposition improves across all domains except for Brown (where it ties).

Pair-finding
These experiments use the 50-best trees from BLLIP parser. When converting to dependencies, some constituency trees map to the same dependency tree. In this case, trees with lower rankings are dropped. Like joint parsing with hard dual decomposition, joint parsing with unrestricted pairfinding (n = 50) allows significantly worse parses to be selected (  other experiments. Interestingly, each corpus has a different optimal value for n which suggests we might improve accuracy further if we know the domain of each sentence.

Logistic Regression
The classifier is trained on sentences where parse scores (UAS) of the proposed methods are higher or lower than those of the baselines 16 from the development set using leave-one-out crossvalidation. We use random greedy search to select specific features from the 15 feature templates defined in Section 3.4.1. Features seen fewer than three times in the development are thrown out. Separate regression models are built for three different parsers. The logistic regression classifier uses an L 1 penalty with regularization parameter C = 1.
Logistic regression experiments are reported in  Table 6: Effect of using logistic regression on top of each method (UAS). Leave-one-out crossvalidation is performed on the development data. +X means augmenting the above system with X. Table 6. All parsers benefit from employing logistic regression models on top of paraphrase methods. BLLIP experiments show a larger improvement than RBG. This may be because BLLIP cannot use soft constraints so its errors are more pronounced.

Final Evaluation
We evaluate the three parsers on the test set using the tuned parameters and logistic regression models from above. Joint parsing with paraphrases significantly improves accuracy for all systems (Table 7). Self-trained BLLIP with logistic regression is the most accurate, though RBG with S-Dual provides the most consistent improvements. Joint parsing without logistic regression (RBG + S-Dual) is more accurate than independent parsing (RBG) overall. With the help of logistic regression, the methods do at least as well as their baseline counterparts on all domains with the exception of self-trained BLLIP on BNC. We believe that the drop on BNC is largely due to noise as our BNC test set is the smallest of the four. As on development, logistic regression does not change the accuracy much over the RBG parser with soft dual decomposition.
Joint parsing provides the largest gains on QuestionBank, the domain with the lowest baseline accuracies. This fits with our goal of using paraphrases for domain adaptation -parsing with paraphrases helps the most on domains furthest from our training data.

Error analysis
We analyzed the errors from RBG and BLLIP along several dimensions: by dependency label, sentence length, dependency length, alignment status (whether a token was aligned), percentage of tokens aligned in the sentence, and edit distance between the sentence pairs. Most errors are fairly uniformly distributed across these dimensions and indicate general structural improvements when using paraphrases. BLLIP saw a 2.2% improvement for the ROOT relation, though RBG's improvement here was more moderate. For sentence lengths, BLLIP obtains larger boosts for shorter sentences while RBG's are more uniform. RBG gets a 1.4% UAS improvement on longer dependencies (6 or more tokens) while shorter dependencies are more modestly improved by about 0.3-0.5% UAS. Surprisingly, alignment information provides no signal as to whether accuracy improves.
Additionally, we had our annotator label a portion of our dataset with the set of paraphrasing operations employed. 17 While most paraphrasing operations generally improved performance under joint inference, the largest reliable gains came from lexical replacements (e.g., synonyms).

Conclusions and Future Work
Our methods of incorporating paraphrases improve parsing across multiple domains for stateof-the-art constituency and dependency parsers. We leverage the fact that paraphrases often express the same semantics with similar syntactic realizations. These provide benefits even on top of selftraining, another domain adaptation technique.
Since paraphrases are not available at most times, our methods may seem limited. However, there are several possible use cases. The best case scenario is when users can be directly asked to rephrase a question and provide a paraphrase. For instance, question answering systems can ask users to rephrase questions when an answer is marked as wrong by users. Another option is to use crowdsourcing to quickly create a paraphrase corpus (Negri et al., 2012;Burrows et al., 2013;Tschirsich and Hintz, 2013). As part of future work, we plan to integrate existing larger paraphrase resources, such as WikiAnswers (Fader et al., 2013) and PPDB (Ganitkevitch et al., 2013). WikiAnswers provides rough equivalence classes of questions. PPDB includes phrasal and syntactic alignments which could supplement our existing alignments or be used as proxies for paraphrases.  Table 7: Final evaluation on testing data. Numbers are unlabeled attachment score (labeled attachment score). +X indicates extending the above system with X. BLLIP-ST is BLLIP using the self-trained model. Coloring indicates a significant difference over baseline (p < 0.01).
While these resources are noisy, the quantity of data may provide additional robustness. Lastly, integrating our methods with paraphrase detection or generation systems could help provide paraphrases on demand.
There are many other ways to extend this work. Poor alignments are one of the larger sources of errors and improving alignments could help dramatically. One simple extension is to use multiple paraphrases and their alignments instead of just one. More difficult would be to learn the alignments jointly while parsing and adaptively learn how alignments affect syntax. Our constraints can only capture certain types of paraphrase transformations currently and should be extended to understand common tree transformations for paraphrases, as in (Heilman and Smith, 2010). Finally, and perhaps most importantly, our methods apply only at inference time. We plan to investigate methods which use paraphrases to augment parsing models created at train time.