Compositional Phrase Alignment and beyond

Phrase alignment is the basis for modelling sentence pair interactions, such as paraphrase and textual entailment recognition. Most phrase alignments are compositional processes such that an alignment of a phrase pair is constructed based on the alignments of their child phrases. Nonetheless, studies have revealed that non-compositional alignments involving long-distance phrase reordering are prevalent in practice. We address the phrase alignment problem by combining an unordered tree mapping algorithm and phrase representation modelling that explicitly embeds the similarity distribution in the sentences onto powerful contextualized representations. Experimental re-sults demonstrate that our method effectively handles compositional and non-compositional global phrase alignments. Our method sig-niﬁcantly outperforms that used in a previous study and achieves a performance competitive with that of experienced human annotators.


Introduction
Phrase alignment is a fundamental problem in modelling the interactions between a pair of sentences, such as paraphrase identification, textual entailment recognition, and question answering (Das and Smith, 2009;Heilman and Smith, 2010;Wang and Manning, 2010). Phrase alignment generally adheres to compositionality, in which a phrase pair is aligned based on the alignments of their child phrases. Nonetheless, non-compositional alignments involving long-distance phrase reordering are prevalent in practice (Burkett et al., 2010;Heilman and Smith, 2010;Arase and Tsujii, 2017). Figure 1 shows an example of phrase alignment in which phrases of the same colours are alignable, i.e. they are phrasal paraphrases. The alignment of 'antivirus vaccines' and 'vaccines against the virus' is compositional, as supported by alignments of their child nodes although their orderings are reversed. Similarly, the alignment of their parents τ s 2 and τ t 2 is compositional. By contrast, the alignment of τ s 1 and τ t 1 is non-compositional in relation to the alignment of τ s 2 and τ t 2 ; although τ t 1 and τ t 2 are siblings, τ s 1 is not a sibling of τ s 2 , i.e. not in the scope of the parent node of τ s 2 . To treat such a long-distance correspondence in non-compositional alignment, one has to consider candidate phrases outside the local scope and potentially the entire sentence.
In this study, we address the phrase alignment problem by combining a tree mapping algorithm with phrase representation modelling. We treat compositional alignment by an algorithm for an unordered tree mapping (Zhang, 1996). For the algorithm to work, definition of the edit cost (i.e. dissimilarity between phrases) is crucial. We propose a novel phrase representation, by which the edit cost is defined, based on contextualized representations by the bidirectional encoder representations from transformers (BERT) (Devlin et al., 2019). The proposed phrase representation models the similarity distribution in the entire sentence, thereby allowing the algorithm to be extended to treat non-compositional global alignments.
Phrase alignment can be difficult even for humans because there is unavoidable subjectivity in acceptable semantic discrepancies between paraphrases. Our experimental results indicate that the proposed method achieves 95.7% of the alignment quality of trained human annotators for phrase alignment in paraphrase sentence pairs.
The contributions of this study are twofold. First, we formalise the compositional phrase alignment problem as an unordered tree mapping. Second, we propose a phrase representation model that allows non-compositional global alignments.
2 Related Work

Tree Mapping and Phrase Alignment
Ordered tree mapping has been employed to estimate the similarity of a pair of sentences for its ability to align syntactic trees (Punyakanok et al., 2004;Alabbas and Ramsay, 2013;Yao et al., 2013;McCaffery and Nederhof, 2016). However, it is too restrictive in that the order of the aligned phrases in the sentences must be the same. Previous studies extended the algorithm to adapt the edit costs (Bernard et al., 2008;Mehdad, 2009;Alabbas and Ramsay, 2013) and edit operations (Heilman and Smith, 2010;Wang and Manning, 2010) to specific tasks. In contrast, the unordered tree mapping that we employ in this study is sufficiently flexible to assure identification of optimal compositional phrase alignments.
Parallel parsing also involves phrase alignment in its parsing process. As the tree isomorphism assumption is too restrictive, previous studies have employed various relaxation techniques that prefer but do not force synchronisation. Burkett et al. (2010) used weakly synchronised grammar, and Das and Smith (2009) used quasi-synchronous grammars (Smith and Eisner, 2006. Choe and McClosky (2015) used dual decomposition to encourage agreement between two parse trees. All of these methods allow excess flexibility beyond compositionality in alignment. Rule extraction for tree transducers also involves phrase alignments (Martínez-Gómez and Miyao, 2016) but disregards phrase boundaries to maximise the coverage of extracted rules. In contrast, the phrase alignment problem addressed in our study adheres to syntactic structures.

Phrase Representation Generation
Researchers have proposed specialised phrase representations for specific tasks (Arase and Tsujii, 2019;Yin et al., 2020) on top of contextualised rep-resentations. In this study, we propose dedicated phrase representations for the alignment problem. Before contextualised representation, studies considered word alignment distributions for modelling semantic interactions between a pair of sentences (He and Lin, 2016;Parikh et al., 2016;Chen et al., 2017). We agree with their intuition that the pairwise similarities alone are not good enough to define the cost of alignment. In case there are other similar phrases, their pairwise similarities have to be properly adjusted. This adjustment is crucial for treating non-compositional global alignment.

Preliminaries and Notation
We refer to one of the paraphrasal sentences as the source, s, and the other as the target, t. Superscripts s and t represent source and target, respectively. The syntactic trees of the source and target, T s = {τ s i } i and T t = {τ t j } j , determine the phrase structures; τ s i and τ t j are the source and target phrases. The alignments of their phrases are We interchangeably use the subscript of a node as the index of the alignment or the index of the node in a tree whenever the meaning is apparent from the context. A phrase can align to an empty node τ ∅ (τ ∅ / ∈ T ), which is called the null alignment.
We define functions to traverse a tree: ds(τ ) derives descendant nodes of τ , and lca(τ i , τ j ) derives the lowest common ancestor of τ i and τ j . Additionally, function deg(T ) computes the maximum depth of T , and | · | counts the number of elements in a set; e.g. |T | is the number of nodes in T .

Problem Definition
Based on Arase and Tsujii (2017), we reformalise conditions of legitimacy as a set of compositional phrase alignments H L .
Definition 3.1. Legitimacy conditions consist of the following: Consistency In H L , a phrase ( = τ ∅ ) in the source tree is aligned with at most one phrase ( = τ ∅ ) in the target tree, and vice versa. .
The consistency condition ensures one-to-one alignment. The monotonicity condition regulates the retainment of the ancestor-descendant relation in the source and target sides. The familiness condition realises compositionality in the language, which constrains such that two separate subtrees of T s should be aligned to two separate subtrees of T t . 2 In other words, the familiness condition prohibits a node in the source subtree to align to a node outside that target subtree. In Figure 1, τ s 1 , τ t 1 violates the familiness condition in relation to τ s 2 , τ t 2 and τ s 3 , τ t 3 because τ s 3 is not a proper ancestor of lca(τ s 1 , τ s 2 ) , whereas τ t 3 is a proper ancestor of lca(τ t 1 , τ t 2 ). We define non-compositional alignments H nc as alignments that satisfy the legitimacy conditions internally but do not satisfy them against H L . For example, the alignment τ s 1 , τ t 1 in Figure 1 is compositionally composed and satisfies the legitimacy conditions for its internal alignments. However, it does not satisfy the legitimacy conditions against alignments of τ s 2 , τ t 2 and τ s 3 , τ t 3 for violation of the familinesss condition. We allow H nc to be added into H L if it is compatible; , both τ s i , τ ∅ and τ ∅ , τ t i are in H L . When the compatibility condition is met, H nc can be safely added to H L by complementing null alignments without violating the consistency condition. We implement this process by a simple post-processing step (Section 3.4).

Compositional Alignment
Finding the optimal set of legitimate compositional alignments (Definition 3.1) is equivalent to finding the minimum cost of constrained tree mapping (Zhang, 1996), which belongs to the problem of unordered tree mapping (Bille, 2005). The edit operations of re-labelling, deletion, and insertion correspond to alignment of two nodes, null alignment of a source node, and null alignment of a target node, respectively. Although the unordered tree mapping problem is in general MAX SNP-hard (Zhang and Jiang, 1994), the constrained 1 A proper ancestor of a node i is any node j such that node j is an ancestor of node i and j is not the same node as i.
2 Our definition is less constrained than that in Arase and Tsujii (2017) as discussed in Appendix A. tree edit distance (CTED) algorithm (Zhang, 1996) achieves polynomial time complexity using the familiness condition. In essence, the CTED algorithm reduces the unordered tree mapping problem to a maximum matching problem by the familiness condition. The reduction enables faster dynamic programming of O(|T s ||T t |(deg(T s ) + deg(T t )) log (deg(T s ) + deg(T t ))). Details of the CTED algorithm are described in detail in Appendix B.
To apply CTED for phrase alignment, the edit cost function γ(·) → R is the key, which should satisfy the properties of a proper distance metric. This function evaluates the dissimilarity of a phrase pair, for which we propose a phrase representation model (Section 4). We use cosine distance as γ(·) ∈ [0, 2.0] because of its prevalence in measuring dissimilarity between representations. However, it is not a proper distance metric because it does not satisfy the triangle inequality property. In future work, we will investigate alternative distance metrics.
We also need to estimate the cost of a null alignment. It is not trivial to generate representation of such an empty phrase; hence, we decided to use a constant cost λ ∅ , i.e., The appropriate value of λ ∅ is determined using a development set.

Non-compositional Alignment
We designed top-down post-processing for noncompositional alignment so that the legitimacy conditions (Definition 3.1) will be maximally satisfied in the final alignments. As Algorithm 3.1 shows, we add a set of alignments H nc that compose the non-compositional alignments into H L when they are compatible. Our post-processing aligns all the coloured phrase pairs in Figure 1 by allowing τ s 1 , τ t 1 and its descendant alignments. Algorithm 3.1 takes matrices of edit distance and corresponding operations D and A as input, which are obtained by CTED. D[i + 1][j + 1] and A[i+1][j +1] store the total cost and operations, respectively, to compose alignment of τ s i , τ t j . Note that index 0 is reserved for null alignments. The algorithm sorts null alignments in H L in descending order of the span covering the source and target phrases (line 2). For each null alignment, the algorithm finds candidates of non-compositional Do the same for the source side 9: function ISCOMPATIBLE(Â, Remove τ ∅ , τ t j from H L and H ∅ alignments achieving the minimum cost (line 5). Then, using the ISCOMPATIBLE function, it checks whether a non-compositional alignment and its descendant alignments are compatible with the current set of alignments. If so, they are added to H L by the UPDATEALIGNMENTS function, replacing null alignments τ s i , τ ∅ and τ ∅ , τ t j in H L with non-compositional alignment τ s i , τ t j . Our post-processing is a heuristic to maximally satisfy the legitimacy conditions, as finding the best combination of non-compositional alignments is computationally intractable. 3 Our method ensures that non-compositional alignments improve the alignment cost by only allowing those with minimum cost.

Phrase Representation for Alignment
We propose a phrase representation model on top of the pre-trained BERT. One of the most common methods for obtaining a phrase representation from BERT is pooling outputs corresponding to tokens in the phrase. However, as we empirically show in Section 6, this method exhibits an unsatisfactory ability for modelling the similarity distribution in a sentence pair. Hence, we propose a novel method for generating phrase representations suitable for the phrase alignment problem.

Problem Statement and Approach
The estimate of a phrase pair's similarity for alignment is unique, because their similarity should depend on similarities of other phrases in the sentence pair. That is, even if the pairwise similarity of τ s i and τ t j is high, the similarity score should be lowered if there is a phrase in the source sentence that is more similar to τ t j . Hence, we generate a phrase representation that reflects the similarity distribution within the sentence pair; this is particularly important for non-compositional alignments to find a globally plausible alignment pair.
We first generate a representation of the similarity distribution within the sentence pair. We then transform the phrase representation obtained from BERT, referring to the representation of the similarity distribution using an attention mechanism.
Similarity Distribution Modelling We regard outputs of the last layer h ∈ R b of BERT as token representations, where b is the hidden size determined by the BERT pre-training settings. Using the token representations, we generate a representation of similarity distribution e c ∈ R b (Figure 2).
We first compute cosine similarities between token representations of the sentence pair and obtain the similarity matrix. We then encode the similarity matrix using a convolutional neural network (CNN) and obtain e c , called the SimMatrix representation. Our CNN is shallow, under the assumption that a shallow model is sufficient to capture latent features in SimMatrix. A shallow model also allows training with a smaller corpus while fine-tuning BERT. The CNN consists of a one-channel convolution layer activated by the rectified linear unit function, a max-pooling layer, and a fully connected feed- Representation Generation We obtain a basic representation of τ s for span i to j by simply pooling the token representations obtained from BERT: Similarly, a basic representation e t of target phrase τ t is obtained. We then transform e t to reflect the SimMatrix representation e c . For this, we use an attention mechanism as shown in Figure 3, which has the same architecture as the Transformer (Vaswani et al., 2017). The attention layer consists of multi-head attention and FFNNs. Our model takes e c , e s , and e t , and transforms e t intoê t ∈ R b .

Loss Function
To train the phrase representation model, we use a triplet margin loss: whereê t p andê t n are transformed representations of positive (alignable) and negative (unalignable) pairs, respectively, and δ is a margin. Intuitively, the loss function makes representations of paraphrase pairs closer, whereas those of nonparaphrase pairs are more distant. For negative examples, we randomly sample phrases that are separated by more than one hop from the alignable pair in T t . At an inference, we transform the basic representation of a target phrase by our model and compute the cost γ(e s ,ê t ).
We also tried models that discriminate alignable phrases or minimise the cosine similarity of an alignable pair. However, they were all inferior to the triplet margin loss. To train our phrase representation model, we need a corpus with phrase alignments annotated on sentence pairs. We extended the Syntactic Phrase Alignment Dataset for Evaluation (SPADE) (Arase and Tsujii, 2018), creating the Extended Syntactic Phrase Alignment DAtaset (ESPADA). Following the same annotation scheme, we annotated 1, 916 sentence pairs sampled from NIST OpenMT 4 corpora. ESPADA is now the largest annotation corpus for this problem and will be released by the Linguistic Data Consortium (LDC) soon. A linguist first annotated gold-standard syntactic trees on paraphrases based on the head-driven phrase structure grammar. Then, three native or near-native English speakers annotated the 1, 916 paraphrases in parallel to identify phrasal paraphrases; i.e. the total number of annotated sentences is 5, 748. Before the formal annotation, there was a training phase to improve annotation agreement; all annotators annotated trial samples. 5 One of the authors inspected the results and gave advice on any misunderstandings of the annotation guidelines. Appendix C provides further details of the annotation process. Table 1 shows the statistics for ESPADA and SPADE; ∼ 252k phrasal paraphrases were identified, among which ∼ 81k unique pairs were agreed upon by at least two annotators and ∼ 66k unique pairs were agreed upon by all annotators. The last two rows show, in ESPADA and SPADE, 3.2% to 4.7% of pairs did not satisfy the monotonicity condition, and 1.1% to 1.4% of triplets did not satisfy 4 https://www.nist.gov/itl/iad/mig/openmt 5 excluded from the formal annotation set  the familiness condition in alignments agreed upon by at least two annotators. Note that the monotonicity and familiness conditions are defined on relations of alignment pairs and triples, respectively; hence, these percentages do not mean that these percentages of alignments are non-compositional.

Evaluation Metrics and Upper Bounds
We used SPADE as an evaluation corpus; Table 1 shows statistics for its development (dev) and test sets. As evaluation metrics, we used alignment recall (ALIR), alignment precision (ALIP), and alignment F-measure (ALIF) Tsujii, 2017, 2018). ALIR evaluates how gold-standard alignments can be replicated by automatic alignments, and ALIP measures how automatic alignments overlap with alignments identified by at least one annotator: where H a is a set of automatic alignments, and G and G are those obtained by two respective annotators. ALIF computes the harmonic mean of ALIR and ALIP. Because SPADE provides alignments by three annotators, there are three combinations for G and G . The final ALIR, ALIP, and ALIF values are calculated by taking the averages. Note that these evaluation metrics count null alignments also; hence, ALIP performs differently from a general precision metric in that stricter models will have lower ALIP scores. This is because a stricter model aligning only a small number of phrases ( = τ ∅ ) increases the number of null alignments, making |H a | larger.
The agreement among the human annotators can also be measured using ALIR, ALIP, and ALIF by regarding one annotator as a test and the other two as gold-standard and then taking averages. The scores for the trained annotators were consistent between ESPADA and SPADE as shown in Table 2. This indicates that phrase alignment is diffi-cult even for humans because acceptable levels of semantic divergence in paraphrases cannot be perfectly controlled. Hence, we regard these human scores as upper bounds for ALIR, ALIP, and ALIF.

Comparison Method
As the comparison state-of-the-art syntactic phrase alignment method, we used Arase and Tsujii (2017). We re-implemented this method and compared the performance on aligning gold parse trees.
Additionally, we compared variations of our method via ablation studies. We investigated the effect of CTED by comparing it with alignments by a naive thresholding, which aligns phrases having cosine similarities above a threshold. The threshold was set to maximise the ALIF score on the SPADE development set.
To investigate the effect of our phrase representation model, we compared it with a simply finetuned BERT using Equation (1)  BERT defines its input to begin with the special symbol [CLS], whose representation has been commonly used as a representation of sentence pair (Devlin et al., 2019). The assumption here is that BERT may learn to embed information of similarity distribution into [CLS] representation.
As a pre-trained model for generating phrase representations, we compared the fine-tuning approach with the feature-based approach, i.e. Fast-Text (Bojanowski et al., 2017) and embeddings from language models (ELMo) (Peters et al., 2018). For all pre-trained models, we used mean-pooling to generate a basic phrase representation, which consistently outperformed max-pooling in our preliminary experiments.
Our attention mechanism had eight heads; the other settings were the same as those for Transformer (Vaswani et al., 2017). Dropouts of 10% and 50% were applied to the BERT and ELMo outputs, respectively, as recommended in their papers. The CNN had a kernel size of three in the convolution layer and two for the pooling layer. The Sim-Matrix was padded with zeros for sentences shorter than the maximum sequence length of 128. 13 All models used AdamW (Loshchilov and Hutter, 2019) as an optimiser, using default settings except on the learning rate. We tuned a few hyperparameters in our models to maximise the ALIF score on the development set of SPADE by a grid search. The value of null alignment cost λ ∅ was searched for in the range [0.05, 0.95] by intervals of 0.05, the margin δ in the loss function was searched for in [0.2, 1.0] by intervals of 0.2, and the learning rate was chosen from among 1.0e − 5, 3.0e − 5, and 5.0e − 5.

Training Settings
All experiments were conducted on an NVIDIA Tesla V100 GPU. We trained our phrase representation model using ESPADA. We simply used all phrase alignments by the three annotators, regarding all of them as equally reliable, i.e. each sentence pair has three sets of phrase alignments. We split the entire dataset into training and validation sets (90% and 10%, respectively) after randomly shuffling the sentence pairs, which prevents the same sentence pair from appearing in both sets. The batch size was 16. Training was terminated by validation-based early-stopping with patience 5 and minimum delta 0.005.
To alleviate the randomness effects in initialising the neural networks, we trained and evaluated the models 10 times with random seeds and report means of the evaluation scores with 95% confidence intervals. Further, we tested the significance of differences in means of the evaluation scores by the randomised test (Efron and Tibshirani, 1994). Throughout the paper, we present the best scores with a significance level of < 1% using a bold font. 11 https://allennlp.org/ (version 0.9.0) 12 https://networkx.github.io/ (version 2.4) 13 None of the sentence pairs exceeded this limit. 6 Experiment Results Table 3 compares the methods' performance. BERT+SimMatrix+CTED (last row) includes the full feature set; it transforms the phrase representation using SimMatrix representation and aligns phrases using CTED. This method performed the best overall, achieving an ALIF score of 87.4% with post-processing. This ALIF score is 95.7% of that achieved by humans (Table 2).

Overall Results
We investigated non-compositional alignments produced by BERT+SimMatrix+CTED with postprocessing. We found that 0.1% of alignment pairs did not satisfy the monotonicity condition and 1.2% of alignment triplets did not satisfy the familiness condition. These non-compositional alignments cover 3.5% and 23.2% of those of the gold standard that did not satisfy the monotonicity and familiness conditions, respectively (as shown in Table 1).

Effect of CTED Algorithm and Post-Processing
The middle and last sets of rows compare CTED-based and thresholding-based alignments. Thresholding-based alignment greedily aligns phrases by disregarding compositionality. In contrast, the pure CTED-based alignment only allows compositional alignments and makes all noncompositional alignments null. Even though CTED is much stricter than thresholding, it achieved competitive ALIF scores. The scores of the CTEDbased alignment further improves by allowing non-compositional alignments by post-processing; ALIR, ALIP, and ALIF improved by 2.2, 3.4, and 2.8 percentage points on average, respectively.

Effect of Phrase Representation Model
The last set of rows shows the performance of alignments by CTED with different phrase representation approaches. BERT+SimMatrix+CTED significantly outperformed BERT+CTED and BERT+[CLS]+CTED. The superiority of Sim-Matrix representation over [CLS] was more pronounced on alignments with post-processing. Although ALIF of BERT+[CLS]+CTED with postprocessing achieved 94.4% of the human score, SimMatrix representation further improved it by 1.2 percentage points.
These results indicate that a phrase representation that explicitly models the similarity distribution is crucial for handling non-compositional alignments. We conjecture that SimMatrix representation has two effects in phrase alignment.    First, it encourages a null alignment in CTED when there is a more similar phrase beyond the local scope. I.e., it implicitly relaxes the syntactic constraint when composing compositional alignments that could be too restrictive to handle noncompositional alignments. Second, the SimMatrix representation allows the post-processing to find a globally plausible alignment pair considering the entire similarity distribution. Figure 4 presents ALIR and ALIP by the cost of null alignment λ ∅ for BERT+SimMatrix+CTED with and without post-processing. A small λ ∅ causes the method to align only a small number of phrases and produce a large number of null alignments. In contrast, a large λ ∅ confuses the method by allowing a larger number of possible alignments. Both situations are harmful, but the former has a larger impact. This is because the constraint of CTED only allows a legitimate set of phrase alignments, which effectively prunes away incorrect alignments. Figure 4 empirically confirms that the postprocessing is effective in improving ALIR and ALIP scores; these scores with post-processing were always higher than those without. The same trend was also observed for BERT+CTED and for BERT+ [CLS]+CTED. This occurs because our post-processing only allows non-compositional alignments of minimum cost. Hence, it also improves ALIR and ALIP scores when phrase representations are reliable. Table 4 shows the effect on performance when CTED is combined with the feature-based approaches: FastText, ELMo, and BERT without fine-tuning. 14 Specifically, we generated a phrase representation by simply mean-pooling token representations generated by these pre-trained models and aligned phrases by CTED or by thresholding. Note that these methods behave deterministically owing to the absence of neural network training.

Effects on Feature-Based Approaches
BERT w/o fine-tuning+CTED achieved an ALIF score of 84.7% with post-processing, even though it only tunes the hyper-parameter λ ∅ . Although it scored lower than the proposed method (BERT+SimMatrix+CTED), the result is still encouraging for conducting phrase alignment in domains for which no corpora are available for training our phrase representation model. Improvements in ALIR, ALIP, and ALIF scores by CTED over thresholding were much greater with FastText than with ELMo or BERT; it showed average gains of 6.0 to 8.6 percentage points. Improvements ranged from −0.8 to 2.7 for ELMo and from 0.2 to 1.3 percentage points for BERT. The CTED algorithm constrains alignments by the syntactic structures. FastText representations obviously do not retain such structural information. We conjecture that FastText-based alignment is com-w/o post-processing w/ post-processing Method λ ∅ ALIR (%) ALIP (%) ALIF (%) ALIR (%) ALIP (%) ALIF (%)

Discussion and Future Work
In contrast to previous methods, ours can align phrases not only in paraphrasal sentence pairs but also in partially paraphrasal pairs. We plan to apply it to a comparable corpus of partial paraphrases and investigate the performance, with the aim of creating a large-scale syntactic and phrasal paraphrase dataset. We intend to expand our method to conduct forest alignments for making it robust against parsing errors, which are inevitable in handling large corpora. Further, as our method does not restrict input to syntactic trees but only assumes tree structures with arbitrary numbering (e.g. leftto-right post-order numbering) as input, we intend to try alignments of chunk-based trees, which is desirable for applications that process text fragments, e.g. those that perform information extraction.

Appendices A Detailed Comparison with Previous Study
Arase and Tsujii (2017) include additional conditions that a legitimate set of compositional alignments should satisfy. One of these is called the root-pair containment condition, which requires the root nodes of trees to be aligned. This constraint firmly restricts their method such that it can only handle a paraphrasal sentence pair as input. Our method, by contrast, can align any pair of sentences, i.e. not only paraphrasal sentences but also sentences that are only partially paraphrasal.  (3)) Additionally, in their study, the familiness condition is replaced by the maximum set condition. The maximum set condition, together with the monotonicity condition, constrains all the lowest common ancestors (LCAs) of any pair of non-null alignments in H L to ensure that they are aligned. That is, for all h m , h n ∈ H L of non-null alignments, τ s i , τ t i ∈ H L , where τ s i = lca(τ s m , τ s n ) and τ t i = lca(τ t m , τ t n ). Owing to this constraint, their method belongs to the class of LCA-preserving distance mappings (Zhang et al., 1995), whose constraint is tighter than the constraint edit distance mapping. In phrase alignment, this forces LCAs of two aligned nodes to be aligned as well, even though the majority of phrases under the LCAs are null alignments. By contrast, CTED allows such LCAs to have null alignments depending on the alignments of descendant nodes.

B CTED Algorithm
Algorithm B.1 shows the CTED algorithm. For brevity, we denote the ith node in a tree as i and its child nodes as I = {i k |i 1 , . . . , i n i }, where n i is the number of children. The input trees are numbered; the numbers are determined by an arbitrary ordering of the nodes in the tree, such as left-to-right post-order numbering or left-to-right pre-order numbering. The algorithm first computes the minimum cost of H i,j , which are alignments of forests rooted at nodes i and j (line 10): Then (line 11), it computes the minimum cost of i, j , i, τ ∅ , and τ ∅ , j as In Equation (2), γ(H i,j ) is the summation of the alignment costs between the forests: where u or v is τ ∅ for null alignments.
The algorithm searches for H i,j that has the minimum cost by solving the minimum cost maximum flow problem on a graph G(V, E), as shown in Figure 5. The vertex set consists of V = {s 0 , s t , τ i ∅ , τ j ∅ } ∪ I ∪ J, where s 0 and s t are the start and sink nodes, respectively, and τ i ∅ and τ j ∅ are null nodes. Each edge in E has a cost and capacity: Edges (s 0 , i k ), (s 0 , τ i ∅ ), (j l , s t ), and (τ j ∅ , s t ) are cost zero; (i k , j l ) is cost D[i k ][j l ]; (τ i ∅ , j t ) is cost D[0][j l ]; (i k , τ j ∅ ) is cost D[i k ][0]; and (τ i ∅ , τ j ∅ ) is cost zero. All the edges have capacity one except (s 0 , τ i ∅ ), (τ i ∅ , τ j ∅ ), and (τ j ∅ , s t ), whose capacities are n j , min(n i , n j ), 15 and n i , respectively. Obviously, the maximum flow of G is n i + n j and G is a network with integer capacities and non-negative costs. The minimum cost on the maximum flow of G is proven to be in agreement with min H i,j γ(H i,j ) in (Zhang, 1996).
Algorithm B.1 only shows the computation of the alignment cost for brevity. However, the corresponding edit operations, i.e. alignments, can be computed simultaneously in the same manner as the edit cost.

C Details of ESPADA Creation
To obtain paraphrasal sentence pairs to annotate, we sampled paraphrases from reference translations in NIST OpenMT corpora 16 excluding sentences in SPADE. There are a variety of resources for constructing paraphrases, including reference translations (Weese et al., 2014), news texts (Dolan et al., 2004), and tweets (Lan et al., 2017). Arase and Tsujii (2018) discussed how paraphrases constructed from reference translations are authentic in the sense that they only pose paraphrastic phenomena because they are constrained by corresponding source sentences. By contrast, paraphrases extracted from other resources tend to have more diverse linguistic phenomena, such as additions and omissions of information and inferences requiring knowledge of the world.
First, we recruited a linguist who is also a native English speaker to annotate the gold-standard syntactic trees on paraphrases based on the grammar of the head-driven phrase structure. Through this process, the linguist identified and discarded ungrammatical and/or non-paraphrasal pairs. The annotated trees were checked automatically for formatting, and the linguist corrected the annotations of trees with errors, such as trees with inconsistent bracketing. We then had three native or near-native English speakers annotate the phrase alignments.  Table 5: ALIR, ALIP, and ALIF scores for our phrase representation model when applied to feature-based models ('BERT w/o FT' stands for 'BERT without fine-tuning')

D Phrase Representation with
Feature-Based Approaches We also applied our phrase representation model to ELMo and BERT, using them as feature generators. We trained only the attention and CNN models using ESPADA. For ELMo, we also trained the scalar weighting parameters. Table 5 shows the results. Unfortunately, all of these methods are inferior to their counterparts that lack our phrase representation model: ELMo+CTED and BERT w/o fine-tuning+CTED, respectively. We conjecture that ESPADA may be insufficiently large for training our phrase representation model to adapt to a pre-trained model that behaves in a completely independent manner. BERT's ability to adapt quickly to a specific task by fine-tuning is a notable advantage.