Abstract Meaning Representation for Paraphrase Detection

Abstract Meaning Representation (AMR) parsing aims at abstracting away from the syntactic realization of a sentence, and denote only its meaning in a canonical form. As such, it is ideal for paraphrase detection, a problem in which one is required to specify whether two sentences have the same meaning. We show that naïve use of AMR in paraphrase detection is not necessarily useful, and turn to describe a technique based on latent semantic analysis in combination with AMR parsing that significantly advances state-of-the-art results in paraphrase detection for the Microsoft Research Paraphrase Corpus. Our best results in the transductive setting are 86.6% for accuracy and 90.0% for F_1 measure.


Introduction
Abstract Meaning Representation (AMR) parsing focuses on the conversion of natural language sentences into AMR graphs, aimed at abstracting away from the surface realizations of the sentences while preserving their meaning.
We make a first step towards showing that AMR can be used in practice for a task that requires identifying the canonicalization of language: paraphrase detection. In a "perfect world" using AMR to test for paraphrasing relation of two sentences should be simple. It would require finding the two AMR parses for each of the sentences, and then checking whether they are identical. Since AMR is aimed at abstracting away from the surface form which is used to express meaning, two sentences should be paraphrases only if they have identical AMRs. For instance, the three sentences: 1. He described her as a curmudgeon, 2. His description of her: curmudgeon, 3. She was a curmudgeon, according to his description. describe-01 he curmudgeon she ARG0 ARG2 ARG1 Figure 1: AMR graph for "He described her as a curmudgeon", "His description of her: curmudgeon" and "She was a curmudgeon, according to his description" should result in the same AMR graph as shown in Figure 1.
However, in practice, things are different. First, there are no known AMR parsers that really distil only the meaning in text. For example, predicates which have interchangeable meaning use different AMR concepts, and there are errors that exist because of the machine learning techniques that are used for learning the parsers from data. Finally, even human annotations do not yield perfect AMRs, as the interannotator agreement reported in the literature for AMR is around 80% (Banarescu et al., 2013).
Second, meaning is often contextual, and it is not fully possible to determine the corresponding AMR parse just by looking at a given sentence. Entity mentions denote different entities in different contexts, and similarly predicates and nouns are ambiguous and depend on context. As such, one cannot expect to use AMR in the transparent way mentioned above to identify paraphrase relations. However, we demonstrate in this paper that AMR can be used in a "softer" way to detect such relations.
Evaluation of AMR parsers is traditionally performed using the Smatch score (Cai and Knight, 2013). However, Damonte et al. (2017) argue that more ad-hoc metrics can be useful for advancing AMR research. Paraphrase detection can be seen as a further benchmark for AMR parsers, highlighting their ability of abstracting away from syntax and representing the core concepts expressed in the sentence. In order to advance research in AMR and its applications, it is important to have metrics that reflect on the ability of AMR graphs to have impact on subsequent tasks. In this work we therefore use two different AMR parsers, comparing them throughout all experiments.

Background
AMRs are rooted, edge labeled, node labeled, directed graphs. They are biased towards the English language and rely on PropBank (Kingsbury and Palmer, 2002) for the definition of the main events in the sentence. Nodes in an AMR graph represent events and concepts, while edges represent the relationships between them. Banarescu et al. (2013) state that AMR are aimed at canonicalizing multiple ways of expressing the same idea, which could be of great assistance to solve the problem of paraphrase detection. However, this goal is not entirely achieved in practice, and it will take long for AMR parsers to mature and achieve such canonicalization. At the moment, for example, even a simple pair of sentences such as "the boy desires the cake" and the "the boy wants the cake" would not have the same canonical form by state-of-the-art AMR parsers.
While some researchers (Fodor, 1975) have doubted the practical possibility of canonicalizing language or finding identical paraphrases in English or otherwise, much work in NLP has been devoted to the problem of paraphrase identification (Mitchell and Lapata, 2010;Baroni and Lenci, 2010;Socher et al., 2011;Guo and Diab, 2012;Ji and Eisenstein, 2013) and more weakly, finding entailment between sentences and phrases (Dagan et al., 2006;Bos and Markert, 2005;Harabagiu and Hickl, 2006;Lewis and Steedman, 2013). In this work, we use the AMRs parsed for given sentences as a mean to extract useful information and train paraphrase detection classifiers on top of them.

Latent Semantic Analysis
Our work falls under the category of distributional methods for paraphrase detection (Turney and Pantel, 2010;Mihalcea et al., 2006;Mitchell and Lapata, 2010;Guo and Diab, 2012;Ji and Eisenstein, 2013) such as with latent semantic analysis (LSA, Landauer et al., 1998). The main principle behind this approach is to detect semantic similarity through distributional representations for a given sentence and its potential paraphrase, where these representations are compared against each other according to some similarity metric or used as features with a discriminative classification method (Mihalcea et al., 2006;Guo and Diab, 2012;Ji and Eisenstein, 2013).
LSA is indeed one of the main tools in obtaining such distributional representations for the problem of paraphrase detection. Most often, TF-IDF weighting has been used for building the sentence-term matrix, but Ji and Eisenstein (2013) have shown that a significant improvement can be achieved in detecting similarity if one re-weights the sentence-term matrix differently. Indeed, this is one of our main contributions: we build on previous work on LSA for paraphrase detection and propose a technique to re-weight a sentenceconcept matrix based on the AMR graphs for the given sentences. More details on the use of LSA for paraphrase detection appear in Section 4.
Meaning representations are usually evaluated based on their compositionality (construction of a representation based on parts of the text in a consistent way), verifiability (ability to check whether a meaning representation is true in a given model of the world), unambiguity (ability to full disambiguate text into the representation in a way that does not leave any ambiguity lingering), inference (the existence of a calculus that can be used to infer whether one meaning representation is logically implied by others) and canonicalization (the ability to map several surface forms, such as paraphrases, into a single unique meaning representation). In this paper, we evaluate AMR on its ability to canonicalize language through its assistance in deciding whether two sentences are paraphrases.
We note that this test is masked by the accuracy of the AMR parsers we use, which indeed do not give always fully correct predictions. These errors in our paraphrase detection due to the accuracy of the AMR parser are different than those which originate in an inherent difficulty of representing paraphrases using AMR because of the limitations of the formalism and the annotation guidelines that AMR follows.
We experiment with two AMR parsers for which a public version is available. The first is JAMR (Flanigan et al., 2014), which is a graphbased approach to AMR parsing. It works by performing two steps on the input sentence: concept identification and relation identification. The former discovers the concept fragments corresponding to span of words in the sentence, while the latter finds the optimal spanning connected subgraph from the concepts identified in the first step. The concept identification step has quadratic complexity and the relation identification step is O(|V | 2 log |V |), with |V | being the set of nodes in the AMR graph.
The second is AMREager (Damonte et al., 2017), which is a transition-based parser that works by scanning the string left-to-right and building the graph as the scan proceeds. This transition-based system is akin to the dependency parsing transition-system ArcEager of Nivre (2004), only without constraints that ensure that the resulting structure is a tree. In addition, there are operations that make the system create additional non-projective structures by checking after transition step whether siblings should be connected together with an edge. The complexity of AMREager is linear in the length of the sentence. AMREager was extended to other languages (Damonte and Cohen, 2018), and we leave it for future work to test the utility of AMR for paraphrase detection in these languages.

Problem Formulation
Let S be a set of sentences. We are given input data in the form of (x where n is the number of training examples, x The goal is to learn a classifier that tells for unseen instances whether the pair of sentences given as input are paraphrases of each other. We denote by [n] the set {1, . . . , n}.

Latent Semantic Analysis for Paraphrase Detection
The first step in our approach is the construction of lower-dimensional representations for the sentences in the training data. We use latent semantic analysis to get the sentence representations, which are then used to detect paraphrases using a classifier. More specifically, given a set of sentences we build a sentence-term matrix T such that T k indicates the use of the th word in the kth sentence in S. The number of rows is the number of sentences in the dataset and the number of columns is the vocabulary size. This follows previous work with the use of LSA for paraphrasing (Guo and Diab, 2012;Ji and Eisenstein, 2013).
As a baseline, we experiment with two ways of assigning the values to the matrix: • T k is the count of the th word in the kth sentence: • T k is the term frequency-inverse document frequency (TF-IDF) for the kth sentence with respect to the th word. TF-IDF is commonly used in Information Retrieval to score words in a document and combines the frequency of the words in a document with the rarity of the term across documents. With TF-IDF, in order to have a high score, concepts must appear in this sentence and not in many others.
In that case, we define: where count( , k) gives the count of the th word in the kth sentence and csent is the number of sentences which contain the th word: The AMR-based systems of Section 5 build upon this by re-weighting T k with terms depending on the AMRs of the sentences.
For paraphrasing, previous work (Ji and Eisenstein, 2013) has also considered the transductive setting (Gammerman et al., 1998), which we also use in our experiments. In the transductive setting, S also includes the sentences on which we expect to perform the final evaluation for the purpose of learning the latent representations. Note that, in this case, the labels b (i) are not used in the process of constructing word representations. In the inductive setting, on the other hand, the sentences in the testing set are not included in training and we project them instead using the LSA projection matrices onto the latent space learned to find their representations.
Once we constructed the matrix T , we perform truncated singular value decomposition (SVD) on it, such that: where U ∈ R k×m , V ∈ R ×m and Σ ∈ R m×m is a diagonal matrix of singular values. The final sentence representations are the rows of the U matrix which range over the sentences and have m dimensions.
The output of this process is a function f : S → R m which attaches to each sentence a representation. The idea behind LSA is that this matrix decomposition will make semantically similar sentences to appear close in the latent space, hence alleviating the problem of data sparsity and making it easier to detect when two sentences are paraphrases of each other.
Once we construct the sentence representations from the training data (either in the inductive or the transductive setting) we use the function f to map each pair of sentences from the training data 2 )| (where the absolute value is taken coordinate-wise) and then concatenate them into a feature vector φ(x 2 ), which is then used as input to a support vector machine (SVM) classifier (Ji and Eisenstein, 2013). 1

Abstract Meaning Representation Features
The main hypothesis tested in this work is that AMR can be useful in deciding whether two sentences are paraphrases of each other. We investigate two ways to use AMR information to better inform the classifier: similarity-based and LSAbased.

Graph Similarity and Bag of AMR Concepts
An obvious way to use AMR information is to just compute the similarity between the two graphs and use the score as an additional feature. As a score we use Smatch, which computes the overlap in terms of recall, precision and F-score between two unaligned graphs by finding the alignments between the graphs that maximizes the overlap. The alignment step is necessary because in AMR multiple nodes can have the same labels and arbitrary variable names are used to distinguish between them. Smatch is the standard metric to evaluate the overlap between AMR graphs. The score returned by Smatch is used as a single additional feature for the SVM. The amount of overlap in the AMR nodes of the two graphs can be a good indicator of whether the sentences are paraphrases of each other. To test this hypothesis, we extract the unordered sets of AMR nodes and use the Jaccard similarity coefficient as a feature. This is directly related to the concept identification step of the AMR parsing process, which is concerned with generating and labeling the nodes of the AMR graph. Concept identification is arguably one of the most challenging part of AMR parsing as the mapping between word spans and AMR nodes is not trivial (Werling et al., 2015). It is often considered as the first stage in the AMR parsing pipeline and it is therefore reasonable to attempt using its intermediate results. We choose Jaccard as a metric for bag of concepts overlap following previous work in paraphrase detection (Achananuparp et al., 2008;Be-rant and Liang, 2014).
We note that while this approach of using AMR to detect paraphrase may sound plausible, it does not perform very well. As such, we compare and contrast this as an AMR baseline with the approach that makes use of PageRank with TF-IDF reweighting for LSA, as described next.

PageRank and TF-IDF Reweighting for LSA
The main idea is to re-weight the LSA sentenceterm matrix T (Section 4) according to a probability distribution over the AMR nodes, which we accomplish by means of PageRank (Page et al., 1999). The utility of re-weighting terms in the sentence-term matrix has been previously proved (Turney and Pantel, 2010). PageRank is a method, originally developed for web pages, for ranking nodes in a graph according to their impact on other nodes. The algorithm works iteratively by adjusting at each iteration the score of each node based on the number and scores of nearby nodes that is connected to it, until convergence. Prior to applying PageRank, we merge the two graphs by collapsing the concepts in the two graphs that have the same labels, similarly to Liu et al. (2015), as shown in Figure 2. We then compute the PageRank score for each node in the merged graph and multiply them by the corresponding frequency count of that concept in the sentence-term matrix. The graph merging step is necessary in order to ensure that overlapping concepts obtain high PageRank scores. The PageRank step applied to the merged graph ensures that this importance propagates to nearby nodes. For a given graph G = (V, E), PageRank takes as input a list of edges between nodes: E = {(n i , m i )}, ∀i = 0, . . . , n n = |E| and outputs a PageRank score for each node by solving the following equations with respect to PG(·): where I(n) are the input edges to node n and l(m) is the number of edges coming out of m.
For each concept of the merged AMR graph, we compute T k , the weight for the LSA matrix introduced in Section 4, as follows: where PG(l, k) is the PageRank of th concept for the kth sentence.
As a baseline for the PageRank system, the TF-IDF re-weighting scheme, as described in Section 4, is also used to re-weight the AMR concepts.

Experiments
We now describe the experiments that we devised to discover whether AMR is useful for paraphrase detection. For AMR parsing, we used the JAMR 2 version published for SemEval 2016 (Flanigan et al., 2016), reporting 0.67 Smatch score on LDC2015E86 and the first and only version available for AMREager, 3 obtaining 0.64 Smatch score on the same dataset. First, we discuss experiments where the AMRs are used as a mean to extract additional sparse features for a SVM classifier. Then we turn to LSA to construct a representation of the sentence based on the reweighting on the AMR nodes achieved through either PageRank or TF-IDF. Results show how the latter, which builds on state-of-the-art systems for this task, is a much more promising approach. Finally, we analyze how performance changes as a function of the number of dimensions used in the truncated matrix.
For evaluation, we use the Microsoft Research Paraphrase Corpus (Dolan et al., 2004). We use 70% of the dataset as training data and 30% as a test set. The total number of sentence pairs in the corpus is 5,801.

Graph Similarity and Bag of AMR Concepts
The Bag of words (BOW) baseline consists of a SVM that takes into account one single feature: the Jaccard score between the BOW representations for the two sentences, i.e., one-hot vectors indicating whether each word in the vocabulary is used or not. The use of the single Jaccard feature means that for the linear kernel we just learn a threshold on the score. We note that the addition of the similarity-based features does not suffice to outperform the BOW baseline, as described in Table 1. Unlike Smatch, the bag of concepts feature does not need to find a, possibly wrong, alignment between the two graphs  Figure 2: Visualization of the graph merging procedure for the sentence Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $2.5 billion. (above) and Yucaipa bought Dominick's in 1995 for $693 million and sold it to Safeway for $1.8 billion in 1998. (below). The "date-entity", "sell-01" and "1998" nodes in the two AMR graphs on the left are merged in the resulting graph on the right.
because it considers the node labels only. Interestingly, the addition of the bag of concepts feature is beneficial only for AMREager. It is indeed worth noting the different behaviors of the two parsers: when using the Smatch score only, JAMR reports slightly higher numbers than AMREager. However, when using the bag of concepts features too, AMREager is considerably better than JAMR, which is unexpected as the concept identification performance of the two parsers is reported to be identical (Damonte et al., 2017). There is also some variability with the kernel used for the SVM classifier. The polynomial kernel does consistently better than the RBF and linear kernel. This means that a low-level interaction between the sentence representations does exist (when trying to determine whether they are paraphrases), but a higher order interaction, such as implied with RBF, is not necessary to be modeled.

PageRank and TF-IDF Reweighting for LSA
We now turn to experiments involving LSA as a mean to represent the candidate paraphrases. In this set of experiments, the baseline consists of using TF-IDF to weight the bag of words in the sentence-term matrix.
We first try to replace the bag of words with the bag of concepts from the AMR graphs, also re-weighted by TF-IDF. Then, we also replace the TF-IDF with PageRank as it is more appropriate to re-weight graph structures than TF-IDF. We report experiments for both inductive setting and transductive setting (Table 3). Our first finding is that, regardless of the parser, AMR is very helpful in the tranductive setting while it is harmful in the inductive setting. When using bag of words, it is easy to project sentences of the test set into the latent space learned on the training set only. However, our experiments indicate that this is not as easy with the AMR concepts produced by the two parsers. On the other hand, when the latent space is learned using also the sentences in the test set, the abstractive power of AMRs is helpful for this task. In the inductive setting, PageRank fails to improve over the TF-IDF scheme and neither of them outperform the BOW baseline. AMREager outperforms JAMR in this case. In the transduc-447 kernel acc.  Table 1: Baseline results for paraphrase detection with AMR and with bag-of-words (BOW). "linear," "poly" and "rbf" denote the kernel which is used with a support vector machine classifier. "Smatch" denotes the use of the additional graph similarity feature and "BOC" the use of the additional Jaccard score on the bag of concept. Best result in each column is in bold.
tive case, the AMRs provided by JAMR are helpful with both TF-IDF and PageRank, while the graphs provided by AMREager give good results only for the PageRank scheme. The best result is achieved with JAMR, PageRank and a linear kernel for the SVM classifier. We wanted to test in our experiments whether the same gains that are achieved with AMR parsing can also be achieved with just a syntactic parser. To test that, we parsed the paraphrase dataset with a dependency parser and reduced the syntactic parse trees to AMR graphs (meaning, we represented the dependency trees as graphs by representing each word as a node and labeled dependency relations as edges). Figure 3 gives an example of such conversion. As can be see, the AMRlike representation for the dependency trees retains words such as determiners ("the"). It also uses a different set of relations, as reflected by the edge labels that the dependency parser returns.
We chose to do this reduction instead of directly building a classifier that makes use of the dependency trees to ensure we are conducting a controlled experiment in which we precisely compare the use of syntax for paraphrase against the use of semantics. Once the syntactic trees are converted  to AMR graphs, the same code is used to run the experiments as in the case of AMR parsing, with both the PageRank and TF-IDF reweighting settings. We used the dependency parser from the Stanford CoreNLP (Manning et al., 2014). The results are given in Table 3, under "dep." As can be seen, these results lag behind the bag-of-words model in the inductive case and the AMR models in the transductive case. This could be attributed to AMR parsers better abstracting away from the surface form than dependency parsers.

System
acc. F 1 Most common class 66.5 79.9 Mitchell and Lapata (2010) 73.0 82.3 Baroni and Lenci (2010) 73.5 82.2 Socher et al. (2011) 76.8 83.6 Guo and Diab (2012) 71.5 NR Ji and Eisenstein (2013) Table 3: LSA experiments in the inductive and transductive settings, with two different reweighting schema: "PageRank" and "TF-IDF". "linear," "poly" and "rbf" denote the kernel for the SVM. "dep." denotes the use of syntactic parsing instead of semantic parsing. truncated matrix U (Section 4). More specifically, on the x axis of the plots we have m/l, where m is the number of columns in the truncated matrix and l the number of words in the vocabulary. The plot shows that the performance stays stable for inductive inference. With transductive inference, however, performance peaks when m is very close to the vocabulary size. This shows that, in order to achieve good results, it is not necessary to remove a large number of columns from the original sentence-term matrix. The plot gives us more evidence on how the inductive setting is not ideal for the AMR-based approach. For the TF-IDF reweighting, the systems that show a considerably different behavior are JAMR with linear and RBF kernels, where we show clear peaks for the transductive case. For PageRank also the AMREager systems with linear and RBF kernel follow this trend. In general the polynomial kernel is the one less affected by this variable. Table 2 shows that our best result for the transductive case, which we obtain with JAMR and PageRank, outperforms the current state of the art for paraphrase detection in the transductive setting. This is not true for the inductive case, proving the preference of the AMR-based LSA approach for the former setting.

Conclusion
We described an approach to incorporate an AMR parser output into the detection of paraphrases. Our method works by merging two graphs that need to be tested for a paraphrase relation, and then re-weighting a sentence-term matrix by the PageRank values of the nodes in the merged graph. We find that our method gives significant improvements over state of the art in paraphrase detection in the transductive setting, showing that AMR is indeed helpful for this task. We further show that the inductive settings is instead not ideal for this type of approach.
We are encouraged by the results, and believe that paraphrase detection can also be used as a proxy test for the performance of an AMR parser: if an AMR parser is close to canonicalizing language, it should be of significant help in detecting paraphrase relations. In our experiments, the overall best result was achieved by JAMR. More generally, our results show that JAMR has been more helpful in the transductive setting and in the first set of experiment when using the Smatch score only, while AMREager wins the comparison in the inductive case as well as in the first set of experiments when using both the Smatch score and the bag of concepts score as additional features.