Generating Syntactic Paraphrases

We study the automatic generation of syntactic paraphrases using four different models for generation: data-to-text generation, text-to-text generation, text reduction and text expansion, We derive training data for each of these tasks from the WebNLG dataset and we show (i) that conditioning generation on syntactic constraints effectively permits the generation of syntactically distinct paraphrases for the same input and (ii) that exploiting different types of input (data, text or data+text) further increases the number of distinct paraphrases that can be generated for a given input.


Introduction
The ability to automatically generate paraphrases (alternative phrasings of the same content) has been shown to be useful in many areas of Natural Language Processing such as question answering (Riezler et al., 2007), semantic parsing (Berant and Liang, 2014)), machine translation (Kauchak and Barzilay, 2006;Zhou et al., 2006), sentence compression  and sentence representation (Wieting et al., 2015). From a linguistic standpoint, the automatic generation of paraphrases is an important task in its own right as it demonstrates the capacity of NLP techniques to handle a key feature of natural language.
In this paper, we focus on the automatic generation of syntactic paraphrases that is, texts which share the same meaning but differ in their syntax. Our work makes the following contributions. We show that conditioning text generation on syntactic information permits generating distinct syntactic paraphrases for the same input. We provide a systematic exploration of how different types of generation tasks impact paraphrasing and show that exploiting different types of input permits increasing the number of paraphrases produced for a given input. We make available four training corpora for syntactically constrained, data-and textto-text generation, text expansion and text reduction.

Related Work
Previous work on paraphrase generation falls into three main groups. Based mainly on monolingual data, earlier approaches use data-driven, (Lin and Pantel, 2001), grammar-or thesaurus-based methods (Madnani et al., 2007;McKeown, 1983;Hassan et al., 2007;Kozlowski et al., 2003;Quirk et al., 2004;Zhao et al., 2008). In contrast, the pivot-based approach exploits bilingual data and machine translation methods to extract and generate paraphrases (Callison-Burch, 2008;Ganitkevitch and Callison-Burch, 2014;. Finally, neural approaches build upon the encoder-decoder architecture to learn paraphrase generation models (Mallinson et al., 2017;Prakash et al., 2016). (Prakash et al., 2016) uses a stacked residual LSTM network with residual connections between LSTM layers and show that their model outperforms sequence to sequence, attention-based, and bi-directional LSTM model on three datasets (PPDB, WikiAnswers, and MSCOCO). (Mallinson et al., 2017) introduces a neural model for multi-lingual, multi-pivot backtranslation and show that it outperforms a paraphrase model trained with a commonly used Statistical Machine Translation system (SMT) on three tasks, namely, correlation with human judgments of paraphrase quality; paraphrase and similarity detection; and sentence-level paraphrase generation. (Iyyer et al., 2018) also use backtranslation as a mean to provide training data. In addition, it uses syntax to control paraphrase generation. Given D2T 5best Aktieselskab is the operating organisation for Aarhus Airport which has a runway name of "10R/28L" with a length of 2777. The Aktieselskab is the operating organisation for Aarhus Airport which has a runway name of "10R/28L" with a length of 2777. Aktieselskab is the operating organisation for Aarhus Airport which has a runway length of 2777 and is named "10R/28L".

T2T 5best
Operated by Aktieselskab, Aarhus Airport has a runway length of 2777 metres and a runway name "10R/28L". Operated by the Aktieselskab organisation, Aarhus Airport, has a runway length of 2777 metres and a runway named "10R/28L". Aarhus Airport is operated by Aktieselskab. Its runway name is "10R/28L" and its length is 2777 metres. Aarhus Airport, operated by Aktieselskab, has a runway length of 2777 with a name of "10R/28L". Aktieselskab is the operation organisation of Aarhus Airport where the runway "10R/28L" with a length of 2777. The Aktieselskab is the operating organisation of Aarhus Airport., This airport has a runway length of 2777 metres and a runway named "10R/28L". The "10R/28L" runway at Aarhus Airport is 2777 in length, which is run by the operating organization of Aktieselskab. The "10R/28L" runway at Aarhus Airport, operated by the Aktieselskab, is 2777 in length.

ALLsyn
Aarhus Airport has a runway length of 2777.0 metres and is operated by Aktieselskab. The name of the runway is "10L/28R". Aarhus Airport is operated by Aktieselskab; its runway name is "10L/28R" and its runway length is 2777.0. Aarhus Airport is operated by Aktieselskab, its runway name is "10R/28L" and has a length of 2777. Aarhus Airport is operated by the Aktieselskab organisation. Its runway name is "10R/28L" and has a length of 2777. Aarhus Airport, operated by Aktieselskab, has a runway length of 2777 and the runway name is "10R/28L". Aarhus Airport, operated by Aktieselskab, has a runway length of 2777 with a name of "10R/28L". Aarhus Airport, which is operated by the Aktieselskab organisation, has a runway that 's 2777.0 long and is named "10L/28R". Aktieselskab is the operation organisation of Aarhus Airport where the runway "10R/28L" with a length of 2777. Aktieselskab is the operation organisation of Aarhus Airport, where the runway is named "10R/28L", with a length of 2777. Aktieselskab is the operation organization for Aarhus Airport, where the runway is named "10L/28R", with a length of 2777.0. Aktieselskab operates Aarhus Airport which has a runway that is 2777 meters long and the runway name "10R/28L". Operated by Aktieselskab, Aarhus Airport, has a runway length of 2777 metres and a runway named "10R/28L". Operated by Aktieselskab, Aarhus Airport, has a runway length of 2777 metres and is named "10R/28L". Operated by the Aktieselskab organisation, Aarhus Airport has a runway that is 2777.0 metres long. It also has a runway with the name "10L/28R". The "10R/28L" runway which is 2777 meters long is located in Aarhus Airport which is operated by the Aktieselskab organisation.
Madrid is part of Community of Madrid whose leader party is the Ahora Madrid. The Adolfo Suarez Madrid-Barajas Airport is located there. k possessive t { (Madrid country Spain) } To Adolfo Suarez Madrid-Barajas Airport is located in Madrid, part of the Community of Madrid in Spain where the leader party is Ahora Madrid.
Al Asad Airbase is located at "Al Anbar Province, Iraq" and operated by the United States Air Force. The base 's runway called "08/26" and 3990 meters long. k Subject Relative t { (Al Asad Airbase operatingOrganisation United States Air Force) } To Al Asad Airbase ( in "Al Anbar Province, Iraq"), has a runway named "08/26" and a runway that is 3990 metres long. a syntactic template T and an input sentence S, the model first generates a full syntactic parse P T . Next this syntactic parse is used together with the input sentence to predict a syntactic paraphrase of S which realises the input syntactic template T . Our approach is closest to (Iyyer et al., 2018) but differs from it in that instead of restricting paraphrase generation to a text rewriting problem, we explore how various sources of input impacts the number and the type of generated paraphrases. It also differs from the former two approaches (Prakash et al., 2016;Mallinson et al., 2017) in that we focus on syntactic paraphrases and condition generation on syntax. In that sense, our approach also shares similarities with recent models for controllable text generation (Hu et al., 2017;Semeniuta et al., 2017), which use variational autoencoders to model holistic properties of sentences such as style, topic and various other syntactic features. Our work is arguably conceptually simpler, focuses on syntactic paraphrases and introduces a new text production mode based on hybrid "data and text" input.

Generating Syntactic Paraphrases
In order to generate syntactically distinct paraphrases, we formulate the generation task as a structured prediction task conditioned on both some input I and some syntactic constraint k. In this way, the same input I can be mapped to several output T i each satisfying a different syntactic constraint k i . Table 1 shows some examples.
In addition, we consider different, semantically equivalent, sources of information. That is, we compare the paraphrases obtained when generating text from data, from text or from text and data. For the later, we consider two subtasks namely text expansion and text reduction. For each of these two tasks, the input is a text and a data unit. For text expansion, the output is a text verbalising both the input text and the input data. Conversely, for text reduction, the output is a text verbalising the input text minus the text verbalising the input data. Table 2 shows some example input and output for text expansion and text reduction.

Training and Test Data
Training data. The WEBNLG dataset (Gardent et al., 2017) associates sets of RDF triples with one or more texts verbalising these sets of triples.
We derive training corpora for syntactically constrained generation from this dataset as follows.
We enrich the WEBNLG texts with labels indicating syntactic structures that are realised by these texts by first, parsing 1 these texts and then using syntactic templates to identify the target structures occurring in those texts. We use the following list of syntactic labels: subject relative, object relative, sentence coordination, VP coordination, passive voice, apposition, possessive relative, pied piping, transitive clause, prepositional object, ditransitive clause, predicative clause.
Based on the resulting, syntactically enriched, WEBNLG corpus, we then build four training corpora (T2T syn , TX syn , D2T syn , TR syn ) using the sets of RDF triples as pivots to relate paraphrases. For data-to-text generation (D2T syn ), the input is a linearised and delexicalised version of the set of RDF triples representing the meaning of the output text, for text-to-text generation (T2T syn ), the input is a text and for hybrid data-and-text-to-text generation (TX syn and TR syn ), the input is a text and a linearised RDF triple.
For the text-to-text datasets, we additionally require that, for any corpus instance k, T i , T o , T o differs from T i on exactly one syntactic label 2 .
Test data. For any input k, I occuring in the test data, we ensure that k, I does not occur in the training data. (where I is either a set of RDF triples, a text or a text and an RDF triple).

Experimental Setup
Models and Baselines D2T 5best and T2T 5best For each generation task, we aim to learn a model that maximises the likelihood P (T |I; k; θ) of a text given some input I, some model parameters θ and some syntactic constraint k. We use a simple encoder-decoder model where both encoder and decoder are bidirectional LSTMs and the encoder receives as input a sequence including both the input I and the syntactic constraint k.
We compare our models with the output produced by beam search when no syntactic constraint applies. For D2T 5best , we take the 5 best output generated from data. For T2T 5best , there may be several input sentences associated with the same meaning: we take the 5 best output for each of these sentences hence T2T 5best may (and does) in fact yield more than 5 output per input meaning. Finally, ALL syn groups together all output generated by the four syntactically constrained models for a given meaning.

Implementation
Details We use the OpenNMT py sequence-to-sequence model (Klein et al., 2017) with attention and a bidirectional LSTM encoder. The encoder and decoder have two layers. Models were trained for 13 epochs, with a mini-batch size of 64, a dropout rate of 0.3, and a word embedding size of 500. They were optimised with SGD with a starting learning rate of 1.0.
Evaluation. We assess both the linguistic/syntactic adequacy of the generated texts and the diversity of the paraphrases being generated.
Syntactic and Linguistic Adequacy (BLEU, Synt, BLEU syn ). For the syntactically constrained models, given an input syntactic constraint k, the BLEU score 3 is computed with respect to those references which satisfy k. In that way, the BLEU score indicates how close to the syntactic target the generated sentence is and therefore how well the model succeeds in generating the required syntactic constructs -as the number of references varies across inputs, we use BLEU at the sentence level (Papineni et al., 2002). In addition, we compute the proportion of output satisfying the input syntactic constraint (Synt) and the BLEU score for these output which satisfy the input syntactic constraint (BLEU syn ). The number of output satisfying the input syntactic constraint is computed by first parsing the generated output and then applying the templates used for the automatic annotation of the training data.
Diversity (Sim, #Txt/Mg). To measure the level of paraphrasing obtained, we group together inputs which share the same meaning (i.e., inputs that are linked in the WEBNLG dataset to the same set of RDF triples) and we compute the number of distinct texts generated per meaning (# Txt/Mg). We further analyse these sets by computing the average pairwise similarity (Sim) of the texts present in these sets. We use the Ratcliff/Obershelp algorithm (Black, 2004) to compute similarity 4 . A low similarity indicates more diversity across the set of outputs sharing the same meaning.
Human Evaluation (% SPar). For each model, we manually examined for 50 meanings, a maximum of 10 randomly chosen output and recorded the average number (# SPar) of syntactically correct paraphrases per input. Table 3 summarises the results.

Results
Diversity. The results for ALL syn (aggregating all output texts generated for a given meaning) shows that combining different generation models increases diversity (# Txt/Mg:13.25, Sim:0.61)) while maintaining a good level of linguistic (BLEU:62.87) and syntactic adequacy (Synt:0.91).
The human evaluation further shows that the distinct outputs generated by the ALL syn model are indeed syntactic, not purely lexical, variants. Table 1 shows some example output for ALL syn .
Expansion, Reduction and Generation. Interestingly, the text expansion and reduction models markedly improve on traditional T2T and D2T models both in terms of linguistic adequacy (higher BLEU score) and in terms of diversity (higher number of distinct output per meaning, lower similarity between texts generated from the same meaning). The comparison with T2T generation is particularly striking as the training data is 3 to 5 times larger for the T2T syn model than for the TX syn and the TR syn model respectively. Similarly, it is noticable that although the T2T syn training corpus is 3 times larger than the D2T syn corpus, the T2T syn and the D2T syn models show similar results. This is in line with results from (Aharoni and Goldberg, 2018) which shows that rephrasing is a difficult task.
Linguistic Adequacy. Overall the linguistic adequacy of the syntactically constrained models is high with a BLEU score with respect to a single reference ranging from 46.20 (D2T syn ) to 83.87 (TX syn ). Moreover, the generated sentences show close similarity with the reference sentence realising the input constraint (BLEU syn : from 48.16 to 89.32).
formula sim(S1, S2) = 2 * match(S1,S2) len(S1)+len(S2) where a match is defined as the sum of the length of the matching segments (match(S1, S2) = m∈overlap(S1,S2) len(m).  While the baseline models underperform in terms of BLEU scores, the manual evaluation (# SPar/Mg) indicates that they, in fact, produce acceptable output. The low BLEU scores for these models are probably due to the fact that each output is evaluated against a single reference while the dataset is constructed to maximise the number of paraphrases available for a given input.
7 Some examples Table 1 shows some example outputs illustrating the main differences between the D2T 5best , T2T 5best and the ALL syn model. As these examples show, syntactically constrained generation (ALL syn ) outputs a much larger number of paraphrases. The difference is due both to the fact that ALL syn groups together the output of 4 (syntactically driven) generation models and to the input syntactic constraint, which ensures greater diversity. Thus in the example shown, ALL syn yields 15 paraphrases each with strong syntactic differences as summarised below.
Sentence Segmentation. The number of verb phrases, clauses and sentences used to verbalise the same input varies. One output text is made of 2 sentences and one VP coordination, another of 3 coordinated clauses and a third of two coordinated clauses and a VP coordination.
Syntax. The same input property is realised by different syntactic structures. For instance, the property operatingOrganisation is alternatively realised by an active verb (operates), a passive verb (is operated by), a participial apposition (,operated by ..,), a subject relative (which is operated by), a nominal predicative construction (is the operation organization) and a preposed participial (Operated by .., ).
Word Order. The same content is verbalised using varying word order and clause ordering. Thus the ALL syn output shows four different ways of ordering the realisation of the three properties op-eratinOrganization (oO), runwayLength (rL), run-wayName (rN) contained in the input namely, rL-oO-rN (once), oO-rN-rL (6 times), oO-rL-rN (6 times) and rN-rL-oO (once).
By constrast, the baseline models output a much smaller range of syntactic paraphrases. The D2T 5best model is particularly weak as among the five best outputs it produces, only three are distinct and all have almost identical syntax. The T2T 5best model produces more outputs (8 against 3 for the D2T 5best model and 15 for the ALL syn model). One reason for this is that, contrary to the D2T 5best model which has a single input (namely a set of RDF triples), this model can have several inputs for the same set of RDF triples.

Conclusion
We have proposed new syntactically constrained models for text generation and shown that their use effectively supports the generation of syntactic paraphrases. In future work, we plan to investigate to what extent these methods can be used to support the automatic generation of grammar exercises.