Generating Paraphrases from DBPedia using Deep Learning

Recent deep learning approaches to Natural Language Generation mostly rely on sequence-to-sequence models. In these approaches, the input is treated as a sequence whereas in most cases, input to generation usually is either a tree or a graph. In this paper, we describe an exper-iment showing how enriching a sequential input with structural information improves results and help support the generation of paraphrases.

A basic feature of these approaches is that both the input and the output data is represented as a sequence so that generation can then be modeled using a Long Short Term Memory Model (LSTM) or a conditional language model. Mostly however, the data taken as input by natural language generation systems is tree or graph structured, not linear.
In this paper, we investigate a constrained generation approach where the input is enriched with constraints on the syntactic shape of the sentence to be generated. As illustrated in Figure 1, there is a strong correlation between the shape  of the input and the shape of the corresponding sentence. The chaining structure T 1 where B is shared by two predications (mission and operator) will favour the use of a participial or a passive subject relative clause. In contrast, the tree structure T 2 will favour the use of a new clause with pronominal subject or a coordinated VP.
Using synthetic data, we explore different ways of integrating structural constraints in the training data. We focus on the following two questions.

Does structural information improve performance ?
We compare an approach where the structure of the input and of the corresponding paraphrase is made explicit in the training data with one where it is left implicit. We show that a model trained on a corpus making this information explicit helps improve the quality of the generated sentences.

Can structural information be used to generate paraphrases ?
Our experiments indicates that training on corpora making explicit structural information in the input data permits generating not one but several sentences from the same input. In this first case study, we restrict ourselves to input data of the form illustrated in Figure 2 (i.e., input data consisting of three DBPedia triples related by a shared subject (e p1 e1) (e p2 e2) (e p3 e3)) and explore different strategies for learning to generate paraphrases using the sequence-to-sequence model described in (Sutskever et al., 2011).

Training Corpus
To learn our sequence-to-sequence models for generation and to test our hypotheses, we build a synthetic training data-to-text corpus for generation which consists of 18 397 (data,text) pairs split into 11039 pairs for training, 7358 for development and 7358 for testing.
We build this corpus by extracting data from DBPEdia using SPARQL queries and by generating text using an existing surface realiser. As a result, each training item associates a given input shape (the shape of the RDF tree from DBPedia) with several output shapes (the syntactic shapes of the sentences generated from the RDF data by our surface realiser). Figure 3 shows an example input data and the corresponding paraphrases.

Data
RDF triples consist of (subject property object) tuples such as (Alan Bean occupation Test pilot). As illustrated in Figure 1, RDF data can be represented by a graph in which edges are labelled with properties and vertices with subject and object resources.
To construct a corpus of RDF data units which can serve as input for NLG, we retrieve sets of RDF triples from DBPedia SPARQL endpoint.
Given a DBPedia category (e.g., Astronaut), we define a SPARQL query that searches for all entities of this category which have a given set of properties. The query then returns all sets of RDF triples which satisfy this query. For instance, for the category Astronaut , we use the SPARQL query shown in Figure 4. Using this query, we extract sets of DBPedia triples corresponding to 634 entities (astronauts).

Text
To associate data with text, we build lexical entries for DBPedia properties and use a small handwritten grammar to automatically generate text from sets of DBPedia triples using the GenI generator (Gardent and Kow, 2007).
Lexicon. The lexicon is constructed semiautomatically by tokenizing the RDF triples and creating a lexical entry for each RDF resource. Subject and Object RDF resources trigger the automatic creation of a noun phrase where the string is simply the name of the corresponding resource (e.g., John E Blaha, San Antonio, ...). For properties, we manually create verb entries and assign each property a given lexicalisation. For instance, the property birthDate is mapped to the lexicalisation was born on.
Grammar. We use a simple Feature-Based Lexicalised Tree Adjoining Grammar which captures canonical clauses (1a), subject relative clauses (1b), VP coordination (1c) and sentence coordination (1d). Given this grammar, the lexicon described in the previous section and the RDF triple shown in (1a), the GenI generator generates the five verbalisations shown in five (1b-f).
(1) a.  Figure 4: The sparql query to DBPedia endpoint for the Astronaut corpus To learn a sequence-to-sequence model that can generate sentences from RDF data, we use the neural model described in (Sutskever et al., 2011) and the code distributed by Google Inc 1 .
We experiment with different versions of the training corpus.
Raw corpus (BL). This is our a baseline system. In this case, the model is trained on the corpus of (data,text) pairs as is. No explicit information about the structure of the output is added to the data.
Raw Corpus+Structure Identifier (R+I). Each input data is associated with a structure identifier corresponding to one of the five syntactic shapes shown in Figure 3.
Raw corpus+Infix Connectors (R+C). The input data is enriched with infix connectors where & specifies conjunction, parentheses indicate a relative clause and "." sentence segmentation. The last line in Figure 3 shows the R+C input for S1.

Evaluation and Results.
We evaluate the results by computing the BLEU-4 score of the generated sentences against the reference sentence. Table 1 shows the results.
The baseline and the R+I model have very low results. For the baseline model, this indicates that training on a corpus where the same input is associated with several distinct paraphrases make it difficult to learn a good data-to-text generation model.
The marked difference between the R+I and the RI+C model shows that simply associating each input with an identifier labelling the syntactic structure of the associated sentence is not sufficient to learn a model that should predict different syntactic structures for differently labelled inputs. Interestingly, training on a corpus where the input data is enriched with infixed connectors giving indications about the structure of the associated sentence yields much better results.

Conclusion
Using synthetic data, we presented an experiment which suggests that enriching the data input to System S1 S2 S3 S4 S5 BL 3.6 5.9 6.6 5.9 7.5 R+I 4.0 6.5 6.9 6.5 8.2 R+C 98.2 91.7 91.6 88.8 89.1 Table 1: BLEU-4 scores generation with information about the corresponding sentence structure (i) helps improve performance and (ii) permits generating paraphrases. Further work involves threee main directions. First, the results obtained in this first case study should be tested for genericity . That is the synthetic data approach we presented here should be tested on a larger scale taking into account input structures of different types (chaining vs branching) and different sizes.
Second, the approach should be extended and tested on "real data" i.e., on a training corpus where the DBPEdia triples used as input data are associated with sentences produced by humans and where there is consequently, no direct information about their structure.
Third, we plan to investigate how various deep learning techniques, in particular, recursive neural networks, could be used to capture the correlation between input data and sentence structure.