Category-Driven Content Selection

In this paper, we introduce a content selection method where the communicative goal is to describe entities of different categories (e


Introduction
With the development of the Linked Open Data framework (LOD 1 ), a considerable amount of RDF(S) data is now available on the Web. While this data contains a wide range of interesting factual and encyclopedic knowledge, the RDF(S) format in which it is encoded makes it difficult to access by lay users. Natural Language Generation (NLG) would provide a natural means of addressing this shortcoming. It would permit, for instance, enriching existing texts with encyclopaedic information drawn from linked data sources such as DBPedia; or automatically creating a wikipedia stub for an instance of an ontology from the associated linked data. Conversely, because of its well-defined syntax and semantics, the RDF(S) format in which linked data is encoded provides a natural ground on which to develop, test and compare Natural Language Generation (NLG) systems.
In this paper, we focus on content selection from RDF data where the communicative goal is to describe entities of various categories (e.g., astronauts 1 http://lod-cloud.net/ or monuments). We introduce a content selection method which, given an entity, retrieves from DBPedia an RDF subgraph that encodes relevant and coherent knowledge about this entity. Our approach differs from previous work in that it leverages the categorial information provided by large scale knowledge bases about entities of a given type. Using n-gram models of the RDF(S) properties occurring in the RDF(S) graphs associated with entities of the same category , we select for a given entity of category C, a subgraph with maximal n-gram probability that is, a subgraph which contains properties that are true of that entity, that are typical of that category and that support the generation of a coherent text.

Method
Given an entity e of category C and its associated DBPedia entity graph G e , our task is to select a (target) subgraph T e of G e such that: • T e is relevant: the DBPedia properties contained in T e are commonly (directly or indirectly) associated with entities of type C • T e maximises global coherence: DBPedia entries that often co-occur in type C are selected together • T e supports local coherence: the set of DBPedia triples contained in T e capture a sequence of entity-based transitions which supports the generation of locally coherent texts i.e., texts such that the propositions they contain are related through shared entities.  To provide a content selection process which implements these constraints, we proceeds in three main steps.
First, we build n-gram models of properties for DBPedia categories. That is, we define the probability of 1-, 2-and 3-grams of DBPedia properties for a given category.
Second, we extract from DBPedia, entity graphs of depth four.
Third, we use the n-gram models of DBPedia properties and Integer Linear Programming (ILP) to identify subtrees of entity graphs with maximal probability. Intuitively, we select subtrees of the entity graph which are relevant (the properties they contain are frequent for that category), which are locally coherent (the tree constraints ensure that the selected triples are related by entity sharing) and that are globally coherent (the use of bi-and tri-gram probabilities supports the selection of properties that frequently co-occur in the graphs of entities of that category).

Building n-gram models of DBPedia
properties.
To build the n-gram models, we extract from DB-Pedia the graphs associated with all entities of those categories up to depth 4. Table 1 shows some statistics for these graphs. We build the n-gram models using the SRILM toolkit. To experiment with various versions of n-gram information, we create for each category, 1-, 2-and 3-grams of DBPedia properties.

Building Entity Graphs.
For each of the three categories, we then extract from DBPedia the graphs associated with 5 entities considering RDF triples up to depth two. Table 2 shows the statistics for each entity depending on the depth of the graph.

Selecting DBPedia Subgraphs
To retrieve subtrees of DBPedia subgraphs which are maximally coherent, we use an the following ILP model.
Representing tuples Given an entity graph G e for the DBPedia entity e of category C (e.g. Astronaut), for each triple t = (s, p, o) in G e , we introduce a binary variable x p s,o such that: Because we use 2-and 3-grams to capture global coherence (properties that often co-occur together), we also have variables for bi-grams and trigrams of tuples. For bigrams, these variables capture triples which share an entity (either the object of one is the subject of the other or they share the same subject). So for each bigram of triples t 1 = (s1, p1, o1) and t2 = (s2, p2, o2) in G e such that o1 = s2, o2 = s1 or s1 = s2, we introduce a binary variable y t 1 ,t 2 such that: Similarly, there is a trigram binary variable z t 1 ,t 2 ,t 3 for each connected set of triples t 1 , t 2 , t 3 in G e such that: z t 1 ,t 2 ,t 3 = 1 if the trigram of triples is preserved 0 otherwise Maximising Relevance and Coherence To maximise relevance and coherence, we seek to find a subtree of the input graph G e which maximises the following objective function: where P (p), the unigram probability of p in entities of category C, is defined as follows, let T c be the set of triples occurring in the entity graphs (depth 2) of all DBPedia entities of category C. Let P c be the set of properties occurring in T c and let count(p,C) be the number of time p occurs in T c , then: and T (t i , t j , t k ) are the 2-and 3-gram probability P (t 2 |t 1 ) and P (t 3 |t 1 t 2 ).

Consistency Constraints
We ensure consistency between the unary and the binary variables so that if a bigram is selected then so are the corresponding triples: Ensuring Local Coherence (Tree Shape) Solutions are constrained to be trees by requiring that each object has at most one subject (eq. 2) and all tuples are connected (eq. 3).
∀o ∈ X,  where X is the set of words that occur in the solution (except the root node). This constraint makes sure that if o has a child then it also has a head. The first part of Eq 3 counts the number of head properties. The second part counts the children of p which could be greater than 0. It is therefore normalised with X to make it less than 1. And then the difference should be greater than 0.
Restricting the size of the resulting tree Solutions are constrained to contain α tuples.
3 Discussion Table 3 shows content selections which illustrate the main differences between four models, a baseline model with uniform n-gram probability versus a unigram, a bigram and a 3-gram model. The baseline model tends to generate solutions with little cohesion between triples. Facts are enumerated which each range over distinct topics (e.g., birth date and place, place of study, status and deathplace). It may also include properties such as 96  "source" which are generic rather than specific to the type of entity being described.
The 1-gram model is similar to the baseline in that it often generates solutions which are simple enumerations of facts belonging to various topics (birth place, nationality, place of study, rank in the army, space mission, death place). Contrary to the baseline solutions however, each selected fact is strongly characteristic of the entity type.
The 2-and 3-gram models tend to yield more coherent solutions in that they often contain sets of topically related properties (e.g., birth date and birth place; death date and date place).

Conclusion
We have presented a method for content selection from DBPedia data which supports the selection of semantically varied content units of different sizes. While the approach yields good results, one shortcoming is that most of the selected subtrees are trees of depth 1 and that moreover, trees of depth 2 have limited coherence. For instance, the 1-gram model generates the solution shown in Table 4 where the triples about England are not particularly relevant to the description of the Deam Man's Plack's monument. More generally, bi-and 3-grams mostly seem to trigger the selection of 2-and 3-grams that are directly related to the target entity rather than chains of triples. We are currently investigating whether the use of interpolated models could help resolve this issue.
Another important point we are currently investigating concerns the creation of a benchmark for Natural Language Generation. Most existing work on data-to-text generation rely on a parallel or comparable data-to-text corpus.
To generate from the frames produced by a dialog system, (DeVault et al., 2008) describes an approach in which a probabilistic Tree Adjoining Grammar is induced from a training set aligning frames and sentences and used to generate using a beam search that uses weighted features learned from the training data to rank alternative expansions at each step.
Creating such data-to-text corpora is however difficult, time consuming and non generic. Contrary to parsing where resources such as the Penn Treebank succeeded in boosting research, natural language generation still suffers from a lack of common reference on which to train and evaluate parsers. Using crowdsourcing and the content selection method presented here, we plan to construct a large benchmark on which data-to-text generators can be trained and tested.