Analysing Data-To-Text Generation Benchmarks

A generation system can only be as good as the data it is trained on. In this short paper, we propose a methodology for analysing data-to-text corpora used for training Natural Language Generation (NLG) systems. We apply this methodology to three existing benchmarks. We conclude by eliciting a set of criteria for the creation of a data-to-text benchmark which could help better support the development, evaluation and comparison of linguistically sophisticated data-to-text generators.


Introduction
Recently, several data-sets associating data to text have been constructed and used to train, mostly neural, data-to-text surface realisers. (Lebret et al., 2016) built a biography data-set using Wikipedia articles and infoboxes. (Wen et al., 2016) crowd sourced text for dialogue acts bearing on multiple domains (restaurant, laptop, car and TV). (Novikova et al., 2016) present a data-set created by crowd sourcing text and paraphrases from image illustrated frames.
In this paper, we examine the adequacy of these data-sets for training high precision, wide coverage and linguistically sophisticated, surface realisers. We focus on the following questions: Lexical richness: Is the data-set lexically varied ? Domain specific data may lead to highly repetitive text structures due to a small number of lexical items. Syntactic variation: is the data-set syntactically varied and in particular, does it include text of varied syntactic complexity ? Semantic adequacy: for each data-text pair contained in the data-set, does the text match the information contained in the data ? Linguistic adequacy: Language provides many ways of expressing the same content. How much do the available data-sets support the learning of paraphrases ?
Using existing tools and metrics, we measure and compare the syntactic complexity and lexical richness of the three data-sets mentioned above. We provide some comparative statistics for size, number of distinct attributes and number of distinct data units. And we report on a human evaluation of their semantic adequacy (data/text match precision). Our analysis reveals different weak points for each of the data-sets. We conclude by eliciting a number of important criteria that should help promote the development of a higher quality data-to-text benchmarks for the training of linguistically sophisticated surface realisers.

Datasets
We examine the three data-sets proposed for data-totext generation by (Lebret et al., 2016), (Wen et al., 2015;Wen et al., 2016) and (Novikova and Rieser, 2016). (Lebret et al., 2016)'s data-set (WIKIBIO) focuses on biographies and associates Wikipedia infobox with the first sentence of the corresponding article in Wikipedia. Thus in this data-set, texts and data (infoboxes) were authored by Wikipedia contributors. As the data-set is much larger than the other data-sets and is not domain specific, we extract three subsets of it for better comparison: two whose size is similar to the other data-sets (WIKIBIO 16317 , WIK-IBIO 2647 ) and one which is domain specific in that all biographies are about astronauts (WIKIBIOASTRO). The other two data-sets were created manually with humans providing text for dialogue acts in the case of (Wen et al., 2015;Wen et al., 2016)'s data-sets (RNNLG Laptop , RNNLG T V , RNNLG Hotel , RNNLG Restaurant ) and producing image descriptions in the case of (Novikova and Rieser, 2016)'s data-set (IMAGEDESC).
We also include a text-only corpus for comparison with the texts contained in our three data-sets. This corpus (GMB) consists of the texts contained the Groningen Meaning Bank (Version 1.0.0, (Basile et al., 2012)) and covers different genres (e.g., news, jokes, fables). Table 1 gives some descriptive statistics for each of these three data-sets. It shows marked differences in terms of size ( WIKIBIO 16317 being the largest and IMAGEDESC the smallest), number of distinct relations (from 16 for IMAGEDESC to 2367 for WIKIBIO 16317 ) and average number of paraphrases (15.11 for IM-AGEDESC against 1 to 3.72 for the other two data-sets). The number of distinct meaning representations (semantic variability) also varies widely (from 77 distinct MRs for the IMAGEDESC corpus to 12527 for RNNLG Laptop ). Overall though, the main observation is that the number of distinct attributes is relatively small and that the high number of distinct meaning representations essentially stems from various combinations of a restricted number of attributes.

Linguistic and Computational Adequacy
Lexical Richness To assess the extent to which the three data-sets support the training of lexically rich surface realisers, we used the Lexical Complexity Analyser developed by (Lu, 2012), a system designed to automate the measurement of various dimensions of lexical richness. These include lexical sophistication (LS) and mean segmental type-token ratio (MSTTR). Table 2 summarises the results.
Type-token ratio (TTR) is a measure of diversity defined as the ratio of the number of word types to the number of words in a text. To address the fact that this ratio tends to decrease with the size of the corpus, Mean segmental TTR (MSTTR) is computed by dividing the corpus into successive segments of a given length and then calculating the average TTR of all segments.
Overall, the WIKIBIO data-set, even when restricted to a single type of entity (namely, astronauts) has a higher MSTTR. This higher lexical variation is probably due to the fact that this data-set also has the highest number of attributes (cf. Table 1): more attributes brings more diversity and thus better lexical range. And indeed, there is a positive correlation between the number of attributes in the data-set and MSTTR (Spearmans rank correlation coefficient rho = +0.385).
Lexical sophistication, also known as lexical rareness, measures the proportion of relatively unusual or advanced word types in the text. In practice, LS is the proportion of lexical word types (lemma) which are not in the list of 2,000 most frequent words generated from the British National Corpus. Again the WIKIBIO data-set has a markedly higher level of lexical sophistication than the other datasets. This might be because the WIKIBIO text are not crowd-sourced but edited independently of input data thereby leaving more freedom to the authors to include additional information. It may also result from the fact that the WIKIBIO data-set, even though it is restricted to biographies, covers a much more varied set of domains than the other data-sets as people's lives may be very diverse and consequently, a more varied range of topics may be mentioned than in a domain restricted data-set.
Syntactic Diversity To support the training of surface realisers with wide syntactic coverage, a benchmark needs to show a balanced distribution of the various syntactic phenomena present in the target language. To compare the syntactic coverage of the three data-sets, we use the system for automatic measurement of syntactic complexity developed by (Lu, 2010    these sub-trees based on the type of the syntactic constructions detected in it using a set of heuristics. Sentences are then assigned to a syntactic level based on the scores assigned to the sub-trees it contains as follows. If all sub-trees found in that sentence are assigned to level zero, the sentence is assigned to level 0; if one and only one non-zero level is assigned to one or more sub-trees, the sentence is assigned to that non-zero level; if two or more different non-zero scores are assigned to two or more of the sub-trees, the sentence is assigned to level 7. When evaluated against a gold standard of 500 sentences 2 , the system achieves a precision of 93.2%, a recall of 93.2% and an F-Score of 93.2% (Lu, 2010).
The system uses the revised D-Level Scale proposed by (Covington et al., 2006) which consists of the following eight levels: (0) simple sentences, including questions (1) infinitive or -ing complement with subject control; (2) conjoined noun phrases in subject position; conjunctions of sentences, of verbal, adjectival, or adverbial construction; (3) relative or appositional clause modifying the object of the main verb; nominalization in object position; finite clause as object of main verb; subject extraposition; (4) subordinate clauses; comparatives; (5) nonfinite clauses in adjunct positions; (6) relative or appositional clause modifying subject of main verb; embedded clause serving as subject of main verb; nominalization serving as subject of main verb; (7) more than one level of embedding in a single sentence. Figure 1 summarises the results for the various data-sets. There are several noticeable issues.
First, the proportion of simple texts (Level 0) is very high across the board (42% to 68%). In fact, in all data-sets but two, more than half of the sentences are of level 0 (simple sentences). In comparison, only 35% of the GMB corpus sentences are of level 0.
Second, levels 1, 4 and to a lesser extent level 3, are absent or almost absent from the data-sets. We conjecture that this is due to the shape and type of the input data. Infinitival clauses with subject control (level 1) and comparatives (level 4) involve coreferential links and relations between entities which are absent from the simple frames comprising the input data. Similarly, non finite complements with their own subject (e.g., "John saw Mary leaving the room", Level 3) and relative clauses modifying the object of the main verb (e.g., "The man scolded the boy who stole the bicycle", Level 3) requires data where the object of a literal is the subject of some other literal i.e. data of the form P x (x)R 1 (x y)P y (y)R 2 (y z)P z (z). In most cases however, the input data consists of sets of literals predicating facts about a single entity i.e., literals of the form P x (x)R 1 (x y)P y (y)R 2 (y z)P z (z) where x is the entity to be described. Third, the choice of domain seems to impact syntactic complexity. Thus the two data-sets in the restaurant domain, although crowd-sourced using different methods, have a similar distribution. Conversely, the various RNNLG data-sets, although developed using the same method, display different distributions. The mean D-level also seems to be impacted by the length of the meaning representation. Thus, in the RNNLG data-sets where the length of the meaning representations (MR) varies (cf . Table 1), the Spearman's rank correlation between MR length and D-level is +0.9 with p=0.037.
Fourth, data-sets may be more or less varied in terms of syntactic complexity. It is in particular noticeable that, for the WIKIBIO data-set, three levels (1, 3 and 7) covers 84% of the cases. This restricted variety points to stereotyped text with repetitive syntactic structure. Indeed, in WIKIBIO, the texts consist of the first sentence of biographic Wikipedia articles which typically are of the form "W (date of birth -date of death) was P". where P usually is an arbitrarily complex predicate potentially involving relative clauses modifying the object of main verb (Level 3) and coordination (Level 7).
Semantic Adequacy A surface realiser ought to express all and only the content captured in the input data. We therefore investigate to what extent data/text pairs of each data-sets match this requirement by manually examining 50 input/output pairs  Conversely, it was annotated as "Additional" if the text contained information not present in the data and as "Missing" if the data contained information not present in the text. Each pair was independently rated by two annotators resulting in a kappa score ranging between 0.566 and 0.691 depending on the data-set. The results shown in Table 3 highlight some important differences. While the RNNLG datasets have a high percentage of correct entries (82% to 94%), the IMAGEDESC data-set is less precise (44% of correct matches). The WIKIBIO data-sets does not contain a single example where data and text coincides. These differences can be traced back to the way in which each resource was created. In most of the cases (70%) annotators indicated both missing and additional content.
The WIKIBIO data-set is created automatically from Wikipedia infoboxes and articles while semantic adequacy is not checked for. As such, this data-set is ill-suited for training precise surface realisers. Moreover, its texts contain both missing and additional information. Due to this mismatch, it cannot be used to train joint models for content selection and surface realisation with standard neural techniques (Sutskever et al., 2014).
In the IMAGEDESC data-set, the texts are created from images using crowd-sourcing. It seems that this method, while enhancing variety, makes it easier for the crowd-workers to omit some information thereby resulting in several cases where the text fails to express all the information contained in the input data.
Using statistics, metrics and manual evaluation, we compared three surface realisation benchmarks. Unsurprisingly, our analysis shows that existing datasets have restrictions (large proportion of simple sentences, absence of certain syntactic constructions, limited syntactic and semantic diversity, restricted number of attributes and of distinct inputs, stereotyped text, etc.). On a more positive note, it also suggests several key aspects to take into account when constructing a data-to-text data-set for the development, evaluation and comparison of linguistically sophisticated surface realisers. Lexical richness can be enhanced by including data from different domains, using a large number of distinct attributes and ensuring that the total number of distinct inputs is high. Wide and balanced syntactic coverage is difficult to ensure and probably requires input data of various size and shape, stemming from different domains (biographic text for instance, has limited syntactic variability). Semantic adequacy can be achieved almost perfectly using crowd-sourcing which also facilitates the inclusion of paraphrases.