DART: Open-Domain Structured Data Record to Text Generation

We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and spoken dialogue systems by utilizing techniques including tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.


Introduction
Automatically generating textual descriptions from structured data improves the accessibility of knowledge bases to lay users. Such applications include explaining data records to non-experts (Cawsey et al., 1997), writing sports news (Chen and Mooney, 2008), summarizing information in multiple documents (Fan et al., 2019), and generating dialogue responses (Wen et al., 2015).
While significant progress has been made in this field, there are still several issues with existing Data-to-Text datasets. First, they adopt a flat ontology structure of the data, such as slot-value pairs for data records (Lebret et al., 2016;Novikova et al., 2017b) or flat schema for tables (Wiseman et al., * Now at Facebook AI. 2017;Chen et al., 2020a;Parikh et al., 2020). This flat structure is not powerful enough to encode rich semantic relationships in the ontology of the structured data, especially tables, whose representation can be further improved with these semantic knowledge. Second, some of the datasets only focus on a small number of domains or knowledge graphs, therefore providing limited number of predicates and data ontologies. For example, E2E (Novikova et al., 2017b) on restaurants and WebNLG (Gardent et al., 2017) on 15 categories from DBPedia. Furthermore, some of them only have loose alignments between data input and sentence due to the nature of the task (Wiseman et al., 2017) and the automatic generation procedure (Vougiouklis et al., 2018;Elsahar et al., 2018).
To address some of these issues and to encourage further research in natural language generation from structured data, we introduce DART, a large and open-domain structured DAta-Record-to-Text generation corpus. The goal of DART is to harvest the diverse predicates occurred in Wikipedia tables, which is significantly richer than those defined in the domain specific ontologies E2E and WebNLG were built on (Table 2). We also introduce a novel tree ontology annotation approach on tables, which converts a flat table schema into a tree structured semantic frame. The tree ontology reflects the core and auxiliary relations in the table schema, and naturally occurs across many domains. As a result, DART provides high-quality sentence annotations to tree structured semantic frames extracted from various data sources, including Wik-iSQL (Zhong et al., 2017) and WikiTableQuestions (Pasupat and Liang, 2015), two open-domain question answering datasets, as well as E2E (Novikova et al., 2017b) and WebNLG (Gardent et al., 2017) ( Figure 1). We evaluated several state-of-the-art data-to-text models on DART, and found that while these models achieve impressive performance on domain-specific datasets, their performance suffers on DART due to its open-domain nature and richer semantic structures.
Our contributions are as follows.
(1) We present a large and open-domain corpus for structured data record to text generation, annotated with tree ontologies converted from the table. This hierarchical input differentiates our corpus from existing datato-text corpora.
(2) We benchmark several stateof-the-art data-to-text models to show that DART introduces new generalization challenges. (3) We demonstrate that using DART for data augmentation improves the performance of existing models on the WebNLG 2017 dataset. We expect the results to generalize to other data-to-text datasets given the open-domain nature of DART.

DART Data Collection
As shown in Figure 1, DART is constructed from three different sources: (1) human annotation on Wikipedia tables from two table semantic parsing and question answering datasets WikiSQL and Wik-iTableQuestions ( § 2.1), (2) automatic conversion of questions in WikiSQL to declarative sentences ( § 2.2), and (3) incorporation of existing datasets including WebNLG 2017 and Cleaned E2E ( § 2.3). After collecting the triple-set, sentence pairs from various data sources, we manually canonicalized the predicates and show that DART covers a broad range of topics ( § 2.4). Finally, we discuss the data split in § 2.5.

Tree Ontology and Sentence Annotation on Tables
Tables are a major source of structured data that contain a wealth of information complementary to text and knowledge graphs. We aim to collect triple-set, sentence pairs from open-domain Wikipedia tables. However, table schema are flat, making them not directly usable for building subject-predicate-object triples to capture rich relationships in the data. As shown in Figure 2, we propose a two-stage annotation process that involves two groups of annotators: internal annotators and Amazon Mechanical Turk 1 workers. In the first stage, skilled internal annotators specify the parent of every column header to construct a tree-structured ontology for each table. In the second stage, both internal and external annotators provide a sentential description of the 1 https://www.mturk.com/ highlighted cells in a row that are automaticallychosen based on the ontology.
Tree Ontology Annotation For each column in a given table, our internal annotators labeled its ontological parent. In Figure 2, for example, the annotator would provide the sequence {NULL, TEAM, STADIUM, STADIUM, TEAM} as the parent of each column -column TEAM has no parent, STADIUM has parent TEAM, and so on. In many cases, the relationship between a parent column and its child column can be conceptualized as a "has-a" relationship. For tables that are malformed or have duplicate or missing column names (as shown in Figure  5 of the Appendix), annotators either changed or added appropriate column names in order to fit these patterns. For each table we generate an ontology tree whose root is always [TABLECONTEXT]. This root node either has (1) one child node [TI-TLE] in the cases where the table title is the subject of entire table, or (2) column header node(s) and a [TITLE] node as children, as shown in Figure 2. This is because in some tables, the table title itself is more appropriate to be the root of the ontology tree (example shown in Figure 6 of the Appendix). In these cases, annotators assigned the special token [TITLE] as the parent of the relevant column nodes. For other tables, title usually provides important context for understanding the table's rows (example shown in Figure 7 of the Appendix). In such cases, [TITLE] is made a child of [TABLE-CONTEXT] together with the column headers that are appropriate.
We evaluate the quality of the initial tree ontology annotation and made corrections with the following procedure: (1) reject and request corrections from the original annotators if the provided ontology is disconnected or contains a cycle, (2) verify that all column headers appear as a node in the tree. For many tables, the determination of an ontology is a subjective process with many "correct" answers -for example, swapping the positions of TEAM and CITY in the tree in Figure 2 produces an equally valid ontology for the referenced table.
If there are multiple ways to construct an ontology based on annotators' decisions of attribute relationships among column headers, we manually unify the annotations for similar tables (for examples, tables about athletes in different sports). The ontologies exhibit a great deal of structural variety. Relevant statistics are summarized in Table 7 and Figure 3 of the Appendix.   Connected Component Extraction After we annotated the ontology, we automatically choose a subset of cells for a selected table row to form the triple set. Randomly selecting cells leads to poor quality annotation as the selected data could lack a subject, lack cohesion, or would require information not encoded in the ontology to form a coherent sentence. For example, in Figure 2, if only two nodes CITY and CAPACITY were highlighted then a coherent sentence cannot be produced as there is no direct logical relationship (functional dependency) between them. To solve these issues, instead of randomly selecting cells in a row, we extract connected components from the ontology.
The extracted components have two controllable properties: size and shape. To create variation in size, we randomly sampled between [2, 5]. The shape is determined by two numbers: the number of sibling node pairs and parent-child node pairs. Increasing the number of sibling node pairs creates a wider tree, while increasing the latter creates a deeper tree. We created a sliding scale between width and depth using an expansion parameter, p. We recursively visit a node if it has children with probability p and otherwise move to a sibling if it exists. If p = 1, the search becomes a DFS and if p = 0, it becomes BFS. We found that randomly selecting p from 0.5 to 0.7 created a reasonable variation in extracted component shapes. This ensures the balance between breadth and depth of ontology coverage of the selected cells, therefore ensuring the quality of the sentence annotation.
Sentence Annotation Given the table, title, and connected highlighted cells of a row, annotators We collect the parent-child relations between columns from internal annotators (yellow is parent, green is child). Then, we collect a surface realization of the cells highlighted in orange. Middle panel: We use the provided parent-child relations to construct an ontology tree on the columns, then select the nodes corresponding to the highlighted cells. We gather a connected subtree by collecting all nodes leading up to the highlighted cells' lowest common ancestor. Bottom panel: We extract a set of triples from the subtree as shown. This triple-set is paired with the provided realization to form a DART instance.
were asked to write a description of the highlighted cells. We encouraged the annotators to use diverse vocabulary and syntactic structures. To ensure quality, internal annotators reviewed every crowd sourced sentence for correctness. They either rewrote or discarded the sentences that were nonsensical or incorrect. In some cases, they also changed cell highlighting patterns to match the sentence provided.
Build Tripleset-Sentence Pairs Finally, we convert the highlighted cells to triplesets. For a row R, we start with the table's column ontology T . We first place the cell values in R in their corresponding slots in T , e.g. in Figure 2 we fill TEAM with "Amsterdam Admirals". We then check that the nodes of T corresponding to the highlighted cells in R form a connected subtree. If not, we walk up the tree and highlight each traversed node up until the lowest common ancestor of the highlighted nodes (inclusive) to form a connected subtree. For each node N in the tree except the root node, we can extract the triple (parent (N ), title (N ), N ). For example, since STADIUM is highlighted in Figure 2, we extract the triple (Amsterdam Admirals, STADIUM, Olympisch Stadion). A small number of triple-sets contained more than 10 triples. We discarded these because their associated surface realizations were of poor quality. The numbers of tripleset-sentence pairs annotated by different annotators are shown in Table 2.

Automatically Converting Questions to Declarative Sentences
High quality natural language questions in open domain semantic parsing datasets such as Wik-iSQL and QA2D techniques found in automatically constructing NLI datasets (Demszky et al., 2018) present themselves as an attractive opportunity to semi-automatically construct an abundance of declarative sentences and align to table cells. We leveraged rule-based QA2D technique 2 together with manual screening to combine WikiSQL questions and SQL-retrieved-answers into declarative sentences and manually filtered out bad sentences. We only execute SQL queries without aggregate commands 3 to retrieve answers corresponding to questions answerable by single rows. An example of such conversion is as follows: Alignment with table cells is done at two stages. We first align sentences with corresponding rows by changing SQL commands to SELECT * and use string matching to obtain columns and column headers relevant to the answer and WHERE condition. After manually filtering out bad sentences, bad alignments, or tables without ontology annotations, we were able to get 4,204 sentences. Finally, the corresponding table cells are then converted into triples in the same way as we described in Section 2.1.
Examples of produced declarative sentences can be found in Figure 10 of the Appendix.

Incorporating Existing Datasets
Since they provide a large amount of strictly aligned data-text pairs with high quality sentences, we incorporate the following existing datasets in the same triple-set, sentence pair format with some modifications.
WebNLG 2017 An instance of the WebNLG dataset contains a set of triples extracted from DBpedia and the target text written by human. We include the WebNLG 2017 dataset 4 consisting of 27731 triple-set sentence pairs with up to 7 RDF triples in a triple set covering 15 domains.

Predicate Unification
We canonicalized the predicates in our triple sets such that those of the same meaning are also represented the same. We manually constructed a predicate mapping table to achieve this. As an example, our predicate mapping maps "Hometown," "Home Town," and "Home Town/City" to the unified predicate "HOMETOWN." After unifying predicates, we evaluated the diversity of DART by counting the number of unique predicates in its partitions. As shown in Table 2, we see that the Wikipedia partition of DART contains much more unique predicates than the WebNLG and Cleaned E2E partitions combined, despite having smaller number of triple-set, sentence pairs. This contributes significantly to the domain diversity of DART. In addition, we can see that DART exhibits a great deal of topical variety in terms of number of unique triples and vocabulary size.

Dataset Split
For WebNLG 2017 and Cleaned E2E, we use their original data splits. For our annotation on Wik-iTableQuestions and WikiSQL, random splitting will make train, dev, and test splits contain similar tables and similar triple-set, sentence examples. Therefore, to increase the generalization challenge, we compare the table title and the table header to find similar tables, and make sure the model is evaluated on test split tables that are least similar to those used for training. We first sample some tables as a seed test set, and then compute Jaccard similarity 6 with remaining tables based on the titles and the headers. If a table has a Jaccard similarity greater than 0.5 with any of the tables in the test set, we add it into the test set. A similar process is repeated to create the dev set, and the remaining tables form the training set. This results in 62,659/6,980/12,552 sentences in the train/dev/test sets, respectively.

Experimental Results
We conduct experiments on DART and the WebNLG 2017 dataset, with an ablation study on WebNLG to show the benefits of using DART for data augmentation.

Models
We investigate several state-of-the-art Data-to-Text generation models. We report results of the following models on DART-testset: (1) Bidirectional-LSTM with attention, for which we use 2-layer bi-LSTM for encoder, with 300 dimensional word embeddings (without using pretrained word vectors), 512 hidden units and 0.3 dropout rate for the decoder.

Evaluation Metrics
We use a variety of automatic metrics and human evaluation (Section 4) to evaluate the quality of the generated text. We report BLEU, METEOR, and TER which are used in the official WebNLG challenge. However, these measures have limitations in considering the semantic meanings of words or phrases (Novikova et al., 2017a), therefore we also report MoverScore (Zhao et al., 2019), BERTScore (Zhang et al., 2020), and BLEURT (Sellam et al., 2020) that incorporate semantics rather than surface forms using contextual embeddings. Furthermore, we include PARENT (Dhingra et al., 2019) which explicitly aligns n-grams from the reference and generated text to the data contents.

Results
DART Our experimental results on DART are summarized in Table 3. The T5-large model has the highest performance among all models with a BLEU score of 50.66. We attribute this to T5's generalization and transfer learning ability due to pretraining on multi-tasks. We can see that in general, pretrained models outperform others by a large margin, and increasing the model size seems to further boost the performance on DART. However, language models such as BART and T5 are pretrained by reconstructing text and, as a result, we found that their output on DART often contains hallucinated words (Parikh et al., 2020;Harkous et al., 2020;Reiter, 2020), as shown in Figure 11. In addition, while the pretrained model shows better text generation quality due to its generalization ability from pretraining, it does not fully capture the hierarchical ontology nature of the triple sets in their linearized input, therefore making DART more challenging. We suspect that models that are better at exploiting the ontology structure preserved in the input tripleset will achieve better performance on DART.
WebNLG Furthermore, we investigate if DART can improve pretrained models' performance on other Data-to-Text generation tasks. To this end, we finetune the baseline transformer model, BART-[base, large] and T5-[small, base, large] on the WebNLG 2017 dataset, and augment the training by adding instances in the DART training set. The experimental results can be found in Table 4. We report performances of some competitive models that are not pretrained, as well as the state-of-the-art performances of pretrained models on the WebNLG 2017 dataset by Ribeiro et al. (2020). On the bottom panel, we include results of experiments augmented with DART instances whose triplesets are generated with table ontology annotation, paired with human written sentences. We are able to achieve new state-of-the-art results on all WebNLG 2017 test set splits (seen, unseen and all) by finetuning T5-large on DART. We observe that using DART for data augmentation consistently improves the performance across all models, including the baseline transformer model that is not pretrained. Furthermore, we observe that more improvement is shown on unseen split of the test set, due to DART's open-domain nature. See Figure 12 of the Appendix for example model outputs aligned with their human references.

Ablation Study
We also conduct an ablation study on the WebNLG dataset to investigate what part of DART contributes most to improving the Data-to-Text tasks in general. We report results of the study in Table 6 of the Appendix. We divide DART into 4 partitions, where declarative sentence (auto-generated) partition and human annotated sentence partition contain instances whose triplesets are extracted from Wikipedia tables based on ontology. E2E partition contains instances converted from the E2E    dataset, and WebNLG partition keeps the original data format. In general, we observe that adding DART instances that contain human written sentences brings most improvement, especially on unseen split. While adding E2E partition boosts the scores on seen test split and deteriorates the performance on unseen test split. This trend is consistent across all models. Comparing results of declarative sentence partition and human written sentence partition, we see that for most of the models, DART instances with human written sentences have better quality as it brings more improvement to the task.

Human Evaluation
In Table 5, we perform human evaluation on DART based on two criteria: (1) fluency if a sentence is natural and grammatical, and (2) semantic faithfulness if a sentence is supported by the input triples. We defined three levels of fluency: fluent, mostly fluent, and not fluent, and the same for semantic faithfulness. We ask 5 internal annotators to evaluate on 100 triplesets sampled from declarative sentence partition and another 100 triplesets sampled from human written sentence partition. Each tripleset is paired with 3 sentences, one of them is the reference sentence, and the other two are outputs of BART-base and T5-base models. The results in Table 5 attest to the high quality of our annotations since the human written references achieve highest fluency and faithfulness comparing to outputs of two strong baseline models. The evaluation on faithfulness also demonstrates that there is a considerable gap between the DART reference and the outputs of the state-of-the-art pretrained model, showing that there is a large room for improvement. We also noticed that the auto-generated declarative sentences are not as fluent or faithful as the model outputs because they are generated with a rule-based system. However, we decided to release this partition, along with other partitions of DART because it demonstrates an economic way to obtain large amounts of DART instances and it also shows benefits for generalization due to the diverse topics it contains.

Related Work
Data-to-Text Data-to-Text generation aims to produce natural language output from structured input. Applications include generating sports commentaries ( (Chen et al., 2020b). There are also studies of converting tabular data to RDF triples in the Semantic Web community (Kellogg et al., 2015).
Recently, some open-domain table-to-text datasets have been proposed including WikiTableText (Bao et al., 2018), LogicNLP (Chen et al., 2020a), and ToTTo (Parikh et al., 2020, whose inputs are rows or entire tables. In ToTTo, highlighted cells are also provided as input, and the authors found using only highlighted cells with flat row and column headers led to higher performance than using the entire table.
In contrast, DART is constructed by first annotating the tree-structured table ontology that encodes the semantic dependencies among table headers, and we could flexibly incorporate additional contexts such as the table title to the ontology tree. We then use an automatic procedure to extract connected components from the tree to form the input of a DART instance. Our annotation framework not only provides a flexible way of incorporating any contexts to the representation of tables, but also encodes hierarchical relationships among table headers and contexts, ensuring the extracted triples are logically consistent and can be described in text without loss of information.
Model Traditional Data-to-Text models break the generation progress into different stages such as signal analysis, data interpretation, document planning, microplanning, and realization (Reiter and Dale, 2000;Reiter, 2007)

Conclusion
In this paper, we introduce DART, an open-domain corpus for structured data record to text generation. DART's ontology-preserving representation of data inputs differentiates itself from other open-domain Data-to-Text corpora. We found that DART introduces new challenges to several state-of-the-art Data-to-Text models due to its open-domain nature and its ontology structure of the semantic triple input. Furthermore, we found that using it for data augmentation improves other Data-to-Text tasks. For future work, we will explore more controlled, high-fidelity generation that better incorporates the ontology hierarchy of data.

7 Ethics Statement
Our dataset is constructed by accumulating and processing resources from various existing datasets that are open to the public. In addition, we collect annotations on structure of tabular data and human written sentences that describe data records.
The existing resources that we utilize mainly consist of (1) tabular data from Wikipedia, (2) information of restaurants presented with dialogueact meaning representation and its textual description (E2E), and (3) information of various entities and their relationship that are in 15 different categories of DBPedia, which is a knowledge base built on contents created in various Wikimedia projects (WebNLG). It is possible that there are biases in these resources, either in the tabular data or the textual description written by humans.
For additional annotations we collected, we have two groups of annotators participating: internal annotators who are the authors of this work, and external annotators recruited from the Amazon Mechanical Turk platform. On MTurk, we use a pay rate of $15 per hour approximately based on our estimation of the time it takes to complete our annotation tasks. In total, it took 125 hours to complete all tasks on the Amazon Mechanical Turk platform. There are three annotation tasks: (1) Annotators are asked to specify ontological structure of the table by indicating relationship between table column headers, (2) Annotators are asked to write descriptions that are fluent and semantically faithful to the data records presented to them, and (3) Annotators are asked to evaluate sentences that are either references or model generated outputs. We acknowledge that it is also possible to have biases in the sentences written by the annotators, or in the data records that are presented to them.
We conducted experiments on our own dataset and the WebNLG dataset using BART and T5, two large-scale pretrained models. Both models are trained on large amounts of textual data such as news, books, and web text, which may contain any kinds of biases. As a result, it is possible to insert those biases into the models.
In total, we conducted 43 experiments: 7 on DART and 36 for our ablation study on the WebNLG dataset. We use a single NVIDIA V100 GPU for all experiments and each experiment took from 5 to 40 hours depending on the model size.    <entry category="MISC" eid="Id5" size="3"> <modifiedtripleset> <mtriple>Apertura 2006 | JORNADA_OR_OTHER | Semifinals Ida</mtriple> <mtriple>Semifinals Ida | AWAY_TEAM | América</mtriple> <mtriple>Semifinals Ida | HOME_TEAM | Chivas</mtriple> </modifiedtripleset> <lex comment="WikiTableQuestions" lid="Id1"> Chivas and América will compete in the semifinals of the Apertura 2006 tournament. </lex> </entry> <entry category="MISC" eid="Id76" size="6"> <modifiedtripleset> <mtriple>Terry Jenkins | ROUND |      table from WikiTableQuestions. Annotators created a table ontology, and they wrote sentences encapsulating the information in the orange cells for a given row. Whenever a sentence referenced the table title, that sentence was also highlighted green. Figure 9: An example of collected MTurk-generated sentences for WikiTableQuestions. Internal annotators went through the generated sentences and checked for both sentence coherence and title usage. Below the generated sentences, 'y' meant the sentence references the table title, 'n' meant the sentence did not use the table title, 'x' meant the sentence was nonsensical. Figure 10: Automatically generated declarative sentences from WikiSQL with human validation. Annotators went through the generated sentences and checked for both sentence coherence and title use. Below the generated sentences, 'y' meant the sentence references the table title, 'n' meant the sentence did not use the table title, 'x' meant the sentence was nonsensical.