ToTTo: A Controlled Table-To-Text Generation Dataset

We present ToTTo, an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. To obtain generated targets that are natural but also faithful to the source table, we introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia. We present systematic analyses of our dataset and annotation process as well as results achieved by several state-of-the-art baselines. While usually fluent, existing methods often hallucinate phrases that are not supported by the table, suggesting that this dataset can serve as a useful research benchmark for high-precision conditional text generation.


Introduction
Data-to-text generation (Kukich, 1983;McKeown, 1992) is the task of generating a target textual description y conditioned on source content x in the form of structured data such as a table. Examples include generating sentences given biographical data (Lebret et al., 2016), textual descriptions of restaurants given meaning representations (Novikova et al., 2017), and basketball game summaries given boxscore statistics (Wiseman et al., 2017).
Existing data-to-text tasks have provided an important test-bed for neural generation models (Sutskever et al., 2014;Bahdanau et al., 2014). Neural models are known to be prone to hallucination, i.e., generating text that is fluent but not faithful to the source (Vinyals and Le, 2015;Koehn and Knowles, 2017;Tian et al., 2019) and it is often easier to assess faithfulness of the generated text when the source content is structured (Wiseman et al., 2017;. Moreover, structured data can also test a model's ability for reasoning and numerical inference (Wiseman et al., 2017) and for building representations of structured objects (Liu et al., 2018), providing an interesting complement to tasks that test these aspects in the NLU setting (Pasupat and Liang, 2015;Chen et al., 2019;Dua et al., 2019).
However, constructing a data-to-text dataset can be challenging on two axes: task design and annotation process. First, tasks with open-ended output like summarization (Mani, 1999;Lebret et al., 2016;Wiseman et al., 2017) lack explicit signals for models on what to generate, which can lead to subjective content and evaluation challenges (Kryściński et al., 2019). On the other hand, data-to-text tasks that are limited to verbalizing a fully specified meaning representation (Gardent et al., 2017b) do not test a model's ability to perform inference and thus remove a considerable amount of challenge from the task.
Secondly, designing an annotation process to obtain natural but also clean targets is a significant challenge. One strategy employed by many datasets is to have annotators write targets from scratch (Banik et al., 2013;Wen et al., 2015;Gardent et al., 2017a) which can often lack variety in terms of structure and style (Gururangan et al., 2018;Poliak et al., 2018). An alternative is to pair naturally occurring text with tables (Lebret et al., 2016;Wiseman et al., 2017). While more diverse, naturally occurring targets are often noisy and contain information that cannot be inferred from the source. This can make it problematic to disentangle modeling weaknesses from data noise.
In this work, we propose TOTTO, an opendomain table-to-text generation dataset that introduces a novel task design and annotation process   to address the above challenges. First, TOTTO proposes a controlled generation task: given a Wikipedia table and a set of highlighted cells as the source x, the goal is to produce a single sentence description y. The highlighted cells identify portions of potentially large tables that the target sentence should describe, without specifying an explicit meaning representation to verbalize.
For dataset construction, to ensure that targets are natural but also faithful to the source table, we request annotators to revise existing Wikipedia candidate sentences into target sentences, instead of asking them to write new target sentences (Wen et al., 2015;Gardent et al., 2017a). Table 1 presents a simple example from TOTTO to illustrate our annotation process. The table and Original Text were obtained from Wikipedia using heuristics that collect pairs of tables x and sentences y that likely have significant semantic overlap. This method ensures that the target sentences are natural, although they may only be partially related to the table. Next, we create a clean and controlled generation task by requesting annotators to highlight a subset of the table that supports the original sentence and revise the latter iteratively to produce a final sentence (see §5). For instance, in Table 1, the annotator has chosen to highlight a set of table cells (in yellow) that are mentioned in the original text. They then deleted phrases from the original text that are not supported by the table, e.g., for the playoffs first leg and replaced the pronoun he with an entity Cristhian Stuani. The resulting final sentence (Final Text) serves as a more suitable generation target than the original sentence. This annotation process makes our dataset well suited for high-precision conditional text generation.
Due to the varied nature of Wikipedia tables, TOTTO covers a significant variety of domains while containing targets that are completely faithful to the source (see Figures 2-6 for more complex examples). Our experiments demonstrate that stateof-the-art neural models struggle to generate faithful results, despite the high quality of the training data. These results suggest that our dataset and the underlying task could serve as a strong benchmark for controllable data-to-text generation models.

Related Work
TOTTO differs from existing datasets in both task design and annotation process as we describe below. A summary is given in Table 2.
Task Design Most existing table-to-text datasets are restricted in topic and schema such as WEATH-ERGOV (Liang et al., 2009), ROBOCUP (Chen and Mooney, 2008), Rotowire (Wiseman et al., 2017, basketball), E2E (Novikova et al., 2016(Novikova et al., , 2017, KBGen (Banik et al., 2013, biology), and Wikibio (Lebret et al., 2016, biographies). In contrast, TOTTO contains tables with various schema spanning various topical categories all over Wikipedia. Moreover, TOTTO takes a different view of content selection compared to existing datasets. Prior to the advent of neural approaches, generation systems typically separated content selection (what to say) from surface realization (how to say it) (Reiter and Dale, 1997). Thus many generation datasets only focused on Dataset Train Size Domain Target Quality Target Source Content Selection Wikibio (Lebret et al., 2016) 583K Biographies Noisy Wikipedia Not specified Rotowire (Wiseman et al., 2017) 4.9K Basketball Noisy Rotowire Not specified WebNLG (Gardent et al., 2017b) 25.3K 15 DBPedia categories Clean Annotator Generated Fully specified E2E (Novikova et al., 2017) 50.6K Restaurants Clean Annotator Generated Partially specified LogicNLG (Chen et al., 2020) 28  the latter stage (Wen et al., 2015;Gardent et al., 2017b). However, this decreases the task complexity, since neural systems have already been quite powerful at producing fluent text. Some recent datasets (Wiseman et al., 2017;Lebret et al., 2016) have proposed incorporating content selection into the task by framing it as a summarization problem. However, summarization is much more subjective, which can make the task underconstrained and difficult to evaluate (Kryściński et al., 2019). We place TOTTO as a middle-ground where the highlighted cells provide some guidance on the topic of the target but still leave a considerable amount of content planning to be done by the model.

Annotation Process
There are various existing strategies to create the reference target y. One strategy employed by many datasets is to have annotators write targets from scratch given a representation of the source (Banik et al., 2013;Wen et al., 2015;Gardent et al., 2017a). While this will result in a target that is faithful to the source data, it often lacks variety in terms of structure and style (Gururangan et al., 2018;Poliak et al., 2018). Domainspecific strategies such as presenting an annotator an image instead of the raw data (Novikova et al., 2016) are not practical for some of the complex tables that we consider. Other datasets have taken the opposite approach: finding real sentences on the web that are heuristically selected in a way that they discuss the source content (Lebret et al., 2016;Wiseman et al., 2017). This strategy typically leads to targets that are natural and diverse, but they may be noisy and contain information that cannot be inferred from the source .To construct TOTTO, we ask annotators to revise existing candidate sentences from Wikipedia so that they only contain information that is supported by the table. This enables TOTTO to maintain the varied language and structure found in natural sentences while producing cleaner targets. The technique of editing exemplar sentences has been used in semiparametric generation models (Guu et al., 2018;Pandey et al., 2018;Peng et al., 2019) and crowd-sourcing small, iterative changes to text has been shown to lead to higher-quality data and a more robust annotation process (Little et al., 2010). However, to our knowledge, we are the first to use this technique to construct generation datasets. Concurrent to this work, Chen et al. (2020) proposed LogicNLG which also uses Wikipedia tables, although omitting some of the more complex structured ones included in our dataset. Their target sentences are annotator-generated and their task is significantly more uncontrolled due to the lack of annotator highlighted cells.

Preliminaries
Our tables come from English Wikipedia articles and thus may not be regular grids. 2 For simplicity, we define a table t as a set of cells t = {c j } τ j=1 where τ is the number of cells in the table. Each cell contains: (1) a string value, (2) whether or not it is a row or column header, (3) the row and column position of this cell in the table, (4) The number of rows and columns this cell spans.
Let m = (m page-title , m section-title , m section-text ) indicate table metadata, i.e, the page title, section title, and up to the first 2 sentences of the section text (if present) respectively. These fields can help provide context to the table's contents. Let s = (s 1 , ..., s η ) be a sentence of length η. We define an annotation example 3 d = (t, m, s) a tuple of table, table metadata,

Dataset Collection
We first describe how to obtain annotation examples d for subsequent annotation. To prevent any overlap with the Wikibio dataset (Lebret et al., 2016), we do not use infobox tables. We employed three heuristics to collect tables and sentences: Number matching We search for tables and sentences on the same Wikipedia page that overlap with a non-date number of at least 3 non-zero digits. The numbers are extracted by regular expressions that capture most common number patterns, including numbers with commas and decimal points. This approach captures most of the table-sentence pairs that describe statistics (e.g., sports, election, census, science, weather).
Cell matching We extract a sentence if it has tokens matching at least 3 distinct cell contents from the same row in the table. The intuition is that most tables are structured, and a row is usually used to describe a complete event (e.g., a sports game, an election, census data from a certain year), which is likely to have a corresponding sentence description from the same page.
Hyperlinks The above heuristics only consider sentences and tables on the same page. We also find examples where a sentence s contains a hyperlink to a page with a title that starts with List (these pages typically only consist of a large table). If the table t on that page also has a hyperlink to the page containing s, then we consider this to be an annotation example. Such examples typically result in more diverse examples than the other two heuristics, but also add more noise, since the sentence may only be distantly related to the table. Using

Data Annotation Process
The collected annotation examples are noisy since a sentence s may be partially or completely unsupported by the table t. We thus define a data annotation process that guides annotators through small, incremental changes to the original sentence. This allows us to measure annotator agreement at every step of the process, which is atypical in existing generation datasets.

Primary Annotation Task
The primary annotation task consists of the following steps: (1) Table Readability, (2) Cell highlighting, (3) Phrase Deletion, (4) Decontextualization. Each of these are described below and more examples are provided in Table 3.

Table Readability
If a table is not readable, then the following steps will not need to be completed. This step is only intended to remove fringe cases where the table is poorly formatted or otherwise not understandable (e.g., in a different language). 99.5% of tables are determined to be readable.
Cell Highlighting An annotator is instructed to highlight cells that support the sentence. A phrase is supported by the table if it is either directly stated in the cell contents or meta-data, or can be logically inferred by them. Row and column headers do not need to be highlighted. If the table does not support any part of the sentence, then no cell is marked and no other step needs to be completed. 69.7% of examples are supported by the table.
For instance, in Figure 1, the annotator highlighted cells that support the phrases second, 13 November 2013, in Jordan, and 5-0 win. We denote the set of highlighted cells as a subset of the table: t highlight ∈ t.
Phrase Deletion This step removes phrases in the sentence unsupported by the selected table cells.
Annotators are restricted such that they are only able to delete phrases, transforming the original sentence: s → s deletion . In Table 1, the annotator transforms s by removing an individual word Charras', and an entire phrase: For the playoffs first leg, finishing nicols lodeiro's cross at close range.
On average, s deletion is different from s for 85.3% of examples and while s has an average length of 26.6 tokens, this is reduced to 15.9 for s deletion . We found that the phrases annotators often disagreed on corresponded to verbs purportedly supported by the table. For instance, in Table 1, some annotators decided netted is supported by the table since it is about soccer, while others opted to delete it.
Decontextualization A given sentence s may contain pronominal references or other phrases that depend on context. We thus instruct annotators to identify the main topic of the sentence; if it is a pronoun or other ambiguous phrase, we ask them to replace it with a named entity from the table or metadata. To discourage excessive modification,  they are instructed to make at most one replacement. 4 This transforms the sentence yet again: s deletion → s decontext . In Table 1, the annotator replaced he with Cristhian Stuani.
Since the previous steps can lead to ungrammatical sentences, annotators are also instructed to fix the grammar to improve the fluency of the sentence. We find that s decontext is different than s deletion 68.3% of the time, and the average sentence length increases to 17.2 tokens for s decontext compared to 15.9 for s deletion .

Secondary Annotation Task
Due to the complexity of the primary annotation task, the resulting sentence s decontext may still have grammatical errors, even if annotators were instructed to fix grammar. Thus, a second set of annotators were asked to further correct the sentence and were shown the table with highlighted cells as additional context. However, the annotators were not required to use the table. They were asked to determine the grammaticality and fluency of the provided sentence. If the sentence is not fluent or grammatical, they fix the errors to make it such. Annotators are also given an option to indicate that the sentence is not fixable.
This results in the final sentence s final . On average, annotators edited the sentence 27.0% of the time, and the sentence length slightly increased to 17.4 tokens from 17.2. We found that for most of the cases, the table is not necessary to fix the sentence since the grammatical errors are due to  surface syntax, such as a missing punctuation or a missing determiner. In a few cases, a verb may be missing, and in such instances, the table is needed to indicate the correct verb to use.

Dataset Analysis
Basic statistics of TOTTO are described in  (3). This indicates the importance of the cell highlighting feature of our dataset toward a well-defined text generation task. Table 5 shows annotator agreement over the development set for each step of the annotation process. We compute annotator agreement and Fleiss' kappa (Fleiss, 1971) for table readability and highlighted cells, and BLEU-4 score between annotated sentences in different stages, including (1) sentence after deletion;

Annotator Agreement
(2) sentence after decontextualization;    and (3) final sentence after the secondary grammar correction task. As one can see, the table readability task has an agreement of 99.38%. The cell highlighting task is more challenging. 73.74% of the time all three annotators completely agree on the set of cells which means that they chose the exact same set of cells. The Fleiss' kappa is 0.856, which is regarded as "almost perfect agreement" (0.81 -1.00) according to (Landis and Koch, 1977).
With respect to the sentence revision tasks, we see that the agreement slightly degrades as more steps are performed. We compute single reference BLEU among all pairs of annotators for examples in our development set (which only contains examples where both annotators chose t highlight = ∅). As the sequence of revisions are performed, the annotator agreement gradually decreases in terms of BLEU-4: 82.19 → 72.56 → 68.98. This is considerably higher than the BLEU-4 between the original sentence s and s final (45.87).

Topics and Linguistic Phenomena
We use the Wikimedia Foundation's topic categorization model (Asthana and Halfaker, 2018)   ure 1 presents an aggregated topic analysis of our dataset. We found that the Sports and Countries topics together comprise 56.4% of our dataset, but the other 44% is composed of a much broader set of topics such as Performing arts, Transportation, and Entertainment.

Training, Development, and Test Splits
Each annotation consists of the set of highlighted cells t highlight and the modified sentences s deletion , s decontext , s final . After the annotation process, we only consider examples where the sentence is related to the table, i.e., t highlight = ∅. This initially results in a training set D orig-train of size 131,849 that we further filter as described below.
For more robust evaluation, each example in the development and test sets was annotated by three annotators. Since the machine learning task uses t highlight as an input, it is challenging to use three different sets of highlighted cells in evaluation. Thus, we only use a single randomly chosen t highlight while using the three s final as references for evaluation 6 . We only use examples where at least 2 of the 3 annotators chose t highlight = ∅. This results in a development set D dev of size 7,700 and a test set D test of size 7,700.
Overlap and Non-Overlap Sets Without any modification D orig-train , D dev , and D test may contain many similar tables. Thus, to increase the generalization challenge, we filter D orig-train to remove some examples based on overlap with D dev , D test .
For a given example d, let h(d) denote its set of header values and similarly let h(D) be the set of header values for a given dataset D.
We remove examples d from the training set where h(d) is both rare in the data as well as occurs in either the development or test sets. Specifically, D train is defined as: The count(h(d), D orig-train ) function returns the number of examples in D orig-train with header h(d).
To choose the hyperparameter κ we first split the test set as follows: The development set is analogously divided into D dev-overlap and D dev-nonoverlap . We then choose κ = 5 so that D test-overlap and D test-nonoverlap have similar size. After filtering, the size of D train is 120,761, and D dev-overlap , D dev-nonoverlap , D test-overlap , and D test-nonoverlap have sizes 3784, 3916, 3853, and 3847 respectively.

Machine Learning Task Construction
In this work, we focus the following task: Given a table t and related metadata m (page title, section title, table section text) and a set of highlighted cells t highlight , produce the final sentence s final . Mathematically this can be described as learning a function f : x → y where x = (t, m, t highlight ) and y = s final .
Note that this task is different from what the annotators perform, since they are provided with a starting sentence requiring revision. Therefore, this task is more challenging, as the model must generate a new sentence instead of revising an existing sentence. Since we use several stages in our annotation mechanism, one could design several other tasks for machine learning models given the data such as sentence revision or cell highlighting, but we leave this out of the scope of this work.

Experiments
We present baseline results on TOTTO by examining three existing state-of-the-art approaches: • BERT-to-BERT (Rothe et al., 2019): A Transformer encoder-decoder model (Vaswani et al., 2017) where the encoder and decoder are both initialized with BERT (Devlin et al., 2018). The original BERT model is pre-trained with both Wikipedia and the Books corpus (Zhu et al., 2015), the former of which contains our (unrevised) test targets. Thus, we also pre-train a version of BERT on the Books corpus only, which we consider a more correct baseline. However, empirically we find that both models perform similarly in practice (Table 7). with an explicit content selection and planning mechanism designed for data-to-text.
Moreover, we explore different strategies of representing the source content that resemble standard linearization approaches in the literature (Lebret et al., 2016;Wiseman et al., 2017).
• Full Table The simplest approach is simply to use the entire table as the source, adding special tokens to mark which cells have been highlighted. However, many tables can be very large and this strategy performs poorly. • Subtable Another option is to only use the highlighted cells t highlight ∈ t with the heuristically extracted row and column header for each highlighted cell. This makes it easier for the model to only focus on relevant content but limits the ability to perform reasoning in the context of the table structure (see Table 10). Overall though, we find this representation leads to higher performance.
In all cases, the selected cells are linearized with row and column separator tokens. We also experiment with prepending the table metadata to the source table. 7

Evaluation metrics
The model output is evaluated using two automatic metrics. Human evaluation is described in § 8.3. BLEU (Papineni et al., 2002): A widely used metric that uses n-gram overlap between the reference y and the predictionŷ at the corpus level. BLEU does not take the source content x into account.
PARENT : A metric recently proposed specifically for data-to-text evaluation that takes the table into account. PARENT is defined at an instance level. For a given example (x n , y n ,ŷ n ) PARENT is defined as: P AREN T (x n , y n ,ŷ n ) = 2 × E p (x n , y n ,ŷ n ) × E r (x n , y n ,ŷ n ) E p (x n , y n ,ŷ n ) + E r (x n , y n ,ŷ n ) E p (x n , y n ,ŷ n ) is the PARENT precision computed using the prediction, reference, and table (the last of which is not used in BLEU). E r (x n , y n ,ŷ n ) is the PARENT recall and is computed as: where R(x n , y n ,ŷ n ) is a recall term that compares the prediction with both the reference and table.
R(x n ,ŷ n ) is an extra recall term that gives an additional reward if the predictionŷ n contains phrases in the table x n that are not necessarily in the reference (λ is a hyperparameter).
In the original PARENT work , the same table t is used for computing the precision and both recall terms. While this makes sense for most existing datasets, it does not take into account the highlighted cells t highlight in our task. To incorporate t highlight , we modify the PARENT metric so that the additional recall term R(x n ,ŷ n ) uses t highlight instead of t to only give an additional reward for relevant table information. The other recall and the precision term still use t. Table 7 shows our results against multiple references with the subtable input format. Both the BERT-to-BERT models perform the best, followed by the pointer generator model. 8 We see that for all models the performance on the non-overlap set is significantly lower than that of the overlap set, indicating that slice of our data poses significant challenges for machine learning models. We also observe that the baseline that separates content selection and planning performs quite poorly. We attest this to the fact that it is engineered to the Rotowire data format with fixed size tables and predefined column names. Table 8 explores the effects of the various input representations (subtable vs. full table) on the BERT-to-BERT model. We see that the full table format performs poorly even if it is the most knowledge-preserving representation. Using table metadata significantly helps under different input.

Human evaluation
For each of the 2 top performing models in Table 7, we take 500 random outputs and perform human evaluation using the following axes: • Fluency -A candidate sentence is fluent if it is grammatical and natural. The three choices are Fluent, Mostly Fluent, Not Fluent. • Faithfulness (Precision) -A candidate sentence is considered faithful if all pieces of information are supported by either the table or one of the references. Any piece of unsupported information makes the candidate unfaithful. • Covered Cells (Recall) -The percentage of highlighted cells the candidate sentence covers. • Coverage with Respect to Reference (Recall) -We ask whether the candidate is strictly more or less informative than each reference (or neither, which is referred to as neutral).
In addition to evaluating the model outputs, we compute an oracle upper-bound by treating one of the references as a candidate and evaluating it compared to the table and other references. The results, shown in Table 9, attest to the high quality of our human annotations since the oracle consistently achieves high performance. All the axes demonstrate that there is a considerable gap between the model and oracle performance.
This difference is most easily revealed in the last column when annotators are asked to directly compare the candidate and reference. As expected, the oracle has similar coverage to the reference (61.7% neutral) but both baselines demonstrate considerably less coverage. According to an independentsample t-test, this difference is significant at a p < 0.001 level for both baselines. Similarly, we observe a significantly lower percentage of covered cells for the baselines compared to the reference according to a χ 2 test. Comparing the baselines to each other, we do not observe a significant difference in either coverage metric.
Furthermore, the baselines are considerably less faithful than the reference. The faithfulness of both models is significantly lower than the reference (χ 2 test with p < 0.001). The models do not differ significantly from each other, except for the non-overlap case, where we see a moderate effect    (Wiseman et al., 2017;Tian et al., 2019), our results indicate it is a problem even when the references are clean.

Model Errors and Challenges
In this section, we visualize some example decoder outputs from the BERT-to-BERT Books model (Table 10) and discuss specific challenges that existing approaches face with this dataset. In general, the model performed reasonably well in producing grammatically correct and fluent sentences given the information from the table, as indicated by Table 10. Given the "full table", the model is not able to correctly select the information needed to produce the reference, and instead produces an arbitrary sentence with irrelevant information. Note the model corrects itself with highlighted cell information ("subtable"), and learns to use the metadata to improve the sentence. However, we also observe certain challenges that existing approaches are struggling with, which can serve as directions for future research. In particular: Hallucination As shown in Table 10 (examples 1-4) the model sometimes outputs phrases such as first scottish, third that seem reasonable but are not faithful to the table. This hallucination phenomenon has been widely observed in other existing data-to-text datasets (Lebret et al., 2016;Wiseman et al., 2017). However, the noisy references in these datasets make it difficult to disentangle model incapability from data noise. Our dataset serves as strong evidence that even when the reference targets are faithful to the source, neural models still struggle with faithfulness.
Rare topics Another challenge revealed by the open domain nature of our task is that models often struggle with rare or complex topics. For instance, example 5 of Table 10 concerns microdrive capacities which is challenging. As our topic distribution indicates (Figure 1) Table 9: Human evaluation over references (to compute Oracle) and model outputs. For Fluency, we report the percentage of outputs that were completely fluent. In the last column X/Y /Z means X% and Z% of the candidates were deemed to be less and more informative than the reference respectively and Y% were neutral.

ID Reference
Decoder output (w/ metadata) w/o metadata Full cortete's production was 512 megabyte.
6 the 1956 grand prix motorcycle racing season consisted of six grand prix races in five classes: 500cc, 350cc, 250cc, 125cc and sidecars 500cc.
the 1966 grand prix motorcycle racing season consisted of seven grand prix races in five classes: 500cc, 350cc, 250cc, 125cc and sidecars 500cc.
the 1956 grand prix motorcycle racing season consisted of eight grand prix races in five classes: 500cc, 350cc, 250cc, 125cc and sidecars 500cc.

Conclusion
In this work, we presented TOTTO, a large, English table-to-text dataset that presents both a controlled generation task and a data annotation process based on iterative sentence revision. We also provided several state-of-the-art baselines, and demonstrated TOTTO could be a useful dataset for modeling research as well as for developing evaluation metrics that can better detect model improvements. TOTTO is available at https://github. com/google-research-datasets/totto.