Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity

End-to-end neural data-to-text (D2T) generation has recently emerged as an alternative to pipeline-based architectures. However, it has faced challenges generalizing to new domains and generating semantically consistent text. In this work, we present DataTuner, a neural, end-to-end data-to-text generation system that makes minimal assumptions about the data representation and target domain. We take a two-stage generation-reranking approach, combining a fine-tuned language model with a semantic fidelity classifier. Each component is learnt end-toe-nd without needing dataset-specific heuristics, entity delexicalization, or post-processing. We show that DataTuner achieves state of the art results on automated metrics across four major D2T datasets (LDC2017T10, WebNLG, ViGGO, and Cleaned E2E), with fluency assessed by human annotators as nearing or exceeding the human-written reference texts. Our generated text has better semantic fidelity than the state of the art on these datasets. We further demonstrate that our model-based semantic fidelity scorer is a better assessment tool compared to traditional heuristic-based measures of semantic accuracy.


Introduction
Data-to-Text generation (D2T) is defined as automatically generating natural language texts from nonlinguistic inputs (Reiter and Dale, 2000). Interest in this task has been driven by its applicability to specialized domains. For instance, D2T has been applied to generating weather reports (Liang et al., 2009), restaurant descriptions (Novikova et al., 2017b), and video game dialogues (Juraska et al., 2019). Recently, researchers have investigated D2T with more diverse domains to arrive at more generalizable text generation (such as works on LDC2017T10 (Knight et al., 2017) and WebNLG (Gardent et al., 2017) datasets).
Traditional approaches to D2T follow a pipeline-based methodology, dividing the problem into several sub-problems (Reiter and Dale, 2000;Gatt and Krahmer, 2018). These include content selection (which information to include in the text), text structuring (the order in which to present the data), sentence aggregation (which information goes in individual sentences), lexicalization (finding the right words and phrases to express the data), referring expression generation (selecting the words and phrases to identify domain objects), and linguistic realization (combining all the generated words and phrases into wellformed sentences).
In recent years, there has been a growing interest in going beyond pipeline-based approaches towards end-to-end (E-to-E) methods driven by recent advancements in deep learning (Lebret et al., 2016;Novikova et al., 2017b;Castro Ferreira et al., 2019;Dušek et al., 2020). Such methods can be trained with (data,text) tuples that can be efficiently collected at scale. In contrast, each step in pipeline-based approaches requires its own setup and training data, such as semantic alignments between sections of text and components of the meaning representation (MR). This makes them more costly and complex to develop and more prone to error propagation.
To date, end-to-end D2T has faced two main challenges: (1) generalization to unseen domains and (2) maintaining semantic fidelity to accurately convey the source data. In a recent comparative study, Castro Ferreira et al. (2019) found that, compared to the best pipeline-based system, E-to-E approaches based on GRU and Transformer architectures scored more than 35 BLEU points lower on unseen domains from the WebNLG dataset, and scored worst for semantic accuracy.
To address these challenges, we introduce DATATUNER, an E-to-E, domain-independent D2T system that makes no assumptions about the generated text's domain or the MR's structure. DATATUNER leverages a pretrained language model and fine-grained state embeddings to achieve strong generalization. It also employs a weakly-supervised Semantic Fidelity Classifier (SFC) to detect and avoid generation errors (such as hallucination and omission). We also leverage this classifier to assess outputs from any D2T system, overcoming the limitations of existing heuristic methods for detecting semantic errors.
In this work, we deliver four main contributions across four major D2T datasets from various domains and MRs.
• We show that DATATUNER pushes the state of the art on automated metrics by significant margins, ranging from 1.2 to 5.9 BLEU points, compared to the best existing pipeline and E-to-E techniques.
• With a crowdsourcing experiment, we demonstrate that DATATUNER generates text with significantly better fluency than existing works. On two datasets, our texts are even judged to be better, on average, than human-written references.
• We show that DATATUNER improves the semantic accuracy of generated texts, with margins ranging from 5.3% to 40% as assessed by crowdsourcing workers.
• With expert annotations, we further show that our model-based semantic accuracy metric is 4.2% to 14.2% more accurate in detecting semantic errors than existing heuristic-based approaches.

Related Work
Pipeline vs. End-to-End Approaches: Within the pipeline-based paradigm, several studies have illustrated that breaking the D2T problem into sub-problems improves overall performance. Moryossef et al. (2019b) showed that separating planning from realization helps achieve better semantic faithfulness compared to an E-to-E neural approach on the WebNLG dataset. Castro Ferreira et al. (2019) conducted a comparative study across a variety of E-to-E and pipeline approaches with WebNLG, concluding that the latter are significantly better at generalizing to unseen domains. However, so far the E-to-E approaches in these studies have been trained from scratch on the task dataset. Our work investigates whether using a pretrained model with strong language generation capabilities raises the performance of E-to-E models. Structured Representations of the Data: Another thread of research focuses on better encoders for meaning representation languages, exploiting their structural properties. This is particularly relevant to AMR (Damonte and Cohen, 2019;Ribeiro et al., 2019;Zhu et al., 2019;Guo et al., 2019). Damonte and Cohen (2019) showed that replacing sequential encoders with a graph encoder improves text quality as measured by BLEU and METEOR scores. Zhu et al. (2019) proposed using self-attention to better model indirectly connected AMR components. In this work, we are the first to design a system that achieves a strong performance across different data structures, ranging from slot-value pairs to graph-based MR.
We also show that such a system can deliver significant gains compared to existing specialized systems. Semantic Fidelity Guarantees: To improve semantic fidelity (how accurately the generated text conveys the meaning) in E-to-E architectures, one approach has been to train reverse "Text-to-Data" models (Chisholm et al., 2017;Agarwal et al., 2018). Another approach by Kedzie and McKeown (2019) used data augmentation and a reliable MR parser to reduce semantic errors in the generated text. Nie et al. (2019) focused on fixing training data errors via an iterative data refinement technique using a language understanding module. Nie et al. (2018) tackled the specific case where symbolic operations (e.g. numerical comparisons) are needed, augmenting the encoded input by pre-calculating these inferrable facts. Shen et al. (2019) used techniques from computational pragmatics and modeled the generation task as a game between speakers and listeners. Despite following the generation-reranking paradigm explored previously in the data-to-text domain (Agarwal et al., 2018;Moryossef et al., 2019a;Dušek et al., 2019), and in other domains including machine translation (Shen et al., 2004), dialogue generation (Wen et al., 2015), and ASR (Morbini et al., 2012), our work has several distinctive aspects compared to previous works. First, we do not make extra assumptions, such as availability of precise MR parsers. Second, our system provides improvements even when the data is not the root cause of semantic errors. Third, we go beyond encouraging the model to avoid semantically inconsistent outputs: we aim to also detect with high probability when the generated text still contains such errors. For industrial NLG applications, including in healthcare (Pauws et al., 2019) or news (Leppänen et al., 2017), identifying individual generations that are inaccurate is vital for the system to be useful in practice (Smiley et al., 2017). This error detection task has commonly relied on handwritten mappings from data values to potential realizations. Such rules were used to compute a Slot Error Rate (SER) metric (Dušek et al., 2019;Juraska et al., 2019;Moryossef et al., 2019a). For instance, Dušek et al. (2019) use SER for reranking beam elements during decoding in an attention-based sequence-to-sequence model on the Cleaned E2E dataset. Juraska et al. (2019) used the approach similarly with a transformer model on the ViGGO dataset. This technique is difficult to scale to new domains or languages, and struggles when the MR is not dominated by values that occur verbatim in the text (e.g. named entities). We aim to tackle that with our model-based semantic fidelity classifier.

Problem Description
The D2T task is formally defined as generating text T from data D that is encoded via a meaning representation MR. We assume that content selection is done prior to the D2T task, an assumption also made in the datasets we use. Therefore, the text T should have semantic fidelity by conveying all the input data, and only the input data.

Datasets
We selected the major datasets that satisfy the task definition above. Each dataset consists of (D,T ) pairs with texts in English. The following describes each dataset and our preprocessing/linearization, including special tokens added (highlighted in <bold> below) to guide our models. We provide the datasets' statistics in Table 1 of the appendix, and we show examples from them in Figure 1.
WebNLG: In WebNLG, D is a set of 1-7 DBpedia triples which T verbalizes (Gardent et al., 2017). The test data spans 15 domains, 10 of which are seen in training. We linearize by concatenating triples, adding special tokens for 'subject', 'predicate', and 'object', and converting strings to sentence-case. For fair comparison with the state of the art, we use v1.4 from Castro Ferreira et al. (2018).
LDC2017T10: In the LDC2017T10 dataset (Knight et al., 2017), D is an Abstract Meaning Representation (AMR) graph representing "who is doing what to whom" for each sentence in T . The texts include broadcast news and weblogs. We use the preprocessing script from Ribeiro et al. (2019), without lowercasing. We merge leaves that correspond to one entity (e.g. "United States" below). Each role specifier is replaced with a special token.
Cleaned E2E: The Cleaned E2E dataset introduced in (Dušek et al., 2019) is an automatically cleaned version of the original E2E dataset (Novikova et al., 2017b), aiming to eliminate omissions and hallucinations in the human text by fixing the corresponding MR. Each MR consists of 3-8 slot-value pairs in the restaurant domain. We preprocess D by adding special tokens before each slot type.
ViGGO: In ViGGO (Juraska et al., 2019), D is a meaning representation with one of 9 dialogue acts (e.g. give opinion, suggest) and 1-8 slot-value pairs from 14 video game attributes (e.g. NAME, GENRES). Each T is an utterance representing a dialogue turn. We add special tokens at the start and end, representing the dialog act, and before each slot type. As we illustrate in Table 1, the datasets vary widely. LDC2017T10 dataset is not bounded to specific domains. Hence, although the AMR format closely describes the text, it is non-trivial to generalize from the training to test data. WebNLG covers a wide, but restricted set of domains, only a subset of which are present in the training data. However it has high lexical diversity. The number of unique words in the test set of WebNLG is 7253 (63% of them capitalized), compared to 5533 (22% capitalized) for LDC2017T10, 2014 (33% capitalized) for ViGGO, and 1966 (29% capitalized) for Cleaned E2E. Measured with the New Dale-Chall readability score (Dale and Chall, 1948), LDC2017T10 has the highest difficulty score (6.49) compared to 1.03, 0.85, and 1.02 for the WebNLG, Cleaned E2E, and ViGGO datasets respectively. In terms of quality, ViGGO was designed with the goal of perfect semantic fidelity, and Cleaned E2E was heavily filtered from the original dataset to achieve that. On the other hand, the versions we use of the other datasets have not undergone such filtering.

DATATUNER Architecture
We designed DATATUNER to be highly generic in order to tackle diverse meaning representations and allow D2T generators to be built for new datasets with minimal work beyond data preprocessing. At a high-level, our text generation system takes a 2-stage approach: generation and reranking. First, we fine-tune a pretrained language model on the D2T task using the task's training data. Next, we build a specialized semantic fidelity classifier trained on an automatically-generated task-specific corpus. Using these models, we construct a customized beam-search decoder that ranks candidates based on the probabilities from the language model, and, at its final stage, reranks them based on the classifier's labels.

Data-to-Text Model Fine-tuning
The fine-tuned Data-to-Text Language Model (D2T-LM) builds on the pretrained OpenAI GPT-2 model (Radford et al., 2019), a multi-layer, autoregressive language model. Each layer is a transformer decoder block (Vaswani et al., 2017) of masked multi-headed attention and a fully connected layer. We provide a full model diagram in Figure 2. :name (n / name :op1 ''United" :op2 ''States'')) :ARG1 (d / develop−01 :mod (t / that)) :ARG2 (c2 / condemn−01 :manner (s / swift))) Linearized D= (respond <:ARG0> (country <:name> (United States)) <:ARG1> (develop <:mod> (that)) <:ARG2> (condemn <:manner> (swift))) T= The United States responded to that development with swift condemnation. Inputs: The input sequence is the data D concatenated with the text T : (<data>{D}<text>{T }). The special tokens <data> and <text> are appended to GPT-2's original vocabulary; their embeddings are learnt during fine-tuning. In addition, we append to the vocabulary the MR-dependent special tokens described above. After tokenization, we get a sequence S of subword tokens, which are encoded to point to vocabulary indices: One interesting feature of GPT-2 is its use of Byte-Pair Encoding (BPE) (Sennrich et al., 2016) on bytes instead of unicode characters. Hence, with a modestly-sized subword vocabulary of around 50K, it can encode any input text and score any output sequence, without suffering from unknown tokens. This is beneficial for our task where named entities are common.
GPT-2 additionally expects positional encodings to help capture the input tokens' order. Our core addition to the model is a third type of input: fine-grained state embeddings. These are analogous to the "Segment Embeddings", introduced in BERT ( Devlin et al., 2019) to distinguish between sentence pairs in the next sentence prediction task. However, in our case, the state is defined at a more fine-grained level to give the model a hint on the type of the data being handled. The state vector for S is a vector of tokens with size |S|, with each token ID indicating the type of s i . Our strategy is to decide the state based on the special tokens we inserted in the data processing stage. We use the following rule: the state token ID of any token s i is the ID of the last special token preceding it (i.e. in the range (s 0 . . . s i ) inclusively).
Training: The input embeddings, positional embeddings, and state embeddings are summed together and fed to the first GPT-2 layer. The last GPT-2 layer output is then normalized using "LayerNorm" (Ba et al., 2016) before passing it to a linear layer added on top. The weights of the latter are tied to the input embeddings. Finally, a softmax is applied to the linear layer's output to generate probability distributions of the output tokens. Our training objective is a language modeling one where we aim to find the set of weights θ that minimize the cross-entropy loss = |S| i=|D|+2 log P θ (s i |s 0 , . . . s i−1 ). Note that, since our task is to generate text given the data, the cross-entropy loss is computed for the text following the input data. We mask the data component in the loss above and sum the loss from index |D| + 2 (i.e., after the <text> token). We use AdamW as an optimizer (Loshchilov and Hutter, 2019).

Semantic Fidelity Classifier
The Semantic Fidelity Classifier (SFC) provides an additional assessment of how accurately the generated text reflects the input data. A text is deemed to possess semantic fidelity if it accurately conveys all of the input data without omitting any nor adding additional data. Our approach draws parallels between this task and natural language inference (NLI) tasks, where the goal is to determine whether a "hypothesis" is true, false, or undetermined given a "premise". Similarly, in semantic fidelity classification, we aim to determine if the text is "accurate" or contains some "omission", "repetition", "hallucination", or "value errors" given the data. We cast the problem as a sentence-pair classification task for the (Data, Text) pairs, using RoBERTa (Liu et al., 2019) as a base encoder. This formulation has been successfully used for NLI problems before (Devlin et al., 2019).
Training Data Generation: The classifier's training data should consist of semantically faithful and semantically incorrect examples. We generate training data for the SFC automatically from the training data of the main D2T task. We define a set of simple dataset-independent transformations that account for common errors in data-to-text generation. For each tuple (D i , T i ) in the training data, we split the text T i into sentences, using the spaCy sentence tokenizer (Honnibal and Montani, 2017). We then generate a set of new tuples for the SFC consisting of (D i , T j , l) for each of the labels l below, generated as follows: • Accurate: This is the text T i .
• Omission: Remove the shortest sentence in T i (to help detect subtle omissions).
• Repetition: Take a random sentence in T i and insert it before another random sentence in T i .
• Hallucination: Select a random sentence from another training text T j =i and insert it before a random sentence in T i .  A related approach, with a different setup and modeling architecture, has been used before in the context of consistency in abstractive summarization (Kryściński et al., 2019). There, weakly supervised models trained on domain-specific data have been shown to outperform supervised models trained on out-of-domain, human-annotated data.
Model Input: As shown in Figure 3, we concatenate the data and text tokens, adding the special start (<s>) and end (</s>) tokens used during the training of RoBERTa. In addition to subword token embeddings, we add positional embeddings (representing the token position) and segment embeddings (representing data vs. text types).
Training: The 3 embeddings are summed element-wise to produce the input representation passed to RoBERTa's first encoder layer. Each layer subsequently applies a self-attention followed by a feedforward network. We take the output hidden layer corresponding to the first token (<s>) and pass it through an additional single-layer neural network. The model is trained as a multi-class classifier with a cross-entropy loss as the objective and AdamW as the optimizer.

Decoder
Our decoding algorithm for the D2T-LM is based on beam-search. At each decoding step, items are ranked according to the score R = 1 (i−(|D|+2)) α i |D|+2 P (s i |s 0 . . . s i−1 ). The score multiplies the conditional probabilities' product with a length normalization factor. Low-scoring candidates are dropped once the number of candidates exceeds the beam size.
Compared to traditional beam search, we do not aggregate probabilities from the start of the sequence, but from the start of the text component (index |D|+2). The length normalization is also adjusted to only account for the text component. We do this because we fine-tuned the D2T-LM to generate text given data as context, not to generate the data itself. Hence, we remove the data tokens from the beam-scoring function. In our experiment, we use a value of α = 0.75. At the end of the beam-search, we use the SFC to rerank the complete candidates (terminated with an end-of-sequence token) in the beam. The reranking metric uses the following binary score: 1 SF C(D i ,T i )="accurate" .
Hence, we push the text T i to the top of the beam if our SFC labels the (D i , T i ) tuple as "accurate". We resolve ties using the original D2T-LM scores. An alternative strategy would be to apply reranking at each decoding stage, but we empirically found this strategy to have negligible accuracy gains while requiring a cost that grows with the text size. In addition to helping surface semantically accurate outputs, the SFC labels can be used to assess whether the generated text is usable in practice. In our experiments, we compare this model-based approach to the heuristic approaches commonly used.

Experiments
For each dataset, we generate outputs from three versions of DATATUNER for our ablation studies. DATATUNER NO FC/FS simply relies on the D2T-LM, with no SFC-based reranking and a coarsegrained version of the state embeddings that contains only <data> and <text> tokens (as done by Wolf et al. (2019b)). DATATUNER NO FC adds the fine-grained state embeddings described in Section 4.1 to DATATUNER NO FC/FS. DATATUNER FC adds the SFC-based reranking. For the SFC, we train the model using the RoBERTa-large model (355M parameters) on lower-cased text. On the synthetic test set generated, the classifier has a macro-averaged F1-score (across 5 classes) of 97%, 97%, 98%, and 98% for the LDC2017T10, WebNLG, Cleaned E2E, and ViGGO datasets respectively. We use the models bundled within the HuggingFace Transformers library (Wolf et al., 2019a). The D2T-LM uses the GPT-2-Medium model (with 345M-parameters) as its base model. The beam search width during decoding is 5. Training was performed on a single machine (Amazon AWS p3.8xlarge). During inference, text from DataTuner is generated at an average rate of 11.8 tokens per second on NVIDIA Tesla K80 GPUs.
We evaluate each variant's outputs with automated metrics and crowdsourced fluency and fidelity evaluation. We also quantify the efficacy of our semantic fidelity classifier with expert-annotations. We compare against the state of the art systems on each dataset, selected based on BLEU scores. These are a graph-optimized Transformer Seq2seq model by Zhu et al. (2019)

Automated Evaluation
For each test set, we compute BLEU (B) (Papineni et al., 2002), which measures n-gram precision, ME-TEOR (M) (Lavie and Agarwal, 2007), which is based on the harmonic mean of the unigram precision and recall while accounting for stem and synonymy matching, ROUGE L (R) (Lin, 2004), which calculates recall for the longest common subsequence, and CIDEr (C) (Vedantam et al., 2015), which is based on TF-IDF scoring of n-grams. We used the official evaluation scripts of the E2E challenge 1 . Table 2 compares the results generated by DATATUNER variants against the state of the art.
Improvements from the D2T-LM alone: Comparing the simple DATATUNER NO FC/FS model to the state of the art, we find that it already improves the BLEU score across 2 datasets and the METEOR score across 3 datasets. This indicates that the D2T-LM component of DATATUNER is itself contributing to achieving an end-to-end state-of-the-art system that needs no delexicalization or MR-specific encoding.
Fine-grained state embeddings matter: Across the 4 datasets, adding fine-grained state embeddings boosts performance on these metrics, with improvements ranging from 0.3 (on Cleaned E2E) to 2.0 BLEU points (on ViGGO).
SFC effect on automated metrics: Several studies highlight shortcomings of automated metrics for evaluating semantic adequacy (Novikova et al., 2017a;Shimorina, 2018). In this vein, compared to our DATATUNER NO FC model, we observe slight additional boosts from introducing the SFC classifier in the DATATUNER FC variant. Interestingly, DATATUNER FC always has the highest METEOR score, which was the only metric found by Shimorina (2018) to correlate with semantic adequacy.  Largest boost on the most complex text: DATATUNER had the biggest improvement, 5.9 BLEU points, on the LDC2017T10 dataset. This is interesting given that (1) the text in LDC2017T10 is typically long with more complex sentence structures and that (2) the baseline systems targeting AMR-to-text (Zhu et al., 2019;Guo et al., 2019;Ribeiro et al., 2019) built more sophisticated architectures compared to other datasets (e.g. ViGGO and Cleaned E2E). This illustrates our system's ability to work across a spectrum of data representations and text complexity.

Human Evaluation of Fluency
We conduct human evaluation of fluency for 150 examples sampled at random from each dataset. We use Amazon's MTurk to ask crowd workers how fluent a text is on a 7-point Likert scale using sliders, where "high fluency" is defined as "grammatical, natural, and could have been produced by a skilled native speaker". Following findings from Novikova et al. (2018) and  for acquiring more consistent human ratings, texts generated for the same meaning representation by different systems are presented together in a single task for annotators to score them relative to each other. We include the human-written text, and randomize the texts' order. For fair comparison, we lower-case our generated texts for the LDC2017T10 to match the outputs of Zhu et al. (2019). We also detokenize outputs from that work to avoid these biasing the workers. We restrict to US-based annotators who completed >500 tasks, out of which more than 97% had been accepted.
Improvement on the state of the art: As shown in the column "Flu." of Table 3, compared to the human baseline, our DATATUNER FC model improves the fluency on all four datasets compared to the state of the art systems with statistically significant margins (p < 0.05). For computing significance, we use the pairwise Wilcoxon signed-rank test (Wilcoxon, 1992) with the null hypothesis that the fluency values for each pair of systems come from the same distribution. For LDC2017T10, where DATATUNER FC had the largest gap in BLEU score (+5.9), we observe the widest fluency improvement (+0.82) compared to Zhu et al. (2019). Interestingly, despite the fact that DATATUNER FC scored 0.7 higher on BLEU compared to the pipeline approach in (Castro Ferreira et al., 2019) for WebNLG, the difference in fluency is 0.69, which is relatively large. We conjecture that this originates from two main sources. First, semantic errors might be perceived by annotators as breaking the fluency. For example, one text contained the phrase "has a runway length of Shehbaz Sharif ". Second, the pipeline approach had a sizeable portion of non-realized outputs (e.g. "PATIENT-1 is made with PATIENT-1 and PATIENT-2."), which were annotated as non-fluent. On the closed-domain datasets (ViGGO and Cleaned E2E), the fluency margins are smaller while still statistically significant. This is expected as these datasets have a narrow set of sentence formulations that are easier to learn. Improvement on the human baseline: Surprisingly, we find that DATATUNER FC received a higher overall average fluency score on 3 datasets compared to the human baseline. This difference is statistically significant in both Cleaned E2E and ViGGO, with the largest difference being 1.04 points in Cleaned E2E. Investigating, we found several low-scored texts had an informal style and problems in sentence construction. One example contained "It serves Chinese food for less." One explanation could be that, once fine-tuned on a large enough dataset, our models have less tendency to deviate from common formulations that annotators prefer.  We take a two-step approach for evaluating semantic fidelity with both crowdworkers and expert annotators. We start with a crowdsourcing experiment involving the same 150 randomly sampled examples from each dataset used in the fluency evaluation. We use Amazon's MTurk (with the same restrictions on annotators) and present texts for the same MR together. To avoid requiring non-expert annotators to understand the MR, we present a human reference text against which to compare the system outputs. We ask annotators to make a choice whether each text is "accurate" or "inaccurate", i.e. whether it "conveys all the factual information in the original text" without any "information missing, added or repeated". We ask annotators to ignore grammar quality or style differences, provided that the overall meaning is the same. Three annotators complete each task, and we take the mode result for each text, assuming "inaccurate" in the event of a tie.

Crowdsourced Evaluation of Fidelity
DATATUNER has higher semantic accuracy in human evaluation The results presented in the "Fid." column of Table 3 show that DATATUNER FC has a superior accuracy to all the other variants as well as the state of the art models. The differences to the state of the art models range from 5.3% (ViGGO) to 40% (WebNLG) and are statistically significant on WebNLG, LDC2017T10, and Cleaned E2E (p < 0.05, as measured by McNemar's test (McNemar, 1947)). The trend among the DATATUNER variants also shows a clear impact of both the fine-grained state embeddings and the SFC on boosting the overall accuracy of the generated text.

Expert Evaluation of Fidelity
Next, we assess the semantic fidelity with experts' annotations. We have two goals in this section: (1) comparing our model-based approach to heuristic-based approaches as automated methods of judging semantic accuracy, and (2) using this comparison outcome to illustrate that DATATUNER delivers higher semantic accuracy as measured by the better fidelity metric.
The baseline method uses heuristics to label each data-text tuple as accurate (A H ) or erroneous (E H ). For this, we use the heuristics by Shimorina and Gardent (2018) for WebNLG, by Juraska et al. (2018) for ViGGO, and by Dušek et al. (2019) for Cleaned E2E. We are not aware of heuristic-based scripts for LDC2017T10. We compute Heuristic Semantic Accuracy (HSA) of a dataset as the fraction with the label A H . Our method uses the SFC component in DATATUNER to assign accurate (A D ) or erroneous (E D ) labels to each data-text tuple. We compute DATATUNER Semantic Accuracy (DSA) as the fraction with the label A D . Both metrics are computed per system across each dataset.
To compare the quality of HSA and DSA as measures of semantic accuracy, we manually annotated a sample of data-text tuples. Since the vast majority of texts are expected to be accurate, especially on the cleaner datasets, we designed a sampling methodology to give a balanced representation of semantically accurate and inaccurate texts. To start, we sample 4 indices from the target dataset such that the human baseline outputs for these indices are labeled as: {(A H , E D ), (E H , A D ), (E H , E D ), (A H , A D )}. We do the same with the state of art system and DATATUNER FC outputs. We continue in a round-robin fashion until we get 24 indices per dataset. For LDC2017T10 dataset, we sample 24 indices in a similar fashion while ignoring the A H and E H labels. Next, two authors were presented with the input meaning representation and the output texts generated by each system (in a random order) for the 24 sampled entries. The authors manually labeled the resulting 480 data-text tuples as accurate (A M ) or erroneous (E M ). Inter-annotator agreement measured with Cohen's Kappa was 0.81, indicating near-perfect agreement. We use these labels to assess the quality Q D of the DSA metric as the percentage of cases where the manual label A M matches A D . Similarly, we evaluate the quality Q H of the HSA metric as the percentage of cases where A M matches A H . These percentages are aggregated across systems, obtaining 120 samples per dataset. We present these metrics in Table 3.
DSA provides higher quality semantic evaluation: We notice that Q D is 4.2% higher on ViGGO and 14.2% higher on both Cleaned E2E and WebNLG, compared to Q H . These differences are statistically significant (p < 0.05) on WebNLG and Cleaned E2E, as measured by McNemar's test (McNemar, 1947) with the null hypothesis that the marginal probability for each outcome (accurate or erroneous) is the same for both algorithms.
HSA struggles with open domains: The heuristic-based approach labeled only 41.2% of the human references in WebNLG as accurate, 16.9% lower than the score it assigned to our DATATUNER FC. Since the latter was trained on human references, this difference is more likely to stem from shortcomings of this approach for assessing the semantics. Checking the data, we observed that humans tend to create more diverse formulations, such as converting United Kingdom to UK, which are easy to miss with heuristics. On the contrary, our DSA metric scored the human references higher.
DATATUNER FC delivers higher semantic accuracy: Now that we have established that DSA is a better measure of semantic accuracy compared to HSA, we can see from Table 3 that, across all datasets, DATATUNER FC significantly improves over the state of the art models as measured by the DSA metric (McNemar's test gives p < 0.05). Compared to other DATATUNER variants, DATATUNER FC adds between 0.3% and 11.3% improvements, corroborating the utility of the semantic fidelity classifier. Finally, we note that, since the baseline models for Cleaned E2E and ViGGO use the heuristics for reranking their outputs, they are expected to show higher HSA. However, what our manual annotations prove is that the HSA metric itself is of lower quality compared to the DSA metric.

Conclusion
We presented DATATUNER, an end-to-end data-to-text generation system equipped with a semantic fidelity classifier. DATATUNER records new state of the art results on four different datasets, with significant margins on automated metrics. We also show that our system has a clear fluency advantage over all the previous state of the art models. We further illustrate DATATUNER's strengths for delivering semantically accurate outputs.