Handling Divergent Reference Texts when Evaluating Table-to-Text Generation

Automatically constructed datasets for generating text from semi-structured data (tables), such as WikiBio, often contain reference texts that diverge from the information in the corresponding semi-structured data. We show that metrics which rely solely on the reference texts, such as BLEU and ROUGE, show poor correlation with human judgments when those references diverge. We propose a new metric, PARENT, which aligns n-grams from the reference and generated texts to the semi-structured data before computing their precision and recall. Through a large scale human evaluation study of table-to-text models for WikiBio, we show that PARENT correlates with human judgments better than existing text generation metrics. We also adapt and evaluate the information extraction based evaluation proposed by Wiseman et al (2017), and show that PARENT has comparable correlation to it, while being easier to use. We show that PARENT is also applicable when the reference texts are elicited from humans using the data from the WebNLG challenge.


Introduction
The task of generating natural language descriptions of structured data (such as tables) (Kukich, 1983;McKeown, 1985;Reiter and Dale, 1997) has seen a growth in interest with the rise of sequence to sequence models that provide an easy way of encoding tables and generating text from them (Lebret et al., 2016;Wiseman et al., 2017;Novikova et al., 2017b;Gardent et al., 2017).
For text generation tasks, the only gold standard metric is to show the output to humans for judging its quality, but this is too expensive to apply repeatedly anytime small modifications are made to a system. Hence, automatic metrics that compare the generated text to one or more reference texts are routinely used to compare models (Bangalore et al., 2000). For table-to-text generation, automatic evaluation has largely relied on BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004). The underlying assumption behind these metrics is that the reference text is gold-standard, i.e., it is the ideal target text that a system should generate. In practice, however, when datasets are collected automatically and heuristically, the reference texts are often not ideal. Figure 1 shows an example from the WikiBio dataset (Lebret et al., 2016). Here the reference contains extra information which no system can be expected to produce given only the associated table. We call such reference texts divergent from the table.
We show that existing automatic metrics, including BLEU, correlate poorly with human judgments when the evaluation sets contain divergent references ( §5.4). For many table-to-text generation tasks, the tables themselves are in a pseudonatural language format (e.g., WikiBio, WebNLG (Gardent et al., 2017), and E2E-NLG (Dušek et al., 2019)). In such cases we propose to compare the generated text to the underlying table as well to improve evaluation. We develop a new metric, PARENT (Precision And Recall of Entailed Ngrams from the Table) ( §3). When computing precision, PARENT effectively uses a union of the reference and the table, to reward correct information missing from the reference. When computing recall, it uses an intersection of the reference and the table, to ignore extra incorrect information in the reference. The union and intersection are computed with the help of an entailment model to decide if a text n-gram is entailed by the table. 2 We Figure 1: A table from the WikiBio dataset (right), its reference description and three hypothetical generated texts with scores assigned to them by automatic evaluation metrics. Text which cannot be inferred from the table is in red, and text which can be inferred but isn't present in the reference is in green. PARENT is our proposed metric.
show that this method is more effective than using the table as an additional reference. Our main contributions are: • We conduct a large-scale human evaluation of the outputs from 16 table-to-text models on 1100 examples from the WikiBio dataset, many of which have divergent references ( §5.2).
• We propose a new metric, PARENT ( §3), and show that it improves correlation with human judgments over existing metrics, both when comparing similar systems (such as different hyperparameters of a neural network) and when comparing vastly different systems (such as template-based and neural models).
• We also develop information extraction based metrics, inspired from Wiseman et al. (2017), by training a model to extract tables from the reference texts ( §4). We find that these metrics have comparable correlation to PARENT, with the latter being easier to use out of the box.
• We analyze the sensitivity of the metrics to divergence by collecting labels for which references contain only information also present in the tables. We show that PARENT maintains high correlation as the number of such examples is varied. ( §5.5).
• We also demonstrate the applicability of PAR-ENT on the data released as part of the WebNLG challenge (Gardent et al., 2017), where the references are elicited from humans, and hence are of high quality ( §5.4). 2 denote an evaluation set of tables, references and texts generated from a model M , and R i n , G i n denote the collection of n-grams of order n in R i and G i , respectively. We use # R i n (g) to denote the count of n-gram g in R i n , and # G i n ,R i n (g) to denote the minimum of its counts in R i n and G i n . Our goal is to assign a score to the model, which correlates highly with human judgments of the quality of that model. Divergent References. In this paper we are interested in the case where reference texts diverge from the tables. In Figure 1, the reference, though technically correct and fluent, mentions information which cannot be gleaned from the associated table. It also fails to mention useful information which a generation system might correctly include (e.g. candidate 3 in the figure). We call such references divergent from the associated table. This phenomenon is quite common -in Wik-iBio we found that 62% of the references mention extra information ( §5.5). Divergence is common in human-curated translation datasets as well (Carpuat et al., 2017;Vyas et al., 2018).
How does divergence affect automatic evalua-tion? As a motivating example, consider the three candidate generations shown in Figure 1. Clearly, candidate 1 is the worst since it "hallucinates" false information, and candidate 3 is the best since it is correct and mentions more information than candidate 2. However, BLEU and ROUGE, which only compare the candidates to the reference, penalize candidate 3 for both excluding the divergent information in the reference (in red) and including correct information from the table (in green). 4 PARENT, which compares to both the table and reference, correctly ranks the three candidates.

PARENT
PARENT evaluates each instance (T i , R i , G i ) separately, by computing the precision and recall of G i against both T i and R i .
Entailment Probability. The table is in a semistructured form, and hence not directly comparable to the unstructured generated or reference texts. To bridge this gap, we introduce the notion of entailment probability, which we define as the probability that the presence of an n-gram g in a text is "correct" given the associated table. We denote this probability as w(g) = P r(g ⇐ T i ). Estimating this probability is in itself a challenging language understanding task, since the information in the table may be expressed in varied forms in text. Here, we describe two simple models of lexical entailment, inspired by work on the Recognizing Textual Entailment Challenge (Dagan et al., 2006). We found these simple models to be effective; while more sophisticated models may be used if there are complex inferences between the table and text, they are beyond the scope of this paper.

Word Overlap Model:
LetT i denote all the lexical items present in the table T i , including both attribute names and their values. Then, w(g) = n j=1 1(g j ∈T i )/n, where n is the length of g, and g j is the jth token in g.

2.
Co-occurrence Model: (Glickman and Dagan, 2005) Originally proposed for the RTE task, this model computes the probability of a term g j in the n-gram being entailed by the table as the maximum of its probabilities of being en-4 BLEU is usually computed at the corpus-level, however here we show its value for a single sentence purely for illustration purposes. The remaining BLEU scores in this paper are all at the corpus-level. tailed by each lexical item v in the table: (1) P r(g j ⇐ v) is estimated using co-occurrence counts from a training set of table-reference pairs. Then the overall probability of the ngram being entailed is taken as the geometric We note that these models are not sensitive to paraphrases between the table and text. For tasks where this is important, embedding-based similarities may be used, but those are beyond the scope of this paper. Next we discuss how to compute the precision and recall of the generation.
Entailed Precision. When computing precision, we want to check what fraction of the n-grams in G i n are correct. We consider an n-gram g to be correct either if it occurs in the reference R i n 6 , or if it has a high probability of being entailed by the table (i.e. w(g) is high). Let P r(g ∈ R i n ) = # G i n ,R i n (g) # G i n (g) denote the probability that an n-gram in G i n also appears in R i n . Then, the entailed precision E n p for n-grams of order n is given by: In words, an n-gram receives a reward of 1 if it appears in the reference, with probability P r(g ∈ R i n ), and otherwise it receives a reward of w(g). Both numerator and denominator are weighted by the count of the n-gram in G i n . P r(g ∈ R i n ) rewards an n-gram for appearing as many times as it appears in the reference, not more. We combine precisions for n-gram orders 1-4 using a geometric 5 Glickman and Dagan (2005) used a product instead of geometric mean. Here we use a geometric mean to ensure that n-grams of different lengths have comparable probabilities of being entailed. 6 It is unlikely that an automated system produces the same extra n-gram as present in the reference, thus a match with the reference n-gram is considered positive. For example, in Figure 1, it is highly unlikely that a system would produce "Silkworm" when it is not present in the table.
average, similar to BLEU: Entailed Recall. We compute recall against both the reference (E r (R i )), to ensure proper sentence structure in the generated text, and the table (E r (T i )), to ensure that texts which mention more information from the table get higher scores (e.g. candidate 3 in Figure 1). These are combined using a geometric average: The parameter λ trades-off how much the generated text should match the reference, versus how much it should cover information from the table.
The geometric average, which acts as an AND operation, ensures that the overall recall is high only when both the components are high. We found this necessary to assign low scores to bad systems which, for example, copy values from the table without phrasing them in natural language.
When computing E r (R i ), divergent references will have n-grams with low w(g). We want to exclude these from the computation of recall, and hence their contributions are weighted by w(g): . (5) Similar to precision, we combine recalls for n = 1-4 using a geometric average to get E r (R i ).
For computing E r (T i ), note that a table is a set of records T i = {r k } K k=1 . For a record r k , letr k denote its string value (such as "Michael Dahlquist" or "December 22 1965"). Then: wherer k denotes the number of tokens in the value string, and LCS(x, y) is the length of the longest common subsequence between x and y. The LCS function, borrowed from ROUGE, ensures that entity names inr k appear in the same order in the text as the table. Higher values of E r (T i ) denote that more records are likely to be mentioned in G i . The entailed precision and recall are combined into an F-score to give the PARENT metric for one instance. The system-level PARENT score for a model M is the average of instance level PARENT scores across the evaluation set: Smoothing & Multiple References. The danger with geometric averages is that if any of the components being averaged become 0, the average will also be 0. Hence, we adopt a smoothing technique from Chen and Cherry (2014) that assigns a small positive value to any of E n p , E n r (R i ) and E r (T i ) which are 0. When multiple references are available for a table, we compute PARENT against each reference and take the maximum as its overall score, similar to METEOR (Denkowski and Lavie, 2014).
Choosing λ and . To set the value of λ we can tune it to maximize the correlation of the metric with human judgments, when such data is available. When such data is not available, we can use the recall of the reference against the table, using Eq. 6, as the value of 1 − λ. The intuition here is that if the recall of the reference against the table is high, it already covers most of the information, and we can assign it a high weight in Eq. 4. This leads to a separate value of λ automatically set for each instance. 7 is set to 10 −5 for all experiments. Wiseman et al. (2017) proposed to use an auxiliary model, trained to extract structured records from text, for evaluation. However, the extraction model presented in that work is limited to the closed-domain setting of basketball game tables and summaries. In particular, they assume that each table has exactly the same set of attributes for each entity, and that the entities can be identified in the text via string matching. These assumptions are not valid for the open-domain WikiBio dataset, and hence we train our own extraction model to replicate their evaluation scheme.

Evaluation via Information Extraction
Our extraction system is a pointer-generator network (See et al., 2017), which learns to produce a linearized version of the table from the text. 8 The network learns which attributes need to be populated in the output table, along with their values. It is trained on the training set of WikiBio. At test time we parsed the output strings into a set of (attribute, value) tuples and compare it to the ground truth table. The F-score of this text-to-table system was 35.1%, which is comparable to other challenging open-domain settings (Huang et al., 2017). More details are included in the Appendix A.1.
Given this information extraction system, we consider the following metrics for evaluation, along the lines of Wiseman et al. (2017). Content Selection (CS): F-score for the (attribute, value) pairs extracted from the generated text compared to those extracted from the reference. Relation Generation (RG): Precision for the (attribute, value) pairs extracted from the generated text compared to those in the ground truth table. RG-F: Since our task emphasizes the recall of information from the table as well, we consider another variant which computes the F-score of the extracted pairs to those in the table. We omit the content ordering metric, since our extraction system does not align records to the input text.

Experiments & Results
In this section we compare several automatic evaluation metrics by checking their correlation with the scores assigned by humans to table-to-text models. Specifically, given l models M 1 , . . . , M l , and their outputs on an evaluation set, we show these generated texts to humans to judge their quality, and obtain aggregated human evaluation scores for all the models,h = (h 1 , . . . , h l ) ( §5.2). Next, to evaluate an automatic metric, we compute the scores it assigns to each model,ā = (a 1 , . . . , a l ), and check the Pearson correlation betweenh andā (Graham and Baldwin, 2014). 9

Data & Models
Our main experiments are on the WikiBio dataset (Lebret et al., 2016), which is automatically constructed and contains many divergent references. In §5.6 we also present results on the data released as part of the WebNLG challenge.
We developed several models of varying quality for generating text from the tables in WikiBio. This gives us a diverse set of outputs to evaluate the automatic metrics on. Table 1 lists the models along with their hyperparameter settings and their scores from the human evaluation ( §5.2). Our focus is primarily on neural sequence-to-sequence methods since these are most widely used, but we 9 We observed similar trends for Spearman correlation. We divide these models into two categories and measure correlation separately for both the categories. The first category, WikiBio-Systems, includes one model each from the four families listed in Table 1. This category tests whether a metric can be used to compare different model families with a large variation in the quality of their outputs. The second category, WikiBio-Hyperparams, includes 13 different hyperparameter settings of PG-Net (See et al., 2017), which was the best performing system overall. 9 of these were obtained by varying the beam size and length normalization penalty of the decoder network (Wu et al., 2016), and the remaining 4 were obtained by re-scoring beams of size 8 with the information extraction model described in §4. All the models in this category produce high quality fluent texts, and differ primarily on the quantity and accuracy of the information they express. Here we are testing whether a metric can be used to compare similar systems with a small variation in performance. This is an important use-case as metrics are often used to tune hyperparameters of a model.

Human Evaluation
We collected human judgments on the quality of the 16 models trained for WikiBio, plus the reference texts. Workers on a crowd-sourcing platform, proficient in English, were shown a table with pairs of generated texts, or a generated text and the reference, and asked to select the one they prefer. Figure 2 shows the instructions they were given. Paired comparisons have been shown to be superior to rating scales for comparing generated texts Figure 2: Instructions to crowd-workers for comparing two generated texts. (Callison-Burch et al., 2007). However, for measuring correlation the comparisons need to be aggregated into real-valued scores,h = (h 1 , . . . , h l ), for each of the l = 16 models. For this, we use Thurstone's method (Tsukida and Gupta, 2011), which assigns a score to each model based on how many times it was preferred over an alternative.
The data collection was performed separately for models in the WikiBio-Systems and WikiBio-Hyperparams categories. 1100 tables were sampled from the development set, and for each table we got 8 different sentence pairs annotated across the two categories, resulting in a total of 8800 pairwise comparisons. Each pair was judged by one worker only which means there may be noise at the instance-level, but the aggregated system-level scores had low variance (cf. Table 1). In total around 500 different workers were involved in the annotation. References were also included in the evaluation, and they received a lower score than PG-Net, highlighting the divergence in WikiBio.

Text & Table:
We compare a variant of BLEU, denoted as BLEU-T, where the values from the table are used as additional references. BLEU-T draws inspiration from iBLEU (Sun and Zhou, 2012) but instead rewards n-grams which match the table rather than penalizing them. For PARENT, we compare both the word-overlap model (PARENT-W) and the co-occurrence model (PARENT-C) for determining entailment. We also compare versions where a single λ is tuned on the entire dataset to maximize correlation with human judgments, denoted as PARENT*-W/C.

Correlation Comparison
We use bootstrap sampling (500 iterations) over the 1100 tables for which we collected human annotations to get an idea of how the correlation of each metric varies with the underlying data. In each iteration, we sample with replacement, tables along with their references and all the generated texts for that table. Then we compute aggregated human evaluation and metric scores for each of the models and compute the correlation between the two. We report the average correlation across all bootstrap samples for each metric in Table 2. The distribution of correlations for the best performing metrics are shown in Figure 3. Table 2 also indicates whether PARENT is significantly better than a baseline metric. Graham and Baldwin (2014) suggest using the William's test for this purpose, but since we are computing correlations between only 4/13 systems at a time, this test has very weak power in our case. Hence, we use the bootstrap samples to obtain a 1 − α confidence interval of the difference in correlation between PARENT and any other metric and check whether this is above 0 (Wilcox, 2016). Correlations are higher for the systems category than the hyperparams category. The latter is a more difficult setting since very similar models are compared, and hence the variance of the correlations is also high. Commonly used metrics which only rely on the reference (BLEU, ROUGE, METEOR, CIDEr) have only weak correlations with human judgments. In the hyperparams category, these are often negative, implying that tuning models based on these may lead to selecting worse models. BLEU performs the best among these, and adding n-grams from the table as references improves this further (BLEU-T).
Among the extractive evaluation metrics, CS, which also only relies on the reference, has poor correlation in the hyperparams category. RG-F, and both variants of the PARENT metric achieve the highest correlation for both settings. There is no significant difference among these for the hyperparams category, but for systems, PARENT-W is significantly better than the other two. While RG-F needs a full information extraction pipeline in its implementation, PARENT-C only relies on co-occurrence counts, and PARENT-W can be used out-of-the-box for any dataset. To our knowledge, this is the first rigorous evaluation of using information extraction for generation evaluation.
On this dataset, the word-overlap model showed higher correlation than the co-occurrence model for entailment. In §5.6 we will show that for the WebNLG dataset, where more paraphrasing is involved between the table and text, the opposite is true. Lastly, we note that the heuristic for selecting λ is sufficient to produce high correlations for PARENT, however, if human annotations are available, this can be tuned to produce significantly higher correlations (PARENT*-W/C).

Analysis
In this section we further analyze the performance of PARENT-W 10 under different conditions, and compare to the other best metrics from Table 2.
Effect of Divergence. To study the correlation as we vary the number of divergent references, we also collected binary labels from workers for whether a reference is entailed by the corresponding table. We define a reference as entailed when it mentions only information which can be inferred from the table. Each table and reference pair was judged by 3 independent workers, and we used the majority vote as the label for that pair. Overall, only 38% of the references were labeled as entailed by the table. Fleiss' κ was 0.30, which indicates a fair agreement. We found the workers sometimes disagreed on what information can be reasonably entailed by the table. Figure 4 shows the correlations as we vary the percent of entailed examples in the evaluation set of WikiBio. Each point is obtained by fixing the desired proportion of entailed examples, and sampling subsets from the full set which satisfy this proportion. PARENT and RG-F remain stable and show a high correlation across the entire range, whereas BLEU and BLEU-T vary a lot. In the hyperparams category, the latter two have the worst correlation when the evaluation set contains only entailed examples, which may seem surprising. However, on closer examination we found that this subset tends to omit a lot of information from the tables. Systems which produce more information than these references are penalized by BLEU, but not in the human evaluation. PARENT overcomes this issue by measuring recall against the table in addition to the reference.
BLEU BLEU-T RG-F PARENT-W PARENT-C 0.556 0.567 * 0.588 * 0.598 ‡ 0.606 † Table 3: Accuracy on making the same judgments as humans between pairs of generated texts. p < 0.01 * /0.05 † /0.10 ‡ : accuracy is significantly higher than the next best accuracy to the left using a paired McNemar's test.
Ablation Study. We check how different components in the computation of PARENT contribute to its correlation to human judgments. Specifically, we remove the probability w(g) of an ngram g being entailed by the table from Eqs. 2 and 5. 11 The average correlation for PARENT-W drops to 0.168 in this case. We also try a variant of PARENT with λ = 0, which removes the contribution of (2018) point out that hill-climbing on an automatic metric is meaningless if that metric has a low instance-level correlation to human judgments. In Table 3 we show the average accuracy of the metrics in making the same judgments as humans between pairs of generated texts. Both variants of PARENT are significantly better than the other metrics, however the best accuracy is only 60% for the binary task. This is a challenging task, since there are typically only subtle differences between the texts. Achieving higher instance-level accuracies will require more sophisticated language understanding models for evaluation.

WebNLG Dataset
To check how PARENT correlates with human judgments when the references are elicited from humans (and less likely to be divergent), we check its correlation with the human ratings provided for the systems competing in the WebNLG challenge (Gardent et al., 2017). The task is to generate text describing 1-5 RDF triples (e.g. John E Blaha, birthPlace, San Antonio), and human ratings were collected for the outputs of 9 participating systems on 223 instances. These systems include a mix of pipelined, statistical and neural methods. Each instance has upto 3 reference texts associated with 11 When computing precision we set w(g) = 0, and when computing recall we set w(g) = 1 for all g.  the RDF triples, which we use for evaluation. The human ratings were collected on 3 distinct aspects -grammaticality, fluency and semantics, where semantics corresponds to the degree to which a generated text agrees with the meaning of the underlying RDF triples. We report the correlation of several metrics with these ratings in Table 4. 12 Both variants of PARENT are either competitive or better than the other metrics in terms of the average correlation to all three aspects. This shows that PARENT is applicable for high quality references as well.
While BLEU has the highest correlation for the grammar and fluency aspects, PARENT does best for semantics. This suggests that the inclusion of source tables into the evaluation orients the metric more towards measuring the fidelity of the content of the generation. A similar trend is seen comparing BLEU and BLEU-T. As modern neural text generation systems are typically very fluent, measuring their fidelity is of increasing importance. Between the two entailment models, PARENT-C is better due to its higher correlation with the grammaticality and fluency aspects.
Distribution of λ. The λ parameter in the calculation of PARENT decides whether to compute recall against the table or the reference (Eq. 4).  the recall of the generated text relies more on the reference.

Related Work
Over the years several studies have evaluated automatic metrics for measuring text generation performance (Callison-Burch et al., 2006;Stent et al., 2005;Belz and Reiter, 2006;Reiter, 2018;Kilickaya et al., 2017;Gatt and Krahmer, 2018). The only consensus from these studies seems to be that no single metric is suitable across all tasks. A recurring theme is that metrics like BLEU and NIST (Doddington, 2002) are not suitable for judging content quality in NLG. Recently, Novikova et al. (2017a) did a comprehensive study of several metrics on the outputs of state-of-the-art NLG systems, and found that while they showed acceptable correlation with human judgments at the system level, they failed to show any correlation at the sentence level. Ours is the first study which checks the quality of metrics when tableto-text references are divergent. We show that in this case even system level correlations can be unreliable.
Hallucination (Rohrbach et al., 2018;Lee et al., 2018) refers to when an NLG system generates text which mentions extra information than what is present in the source from which it is generated. Divergence can be viewed as hallucination in the reference text itself. PARENT deals with hallucination by discounting n-grams which do not overlap with either the reference or the table.
PARENT draws inspiration from iBLEU (Sun and Zhou, 2012), a metric for evaluating paraphrase generation, which compares the generated text to both the source text and the reference.
While iBLEU penalizes texts which match the source, here we reward such texts since our task values accuracy of generated text more than the need for paraphrasing the tabular content (Liu et al., 2010). Similar to SARI for text simplification (Xu et al., 2016) and Q-BLEU for question generation (Nema and Khapra, 2018), PARENT falls under the category of task-specific metrics.

Conclusions
We study the automatic evaluation of table-to-text systems when the references diverge from the table. We propose a new metric, PARENT, which shows the highest correlation with humans across a range of settings with divergent references in WikiBio. We also perform the first empirical evaluation of information extraction based metrics (Wiseman et al., 2017), and find RG-F to be effective. Lastly, we show that PARENT is comparable to the best existing metrics when references are elicited by humans on the WebNLG data.

A.1 Information Extraction System
For evaluation via information extraction (Wiseman et al., 2017) we train a model for WikiBio which accepts text as input and generates a table as the output. Tables in WikiBio are open-domain, without any fixed schema for which attributes may be present or absent in an instance. Hence we Text: michael dahlquist ( december 22 , 1965Text: michael dahlquist ( december 22 , -july 14 , 2005 was a drummer in the seattle band silkworm .  Figure 6: An input-output pair for the information extraction system. <R> and <C> are special symbols used to separate (attribute, value) pairs and attributes from values, respectively.  employ the Pointer-Generator Network (PG-Net) (See et al., 2017) for this purpose. Specifically, we use a sequence-to-sequence model, whose encoder and decoder are both single-layer bi-directional LSTMs. The decoder is augmented with an attention mechanism over the states of the encoder. Further, it also uses a copy mechanism to optionally copy tokens directly from the source text. We do not use the coverage mechanism of See et al. (2017) since that is specific to the task of summarization they study. The decoder is trained to produce a linearized version of the table where the rows and columns are flattened into a sequence, and separate by special tokens. Figure 6 shows an example. Clearly, since the references are divergent, the model cannot be expected to produce the entire table, and we see some false information being hallucinated after training. Nevertheless, as we show in §5.4, this system can be used for evaluating generated texts. After training, we can parse the output sequence along the special tokens <R> and <C> to get a set of (attribute, value) pairs. Table 5 shows the precision, recall and F-score of these extracted pairs against the ground truth tables, where the attributes and values are compared using an exact string match.

A.2 Hyperparameters
After tuning we found the same set of hyperparameters to work well for both the table-to-text PG-Net, and the inverse information extraction PG-Net. The hidden state size of the biLSTMs Table 6: Sample references and predictions from PG-Net with beam size 8. Information which is absent from the reference, but can be inferred from the table is in bold. Information which is present in the reference, but cannot be inferred from the table is in italics.
was set to 200. The input and output vocabularies were set to 50000 most common words in the corpus, with additional special symbols for table attribute names (such as "birth-date"). The embeddings of the tokens in the vocabulary were initialized with Glove (Pennington et al., 2014). Learning rate of 0.0003 was used during training, with the Adam optimizer, and a dropout of 0.2 was also applied to the outputs of the biLSTM. Models were trained till the loss on the dev set stopped dropping. Maximum length of a decoded text was set to 40 tokens, and that of the tables was set to 120 tokens. Various beam sizes and length normalization penalties were applied for the table-totext system, which are listed in the main paper. For the information extraction system, we found a beam size of 8 and no length penalty to produce the highest F-score on the dev set. Table 6 shows some sample references and the corresponding predictions from the best performing model, PG-Net for WikiBio.