Towards Comprehensive Description Generation from Factual Attribute-value Tables

The comprehensive descriptions for factual attribute-value tables, which should be accurate, informative and loyal, can be very helpful for end users to understand the structured data in this form. However previous neural generators might suffer from key attributes missing, less informative and groundless information problems, which impede the generation of high-quality comprehensive descriptions for tables. To relieve these problems, we first propose force attention (FA) method to encourage the generator to pay more attention to the uncovered attributes to avoid potential key attributes missing. Furthermore, we propose reinforcement learning for information richness to generate more informative as well as more loyal descriptions for tables. In our experiments, we utilize the widely used WIKIBIO dataset as a benchmark. Besides, we create WB-filter based on WIKIBIO to test our model in the simulated user-oriented scenarios, in which the generated descriptions should accord with particular user interests. Experimental results show that our model outperforms the state-of-the-art baselines on both automatic and human evaluation.


Introduction
Generating descriptions for the factual attributevalue tables has attracted widely interests among NLP researchers especially in a neural end-to-end fashion (e.g. Lebret et al. (2016); ; ; Bao et al. (2018); Puduppully et al. (2018); Li and Wan (2018); ) as shown in Fig 1a. For broader potential applications in this field, we also simulate useroriented generation, whose goal is to provide comprehensive generation for the selected attributes according to particular user interests like Fig 1b. However, we find that previous models might miss key information and generate less informa-

Attribute
Value Birthplace Utah, America Position forward (soccer player) Comprehensive: A Utah soccer player who plays as forward Missing Key Attri.: A soccer player who plays as forward Groundless info: A Utah forward in the national team Less Informative: An American forward tive and groundless content in its generated descriptions towards source tables. For example, in Table 1, the 'missing key attribute' case doesn't mention where the player comes from (birthplace) while the 'less informative' one chooses American rather than Utah. The case with groundless information contains 'in the national team' which is not mentioned in the source attributes. Although the 'key points missing' problem exists in many text-to-text and data-to-text datasets, for largescale structured tables with vast heterogeneous attributes such as Wikipedia infoboxes, 'Key attribute missing' and 'less informative' problems might be even more challenging. As the key attributes, like the 'position' of a basketball player or the 'political party' of a senator, are very likely to be unique features to particular tables, which usually appear much less frequently and are seldomly mentioned than the common attributes like 'Name' and 'Birthdate'. The 'groundless information', which is also known as the 'hallucination' problem, remains a long-standing problem in NLG.
In this paper, we show that our model can generate more accurate and informative descriptions with less groundless content for tables. Firstly we design a force-attention (FA) method to encourage the decoder to pay more attention to the un-

(b) User-oriented Description Generation for the Tables User interests
Attributes selected by users : Name ; Current Club ; Position

Description Generation
Name played as a Position in Current Club Wikipedia Infobox Figure 1: The end-to-end (a) and user-oriented table-totext generation (b) for an infobox (left) in WIKIBIO.
covered attributes to avoid potential key attributes missing by both stepwise and global constraints.
In addition, we define the 'information richness' measurement of the generated descriptions to the source tables. Based on that, we use the reinforcement learning to encourage the generator to cover infrequent and rarely mentioned attributes as well as generate more informative descriptions with less groundless content. We test our models on two settings: 1) For neural table-to-text generation like Fig  1a, we test our model on WIKIBIO (Lebret et al., 2016), a crawled dataset from Wikipedia with paired infoboxes and associated descriptions. It is a widely used benchmark dataset for description generation for factual attribute-value tables and also a quite meaningful testbed in the real-world scenarios with vast and heterogenous attributes.
2) To test our model in the user-oriented setting, we filter WIKIBIO to form WB-filter. In this setting, we suppose all attributes in the source tables of WB-filter are selected by users that should be covered in the corresponding descriptions. We try to make sure the gold descriptions in WB-filter cover all the attributes of the source tables in this condition. Details in Sec 4.
Both automatic and human evaluation show that our model relieves the 3 problems (Table 1) and helps the generator to produce accurate, informative and loyal descriptions. We also achieve the state-of-the-art performance on the end-to-end table description and the user-oriented generation tasks.
The remainder of this paper is organized as follows. We first introduce how we formulate tableto-text generation into encoder-decoder framework in Sec 2. After that, we discuss forceattention method (Sec 3.1) and richness-oriented reinforcement learning (Sec 3.2), which are motivated by the three goals we set up for comprehen-sive table descriptions (Table 1). Then we demonstrate how and why we create WB-filter (Sec 4.1) as well as evaluations (Sec 4.2), experimental configurations (Sec 4.3 and 4.4), case studies and visualizations (Sec 4.5) and error analysis (Sec 4.6).
2 Background: Table-to-Description   2.1 Table Encoder Given a structured table like Fig 1 (left), we model the attribute-value tuples in the table as a sequence of words with related attribute names. After serializing all the words in the 'Value' columns, for the i-th word in the table x a k i whose attribute is a k (the k-th attribute), we use the attribute name a k and the word's position in that tuple to locate the word (Lebret et al., 2016). Specifically we utilize a triple z a k i = {a k , p a k i+ , p a k i− } to represent the structure information for word x a k i , in which p a k i+ and p a k i− are the positions of x a k i counted from the beginning and end of a k , respectively. For example, for the 'Birthplace' attribute in Fig  1 (left), we can use triples {birthplace,1,4} and {birthplace,4,1} to represent the structure information for the words 'Durban' 1 and 'Africa'. We concatenate the word x t and its structure representation z t at the t-th time step and feed them into LSTM (Hochreiter and Schmidhuber, 1997) is the t-th hidden state among the encoder states H = {h t } T t=1 . In the following sections, we might omit the superscript of x a k i if it is not necessary.

Description Decoder
For the generated description y * , the generated token y * t at the t-th time step is predicted based on all the previously generated tokens y * <t before y * t and the hidden states H of the table encoder: is element-wise product, s t = LSTM(y * t−1 , s t−1 ) is the t-th hidden state of the decoder. c t = where g(s t , h i ) is a relevance score between s t and h i . We use Bahdanau-style attention mechanism (Bahdanau et al., 2014) to calculate g(s t , h i ).

Comprehensive Table Description
The problems listed in Table 1 not only prevent the generators to produce comprehensive descriptions for selected entries in the tables (Fig 1b), but also prevent the generator to produce informative, accurate and loyal table descriptions (Fig 1a). So we propose two methods: force-attention (FA) and richness-oriented reinforcement learning to produce accurate, informative and loyal descriptions.

Force-Attention Module
For 'missing key attributes' problem (Table 1), we find that the generator usually focuses on particular attributes while the other attributes have relatively low attention values in the entire decoding procedure. So force attention method is proposed to guide the decoder to pay more attention to the previous uncovered attributes with low attention values to avoid potential key attribute missing. Note that FA method focuses on attributelevel coverage rather than word-level coverage (Tu et al., 2016) as our goal is to reduce the 'missing key attributes' phenomenons instead of building rigid word-by-word alignment between tables and descriptions.
Stepwise Forcing Attention: We define attributelevel attention β a k t = avg( x i ∈a k α i t ) at the t-th step for attribute a k as the average value of the word-level attention values for the words in that attribute. The word-level coverage is defined as the sum of attention vector before the t-th step (Tu et al., 2016). In the similar way, we define the attribute-level coverage γ a k t = γ a k t−k + β a k t as the overall attention for attribute a k before the t-th time step. The average word-level and attribute-level coverage are θ i t = θ i t /t and γ a k t = γ a k t /t, respectively. Then we propose stepwise attention forcing, which explicitly guides the decoder to pay more attention on the uncovered attributes by calculating a new context vector c t = πc t + (1 − π)v t to make compensation for the ignored attributes in the previous time steps. π is a learnable vector. v t is a compensation vector for the low-coverage attributes: (3) ζ t is the modified average word-level coverage regarding the average attribute-level coverage as the upper bound to avoid excessive compensation.
Fig 2 shows a running example. The motivation behind is that we want the decoder to pay enough attention to all the attributes in the whole decoding process, which prevents missing key attributes because of the low attention value on them. Thus we make compensation for the previous uncovered attributes (like 'currentclub ' and 'position' in Fig 2 ) by v t at the t-th time step. Global Forcing Attention: Inspired by the softattention constraint of (Xu et al., 2015) which encourages the generator to pay equal attention to every part of the image while generating image captions, we propose global forcing attention to avoid insufficient or excessive attention on certain attributes by adding the following loss to the prime seq2seq loss.
where K is the number of attributes in the table, λ is a hyper-parameter which is set to 0.3 based on evaluations on the validation data. γ a k −1 is the average attribute-level coverage for attribute a k at the last time step.

Reinforced Richness-oriented Learning
We also propose a reinforcement learning framework which encourages the generator to cover rare and seldom mentioned words and attributes in the table. The experiments and case studies show its effectiveness to deal with the 'groundless information' and 'less informative' problems in Table 1.

Information Richness
The information richness (Eq 5) is the multiplication of the attribute-level and word-level richness of the descriptions towards the source tables. Attribute-level Information Richness: Different tables which describe different objects are always featured by the unique attributes in the table. For example, a sportsman often has the attributes like 'position', 'debutyear'. The information in the unique attributes is harder to capture than that in the common attributes like 'name', 'birthdate' as the latters are very frequent in the training set. We define the information richness for an attribute a i as f (a k ) = [f req(a k )] −1 by calculating its frequency in the training set. Word-level Information Richness: The unique words in the tables are more likely to be informative, such as a specific location, name or book. To calculate the word-level information richness, we firstly lemmatize all the words in the tables and filter the words with a stop-words list which including prepositions, symbols and numbers, etc. Then we randomly sample 5 synonyms of the certain word from WordNet (Miller, 1995). Finally, we calculate the word-level richness w(x a k i ) for the i-th word in attribute a k by averaging the tf-idf values of x a k i and its synonyms in the training set. For a generated description y * , we lemmatize all the words in y * to get y * . Then we calculate the information richness based on the related source table with T words and the gold description y, respectively.
in whichx a k i represents any word among x a k i and its synonyms in the table. The information richness measures the ratio of covered information in the table by the description.

Reinforcement Learning
Reward Function: Different from previous models which only measures how well the generated sentences match the target sentences, we design a mixed reward R mix which contains both the BLEU-4 scores and the information richness of the generated descriptions towards the source tables.
λ is set to 0.4 and 0.6 for WIKIBIO and WB-filter based on evaluations on the validation data. Fig 6 shows how we choose λ.
Training Algorithm: We use the REINFORCE algorithm (Williams, 1992) to learn an agent to maximize the reward function R mix . The training loss of sequence generation is defined as the negative expected reward.
where P φ (y s ) is the agent's policy, i.e. the word distribution of description decoder (Eq 1), and r(·) is the reward function defined in Eq 6. In the implementation, y s is a sequence that can be sampled from P φ by Monte-Carlo sampling y s = {y s 1 , y s 2 , · · · , y s |Y | }. The policy gradients for Eq 7 can be calculated as: We use self-critical sequence training method (Rennie et al., 2017;Paulus et al., 2017) to reduce the variance of gradients by subtracting a baseline reward for the mix reward in Eq 6.
where B(a, b) is the BLEU score of sequence a compared with sequence b, y g is a generated sequence using greedy search. To calculate the information richness reward R inf o for the lemmatized sampled sequence y s , we use the information richness (Eq 5) of the related lemmatized gold description y towards the source table as the baseline reward.
For more technical details, we refer the interested readers to (Williams, 1992;Ranzato et al., 2015;Rennie et al., 2017).  Figure 3: The 'coverage-frequency' figure (left) (each point represents an attribute) shows that many attributes have very low coverage and low frequency in the WIKIBIO dataset. Due to our filtering, the attributes in WB-filter have 100% Hit-1 coverage (Sec 4.2) and more overlapping words with the original tables as shown in the data statistics (right).

Datasets
We use two datasets to test our model in the context of end-to-end table description generation and comprehensive generation for selected attributes in user-oriented scenario.
For end-to-end description generation, we use WIKIBIO dataset (Lebret et al., 2016) as the benchmark dataset, which contains 728,321 articles from English Wikipedia (Sep 2015) and uses the first sentence of each article as the description.
To test our model in the user-oriented scenario, we filtered the WIKIBIO dataset to form a new dataset WB-filter. To simulate the user interests, we first select the top 100 frequent 2 attributes in WIKIBIO. After that we manually filter irrelevant attributes (like 'caption', 'website' or 'signature') and merge identical attributes (like 'article title' and 'name') to avoid repetition. Then we leave out all the remaining attributes in the tables and filter the instances in WIKIBIO whose descriptions can not cover the selected attributes to form WB-filter. To achieve this, we firstly lemmatize all the tokens in the infoboxes as well as those in the related gold biographies and filter them by a stop-words list, then we randomly retrieve 5 synonyms for every word in the infoboxes from WordNet. Finally we make sure the gold biographies cover at least one word (or its synonym) for every attribute-value tuple among the chosen attributes and filter the unqualified instances in

WIKIBIO.
The 'frequency-coverage' figure in Fig 3 shows  1) The filtering ensures that the WB-filter dataset achieves 100% Hit-1 coverage. 2) The WIKIBIO dataset suffers from both 'low frequency' and 'low coverage' problems, which means some key attributes in the tables are seldom mentioned by the descriptions. The cause of 'low coverage' problem is the loosely alignment between structured data and related descriptions. The two datasets are divided in to training (80%), testing (10%) and validation (10%) sets.

Evaluation Metrics
Automatic Metrics: Following the previous work (Lebret et al., 2016;, we use BLEU-4 (Papineni et al., 2002) and ROUGE-4 (F measure) (Lin, 2004) for automatic evaluation. Furthermore, to evaluate how the generated biographies cover the key points in the infoboxes, we also use information richness (Eq 5) as one of our automatic evaluation. 'Hit at least 1 word' for an attribute means that a biography has at least one overlapping word with the words (or their synonyms) in that attribute, which are lemmatized and filtered by a stop-words list like the way we get WB-filter in Sec 4.1. 'HIT-1 coverage' for an attribute is the ratio of the instances involving that attribute whose biographies 'Hit at least 1 word' in that attribute. Human Evaluation: Since automatic evaluations like BLEU may not be reliable for NLG systems (Callison-Burch et al., 2006;Reiter and Belz, 2009;Reiter, 2018). We use human evaluation which involves the generation fluency, coverage (how much given information in the infobox is mentioned in the related biography) and correctness (how much false or irrelevant information is mentioned in the biography). We firstly sampled 300 generated biographies from the generators for human evaluation. After that, we hired 3 thirdparty crowd-workers who are equipped with sufficient background knowledge to rank the given biographies. We present the generated descriptions to the annotators in a randomized order and ask them to be objective and not to guess which system a particular generated case is from. Two biographies may have the same ranking if it is hard to decide which one is better. The Pearson correlations of inter-annotator agreement are 0.76 and 0.71 (Table 3) on WIKIBIO and WB-filter, re-spectively.

Experimental Details
Following previous work . For WIKIBIO We select the most frequent 20,000 words and 1480 attributes in the training set as the word and attribute vocabulary. We tune the hyper-parameters based on the model performance on the validation set. The dimensions of word embedding, attribute embedding, position embedding and hidden unit are 500, 50, 600, 10 respectively. The batch size, learning rate and optimizer for both two datasets are 32, 5e-4 and Adam (Kingma and Ba, 2014), respectively. We use Xavier initialization (Glorot and Bengio, 2010) for all the parameters in our model. The global constraint of force-attention (Eq 4) is adapted after 4 and 1.5 epochs of training to avoid hurting the primary loss for the WIKIBIO and WB-filter datasets, respectively. Before the richness-oriented reinforced training, the neural generator is pre-trained 8 and 4 epochs for the WIKIBIO and WB-filter datasets (with or without force-attention module), respectively. We replace UNK tokens with the most relevant token in the source table according to the attention matrix (Jean et al., 2015).

Baselines
KN & Template KN: A template-based Kneser-Ney (KN) language model (Heafield et al., 2013) The extracted template for Table 1 is "name 1 name 2 (born birthdate 1 · · · ". During inference, the decoder is constrained to emit words from the vocabulary or the special tokens in the tables.   proposed a link matrix to model the order for the attributevalue tuples while generating biographies. Struct-aware:  proposed a structure-aware model using a modified LSTM unit and a specific attention mechanism to incorporate the attribute information. Word & Attribute level Coverage: we also implement the implicit coverage method (Tu et al., 2016) for comparison. For word-level coverage, we replace Eq 2 with g(s t , h i ) = tanh(W p h i + W q s t + W m θ t + b). For attribute-level coverage, we replace Eq 2 with g(s t , h i ) = tanh(W p h i +  W q s t + W m γ t + b). θ t and γ t are the word-level and attribute-level coverage defined in Sec 3.1.

Analysis of Experimental Results
Automatic evaluations are shown in Table 2 for WIKIBIO and WB-filter. The proposed forceattention module achieves 1.11/0.98 and 2.04/1.32 BLEU/ROUGE increases on the WIKIBIO and WB-filter datasets, respectively. Although the proposed force attention method does not outperform the 'struct-aware' method in terms of BLEU and ROUGE in the WIKIBIO dataset. We show its advantages in the user-oriented scenario as well as its ability to cover the key attributes as shown in Table 4 and 5. The richness-oriented reinforced module further enhances the model performance, helping our model outperform the state-of-the-art system  by about 0.79 BLEU and 0.58 ROUGE. Note that the BLEU and ROUGE scores are lower in the WB-filter datasets because firstly, the WIKIBIO has much larger training set. Secondly, the gold biographies might con-   tain information beyond the tables. Although this phenomenon also occurs in WIKIBIO, the filtering of WB-filter magnifies this issue. Human evaluations in Table 3 show our model achieves better generation coverage and correctness than all the baselines. Table 4 shows that the ablation studies of our model.
As demonstrated in Table 5, we select an infobox from WIKIBIO and WB-filter respectively for case studies. By observing the generated description in WIKIBIO, we find that 1) compared with the vanilla seq2seq model, our force-attention module can cover the information in the 'Notableworks' attribute. 2) The richnessoriented module further helps our model to cover the 'Alma mater' and ' Notableworks' attributes as they are infrequent attributes (more informative) in the dataset. Additionally, due to the rareness of the word 'kiev', our model is able to cover the related information. Similarly, the generated description for WB-filter covers the information from 'Organization' and ' Birthplace' with the help of pro-

Error Analysis
Although the proposed models achieve competitive performance, we also observe some failure cases. To sum up, the irrelevant information in the generated descriptions to the source tables. For ex-  ( January 2 , 1882-March 29 , 1972 ) was a Ukrainian cleric , historian , ethnographer, writer , linguist , writer and scolar. +Force-Attention: Ivan Ohienko Metropolitan Ilarion ( 2 January 1882 in Brusilov -29 march 1972 in Winnipeg ) was a Ukrainian linguist , ethnographer , and scholar , best known for his translation of the bible into ukrainian . +Richness-oriented RL: Ivan Ohienko Metropolitan Ilarion ( 2 January 1882 , Krusilov , Kiev governorate-29 march 1972 , Winnipeg ) was a Ukrainian cleric, historian , ethnographer , and scholar of Kiev university , best known for his translation of the bible into ukrainian . Name:Rajendra Singh ; Birthdate:06 August 1959 ;Birthplace:Daula, Bagpat District, Uttar Pradesh ; Nationality: Indian; Organization:Tarun Bharat Sangh; Occupation:water conservationist Alma mater:Allahabad University Seq2seq: Rajendra Singh is an Indian water conservationist. +Force-Attention: Rajendra Singh (born 6 August 1959) is an Indian conservationist and a senior fellow of the Tarun Bharat Sangh. +Richness-oriented RL: Rajendra Singh (born 6 august 1959, Uttar Pradesh) is an Indian water conservationist and a member of the Tarun Bharat Sangh. Table 5: The generated cases in WIKIBIO (above) and WB-filter (below) datasets. The underlined texts, which are the key information of the source tables, are ignored by seq2seq model. ample, a biography about a football player might contain 'in the national football league' although the related infobox does not mention this piece of information as the similar expression exists in many instances of the training set. Although our model could largely relieve this problem as shown in human evaluation (Table 3), it is still a general problem in NLG. As for the ability to cover important information in the tables, although our model is able to cover much more comprehensive information than the previous models (Table 2 and 3). Some implicitly expressed (like if a person is retired or not) or rarely covered (like 'spouse' or 'high school') attributes in the source tables might still be ignored in the descriptions generated by our model. Furthermore, those pieces of information which need some form of inference across several attributes (like a time span) may not be well represented by our model.

Related Work
Data-to-text a language generation task to generate text for structured data. Table-to-text belongs to the data-to-text generation (Reiter and Dale, 2000). Many previous work Lapata, 2005, 2006;Liang et al., 2009) treated the task as a pipelined systems, which viewed content selection and surface realization as two separate tasks. Duboue and McKeown (2002) proposed a clustering approach in the biography domain by scoring the semantic relevance of the text and paired knowledge base. In a similar vein, Barzilay and Lapata (2005) modeled the dependencies between the American football records and identified the bits of information to be verbalized. Liang et al. (2009);Angeli et al. (2010) extended the work of Barzilay and Lapata (2005) to soccer and weather domains by learning the alignment between data and text using hidden variable models. Androutsopoulos et al. (2013) and Duma and Klein (2013) focused on generating descriptive language for Ontologies and RDF triples. Most recent work utilize neural networks on data-to-text generation (Mahapatra et al., 2016;Wiseman et al., 2017;Kaffee et al., 2018;Freitag and Roy, 2018;Qader et al., 2018;Dou et al., 2018;Yeh et al., 2018;Jhamtani et al., 2018;Liu et al., 2017bLiu et al., , 2019Peng et al., 2019;Dušek et al., 2019). Some closely relevant work also focused on the table-to-text generation. Mei et al. (2016) proposed an encoder-aligner-decoder framework for generating weather broadcast. Hachey et al. (2017) used a table-text and text-table autoencoder framework for table-to-text generation.  proposed gated orthogonalization to avoid repetitions. Wiseman et al. (2018) used neural semi-HMM to generate template-like descriptions for structured data. Our work somewhat shares similar goals as Kiddon et al. (2016); Tu et al. (2016); Liu et al. (2017a);  in the sense that they emphasis easily ignored (usually less frequent) features or bits of information in the training procedure by smoothing or regularization. The greatest difference between our work and theirs is that our method is tailored for covering the key information embedded in the attributes (entries) of the key-value tables rather than single words or labels. Although the deficient score of Tu et al. (2016) in Table 2 has demonstrated that word-level coverage oriented methods may not still be suitable to the structured tables, we assume other word-level constraints may easily transfer to the structured tables without losing efficiency. We leave the recognition of potential applicable word-level constraints to the future work. This paper focused on generating one-sentence biographies for infoboxes like many previous works (Lebret et al., 2016;Hachey et al., 2017;Bao et al., 2018;Puduppully et al., 2018;Cao et al., 2018). Perez-Beltrachini and Lapata (2018) used the first paragraph of the wikipedia pages as the gold biographies aiming at generating longer biographies. We tried the same setting and unfortunately found most generated biographies contain too much groundless information compared with the source infoboxes. This is because the related gold biographies from first paragraph contain too much groundless information beyond the source infoboxes.

Conclusion and Future Work
We set up 3 goals for comprehensive description generation for attribute-value factual tables: accurate, informative and loyal. To achieve these goals, we propose force-attention method, which encourages the generator to pay more attention to previous uncovered attributes to avoid poten-tial key attribute missing. Richness-oriented reinforcement learning is proposed to cover more informative contents in source tables, which help the generator to generate informative and accurate descriptions. The experiments on the WIKIBIO and WB-filter datasets show the merits of our model. In the future, we will explore the representation for the implicit information like whether a man is retired or not or how long a sportsman's career is given starting and ending years, in the table by including some inference strategies.