Partially-Aligned Data-to-Text Generation with Distant Supervision

The Data-to-Text task aims to generate human-readable text for describing some given structured data enabling more interpretability. However, the typical generation task is confined to a few particular domains since it requires well-aligned data which is difficult and expensive to obtain. Using partially-aligned data is an alternative way of solving the dataset scarcity problem. This kind of data is much easier to obtain since it can be produced automatically. However, using this kind of data induces the over-generation problem posing difficulties for existing models, which tends to add unrelated excerpts during the generation procedure. In order to effectively utilize automatically annotated partially-aligned datasets, we extend the traditional generation task to a refined task called Partially-Aligned Data-to-Text Generation (PADTG) which is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains. To tackle this new task, we propose a novel distant supervision generation framework. It firstly estimates the input data's supportiveness for each target word with an estimator and then applies a supportiveness adaptor and a rebalanced beam search to harness the over-generation problem in the training and generation phases respectively. We also contribute a partially-aligned dataset (The data and source code of this paper can be obtained from https://github.com/fuzihaofzh/distant_supervision_nlg by sampling sentences from Wikipedia and automatically extracting corresponding KB triples for each sentence from Wikidata. The experimental results show that our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.


Introduction
The Data-to-Text generation task focuses on generating human-readable text corresponding to some

Model
Age of Empires, genre, strategy video game

⟨ ⟩
Age of Empires is a strategy video game developed in Canada.

}
Figure 1: Illustration of the over-generation problem in the partially-aligned data-to-text generation task.In the training set, there is no KB triple corresponding to the text "developed in Canada".The model is likely to bind the text to existing triples incorrectly.As a result, during the testing or operational stage, the model is likely to overly generate this kind of excerpt for similar triples.
given structured data.For example, given the input knowledge base (KB) triple Company of Heroes, developer, Relic Entertainment , the aim is to generate a text description such as "Company of Heroes is developed by Relic Entertainment.".In recent years, many works have been proposed to give impetus to the Data-to-Text generation task.For instance, Gardent et al. (2017a;2017b) proposed the WebNLG task aiming at generating description text of the given KB triples.Novikova et al. (2017) proposed the E2E task aiming at generating restaurant reviews according to the given restaurant attributes.Lebret et al. (2016) proposed the WikiBio task in which the biography of each person is generated according to the given Wikipedia infobox.
These typical data-to-text generation tasks are confined to a few particular domains since it requires well-aligned data and text pairs which are difficult and costly to obtain.Specifically, it is required that each input data provides exactly the same information with the target text.This requirement makes the dataset difficult to build and confines the task to particular domains where such kind of data (WikiBio, E2E) or human-labeled data (WebNLG) are available.Using partially-aligned data is an alternative way of solving the dataset scarcity problem.Partially-aligned data do not re-arXiv:2010.01268v1[cs.CL] 3 Oct 2020 quire that each part in the text is exactly aligned with a particular input KB triple.This kind of data is much easier to obtain with automatic methods.Consequently, it can handle much broader kinds of domains.However, it induces the overgeneration problem2 .As shown in Fig. 1, some parts ("developed in Canada") in the generated text for Age of Empires, genre, strategy video game have not been mentioned in the input KB triple.Essentially, it is because in the training set, such unrelated text exists in some training samples.During the training, it misleads the model to bind the text "developed in Canada" to some irrelevant KB triples.When similar triples exist in the testing, it is prone to adding some over-generated text which is actually unrelated to the given input data.Current generation models fail to be trained on such partially-aligned data due to lacking the tolerance of the over-generation problem.
In order to effectively utilize automatically annotated partially-aligned datasets for handling more domains, we extend the traditional generation task to a refined task called Partially-Aligned Data-to-Text Generation (PADTG).Like the traditional task, the PADTG task also requires generating text with respect to the given input data.However, for the training data, we only require that the given structured data contains partial information of the corresponding text.This task is more practical since it utilizes the partially-aligned data for training and thus considerably expands the application domains.However, due to such data's nature, successfully suppressing the over-generation problem is the critical point for proposing an effective model.
We propose a Distant Supervision Generation (DSG) framework to tackle the PADTG task.Our framework can deal with the challenging overgeneration problem when training on the partiallyaligned data.It firstly trains an estimator to calculate each word's supportiveness in the target sentence with respect to the input data, i.e. how likely the word is conveyed by the input triples.Then the framework employs a sequence-to-sequence (S2S) neural model to encode the input data and generates the description sentence accordingly.In the training procedure, a supportiveness adaptor is used to adapt the estimated supportiveness into the loss function while in the generation procedure, a rebalanced beam search is used to generate text augmented with the supportiveness scores.
To prepare the partially-aligned data, we build a new dataset called WITA from text sources, namely, Wikipedia and Wikidata.We propose a novel KB extractor to extract KB triples given a piece of text sampled from Wikipedia.The KB extractor firstly detects named entities with an entity detector.The triple retriever queries the Wikidata database to find the most matching triples corresponding to these entities.We filter the results with a matching score to remove unextractable sentences.
Our contributions can be summarized as follows.(1) We propose a new task, namely, partiallyaligned Data-to-Text generation, which is more practical and extensible to more domains.(2) We propose a distant supervision generation framework that can tackle the challenges of the new task including the over-generation problem.(3) We contribute a sizeable partially-aligned dataset suitable for this task.

Overview
Formally, we denote the input KB triples as , where h i , r i , t i represent the ith head, relation, and tail respectively while n is the number of triples.The corresponding text is denoted as in which w i is the ith word in T and m is the sentence length.It should be noted that, in the task of Partially-Aligned Data-to-Text Generation (PADTG), T contains some information that K does not have.The target of the task is to train a model that generates text T that exactly describes the KB triples in K.
Our proposed Distant Supervision Generation (DSG) framework contains four components, namely a Supportiveness Estimator (SE), a Sequence-to-Sequence Generator (S2SG), a Supportiveness Adaptor (SA), and a Rebalanced Beam Search (RBS).As illustrated in Fig. 2, in the SE training procedure, we first pre-train the SE component to estimate a supportiveness vector s ∈ R m indicating whether each target word w i ∈ T is describing the input triples in K.It adopts the selfsupervised mechanism that trains the model to maximize the margin between the target words' scores and negative sampled words' scores.Then, the pre-trained SE component is utilized to estimate a supportiveness vector s in both S2SG Training and S2SG Generation.In the S2SG training procedure, the S2SG model firstly calculates the generation

SE Training
LSE < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 L g F 7 k a O / S S / 6 R P G f t D 6 e Q 1 a j P Y = " > A loss .Then, SA combines with s to get a refined loss in which the loss is diminished if one target word has lower supportiveness.In the S2SG generation procedure, the RBS component combines s with the probability distribution of candidate words to obtain a better generation result.

Supportiveness Estimator
We concatenate the input KB triples word-byword as , in which h j , r j , t j is the head, relation and tail entity of the jth triple.w h j i represents the ith word in the jth KB triple's head entity.|h j | stands for the word count of the jth head entity.KBSEP is a seperator between each triple.
Feature Extraction.In SE, a feature extraction component f K is utilized to extract features for each word denoted as is the extracted feature matrix for K .d is the embedding dimension and |K | is the length of K .Specifically, in the feature extraction component, K is firstly embedded with an embedding layer as in which E and Var are mean and variance of the input x while γ and β are learnable parameters. is a small constant which is usually set to 1.0e − 5. K 2 is then sent into a combination of linear feedforward layers and a ReLU layer as where FW 2 and FW 1 are linear feedforward layer while ReLU stands for the ReLU layer.Afterwards, the features representation is calculated as Similarly, the features for each word in the target text is denoted as Supportiveness Vector.We calculate the supporting matrix as M = F T K F T , M ∈ R |K |×m , in which M i,j represents the supportiveness of the ith word in K that support for the jth word in T .The supportiveness score vector is aggregated from M as where s j is the jth element of the vector s ∈ R m and it stands for input K's supportiveness to the jth word.
Negative Sampling.In order to prevent the model from giving all words a high supportiveness score, we use the negative sampling method to sample some negative sentences.We denote the empirical distribution of the words in the target text as P T in which T is the set of all target sentences.We sample words from P T while avoiding sampling the same words in T. The sampling procedure can be denoted as wi ∼ P T , wi ∈ T .The negative sample is composed of wi s as where wi is the ith word in T which has the same length as T .The negative sample T will also be fed to the network in the same way as the original target T .The supportiveness score vector for the negative sample is denoted as s ∈ R m .
Optimization Target.The overall loss function consists of a margin loss, a word-consistent loss, and a concentration loss.The margin loss is defined as the margin between the supportiveness of the original text and that of the negative sample, which can be written as Minimizing L m helps maximize the gap between the positive and the negative samples.The wordconsistent loss is used to make the supportiveness from the same word in input KB larger than the supportiveness from different words.It is defined as It increases the supportiveness M ij if the ith word in T and the jth word in K are the same word.The concentration loss is used to avoid one word in K supporting too many words in T .It is denoted as If one word supports too many words, all its corresponding supportiveness will be penalized.The overall loss function is denoted as the weighted sum of these loss functions as in which ω w and ω c are tunable hyper-parameters.

Sequence-to-Sequence Generator
We use the Transformer (Vaswani et al., 2017) structure as our S2S generator.The Transformer is an attention-based structure and is widely used in many tasks.It contains two major components, namely an encoder and a decoder.All of these components are built with several attention layers.The encoder firstly takes K as input to generate the feature representation: G K = Enc(K ).The decoder takes shifted target text T (shift last EOS tag to the beginning) as input and get the negative loglikelihood for each word in T as = Dec( T , G K ), ∈ R |T | , where Dec is the transformer decoder and |T | is the length of T .We refer readers to Vaswani et al. (2017) for more technical details.

Supportiveness Adaptor
The supportiveness Adaptor adapts the supportiveness score to S2SG's output.We investigate three methods, namely, Hard Adaptor, Soft Adaptor, and Attention Adaptor.
Hard Adaptor.With the supportiveness scores, we can simply remove words that have low supportiveness.For each word w i in the target sentence T , we use a uniform random number generator to generate a random number r i ∈ [0, 1].we ignore words in T if r i > s i and copy it into T otherwise.Then, T is used instead of T in the training procedure.
Soft Adaptor.Since the hard adaptor directly removes words, it is easy to omit essential words and make the model generating unreadable text.We propose a soft adaptor to alleviate this issue.We use the original target text T as input.For the output negative log-likelihood loss vector , we combine it with the supportiveness vector s to modify the S2SG's loss as Attention Adaptor.Instead of using SE to estimate the supportiveness vector, attention adaptor directly aggregates the attention matrix as the supportiveness scores in our proposed DSG model.For each target word, it takes its max attention weight on each source word as the supportiveness score.We use maximization to aggregate the scores instead of considering all scores because all attention weights sum up to 1, and thus irrelevant words can also be assigned some attention.Using maximization aggregation avoids such irrelevant words.The supportiveness scores are then utilized in a similar way as the soft adaptor.

Rebalanced Beam Search
In the generation step, the supportiveness scores can also be utilized to help rebalance the final word probability distribution.We make a pseudo target sequence as which contains all words in vocabulary V .The supportiveness score s p ∈ R |V | for all words is calculated as the same procedure in training.In the traditional beam search, it outputs a probability p b ∈ R |V | over the whole vocabulary denoting the possibility for each token in the vocabulary to be chosen as the next word.We rebalance the probability with the supportiveness score vector as p r = p b • s α p where α is a tunable hyper-parameter.

WITA: Our Partially-Aligned Dataset
We automatically harvest some partially-aligned data from Wikipedia and Wikidata and prepare a dataset called WITA.We select each Wikipedia article's first sentence from the 20190920 Wikipedia dump3 as the target text.Then, we remove irrelevant tags and links with several predefined rules.We propose a KB extractor, as illustrated in Fig. 3, which can take the selected Wikipedia sentences and extract the corresponding KB triples.In the KB extractor, named entities are detected by an entity detector.The detected named entities are then combined into pairs by the Cartesian Product operation.The triples that mention these entity pairs are retrieved by a triple retriever that searches the corresponding KB triples from the Wikidata database.We use an entity-recall based score to filter inappropriate sentences.

Entity Detector
We use three sub-detectors to recognize named entities and union them together.We first use a NER detector based on the spaCy4 's NER tool to recognize the named entities.Then we use the Noun detector based on spaCy's noun chunks recognition component to identify noun chunks.This detector is used because noun chunks have a high probability of being named entities.Finally, we use a linking detector, which is rule-based, to extract entities tagged with internal links.The detected entities for given sentence c is denoted as while p is the entity number.

Triple Retriever
In order to quickly retrieve related triples for given named entities, we first store the Wikidata database in Elasticsearch5 .We concatenate all possible variant names for an entity in Wikidata as the entity name.For example, Steve Jobs" has alternative names like "Steven Paul Jobs" and "Steven Jobs".The entity name is concatenated as "Steve Jobs -Steven Paul Jobs -Steven Jobs".In the Wikidata database, each triple contains a head, a relation and a tail entity and we denote the set of all the preprocessed triple as D = { h i , r i , t i |∀i}, where h i , r i , t i are the head, relation and tail entity for the ith triple.
For given detected entities E c , we make a list of named entity pairs by conducting a Cartesian Product as C e = { e i , e j |∀e i ∈ E c , e j ∈ E c , e i = e j }.Afterwards, we query the Wikidata database to find a triple that matches the given named entity pair e i , e j ∈ C e to make the head entity close to e i while the tail entity close to e j .It should be noted that the relation may be wrong, i.e. the matching triple describes a relation different from the one in the input sentence.But in reality, this probability is very small since most of the entity pairs only have one corresponding relation.For a given named entity pair e i , e j , the query condition can be formally expressed as: h, r, t = arg max h,r,t d(g(e i , h) + g(e j , t))+ (1 − d)(g(e j , h) + g(e i , t)) in which g is a single-term matching score6 while l is the string similarity metric ranging from 0 to 1. M is a sufficiantly large number and d is an integer.κ is a threshold preventing the retrieved head and tail being too different from e 1 and e 2 .After we have retrieved entities for all sentences, we calculate a score based on entity-recall to filter wrongly extracted data-text pairs.entity-recall for KB triples and the corresponding text is defined as where m is the length of the sentence while n is the triple number.r e indicates how much information in text has been covered by the retrieved triples.Since WebNLG is the most similar task to our PADTG task among others, we compare the statistics of our WITA dataset with WebNLG in Table 1.It can be observed that (1) WITA is larger than the WebNLG dataset making it more practical.It can be easily extended to more domains.(2) WITA contains more relation types and entity types than that of WebNLG, indicating that our dataset involves more domains.(3) The vocabulary of the target sentences of WITA is much larger than that of WebNLG, which shows that our dataset is more challenging and more realistic.(4) The entity-recall score of WITA is lower than WebNLG.This is because WITA is automatically annotated and some information in the text is not contained in the KB triples.The low entity-recall score causes the overgeneration problem and the specific value measures how serious the problem is.

Experimental Setup
We split WITA into a training set, a development set, and a testing set of 50,000, 5,000, and 400 records respectively.For the purpose of evaluating the performance of the models, we ask human helpers to annotate the testing set sentences.The human helpers are asked to revise the input KB triples and the corresponding target sentences making them exactly consistent with each other.We use several evaluation metrics including BLEU (Papineni et al., 2002), ROUGE L (Lin, 2004), METEOR (Banerjee and Lavie, 2005), NIST (Doddington, 2002) and CIDEr (Vedantam et al., 2015) with the package provided by Novikova et al. (2017).We follow the default setting in ROUGE L where β is set to 1.2.We build our model based on the Transformer model (Vaswani et al., 2017;Ott et al., 2019) and use Byte Pair Encoding (BPE) (Sennrich et al., 2016) to build the subword dictionary.We use Fairseq (Ott et al., 2019) to build our model and keep all hyper-parameters for Transformer unchanged.We set κ = 0.75 from {0.1, 0.25, 0.5, 0.75, 0.9} by extracting samples and ask human helper to evaluate.We use grid search to tune hyper-parameters on the development set and choose ω w = 0.05 from {0.02,0.05,0.1,0.2,0.5,1.0,2.0,5.0},choose ω c = 1.0 from {0.02,0.05,0.1,0.2,0.5,1.0,2.0,5.0} and choose α = 0.1 from {0.02,0.05,0.1,0.2,0.5,1.0}.
The model has 49M parameters and it takes 2.4 hours to train it on a NVIDIA TITAN RTX graphics card.
S2ST utilizes the prevalent Transformer model (Vaswani et al., 2017;Ott et al., 2019)    the SA components contribute to alleviating the over-generation problem.Specifically, the S2S model performs worse than other models.It shows that the Transformer based model works better in this task.This observation is consistent with the results observed in a lot of other similar tasks.The S2ST model performs worse than all other transformer-based models.The reason is that it suffers severely from the over-generation problem and is very likely to generate superfluous content in the generation procedure.The DSG-A model outperforms other models without any adaptor.This is because attention can also be regarded as a kind of supportiveness and it can be undoubtedly used to detect the over-generated words.However, since the purpose of the attention is to give weights to each input word, it is forced to assign weights to input words even no input data support the target word.As a result, it performs worse than our DSG model.Our DSG model with a soft adaptor outperforms the DSG-H model equipped with a hard adaptor.The reason is that when the hard adaptor is used, some words are directly ignored possibly resulting in generating an incoherent target sentence.Therefore, though it outperforms other models without any supportiveness adaption, it fails to exceed our proposed DSG model.The ablation experiment results show that both the RBS and SA components contribute to alleviating the over-generation problem.SA mainly focuses on alleviating the problem in the training phase while RBS focuses on solving it in the generation phase.They are all essential components of our model.
Supportiveness Distribution Analysis.To Dataset Size Analysis.In order to explore whether our framework is capable of working on small datasets, we conduct a dataset size analysis.The results are shown in Table 4.It can be concluded that as the data size increases, all the performance of models with or without supportiveness improve noticeably.It shows that increasing data size help improve the overall scores.On the other hand, models assembled with supportiveness scores always outperform models without it.It shows that our novel architecture alleviates the over-generation problem at all scales of data size.
Human Evaluation.We conduct a human evaluation to eval the generation performance.We sample 130 samples from each model's generated sentences and ask human helpers to give an overall score and a match score with respect to the target sentences ranging from 1 to 10.The results are illustrated in Table 6.It can be concluded from the experiment that the DSG model generates better sentences in the sense of humans.
Case Study.We provide a case study for several models.As shown in Table 5, The S2ST model is always generating text accompanied with overgenerated content while our proposed DSG model alleviates this problem significantly and consistently.When comparing the DSG-H model with the DSG model, we can find that the DSG-H model can also avoid producing over-generated content.However, it tends to remove a lot of correct words making the sentence incoherent and unreadable.Take the last case for example: The S2ST model conveys that Gaius Helen Mohiam comes from an American comic book.However, the given input KB triple does not mention this fact.On the other hand, the DSG-H model produces " ... created by Frank Herberfor the Dune univer ... " which is even not a human-readable sentence.

Related Works
During the past few years, many tasks have been proposed to generate human-readable text from the structured data.WebNLG (Gardent et al., 2017a,b;Ferreira et al., 2019) is proposed to describe KB triples sampled from DBPedia (Auer et al., 2007).
The E2E (Novikova et al., 2017;Dušek et al., 2020) task is proposed for generating restaurant reviews based on the given attributes.Lebret et al. (2016) propose the Wikibio task to generate people's biography based on given Wikipedia infobox.Fu et al. (2020a) propose to generate text based on event chains.Moreover, Liang et al. (2009) propose to generate weather reports for weather records and Wiseman et al. (2017), Chen andMooney (2008) and Puduppully et al. (2019) propose to generate a match report according to the match briefing.All these datasets are restricted to a few domains where well-aligned data is happened to be available.No existing works are focusing on handling partiallyaligned data.To solve the dataset scarcity problem, Fu et al. (2020c) propose to use dual learning to train generation models based on unaligned text and knowledge triples.The model generates text based on input triples and then predict the input triples with a dual extraction model.The two models are trained alternatively with dual learning.Although Cheng et al. (2020) proposed the ENT-DESC task aiming at generating better text description for a few entities by exploring the knowledge from KB, their focus is more on distilling the useful part from the input knowledge.
Text aligning has been studied for many years.Dyer et al. (2013) propose the Fast Align model which is a log-linear reparameterization of IBM Model 2. Legrand et al. (2016) propose a new score aggregation method to improve the alignment result.Moreover, attention-based models (Bahdanau et al., 2014) can also be recognized as a kind of alignment.However, these models focus on aligning source words to target words, and no existing models have been proposed to directly calculate supportiveness for generation tasks.In generation systems, Fu et al. (2020b) propose to dynamically align the current generation state with topics to improve the generation performance.However, it still can not directly align to the input source words.

Conclusions
In this work, we propose a new task, namely, partially-aligned Data-to-Text generation, in which we generate human-readable text based on automatically produced training data.This task is more practical and extensible to any domains.We propose a distant supervision generation framework that tackling the task.The experimental results show that our proposed model solves the over-generation problem effectively and outperforms all baseline models.Moreover, we contribute a partially-aligned dataset WITA produced by our novel automatically annotating framework which is suitable for this new task.
e j b f p a M 6 Y 7 e y B P z D e f w D N 6 5 c 0 < / l a t e x i t > e H U e n G f n z X m f t e a c + c w h / I H z 8 Q M 3 R o 3 x < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " t 8 q e M i Z X + Z + w 7 p A o P g v e m w O P Y B w = " > A A A B 6 3 i c b V D L S g N B E O y N r x h f U Y 9 e F o P g K e w q o p 4 M e P E Y w T w g W c L s p D c Z M j O 7 z M w

Figure 3 :
Figure 3: Our proposed KB extractor for harvesting the partially-aligned data from Wikipedia and Wikidata.

Table 1 :
Statistics of WITA and WebNLG.For the text length and KB number, the data are mean, median, min and max respectively.

Table 2 :
equipped with atten-Main results.

Table 3 :
N-gram statistics for over-generation error analysis.

Table 5 :
Case study.The red font stands for over-generated words while the blue underline indicates incoherent parts.is.For generated sentences, we first remove all stopwords and check whether each of the remaining words appears in the given input KB triple.If it is not contained in the KB triple, we will count it as an over-generated word.The statistics are shown in Table3.It can be observed that the DSG-H has minimal over-generated words.This is because it directly drops all the possible overgenerated words in the training.