Bootstrapping Generators from Noisy Data

A core step in statistical data-to-text generation concerns learning correspondences between structured data representations (e.g., facts in a database) and associated texts. In this paper we aim to bootstrap generators from large scale datasets where the data (e.g., DBPedia facts) and related texts (e.g., Wikipedia abstracts) are loosely aligned. We tackle this challenging task by introducing a special-purpose content selection mechanism. We use multi-instance learning to automatically discover correspondences between data and text pairs and show how these can be used to enhance the content signal while training an encoder-decoder architecture. Experimental results demonstrate that models trained with content-specific objectives improve upon a vanilla encoder-decoder which solely relies on soft attention.


Introduction
A core step in statistical data-to-text generation concerns learning correspondences between structured data representations (e.g., facts in a database) and paired texts (Barzilay and Lapata, 2005;Kim and Mooney, 2010;Liang et al., 2009).These correspondences describe how data representations are expressed in natural language (content realisation) but also indicate which subset of the data is verbalised in the text (content selection).
Although content selection is traditionally performed by domain experts, recent advances in generation using neural networks (Bahdanau et al., 2015;Ranzato et al., 2016) have led to the use of large scale datasets containing loosely related data and text pairs.A prime example are online data sources like DBPedia (Auer et al., 2007) and Wikipedia and their associated texts which 1 Our code and data are available at https://github.com/EdinburghNLP/wikigen. are often independently edited.Another example are sports databases and related textual resources.Wiseman et al. (2017) recently define a generation task relating statistics of basketball games with commentaries and a blog written by fans.
In this paper, we focus on short text generation from such loosely aligned data-text resources.We work with the biographical subset of the DBPedia and Wikipedia resources where the data corresponds to DBPedia facts and texts are Wikipedia abstracts about people.Figure 1 shows an example for the film-maker Robert Flaherty, the Wikipedia infobox, and the corresponding abstract.We wish to bootstrap a data-to-text generator that learns to verbalise properties about an entity from a loosely related example text.Given the set of properties in Figure (1a) and the related text in Figure (1b), we want to learn verbalisations for those properties that are mentioned in the text and produce a short description like the one in Figure (1c).
In common with previous work (Mei et al., 2016;Lebret et al., 2016;Wiseman et al., 2017) our model draws on insights from neural machine translation (Bahdanau et al., 2015;Sutskever et al., 2014) using an encoder-decoder architecture as its backbone.Lebret et al. (2016) introduce the task of generating biographies from Wikipedia data, however they focus on single sentence generation.We generalize the task to multi-sentence text, and highlight the limitations of the standard attention mechanism which is often used as a proxy for content selection.When exposed to sub-sequences that do not correspond to any facts in the input, the soft attention mechanism will still try to justify the sequence and somehow distribute the attention weights over the input representation (Ghader and Monz, 2017).The decoder will still memorise high frequency sub-sequences in spite of these not being supported by any facts in the input.
We propose to alleviate these shortcom-(a) (b) Robert Joseph Flaherty, (February 16, 1884July 23, 1951) was an American film-maker who directed and produced the first commercially successful feature-length documentary film, Nanook of the North (1922).The film made his reputation and nothing in his later life fully equalled its success, although he continued the development of this new genre of narrative documentary, e.g., with Moana (1926), set in the South Seas, and Man of Aran (1934), filmed in Ireland's Aran Islands.He is considered the "father" of both the documentary and the ethnographic film.(c) Robert Joseph Flaherty, (February 16, 1884July 23, 1951) was an American film-maker.Flaherty was married to Frances H. Flaherty until his death in 1951.ings via a specific content selection mechanism based on multi-instance learning (MIL; Keeler and Rumelhart, 1992) which automatically discovers correspondences, namely alignments, between data and text pairs.These alignments are then used to modify the generation function during training.We experiment with two frameworks that allow to incorporate alignment information, namely multi-task learning (MTL; Caruana, 1993) and reinforcement learning (RL; Williams, 1992).
In both cases we define novel objective functions using the learnt alignments.Experimental results using automatic and human-based evaluation show that models trained with content-specific objectives improve upon vanilla encoder-decoder architectures which rely solely on soft attention.The remainder of this paper is organised as follows.We discuss related work in Section 2 and describe the MIL-based content selection approach in Section 3. We explain how the generator is trained in Section 4 and present evaluation experiments in Section 5. Section 7 concludes the paper.

Related Work
Previous attempts to exploit loosely aligned data and text corpora have mostly focused on extracting verbalisation spans for data units.Most approaches work in two stages: initially, data units are aligned with sentences from related corpora using some heuristics and subsequently extra content is discarded in order to retain only text spans verbalising the data.Belz and Kow (2010) obtain verbalisation spans using a measure of strength of association between data units and words, Walter et al. (2013) extract textual patterns from paths in dependency trees while Mrabet et al. (2016) rely on crowd-sourcing.Perez-Beltrachini and Gardent (2016) learn shared representations for data units and sentences reduced to subject-predicate-object triples with the aim of extracting verbalisations for knowledge base properties.Our work takes a step further, we not only induce datato-text alignments but also learn generators that produce short texts verbalising a set of facts.
Our work is closest to recent neural network models which learn generators from independently edited data and text resources.Most previous work (Lebret et al., 2016;Chisholm et al., 2017;Sha et al., 2017;Liu et al., 2017) targets the generation of single sentence biographies from Wikipedia infoboxes, while Wiseman et al. (2017) generate game summary documents from a database of basketball games where the input is always the same set of table fields.In contrast, in our scenario, the input data varies from one entity (e.g., athlete) to another (e.g., scientist) and properties might be present or not due to data incompleteness.Moreover, our generator is enhanced with a content selection mechanism based on multi-instance learning.MIL-based techniques have been previously applied to a variety of problems including image retrieval (Maron and Ratan, 1998;Zhang et al., 2002), object detection (Carbonetto et al., 2008;Cour et al., 2011), text classification (Andrews and Hofmann, 2004), image captioning (Wu et al., 2015;Karpathy and Fei-Fei, 2015), paraphrase detection (Xu et al., 2014), and information extraction (Hoffmann et al., 2011).The application of MIL to content selection is novel to our knowledge.
We show how to incorporate content selection into encoder-decoder architectures following training regimes based on multi-task learning and reinforcement learning.Multi-task learning aims to improve a main task by incorporating joint learning of one or more related auxiliary tasks.It has been applied with success to a variety of sequence-prediction tasks focus-ing mostly on morphosyntax.Examples include chunking, tagging (Collobert et al., 2011;Søgaard and Goldberg, 2016;Bjerva et al., 2016;Plank, 2016), name error detection (Cheng et al., 2015), and machine translation (Luong et al., 2016).Reinforcement learning (Williams, 1992) has also seen popularity as a means of training neural networks to directly optimize a taskspecific metric (Ranzato et al., 2016) or to inject task-specific knowledge (Zhang and Lapata, 2017).We are not aware of any work that compares the two training methods directly.Furthermore, our reinforcement learning-based algorithm differs from previous text generation approaches (Ranzato et al., 2016;Zhang and Lapata, 2017) in that it is applied to documents rather than individual sentences.

Bidirectional Content Selection
We consider loosely coupled data and text pairs where the data component is a set We define a mention span τ as a (possibly discontinuous) subsequence of T containing one or several words that verbalise one or more property-value from P. For instance, in Figure 1, the mention span "married to Frances H. Flaherty" verbalises the property-value {Spouse(s) : F rances Johnson Hubbard}.
In traditional supervised data to text generation tasks, data units (e.g., p i : v i in our particular setting) are either covered by some mention span τ j or do not have any mention span at all in T .The latter is a case of content selection where the generator will learn which properties to ignore when generating text from such data.In this work, we consider text components which are independently edited, and will unavoidably contain unaligned spans, i.e., text segments which do not correspond to any property-value in P. The phrase "from 1914" in the text in Figure (1b) is such an example.Similarly, the last sentence, talks about Frances' awards and nominations and this information is not supported by the properties either.
Our model checks content in both directions; it identifies which properties have a corresponding text span (data selection) and also foregrounds (un)aligned text spans (text selection).This knowledge is then used to discourage the generator from producing text not supported by facts in the property set P. We view a property set P and its loosely coupled text T as a coarse level, imperfect alignment.From this alignment signal, we want to discover a set of finer grained alignments indicating which mention spans in T align to which properties in P. For each pair (P, T ), we learn an alignment set A(P, T ) which contains property-value word pairs.For example, for the properties spouse and died in Figure 1, we would like to derive the alignments in Table 1.
We formulate the task of discovering finergrained word alignments as a multi-instance learning problem (Keeler and Rumelhart, 1992).We assume that words from the text are positive labels for some property-values but we do not know which ones.For each data-text pair (P, T ), we derive |T | pairs of the form (P, s) where |T | is the number of sentences in T .We encode property sets P and sentences s into a common multi-modal h-dimensional embedding space.While doing this, we discover finer grained alignments between words and property-values.The intuition is that by learning a high similarity score for a property set P and sentence pair s, we will also learn the contribution of individual elements (i.e., words and property-values) to the overall similarity score.We will then use this individual contribution as a measure of word and property-value alignment.More concretely, we assume the pair is aligned (or unaligned) if this individual score is above (or below) a given threshold.Across examples like the one shown in Figure (1a-b), we expect the model to learn an alignment between the text span "married to Frances H. Flaherty" and the property- value {spouse : F rances Johnson Hubbard}.
In what follows we describe how we encode (P, s) pairs and define the similarity function.
Property Set Encoder As there is no fixed order among the property-value pairs p : v in P, we individually encode each one of them.Furthermore, both properties p and values v may consist of short phrases.For instance, the property cause of death and value cerebral thrombosis in Figure 1.We therefore consider property-value pairs as concatenated sequences p v and use a bidirectional Long Short-Term Memory Network (LSTM; Hochreiter and Schmidhuber, 1997) network for their encoding.Note that the same network is used for all pairs.Each property-value pair is encoded into a vector representation: which is the output of the recurrent network at the final time step.We use addition to combine the forward and backward outputs and generate en- Sentence Encoder We also use a biLSTM to obtain a representation for the sentence s = w 1 , • • • , w |s| .Each word w t is represented by the output of the forward and backward networks at time step t.A word at position t is represented by the concatenation of the forward and backward outputs of the networks at time step t : and each sentence is encoded as a sequence of vectors Alignment Objective Our learning objective seeks to maximise the similarity score between property set P and a sentence s (Karpathy and Fei-Fei, 2015).This similarity score is in turn defined on top of the similarity scores among property-values in P and words in s.Equation (3) defines this similarity function using the dot product.The function seeks to align each word to the best scoring property-value: Equation ( 4) defines our objective which encourages related properties P and sentences s to have higher similarity than other P ′ = P and s ′ = s:

Generator Training
In this section we describe the base generation architecture and explain two alternative ways of using the alignments to guide the training of the model.One approach follows multi-task training where the generator learns to output a sequence of words but also to predict alignment labels for each word.The second approach relies on reinforcement learning for adjusting the probability distribution of word sequences learnt by a standard word prediction training algorithm.

Encoder-Decoder Base Generator
We follow a standard attention based encoderdecoder architecture for our generator (Bahdanau et al., 2015;Luong et al., 2015).Given a set of properties X as input, the model learns to predict an output word sequence Y which is a verbalisation of (part of) the input.More precisely, the generation of sequence Y is conditioned on input X: The encoder module constitutes an intermediate representation of the input.For this, we use the property-set encoder described in Section 3 which outputs vector representations {p 1 , • • • , p |X| } for a set of property-value pairs.
The decoder uses an LSTM and a soft attention mechanism (Luong et al., 2015) to generate one word y t at a time conditioned on the previous output words and a context vector c t dynamically created: where g(•) is a neural network with one hidden layer parametrised by W o ∈ R |V |×d , |V | is the output vocabulary size and d the hidden unit dimension, over h t and c t composed as follows: where W c ∈ R d×2d .h t is the hidden state of the LSTM decoder which summarises y 1:t : The dynamic context vector c t is the weighted sum of the hidden states of the input property set (Equation ( 9)); and the weights α ti are determined by a dot product attention mechanism: We initialise the decoder with the averaged sum of the encoded input representations (Vinyals et al., 2016).The model is trained to optimize negative log likelihood: We extend this architecture to multi-sentence texts in a way similar to Wiseman et al. (2017).We view the abstract as a single sequence, i.e., all sentences are concatenated.When training, we cut the abstracts in blocks of equal size and perform forward backward iterations for each block (this includes the back-propagation through the encoder).From one block iteration to the next, we initialise the decoder with the last state of the previous block.The block size is a hyperparameter tuned experimentally on the development set.

Predicting Alignment Labels
The generation of the output sequence is conditioned on the previous words and the input.However, when certain sequences are very common, the language modelling conditional probability will prevail over the input conditioning.For instance, the phrase from 1914 in our running example is very common in contexts that talk about periods of marriage or club membership, and as a result, the language model will output this phrase often, even in cases where there are no supporting facts in the input.The intuition behind multi-task training (Caruana, 1993) is that it will smooth the probabilities of frequent sequences when trying to simultaneously predict alignment labels.
Using the set of alignments obtained by our content selection model, we associate each word in the training data with a binary label a t indicating whether it aligns with some property in the input set.Our auxiliary task is to predict a t given the sequence of previously predicted words and input X: where v a ∈ R d and the other operands are as defined in Equation ( 7).We optimise the following auxiliary objective function: and the combined multi-task objective is the weighted sum of both word prediction and alignment prediction losses: where λ controls how much model training will focus on each task.As we will explain in Section 5, we can anneal this value during training in favour of one objective or the other.

Reinforcement Learning Training
Although the multi-task approach aims to smooth the target distribution, the training process is still driven by the imperfect target text.In other words, at each time step t the algorithm feeds the previous word w t−1 of the target text and evaluates the prediction against the target w t .
Alternatively, we propose a training approach based on reinforcement learning (Williams 1992) which allows us to define an objective function that does not fully rely on the target text but rather on a revised version of it.In our case, the set of alignments obtained by our content selection model provides a revision for the target text.The advantages of reinforcement learning are twofold: (a) it allows to exploit additional taskspecific knowledge (Zhang and Lapata, 2017) during training, and (b) enables the exploration of other word sequences through sampling.Our setting differs from previous applications of RL (Ranzato et al., 2016;Zhang and Lapata, 2017) in that the reward function is not computed on the target text but rather on its alignments with the input.
The encoder-decoder model is viewed as an agent whose action space is defined by the set of words in the target vocabulary.At each time step, the encoder-decoder takes action ŷt with policy P π (ŷ t |ŷ 1:t−1 , X) defined by the probability in Equation ( 6).The agent terminates when it emits the End Of Sequence (EOS) token, at which point the sequence of all actions taken yields the output sequence Ŷ = (ŷ 1 , • • • , ŷ| Ŷ | ).This sequence in our task is a short text describing the properties of a given entity.After producing the sequence of actions Ŷ , the agent receives a reward r( Ŷ ) and the policy is updated according to this reward.

Reward Function
We define the reward function r( Ŷ ) on the alignment set A(X, Y ).If the output action sequence Ŷ is precise with respect to the set of alignments A(X, Y ), the agent will receive a high reward.Concretely, we define r( Ŷ ) as follows: where γ pr adjusts the reward value r pr which is the unigram precision of the predicted sequence Ŷ and the set of words in A(X, Y ).
Training Algorithm We use the REINFORCE algorithm (Williams, 1992) to learn an agent that maximises the reward function.As this is a gradient descent method, the training loss of a sequence is defined as the negative expected reward: where P π is the agent's policy, i.e., the word distribution produced by the encoder-decoder model (Equation ( 6)) and r(•) is the reward function as defined in Equation ( 16).The gradient of L RL is given by: where

Experimental Setup
Data We evaluated our model on a dataset collated from WIKIBIO (Lebret et al., 2016), a cor-pus of 728,321 biography articles (their first paragraph) and their infoboxes sampled from the English Wikipedia.We adapted the original dataset in three ways.Firstly, we make use of the entire abstract rather than first sentence.Secondly, we reduced the dataset to examples with a rich set of properties and multi-sentential text.We eliminated examples with less than six propertyvalue pairs and abstracts consisting of one sentence.We also placed a minimum restriction of 23 words in the length of the abstract.We considered abstracts up to a maximum of 12 sentences and property sets with a maximum of 50 property-value pairs.Finally, we associated each abstract with the set of DBPedia properties p : v corresponding to the abstract's main entity.As entity classification is available in DBPedia for most entities, we concatenate class information c (whenever available) with the property value, i.e., p : v c.In Figure 1, the property value spouse : F rances H. F laherty is extended with class information from the DBPedia ontology to spouse : F rances H. F laherty P erson.
Pre-processing Numeric date formats were converted to a surface form with month names.Numerical expressions were delexicalised using different tokens created with the property name and position of the delexicalised token on the value sequence.For instance, given the property-value for birth date in  tent aligner) and five UNKs (for the generator); or more than two UNKs in the property-value (for generation).Finally, we added the empty relation to the property sets.Table 2 summarises the dataset statistics for the generator.We report the number of abstracts in the dataset (size), the average number of sentences and tokens in the abstracts, and the average number of properties and sentence length in tokens (sent.len).For the content aligner (cf.Section 3), each sentence constitutes a training instance, and as a result the sizes of the train and development sets are 796,446 and 153,096, respectively.

Training Configuration
We adjusted all models' hyperparameters according to their performance on the development set.The encoders for both content selection and generation models were initialised with GloVe (Pennington et al., 2014) pre-trained vectors.The input and hidden unit dimension was set to 200 for content selection and 100 for generation.In all models, we used encoder biLSTMs and decoder LSTM (regularised with a dropout rate of 0.3 (Zaremba et al., 2014)) with one layer.Content selection and generation models (base encoder-decoder and MTL) were trained for 20 epochs with the ADAM optimiser (Kingma and Ba, 2014) using a learning rate of 0.001.The reinforcement learning model was initialised with the base encoder-decoder model and trained for 35 additional epochs with stochastic gradient descent and a fixed learning rate of 0.001.Block sizes were set to 40 (base), 60 (MTL) and 50 (RL).Weights for the MTL objective were also tuned experimentally; we set λ = 0.1 for the first four epochs (training focuses on alignment prediction) and switched to λ = 0.9 for the remaining epochs.

Content Alignment
We optimized content alignment on the development set against manual alignments.
Specifically, two annotators aligned 132 sentences to their infoboxes.We used the Yawat annotation tool (Germann, 2008) and followed the alignment guidelines (and evaluation metrics) used in Cohn et al. (2008).The inter-annotator agreement using macro-averaged f-score was 0.72 (we treated one annotator as the reference and the other one as hypothetical system output).
Alignment sets were extracted from the model's output (cf.Section 3) by optimizing the threshold avg(sim) + a * std(sim) where sim denotes the similarity between the set of property values and words, and a is empirically set to 0.75; avg and std are the mean and standard deviation of sim scores across the development set.Each word was aligned to a property-value if their similarity exceeded a threshold of 0.22.Our best content alignment model (Content-Aligner) obtained an fscore of 0.36 on the development set.
We also compared our Content-Aligner against a baseline based on pre-trained word embeddings (EmbeddingsBL).For each pair (P, s) we computed the dot product between words in s and properties in P (properties were represented by the the averaged sum of their words' vectors).Words were aligned to property-values if their similarity exceeded a threshold of 0.4.Embed-dingsBL obtained an f-score of 0.057 against the manual alignments.Finally, we compared the performance of the Content-Aligner at the level of property set P and sentence s similarity by comparing the average ranking position of correct pairs among 14 distractors, namely rank@15.The Content-Aligner obtained a rank of 1.31, while the EmbeddingsBL model had a rank of 7.99 (lower is better).

Results
We compared the performance of an encoderdecoder model trained with the standard negative log-likelihood method (ED), against a model trained with multi-task learning (ED MTL ) and reinforcement learning (ED RL ).We also included a template baseline system (Templ) in our evaluation experiments.
The template generator used hand-written rules to realise property-value pairs.As an approximation for content selection, we obtained the 50 more frequent property names from the training set and manually defined content ordering rules with the following criteria.We ordered personal life properties (e.g., birth date or occupation) based on their most common order of mention in the Wikipedia abstracts.Profession dependent properties (e.g., position or genre), were as- Table 3: BLEU-4 results using the original Wikipedia abstract (Abstract) as reference and crowd-sourced revised abstracts (RevAbs) for template baseline (Templ), standard encoder-decoder model (ED), and our content-based models trained with multi-task learning (ED MTL ) and reinforcement learning (ED RL ).
signed an equal ordering but posterior to the personal properties.We manually lexicalised properties into single sentence templates to be concatenated to produce the final text.The template for the property position and example verbalisation for the property-value position : def ender of the entity zanetti are "[NAME] played as [POSITION]." and " Zanetti played as defender."respectively.
Automatic Evaluation Table 3 shows the results of automatic evaluation using BLEU-4 (Papineni et al., 2002) against the noisy Wikipedia abstracts.Considering these as a gold standard is, however, not entirely satisfactory for two reasons.Firstly, our models generate considerably shorter text and will be penalized for not generating text they were not supposed to generate in the first place.Secondly, the model might try to reproduce what is in the imperfect reference but not supported by the input properties and as a result will be rewarded when it should not.To alleviate this, we crowd-sourced using AMT a revised version of 200 randomly selected abstracts from the test set.Crowdworkers were shown a Wikipedia infobox with the accompanying abstract and were asked to adjust the text to the content present in the infobox.Annotators were instructed to delete spans which did not have supporting facts and rewrite the remaining parts into a well-formed text.We collected three revised versions for each abstract.Inter-annotator agreement was 81.64 measured as the mean pairwise BLEU-4 amongst AMT workers.
Automatic evaluation results against the revised abstracts are also shown in Table 3.As can be seen, all encoder-decoder based models have a significant advantage over Templ when evaluating against both types of abstracts.The model enabled with the multi-task learning content selection mechanism brings an improvement of 1.29 BLEU-4 over a vanilla encoder-decoder System 1 st 2 nd 3 rd 4 th 5 th Rank Templ 12.17 14.33 10.17 15.50 47.83 3.72 ED 12.83 24.17 24.67 25.17 13.17 3.02 EDMTL 14.83 26.17 26.17 19.17 13.67 2.90 EDRL 14.67 25.00 25.50 24.00 10.83 2.91 RevAbs 47.00 14.00 12.67 16.17 9.17 2.27 model.Performance of the RL trained model is inferior and close to the ED model.We discuss the reasons for this discrepancy shortly.
To provide a rough comparison with the results reported in Lebret et al. (2016), we also computed BLEU-4 on the first sentence of the text generated by our system. 3Recall that their model generates the first sentence of the abstract, whereas we output multi-sentence text.Using the first sentence in the Wikipedia abstract as reference, we obtained a score of 37.29% (ED), 38.42% (ED MTL ) and 38.1% (ED RL ) which compare favourably with their best performing model (34.7%±0.36).

Human-Based Evaluation
We further examined differences among systems in a human-based evaluation study.Using AMT, we elicited 3 judgements for the same 200 infobox-abstract pairs we used in the abstract revision study.We compared the output of the templates, the three neural generators and also included one of the human edited abstracts as a gold standard (reference).For each test case, we showed crowdworkers the Wikipedia infobox and five short texts in random order.The annotators were asked to rank each of the texts according to the following criteria: (1) Is the text faithful to the content of the table?and (2) Is the text overall comprehensible and fluent?Ties were allowed only when texts were identical strings.Table 5 presents examples of the texts (and properties) crowdworkers saw.
Table 4 shows, proportionally, how often crowdworkers ranked each system, first, second, and so on.Unsurprisingly, the human authored gold text is considered best (and ranked first 47% of the time).ED MTL is mostly ranked second and third best, followed closely by ED RL .The vanilla encoder-decoder system ED is mostly forth and Templ is fifth.As shown in the last column of the table (Rank), the ranking of ED MTL is overall slightly better than ED RL .We further converted the ranks to ratings on a scale of  (December 28 , 1932-August 19 , 1979) (December 28 , 1932-August 19 , 1979) was an american singer and songwriter.He was a member of the Rock band the band from YEAR to YEAR.EDMTL Dorothy Burnette (December 28 , 1932-August 19 , 1979) was an american country music singer and songwriter.He was a member of the Rock band Roll.

EDRL
Burnette Burnette (December 28 , 1932-August 19 , 1979)   1 to 5 (assigning ratings 5. . . 1 to rank placements 1. . .5).This allowed us to perform Analysis of Variance (ANOVA) which revealed a reliable effect of system type.Post-hoc Tukey tests showed that all systems were significantly worse than RevAbs and significantly better than Templ (p < 0.05).ED MTL is not significantly better than ED RL but is significantly (p < 0.05) different from ED.
Discussion The texts generated by ED RL are shorter compared to the other two neural systems which might affect BLEU-4 scores and also the ratings provided by the annotators.As shown in Table 5 (entity dorsey burnette), ED RL drops information pertaining to dates or chooses to just verbalise birth place information.In some cases, this is preferable to hallucinating incorrect facts; however, in other cases outputs with more information are rated more favourably.Overall, ED MTL seems to be more detail oriented and faithful to the facts included in the infobox (see dorsey burnette, aaron moores, or kirill moryganov).
The template system manages in some specific configurations to verbalise appropriate facts (indrani bose), however, it often fails to verbalise in- frequent properties (aaron moores) or focuses on properties which are very frequent in the knowledge base but are rarely found in the abstracts (kirill moryganov).

Conclusions
In this paper we focused on the task of bootstrapping generators from large-scale datasets consisting of DBPedia facts and related Wikipedia biography abstracts.We proposed to equip standard encoder-decoder models with an additional content selection mechanism based on multi-instance learning and developed two training regimes, one based on multi-task learning and the other on reinforcement learning.Overall, we find that the proposed content selection mechanism improves the accuracy and fluency of the generated texts.
In the future, it would be interesting to investigate a more sophisticated representation of the input (Vinyals et al., 2016).It would also make sense for the model to decode hierarchically, taking sequences of words and sentences into account (Zhang and Lapata, 2014;Lebret et al., 2015).

Flaherty
was married to writer Frances H. Flaherty from 1914 until his death in 1951.Frances worked on several of her husband's films, and received an Academy Award nomination for Best Original Story for Louisiana Story (1948).

Figure 1 :
Figure 1: Property-value pairs (a), related biographic abstract (b) for the Wikipedia entity Robert Flaherty, and model verbalisation in italics (c).
Figure (1a), the first sentence in the abstract (Figure (1b)) becomes " Robert Joseph Flaherty, (February DLX birth date 2, DLX birth date 4 -July . . .". Years and numbers in the text not found in the values of the property set were replaced with tokens YEAR and NU-MERIC. 2In a second phase, when creating the input and output vocabularies, V I and V O respectively, we delexicalised words w which were absent from the output vocabulary but were attested in the input vocabulary.Again, we created tokens based on the property name and the position of the word in the value sequence.Words not in V O or V I were replaced with the symbol UNK.Vocabulary sizes were limited to |V I | = 50k and |V O | = 50k for the alignment model and |V O | = 20k for the generator.We discarded examples where the text contained more than three UNKs (for the con-

Table 1 :
Example of word-property alignments for the Wikipedia abstract and facts in Figure1.

Table 4 :
Rankings shown as proportions and mean ranks given to systems by human subjects.
propertyset name= dorsey burnette, date= may 2012, bot= blevintron bot, background= solo singer, birth= december 28 , 1932, birth place= memphis, tennessee, death place= {los angeles; canoga park, california}, death= august 19 , 1979, associated acts= the rock and roll trio, hometown= memphis, tennessee, genre= {rock and roll; rockabilly; country music}, occupation= {composer; singer}, instruments= {rockabilly bass; vocals; acoustic guitar}, record labels= {era records; coral records; smash records; imperial records; capitol records; dot records; reprise records} RevAbs Dorsey Burnette was an american early Rockabilly singer.He was a member of the Rock and Roll Trio.Templ Dorsey Burnette (DB) was born in December 28 , 1932.DB was born in Memphis, Tennessee.DB died in August 19 , 1979.DB died in August 19 , 1979.DB died in Canoga Park, California.DB died in los angeles.DB was a composer.DB was a singer.DB 's genre was Rock and Roll.The background of DB was solo singer.DB 's genre was Rockabilly.DB worked with the Rock and Roll Trio.DB 's genre was Country music.DB worked with the Rock and Roll Trio.ED Dorsey Burnette was an american singer and songwriter.He was born in memphis , Tennessee.propertysetname=indranibose,doctoral advisor= chanchal kumar majumdar, alma mater= university of calcutta, birth= 1951-0-0, birth place= kolkata, field= theoretical physics, work institution= bose institute, birth= august 15 , 1951, honours= fna sc, nationality= india, known for= first recipient of stree sakthi science samman award RevAbs Indrani Bose(born 1951)is an indian physicist at the Bose institute.Professor Bose obtained her ph.d.fromUniversity of Calcutta Templ Indrani Bose (IB) was born in year-0-0.IB was born in August 15 , 1951.IB was born in kolkata.IB was a india.IB studied at University of Calcutta.IB was known for First recipient of Stree Sakthi Science Samman Award.ED Indrani UNK (born 15 August 1951) is an indian Theoretical physicist and Theoretical physicist.She is the founder and ceo of UNK UNK.EDMTL Indrani Bose (born 15 August 1951) is an indian Theoretical physicist.She is a member of the UNK Institute of Science and technology.EDRL Indrani UNK (born 15 August 1951) is an indian Theoretical physicist.She is a member of the Institute of technology ( UNK ).propertyset name= aaron moores, coach= sarah paton, club= trowbridge asc, birth= may 16 , 1994, birth place= trowbridge, sport= swimming, paralympics= 2012 RevAbs Aaron Moores (born 16 May 1994) is a british ParalyMpic swiMMer coMpeting in the s14 category , Mainly in the backstroke and breaststroke and after qualifying for the 2012 SuMMer ParalyMpics he won a Silver Medal in the 100 M backstroke.Templ Aaron Moores (AM) was born in May 16 , 1994.AM was born in May 16 , 1994.AM was born in Trowbridge.ED Donald Moores (born 16 May 1994) is a Paralympic swimmer from the United states.He has competed in the Paralympic Games.EDMTL Donald Moores (born 16 May 1994) is an english swimmer.He competed at the 2012 Summer Paralympics.EDRL Donald Moores (born 16 May 1994) is a Paralympic swimmer from the United states.He competed at the dlx updated 3 Summer Paralympics.propertysetname=kirill moryganov, height= 183.0, birth= february 7 , 1991, position= defender, height= 1.83, goals= {0; 1}, clubs= fc torpedo moscow, pcupdate= may 28, 2016, years= {2013; 2012; 2015; 2016; 2010; 2014; 2008; 2009}, team={fc neftekhimik nizhnekamsk; fc znamya truda orekhovo-zuyevo; fc irtysh omsk; fc vologda; fc torpedo-zil moscow; fc tekstilshchik ivanovo; fc khimki; fc oktan perm, fc ryazan, fc amkar perm}, February 1991) is a russian professional football player.He plays for fc Irtysh Omsk.He is a Central defender.Templ Kirill Moryganov (KM) was born in February 7 , 1991.KM was born in February 7 , 1991.The years of KM was 2013.The years of KM was 2013.KM played for fc Neftekhimik Nizhnekamsk.KM played for fc Znamya Truda Orekhovo-zuyevo.KM scored 1 goals.The years of KM was 2013.KM played for fc Irtysh Omsk.The years of KM was 2013.KM played as Defender.KM played for fc Vologda.KM played for fc Torpedo-zil Moscow.KM played for fc Tekstilshchik Ivanovo.KM scored 1 goals.KM 's Club was fc Torpedo Moscow.KM played for fc Khimki.The years of KM was 2013.The years of KM was 2013.The years of KM was 2013.KM played for fc Amkar Perm.The years of KM was 2013.KM played for fc Ryazan.KM played for fc Oktan Perm.ED Kirill mikhailovich Moryganov (; born February 7 , 1991) is a russian professional football player.He last played for fc Torpedo armavir.EDMTL Kirill Moryganov (; born 7 February 1991) is an english professional footballer who plays as a Defender.He plays for fc Neftekhimik Nizhnekamsk.EDRL Kirill viktorovich Moryganov (; born February 7 , 1991) is a russian professional football player.He last played for fc Tekstilshchik Ivanovo.

Table 5 :
Examples of system output.