Handling Rare Items in Data-to-Text Generation

Neural approaches to data-to-text generation generally handle rare input items using either delexicalisation or a copy mechanism. We investigate the relative impact of these two methods on two datasets (E2E and WebNLG) and using two evaluation settings. We show (i) that rare items strongly impact performance; (ii) that combining delexicalisation and copying yields the strongest improvement; (iii) that copying underperforms for rare and unseen items and (iv) that the impact of these two mechanisms greatly varies depending on how the dataset is constructed and on how it is split into train, dev and test.


Introduction
The input to data-to-text generation often contains rare items, i.e. low frequency items such as names, locations and dates. This makes it difficult for neural models to predict their verbalisation. To address these issues, neural approaches to datato-text generation typically resort either to delexicalisation (Wen et al., 2015;Dušek and Jurcicek, 2015;Trisedya et al., 2018; or to a copy mechanism . Character-based encodings (Agarwal and Dymetman, 2017;Deriu and Cieliebak, 2018) and byte pair encodings have also been used (Elder, 2017;. However, when applying a character-based approach within a standard sequence-to-sequence model to the WebNLG and E2E datasets, the results were low. Hence we chose not to discuss them. When using delexicalisation, the data is preprocessed to replace rare items with placeholders and the generated text is post-processed to replace these placeholders with appropriate values based on a mapping between placeholders and initial values built during preprocessing. While this method is often used, it has several drawbacks. It requires an additional pre-and post-processing step. These processing steps must be re-implemented for each new data-to-text application. The matching procedure needed to correctly match a rare input item (e.g., Barack Obama) with the corresponding part in the output text (e.g., the former President of the United States) may be quite complex which may result in incorrect or incomplete delexicalisations. In contrast, the copy mechanisms standardly used in neural approaches to summarisation (See et al., 2017;Gu et al., 2016;Cheng and Lapata, 2016), paraphrasing , answer generation (He et al., 2017) and data-to-text generation ) is a generic technique which is easy to integrate in the encoder-decoder framework and can be used independently of the particular domain and application.
In this paper, we investigate the impact of copying and delexicalisation on the quality of generated texts using two sequence-to-sequence models with attention: one using the copy and coverage mechanism of See et al. (2017), the other using delexicalisation. We evaluate their respective output on two data-to-text datasets, namely the E2E (Novikova et al., 2017) and the WebNLG (Gardent et al., 2017a) datasets. We also compare the two methods in two different settings: the original train/dev/test partition produced by the E2E and by the WebNLG challenge vs. a more constrained train/dev/test split which aims to further minimise the amount of redundancy between train, dev and test data. This latter experimental setting is in- Original: The Cricketers is a coffee shop that also has Chinese food, located near The Portland Arms. It is not family friendly, and has an average customer rating. Delexicalised: NAME is a coffee shop that also has Chinese food, located near NEAR. It is not family friendly, and has an average customer rating.
Classified as a dessert which can be served warm or cold, its main ingredients are ground almond, jam, butter and eggs.
Delexicalised: FOOD, also called DISH-VARIATION, originates from the REGION. Classified as a COURSE which can be served SERV-INGTEMPERATURE, its main ingredients are MAININGREDIENTS. Table 1: Entry examples of the E2E (first row) and WebNLG (second row) datasets with and without delexicalisation. spired by a recent paper by Aharoni and Goldberg (2018), which shows that the train/dev/test split may have a strong impact on how much the model learns to generalise and how much it memorises.
Our study suggests the following.
• Rare items strongly impact the performance of Data-to-Text generation. • Combining delexicalisation and copying yields the strongest improvements. • Copying underperforms for items not, or rarely, seen in the training data. • The content (e.g., distribution and number of named entities) and the partitioning (constraints on the test set) of the training data strongly affect the impact of both copying and delexicalisation.

Datasets
Two recently released corpora for data-to-text generation served as experimental datasets for our study: the E2E (Novikova et al., 2017) and the WebNLG (Gardent et al., 2017a) datasets.
In the E2E dataset, the input to generation is a dialogue act consisting of three to eight slot-value pairs describing a restaurant, while the output is a restaurant recommendation verbalising this input. Table 1 shows an example with an input consisting of six slot-value pairs. In average, each input is associated with 8.1 references. The number of possible values for each slot ranges from two (binary slots) to 34 (restaurant name). Tables 2 and 3 summarise the statistics of the E2E dataset.
In WebNLG, the aim is to verbalise a set of RDF (Resource Description Framework) triples describing entities of different categories. An RDF triple is of the form (subject, property, object) where subject and object denotes entities or values and property denotes a binary relation holding between subject and object. The inputs consist of sets of (one to seven) triples and the entities belong to fifteen distinct DBpedia categories 2 .
Both dataset releases gave rise to a shared task in NLG in 2017 3 . Note though that for WebNLG, the present study relies on the final release data (version 2) 4 , which is a larger dataset than that used for the WebNLG Challenge 2017.

Delexicalising Datasets
We derive delexicalised datasets from the original E2E and WebNLG datasets as follows.
For each dataset, we replicated the delexicalisation procedure which was applied to the baseline systems developed for the E2E (Novikova et al., 2017) and for the WebNLG challenge (Gardent et al., 2017b) respectively 5 . As shown in Table 1, both input data and output text were delexicalised. In E2E, only the name and near slots were delexicalised (because contrary to the other slots,   they have a large number of distinct values). In WebNLG, delexicalisation was done on the subjects and objects of RDF triples. While delexicalisation was flawless in E2E, WebNLG data poses additional challenges as the subject and object values in the input do not necessarily match the corresponding text fragment in the output. As a result, not all subjects and objects were delexicalised.
In the delexicalised E2E corpus, placeholders constitute 5.7% of all tokens, while they reach 15.7% in the WebNLG data.

The Copy Mechanism
The copy mechanism is widely used in text production approaches where it is relevant for handling rare input but also, for instance, in text summarisation, for copying input into the output. Thus,  uses a copy mechanism to generate paraphrases, Gu et al. (2016), Cheng and Lapata (2016) for text summarisation and He et al. (2017) for answer generation.
Here we use the copy mechanism introduced by See et al. (2017). The decoder uses an extended vocabulary which consists of a predefined target vocabulary P vocab which is dynamically extended at inference time with the tokens contained in the input. At each time step during decoding, the model then decides whether to copy from the input or to generate from the target vocabulary using a probability distribution over the extended vo-cabulary which is computed based on a generation probability (sampling from the target vocabulary) and on the attention distribution (sampling from the input).
The attention distribution a t is calculated as in (Bahdanau et al., 2015): with v, W h , W s and b attn parameters to be learned, s t is the decoder state and h i is a variable ranging over the encoder hidden states.
The generation probability p gen is then defined as: where W h , W s , W x , b ptr are parameters to be learned, x t is the decoder input and h t is the context vector produced by the attention mechanism as the weighted sum E 1 a t i h i of the encoder states (with N the number of encoder states).
Finally, the probability distribution over the extended vocabulary is defined as:

Constraining Datasets
The train/dev/test split is often constrained to ensure that there is no overlap in terms of input between the training, the development and the test set. As Aharoni and Goldberg (2018) recently showed however, this may result in a setup where certain input fragments (in that case, subject and object entities present in the input set of RDF triples) are present so often in the test set that models built on this standard split, overfit and memorise rather than learn. Thus, in the split-andrephrase application they studied, Aharoni and Goldberg (2018) observed that, given some input containing the entity e and some set of facts T (e) about this entity, the model will regularly output a text which mentions e but is unrelated to the set of facts T (e). That is, instead of learning to generate text from data, the model learns to associate a text with an entity.
To better assess the impact of delexicalisation and copying on the output of data-to-text generation models, we therefore consider two ways of partitioning the corpus into train, dev and test: the traditional way (Unconstrained) where the overlapping constraint applies to entire inputs (i.e., sets of RDF triples in WebNLG and dialogue moves in E2E) and a more challenging split (Constrained) where the no-overlap constraint applies to input fragments (i.e., RDF triples in WebNLG and slotvalues in E2E). Table 3 shows the statistics for both splits for each dataset.
Unconstrained The unconstrained split is the original split provided by the challenge organisers.
The E2E dataset was split into training, validation and test sets following a 76.5/8.5/15 ratio. It was ensured that the input were distinct for all three sets and that a similar distribution of input and reference text lengths was kept. We found 1,430 identical (data, text) pairs in the original E2E data. They were deleted for the subsequent experiments.
In WebNLG, the original split follows an 80/10/10 ratio. As with the E2E dataset, there is a null intersection in terms of input between train, dev and test. In addition, sets of triples of different sizes and sets of triples of different categories were proportionally distributed between training, dev and test sets.
Constrained We consider a second partitioning where we aim to minimise the overlap between train, dev and test in terms of input fragments.
As shown in Table 2, in the E2E dataset, most of the slots have under eight possible values. As these few values appear with a large number of distinct slot-value combinations (49,966 input-text instances), they are unlikely to trigger fact memorisation. We therefore focus on those slots which have a higher number of values and restrict the test set using restaurant names, a slot with 34 possible values. Four restaurant names were selected to occur only in the test data and to support a distribution of inputs types and text length similar to that of the original train/dev/test (cf . Table 3).
Nonetheless, it is worth noting that the E2E dataset was constructed in such a way that a specific restaurant name may have mutually exclusive values in different inputs, such as low customer rating and high customer rating. This might result in weak association between restaurant names and specific inputs and therefore, in little risk of memorising facts related to a specific restaurant name. As we shall see in Section 3, this intuition is confirmed by the results which show little differences, for the E2E data, in terms of both automatic and human-based metrics between the Constrained and the Unconstrained setting. Note also that since the E2E Constrained split is defined with respect to a slot value (restaurant names) which is delexicalised, the constrained vs. unconstrained split distinction loses its impact in the delexicalised setting.
For the WebNLG dataset, the constraint on the train/dev/test partition is in terms of triples. In addition to the exclusion from the test set all inputs (set of RDF triples) which occur in either the dev or the train set, we require that no RDF triple occurs in two of these sets. Let t = (s, p, o) be an RDF-triple, with p a property and s, o subject and object RDF resources. In the constrained dataset, if, t is in the test set, then t may not be in either the dev or the training set but variants such as (s , p, o), (s, p, o ), (s , p, o ) or (s, p , o) may (with s = s , p = p and o = o ). In this way, models can be trained which must learn to verbalise properties independently of their arguments. Again, care was taken to keep the distribution in terms of input length similar to that of the original split (cf. Table 3).

Model Parameters
We trained two types of models: a standard sequence-to-sequence model and the same model augmented with a copy and coverage mechanism (denoted as C in the tables). For the standard sequence-to-sequence model, we made use  of an LSTM encoder-decoder model with attention (Luong et al., 2015) from the OpenNMT-py toolkit 6 , a PyTorch port of OpenNMT (Klein et al., 2017). The default parameters of OpenNMT-py were used for training and decoding. The encoder and decoder both have two layers. Models were trained for 13 epochs, with a mini-batch size of 64, a dropout rate of 0.3, and a word embedding size of 500. They were optimised with SGD with a starting learning rate of 1.0.
Data was not lowercased, nor was it truncated (the maximal sequence length was used in the source and target).
Special options available in OpenNMT-py were used to augment the standard model with the copy and the coverage mechanisms. The OpenNMTpy implementation of training additional copy and coverage attention layers follows See et al. (2017).

Evaluation
Automatic Evaluation Systems were evaluated using four automatic corpus-based metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), METEOR (Denkowski and Lavie, 2014), ROUGE L (Lin, 2004). We made use of the scripts used for the E2E Challenge evaluation 7 . The first three metrics were originally developed for machine translation, the last one for summarisation. Roughly speaking, BLEU calculates the n-gram precision; NIST is based on BLEU, but adds more weight to rarer n-grams; METEOR computes the harmonic mean of precision and recall, featuring also stem and synonymy matching; ROUGE L calculates recall for common longest subsequences in a reference and candidate text. Given our taskhandling rare items (or named entities in the corpora in question)-we also applied the slot-error rate (SER) to evaluate outputs which seems to be more suitable for evaluating the presence of named entities. SER was calculated by exact matching 6 https://github.com/OpenNMT/OpenNMT-py 7 https://github.com/tuetschek/ e2e-metrics slot values in the candidate texts, where S is a number of substitutions, D is a number of deletions, I is a number of insertions, and N is a total number of slots in the reference. The resulting SER is an average of SER for each prediction. While computing SER for the dialogue slot-based E2E corpus is straightforward (the binary slot familyFriendly was excluded), it results in some noise for WebNLG where subjects and objects are numerous (3,648 vs. 79 values in E2E) and where they were rephrased in references (cf. also Section 2.2).
Manual Annotation To allow comparisons between constrained and unconstrained settings, we intersected inputs of constrained and unconstrained test sets and gathered corresponding predictions from them for all the models. The intersection between the two test sets has 40 inputs in the E2E corpus and 153 in WebNLG. For E2E, we manually evaluated all 40 predictions available for each system (constrained and unconstrained); for WebNLG, we chose 44 predictions ensuring the presence of different sizes and categories. By manually assessing outputs for the same inputs for all the systems, contrasts between constrained and unconstrained settings are better highlighted. Manual inspection of outputs revealed that most of generated predictions did not encounter issues with grammar or fluency. For this reason, we chose to focus on semantic adequacy of predicted texts. The evaluation was done by one human judge. After the evaluation was finished, the human annotator confirmed that, except for one system (see Section 3), all system outputs demonstrated fluent and grammatical English sentences.
Once presented with an input and a corresponding prediction text, a human judge was asked to evaluate semantic information present in the prediction. A minimal unit of analysis was a slotvalue pair in E2E and an RDF triple element (sub- WebNLG example annotations were done taking into account the three parts of a triple. If a property was not translated correctly, we considered that a model missed out that information. While a subject or object was not rendered correctly, they were annotated as wrong. All the semantic information beyond the size of initial set of triples was evaluated as added. The WebNLG example in Table 4 received the following scores, the total number of constituents being three: right: 2 (66%), wrong: 1 (33%), missed: 0 (0%), added: 1 (33%). If semantic information was repeated, it was rated as added.
The human evaluation analysis presented above is modest due to the lack of resources. To justify it, we argue that our focus is solely on semantic adequacy which is a more objective parameter in evaluations than, say, fluency or grammaticality. With no intent to question the documented unreliability of automatic metrics in NLG, we attribute such high correlations to the design of our configurations which cover some extreme cases where models are supposed to show a drastic drop in performance.

Results and Discussion
We compared the output of the sequence-tosequence model with attention described in Section 2.5 on two datasets (WebNLG and E2E) and considering eight different configurations depending on how rare words are handled (without delexicalisation, with delexicalisation, with a copy-andcoverage mechanism and with both delexicalisation and a copy-and-coverage mechanism) and on how the train/dev/test partition is constructed (unconstrained vs. constrained).
As pointed out in Section 2.6, automatic scores are reported using the whole test sets whereas human evaluation is based on shared MR instances between the non-constrained and constrained test sets (40 instances for E2E and 44 for WebNLG).
The results are summarised in Table 5 (E2E) and 6 (WebNLG). Some example predictions are shown in Tables 7 and 8.  A similar, though weaker, trend can be observed for the other automatic metrics (e.g., ∆BLEU E2E, NIL vs. D+C, unconstrained: −0.12 points).
Delexicalisation, Copying or Both The results show two trends. First, combining copying and delexicalisation yields the best results across the board. Second, while in the unconstrained setting, there is not much difference in terms of results between copying and delexicalisation, in the constrained setting, copying yields lower results (∆BLEU E2E: −0.15, ∆BLEU WebNLG: −0.10, ∆right E2E: −12.9%, ∆right WebNLG: −28.89%, ∆SER E2E: +11.91, ∆SER WebNLG: +16.43; constrained setting, C vs. D). This suggests that copying only partially captures rare items. Looking at the outputs, copying seems to work better when the item to be copied has been seen in the training data. When an entity was not seen, the network often chooses to generate a frequent entity seen in the source, rather than copying. For instance, for the E2E data, restau-rant names (which had not been seen in the training data) were not copied over in the constrained setting. In most cases, the input restaurant name was replaced by a restaurant name that is frequent in the training data. For example, given the MR name[Cocum], eatType[coffee shop], near[The Rice Boat], the text Near The Rice Boat there is a coffee shop called Fitzbillies was generated, where Fitzbillies, a frequently occurring restaurant name in the training data (2,371 instances), was generated instead of the input restaurant name Cocum.
Constrained vs. Unconstrained Setting There is a clear difference in terms of relative performance in the constrained vs. the unconstrained setting between the two datasets.
For the E2E dataset, since the constrained dataset is defined with respect to slot values (name and near) which are delexicalised, the constrained setting is in fact similar to the unconstrained setting. And indeed the scores are similar (e.g., unconstrained vs. constrained, D, E2E: ∆BLEU: −0.05, ∆SER: −1.1 and ∆right: +0.54%). When using copying however, the results are lower in the constrained setting again suggesting that copying underperforms for items that have rarely been seen at training and development time (e.g., unconstrained vs. constrained, C, E2E: ∆BLEU: 0.11, ∆SER: −9.84 and ∆right: 12.9%).
For the WebNLG data, the difference between constrained and unconstrained setting is much stronger for both delexicalisation and copying. For instance, for copying the BLEU score in the un-constrained setting is 0.61 vs. 0.34 in the constrained setting. Semantic adequacy also drops noticeably (unconstrained: 83%, constrained: 41%). This is in line with Aharoni and Goldberg (2018)'s observation that in the unconstrained setting, the model learns to memorise association between facts and entities and thereby fails to generate text that adequately captures the meaning of the input data. The low scores for the copying mechanism also confirm the observation made above that copying underperforms for rare data fragments.
This difference between datasets is further discussed in the next paragraph.
Semantic Adequacy As mentioned above, the manual and automatic evaluation metrics we used to assess semantic adequacy strongly correlate. They both show that semantic adequacy is much lower for the WebNLG data (higher SER, higher proportion of wrong and missed items). This is not surprising, since the WebNLG dataset contains a much higher number of distinct values (3,648 against 79 in the E2E dataset) and exhibits a greater mismatch between input and output value names 8 . That is, the delta shows that the efficiency of copying and delexicalisation varies depending on the variety and content of the dataset.
The two datasets also differ with respect to the proportion of added slots which is higher for the E2E dataset and suggests an overfitting effect due to a skewed distribution in favour of inputs containing more than 3 attributes. Thus, the human evaluation shows that the majority of cases with added slots are cases where the input consists of three slots (the minimal number of attributes in E2E). The overgeneration can be explained by the restricted number of three-slot inputs in the E2E dataset (only 2.5% MRs out of the whole corpus). That claim is also supported by predictions produced by adversarial examples. While inputting dialogue moves consisting of 2 slots (the nonexistent number of attributes in E2E), all eight E2E models tend to overgenerate by predicting texts with 3 or 4 slot-value pairs.
Fluency As mentioned in Section 2.6, while annotating the data for semantic adequacy, we found that almost all systems outputs were wellformed English sentences. The only exception was the WebNLG model with copy mechanism where stutterings were spotted in half of the examined instances. Despite those repetitions, it was always possible to detect the subject-predicateobject structure (e.g., 1001 kelvins is an escape velocity of 1001 kelvins; Asterix was created by R. Goscinny and was created by R. Goscinny), so the annotation was not hampered.

Related Work
Delexicalisation remains one of the most popular techniques for handling rare named entities. We analysed the submissions participating in the E2E and WebNLG challenges, which used a neural approach. Among them, six teams applied delexicalisation Davoodi et al., 2018;Juraska et al., 2018;Puzikov and Gurevych, 2018;Trisedya et al., 2018;van der Lee et al., 2017), three resorted to the copy mechanism , two developed character-based systems (Agarwal and Dymetman, 2017;Deriu and Cieliebak, 2018), and another two made use of byte pair encodings (Elder, 2017;. A copy mechanism allows to detect a word in an input sequence and to copy it to an output sequence. The copy mechanism is widely used in text production approaches where it is relevant for handling rare input but also, for instance in text summarisation, for copying input into the output. See et al. (2017), Gu et al. (2016), Cheng and Lapata (2016) introduced pointer networks (Vinyals et al., 2015) extended with a copy mechanism for text summarisation. Similarly,  uses a copy mechanism to generate paraphrases and He et al. (2017) to generate answers. The copy mechanism is often paired with coverage, which tries overcoming a common problem of repeated or omitted content in neural network outputs. It was used for instance in NMT (Tu et al., 2016) and summarisation (See et al., 2017).
Finally, some approaches apply neither copying nor delexicalisation. In particular, Nayak et al. (2017), working in the restaurant domain for dialogue systems, investigated ways of including slot values directly into the input representation of sequence-to-sequence models.

Conclusion
We investigated the impact of copying and delexicalisation on two datasets and using two different Wildwood is an Italian pub situated in the city centre, near the Raja Indian Cuisine. It is not family-friendly. unconstr., NIL Wildwood is an Italian pub located in the city centre near the Ranch. It is not family-friendly. unconstr., C Wildwood is an Italian pub located in the city centre near Raja Indian Cuisine. It is not familyfriendly. unconstr., D Wildwood is an Italian pub located in the city centre near Raja Indian Cuisine. It is not familyfriendly. unconstr., D+C Wildwood is an Italian pub located near the Raja Indian Cuisine in the city centre. It is not familyfriendly. constr., NIL In the city centre near Raja Indian Cuisine there is an Italian pub called the Aromi. It is not family-friendly. constr., C There is a pub near Raja Indian Cuisine in the city centre called Aromi. It serves Italian food and is not family-friendly. constr., D Wildwood is an Italian pub near Raja Indian Cuisine in the city centre. It is not family-friendly. constr., D+C Wildwood is an Italian pub located near the Raja Indian Cuisine in the city centre. It is not familyfriendly.  ways of splitting the data into train, dev and test.
The results show some regularities and highlight some interesting differences.
Overall, the results indicate that delexicalisation outperforms copying. Furthermore, they show that copying underperforms on rare items. Since delexicalisation is a somewhat ad hoc process, an interesting direction for future research would be to devise copying methods that are more accurate and that can better handle rare data items.
Another direction for future research would be to further investigate how the content and train/dev/test split of a dataset impact learning. Our results suggest two ways in which these may induce overfitting.
In the WebNLG dataset, strong associations between entities and facts seem to result in generation models that memorise facts with entities rather than generate a text that adequately verbalises the input. This is highlighted in the manual evaluation by the high number of wrong and missed data items observed both in the constrained and in the unconstrained setting.
In the E2E dataset, on the other hand, we saw that added facts are frequent and manual evaluation suggests that this is due to an overfitting effect whereby, because most inputs consists of more than three slot-value pairs, the models tend to overgenerate by predicting texts that verbalise four or more slot-value pairs.
In both cases, the copy-and-coverage mechanism does not suffice to ensure correct output and the results further decrease in the constrained setting. It would therefore be interesting to see to what extent better methods can be devised both for creating datasets and for devising train/dev/test splits that adequately test the ability of models to generalise.
Another direction for future work is to investigate the capability of byte pair encoding models and subword representations to handle rare input tokens in data-to-text generation.