Neural Generation for Czech: Data and Baselines

We present the first dataset targeted at end-to-end NLG in Czech in the restaurant domain, along with several strong baseline models using the sequence-to-sequence approach. While non-English NLG is under-explored in general, Czech, as a morphologically rich language, makes the task even harder: Since Czech requires inflecting named entities, delexicalization or copy mechanisms do not work out-of-the-box and lexicalizing the generated outputs is non-trivial. In our experiments, we present two different approaches to this this problem: (1) using a neural language model to select the correct inflected form while lexicalizing, (2) a two-step generation setup: our sequence-to-sequence model generates an interleaved sequence of lemmas and morphological tags, which are then inflected by a morphological generator.


Introduction
While most current neural NLG systems do not explicitly contain language-specific components and are thus capable of multilingual generation in principle, there has been little work to test these capabilities experimentally.This goes hand in hand with the scarcity of non-English training datasets for NLG -the only data-to-text NLG set known to us is a small sportscasting Korean dataset (Chen et al., 2010), 1 which only contains a limited number of named entities, reducing the need for their inflection.
Since most generators are only tested on English, they do not need to handle grammar complexities not present in English.A prime example is the delexicalization technique used by most current generators (e.g., Oh and Rudnicky, 2000;Mairesse et al., 2010;Wen et al., 2015a,b;Juraska et al., 2018): It is generally assumed that attribute (slot) values from the input meaning representation (MR) can be replaced by placeholders during generation and inserted into the output verbatim.Delexicalization or an analogous technique, such as a copy mechanism (Gu et al., 2016;Gehrmann et al., 2018), is required for most generation scenarios to allow generalization to unseen entity names: sets of entities are open (potentially infinite and subject to change) while training data is scarce.However, the verbatim insertion assumption does not hold for languages with extensive noun inflection -attribute values need to be inflected here to produce fluent outputs (see Figure 1).This paper presents the following contributions: • We create a novel dataset for Czech delexicalized generation; this extends the typical task of data-to-text NLG by requiring attribute value inflection (Section 2).We choose Czech as an example of a morphologically complex language (Cotterell et al., 2018) with a large set of NLP tools readily available (e.g.Popel and Žabokrtský, 2010;Straková et al., 2014;Straka and Straková, 2017).
• We present baseline models based on the TGen sequence-to-sequence (seq2seq) system (Dušek and Jurčíček, 2016), with two novel extensions to the model for our task (Section 3): -A model for lexicalization, i.e., selecting the correct inflected surface form for a slot value, based on a recurrent neural network language model (RNN LM); -A new generation mode, where the seq2seq generator produces interleaved sequences of lemmas (base word forms) and morphological tags that are postprocessed using a morphological generator.
• Using both automatic and manual evaluation in Section 4, we show that our extensions improve arXiv:1910.05298v1[cs.CL] 11 Oct 2019
Do-you-want to-find a-restaurant where yourself well [you-will-have-breakfast] X-name je na X-area .(delexicalized) outputs in bold black, with "X-" marking slot value placeholders.English glosses are shown below each word in gray.Appropriate inflected forms to be filled into slot placeholders are shown in bold green, with lists of all possible forms along with their morphological tags (Hajič, 2004).Note that the surface form for "Xgood for meal" can even have different parts-of-speech (left column: noun, middle: adjective, right: verb forms).over the base model, but do not solve the task completely.We propose improvements for future work in Section 6.Our dataset and all experimental code are released on GitHub.2

Dataset
Our goal was to create a dataset comparable in size and domain to existing English data-to-text NLG datasets used in experiments with neural systems.Since there are few to none Czech speakers on crowdsourcing platforms (Pavlick et al., 2014;Dušek et al., 2014), we were not able to use them for data collection.Recruiting freelance translators seemed easier than training annotators; therefore, we turned to localizing and translating an existing dataset instead of creating a new one from scratch.We chose the restaurant dataset of Wen et al. (2015b) due to its manageable, yet non-trivial size and the familiarity of the domain (cf.Mairesse et al., 2010;Dušek et al., 2019).The original dataset contains 5,192 MR-sentence pairs, where MRs come in the form of dialogue acts (DAs).A DA consists of DA type (e.g., request, confirm, inform) and a list of slots (attributes) and their values (e.g., name, price range, address, area).There are 8 different DA types and 12 slots in the dataset.All slots except the binary kids allowed are delexicalized during generation (cf. Figure 1).

Localizing the Data
We first needed to localize the dataset, replacing the original setting of San Francisco with a Czech one.
In particular, we aimed at using domestic entity names (DA slot values) that need to be inflected since foreign names are often kept uninflected in Czech, using less fluent and conspicuous grammatical constructions to avoid inflection. 3e localized the following slots in both DAs and texts from the dataset: restaurant names, areas, food types, street addresses, and landmarks.We used a list of randomly chosen restaurant names from the Prague city center as well as lists of Prague neighborhoods, streets, and landmarks.The resulting sentences contain mostly factually inaccurate, yet meaningful utterances about restaurants in Prague.
The localized lists are quite short, with just 15 different restaurant names and a similar number of landmarks, streets, and neighborhoods.While much longer lists would be needed for a real-world scenario, this is sufficient to cover most common classes of names with different inflection patterns and/or syntactic behavior (see Figure 2).

Translation
We recruited six translators and asked them to translate all unique texts in the localized dataset.They were given the following instructions: • translate the utterances in isolation, • use fluent, spoken-style Czech, • strive to preserve the facts but not necessarily all nuances of the original, • use varying synonyms (as long as they belong to casual, fluent Czech), including for entity names or slot values (such as price ranges or meal types), • inflect entity names as needed, • use formal address (or plural) when addressing the user, and use the female form in the first person for self-references. 4All rules but the last one aim at obtaining a varied and fluent dataset; the last rule strives for consistency.Note that the translators were not given the input DAs -these carry no more information than the corresponding English sentences, and we assume that they would only confuse the translators and could hurt the fluency of the results.

Consistency Checks and Deduplication
We checked the translated Czech texts for the presence of all required slot values.We took the following iterative, partially automatic approach: 1. Create a list of possible inflected surface forms for all slot values in the dataset.We used 4 Czech grammar requires a selection between formal an informal address whenever using a verb in the 2nd person (Naughton, 2005, p. 134ff.).For verbs with past tense or conditional and in any person, gender must be selected (Naughton, 2005, p. 140ff.).Here we opted for a feminine form whenever the system addresses itself, and formal address (mostly homonymous with plural) when addressing the user.
the morphological generator of Straková et al. (2014) to inflect the surface forms automatically and manually checked for errors.
2. Given a DA and a translated sentence, check (using an automatic script) that the sentence contains surface forms for all slots in the DA.
3. Given a sentence found by the script to miss a value, check if it contains an alternative surface form not included in the list from Step 1.If so, add this alternative surface form to the list.
4. If the translated sentence does not contain any mention of the DA value, fix the translation.

Repeat from
Step 2 until there are no missing DA value mentions in the whole set.Note that these checks result not only in greater consistency of the dataset, but also in a list of possible surface realizations for all slot values in the dataset.We store this list including morphological information provided by the tagger (with manually corrected errors), and we use it for lexicalization (see Section 3).5

Duplicate Sentence Handling
If the exact lexicalization is not taken into account, the original dataset of Wen et al. (2015b) contains a lot of duplicate texts -the total number of DA-text pairs is 5,192, but only 2,648 are unique.Therefore, we chose to only translate unique texts, in order to speed up the translation process and lower the costs, albeit at a cost of a lower-quality result.We ensured that the translations preserve the same number of unique sentences by modifying any duplicate translations, manually replacing selected words or phrases with synonyms.
After the dataset was translated, we expanded it to obtain the same number of instances and the same distribution of different DAs as in the original.Given a delexicalized DA, a list of corresponding translated sentences, and the target number of corresponding sentences to match the original set, we sampled additional copies of the existing translations to match the number of originals.To estimate probabilities of the individual translations for the sampling, we used a 5-gram LM6 trained on lemmatized and delexicalized translations (see Fig translations, used softmax to obtain a probability distribution, and sampled additional copies from this distribution.This ensures that translations using more frequent phrasing are more likely to be used multiple times in the set. We then relexicalized the sampled copies: We randomly changed DA slot values and replaced their surface forms in the text using the surface forms list, checking for roughly corresponding morphology.Since the morphological information used by this approach was rather crude (e.g., noun/adjective gender was not taken into account), disfluencies ensued in some cases.Therefore, we manually corrected all relexicalized sentences, changing inflection or wording where needed.

Dataset Statistics
The final Czech set contains the same number of instances as the English original, copies the DA distribution of the original, and contains a slightly higher number of unique delexicalized sentences due to post-expansion corrections (see Section 2.4).A statistics of the dataset size is shown in Table 1, with a comparison to the original English set.We can see that while the number of unique word lemmas (disregarding restaurant and place names) is slightly higher in the Czech set, the number of unique inflected word forms is more than twice as high.It is also clear that using slot values verbatim in the text is not possible in the Czech set as the number of possible lexical realizations for each value is much higher than one.

Data Split
The original dataset of Wen et al. (2015b), which used a sequential 3:1:1 split into training, development and test parts, suffered from a lot of overlap in terms of delexicalized DAs between the sections.This means that a system can perform quite well on this dataset and still be unable to generalize to unseen DAs (Lampouras and Vlachos, 2016).To make testing systems' generalization capabilities possible on our Czech dataset, we opted for a different data split.We roughly keep the same 3:1:1 size proportion (see Table 2), but we make sure no delexicalized DA appears in two different parts.
On the other hand, we ensure that most DA types (inform, confirm etc.) are represented in all data parts, so the system has access to all general types of sentences during training. 7 with LSTM cells (Hochreiter and Schmidhuber, 1997).The encoder takes the input DA as a sequence of triples "DA type -slot -value" 8 and produces a sequence of hidden states.The last hidden state is used to initialize the decoder, all hidden states serve as input into the attention model.The attention model produces their weighted combination for each decoder step using a 1-layer fully connected network.The decoder generates output tokens one-by-one using the previously generated token and the attention model as inputs.
In addition to the basic seq2seq model, TGen adds beam search and a reranker for the candidate outputs on the generation beam that checks if the input semantics is preserved.The reranker encodes a candidate output using an LSTM RNN and produces a binary classification of DA types and slot-value pairs present.The number of differences against the input DA is used as penalty.

Basic Extensions
We added two features fairly standard in seq2seqbased models but absent from TGen: • Bidirectional encoder (Bahdanau et al., 2015) the input sequence is encoded in both directions and the resulting hidden states are joined.We added this for both the main seq2seq generator and the reranker.
• Dropout (Hinton et al., 2012) -this zeroes out certain connections within the network with a given probability during the training process; it serves as regularization feature.We use this in the main generator only.We use these extensions in all our setups as they improved results in our preliminary experiments.

Lemma-tag Generation Mode
Dušek and Jurčíček (2016) experiment with generating syntactic trees and realizing them using an external surface realizer; they report slightly worse performance than generating tokens directly.
In order to fight data sparsity coming from the rich morphology of Czech, we decided to explore the middle ground between syntactic trees and full word-form generation: generating base forms (lemmas) and morphological tags that indicate how the form should be inflected.We train TGen to simply generate an interleaved sequence of lemmas and tags (see Figure 4), which are then postprocessed 8 DA type is repeated for each slot-value pair.using the dictionary-based morphological generator of Straková et al. (2014) to obtain the inflected word forms.
In the lemma-tag mode, the set of possible output tokens is reduced compared to direct token generation, but the postprocessing step is much simpler than using a full syntactic surface realizer.Moreover, the generated morphological tags following slot placeholders can be used to limit the scope of possible surface forms during lexicalization (see Section 3.3).
This approach is inspired by similar approaches in phrase-based MT (Bojar, 2007;Toutanova et al., 2008;Fraser, 2009) and was developed in parallel to recent similar experiments with two-step neural MT (Nadejde et al., 2017;Tamchyna et al., 2017).We compare the lemma-tag generation mode against the TGen default direct word-form generation mode in our experiments.

Lexicalization
We experiment with three different approaches for selecting the surface form for a DA slot value placeholder from a set of applicable ones -two very straightforward baselines requiring no training and our proposed solution based on a neural LM: • Random baseline.This selects a surface form at random.This approach is certainly not suitable for a real application, we only use it for comparison.
• Most frequent baseline.Here, the applicable surface form that occurs overall most frequently in the training data is selected.This represents a stronger baseline than the random method.
• RNN-based language model.Our main solution attempts to choose the best surface form using a bidirectional LSTM RNN-based LM (Mikolov et al., 2010), trained to predict a token probability distribution given all previous and all following tokens.During decoding, the RNN LM estimates the probabilities of all applicable surface forms, and we select the most probable surface form for the output.When selecting a surface form during direct word-form generation, all possible forms for the given slot value are considered.In the lemma-tag mode (Section 3.2), only forms matching the morphological tag following the slot placeholder are considered (cf. Figure 4) -first the ones matching perfectly, with backoffs to coarse part-of-speech or all possible forms.Z:------------final punctuation Figure 4: Example interleaved lemma-tag sequence for the input DA ?confirm(good for meal=breakfast), the first output from Figure 1 (acc = accusative, fem = feminine, sg = singular; cf.(Hajič, 2004) for tagset details).Note that the morphological tag for the slot placeholder is included and can be used during lexicalization (cf.Section 3.3).

Lexicalized Input DAs
Some slot values in our dataset may require certain morphosyntactic structure of their neighborhood.This is the case for restaurant counts: Czech cardinal numerals 2-4 behave as adjectives, while higher numerals behave as nouns and take the counted quantity as genitive object.The correct nominative forms when counting restaurants are then "2 restaurace", but "5 restaurací" (Naughton, 2005, p. 113ff.).Another example are area names requiring different prepositions for location -the correct form for "in Malá Strana" is "na Malé Straně", but for "in Karlín", it is "v Karlíně" (Naughton, 2005, p. 202).
Therefore, inspired by Sharma et al. (2017), we test using fully lexicalized input DAs with the main generator to check if it learns to produce more appropriate structure for concrete values (while still producing delexicalized output). 9We compare this setup against the default with delexicalized DAs.

Experimental Setup
We test all combinations of the features described in Section 3: • Direct token vs. lemma-tag generation • Random / most-frequent / RNN LM lexicalizer

• Delexicalized vs. lexicalized input DAs
We train the resulting 12 model variants using the Adam optimizer (Kingma and Ba, 2015) to minimize cross entropy on the training set; this approach is used for all parts of the system: the main seq2seq generator, the reranker, and the RNN LM lexicalizer.After each training data pass, we validate the models and keep the best-performing parameters.We use BLEU score (Papineni et al., 2002), classification error, and LM perplexity as the respective validation criteria.We set hyperparameters based on TGen defaults for other datasets and a few experiments on the development set. 10  Training the baseline lexicalizers is trivial: the random baseline does not require any training, it simply uses the list of possible surface forms; the most frequent baseline just memorizes surface form frequencies in the training data.
To reduce the effect of random initialization, we train five runs using different random seeds and use results of all of them for evaluation.In addition, we fix the random seeds so that identical seq2seq generators and rerankers are used in setups that only differ in the lexicalization method.

Metrics
We use the suite of word-overlap-based automatic metrics from the E2E NLG Challenge (Dušek et al., 2019), 11 supporting BLEU (Papineni et al., 2002), NIST (Doddington, 2002), ROUGE-L (Lin, 2004), METEOR (Lavie and Agarwal, 2007) and CIDEr (Vedantam et al., 2015).Although multiple texts often correspond to the same delexicalized DA, we treat each instance individually both in training and testing since the particular slot values influence the shape of the whole sentence (see Sections 2.4 and 3.4).This means that only a single reference output per instance is available to be used with automatic metrics (see Section 4.3). 10The main generator uses embedding and LSTM cell size 200, learning rate 0.005, dropout rate 0.5, and batch size 20.At least 50 and up to 1000 training data passes are used, with early stopping if the top 10 validation BLEU scores do not change for 50 passes.Beam size 20 is used for decoding.
The reranker uses embedding and LSTM cell size 50, no dropout, learning rate 0.001, and batch size 20.Training runs for 100 passes, performance is validated starting with pass 10.The reranker is validated both on training and development data; classification error on the development set is given 10 times more weight than training set error.
The RNN LM lexicalizer uses the same parameters as the reranker, with training for 50 passes maximum and validation (on development data only) starting after the first pass.Table 3: Automatic metrics results.See Section 4.2 for metrics; scores are averaged over 5 different random initializations, all scores except for NIST and CIDEr are percentages.* = significantly better than the corresponding most frequent baseline lexicalizer, † = significantly better than the corresponding word forms mode, ‡ = significantly better than the corresponding (de)lexicalized input DAs.Significance was assessed using pairwise bootstrap resampling (Koehn, 2004), p < 0.01.Table 4: Manual evaluation results on 100 sampled sentences -absolute numbers of different types of errors (S = semantic errors, R = repetition, F = fluency problems except lexicalization, I = impossible to lexicalize correctly with the given value, L = lexicalization errors).All error types are exemplified in Figure 5.
In addition to word-overlap metrics, we use the slot error rate (SER; Wen et al., 2015b) to evaluate semantic accuracy of the outputs.This metric counts slot placeholders in the output before lexicalization and compares them to slots in the input DA.It reliably measures the amount of missed/added content in all delexicalized slots (cf.Section 2), but the non-delexicalized binary kids allowed slot is ignored.

Results
The automatic metrics scores for all setups are shown in Table 3.In terms of generator mode, using lemma-tag generation significantly 12 improves word-overlap metrics over direct token generation in the delexicalized input setting.However, it also leads to an increased SER.The RNN LM brings a significant 12 improvement over both baselines in all setups; the very low performance of the random baseline only documents that inflection indeed matters for slot values.The lexicalized input DAs did not bring improvement over the delexicalized set-12 BLEU and NIST differences are statistically significant (p < 0.01) according to bootstrap resampling (Koehn, 2004).
ting -lexicalized setups seem to perform slightly worse in terms of both word-overlap metrics and SER.

Manual Error Analysis
To obtain a deeper insight into the results and account for automatic metrics' inaccuracy (Novikova et al., 2017;Reiter, 2018), we performed a detailed manual error analysis on a sample of 100 outputs produced by all systems except the ones with random baseline lexicalizers, which clearly perform poorly.This was a blind annotation of semantic and fluency errors; it is not a preference rating.We categorized multiple error types; the results are shown in Table 4.
The analysis confirmed that lexicalized input DAs cause more semantic errors (both missed slots and repetition).On the other hand, the outputs were more fluent in this setting, which is not apparent with automatic metrics.Lemma-tag generation also improves fluency overall, at the cost of increasing the number of semantic errors.The RNN LM lexicalizer leads to significant reduction of lexicalization errors compared to the most frequent   4).In the top example, the RNN LM lexicalizer is able to select the correct feminine singular form, while the most frequent baseline selects a neuter form.In the bottom example, systems with lexicalized input DAs make more semantic errors.The lemma-tag mode is able to select a more appropriate syntactic structure for the numeral 218.baseline, especially in combination with lemma-tag generation (see top example in Figure 5).None of the systems produce perfect output; they seem to struggle especially with DAs that are very different from the ones found in the training set and/or occur less frequently (see bottom example in Figure 5, cf.Section 2.6).We believe that an increased amount of training data could improve the situation.

Related Work
NLG experiments for non-English languages are relatively rare and fully trainable approaches even rarer.Our work is, to our knowledge, the first application of neural NLG to a non-English language for data-to-text generation.
In data-to-text generation, the recent work of Moussallem et al. (2018) is applied to Portuguese, but is largely rule-based.The works of Chen et al. (2010) and Kim and Mooney (2010) represent the only data-to-text end-to-end NLG system with multilingual experiments known to us; they generate English and Korean sport commentary sentences using an inverted (non-neural) semantic parser.Our dataset is ca.2.5 times larger and more complex, given the slot value inflection.
We presented the first dataset targeted at end-toend neural non-English NLG, containing Czech texts from the restaurant domain.We show that the task of data-to-text NLG here is harder as slot values require morphological inflection.We apply to our data the freely available, state-of-the-art TGen NLG system (Dušek and Jurčíček, 2016) based on the seq2seq architecture, and we implement two extensions for Czech: (1) an RNN LM model to select the correct inflected surface form for slot values and (2) lemma-tag generation mode, where the generator produces an interleaved sequences of base form and morphological tags, which are postprocessed by a morphological generator.We also experiment with lexicalized and delexicalized slot values in generator inputs.Using both automatic metrics and manual analysis, we show that the RNN LM brings clear benefits.The lemma-tag mode and lexicalized inputs improve fluency but hurt semantic accuracy of the outputs.We release our dataset dataset and all experimental code on GitHub. 13 In future work, we will collect a large unannotated dataset and pretrain the generator (Chen et al., 2019).We believe that this will lead to increased output fluency and accuracy.We are also considering using machine translation to obtain more synthetic training data points.

Figure 1 :
Figure1: Example of delexicalized generation in Czech.Input MRs are shown in bold blue, corresponding target (delexicalized) outputs in bold black, with "X-" marking slot value placeholders.English glosses are shown below each word in gray.Appropriate inflected forms to be filled into slot placeholders are shown in bold green, with lists of all possible forms along with their morphological tags(Hajič, 2004).Note that the surface form for "Xgood for meal" can even have different parts-of-speech (left column: noun, middle: adjective, right: verb forms).

Table 1 :
Wen et al. (2015b)restaurant for you.Its name is Kočár z Vídně and you can have Czech cuisine.' Figure3: Lemmatized and delexicalized form of the translations for LM scoring.Top: lemmatized and delexicalized Czech used for the LM; middle: original Czech sentence including lexicalization; bottom: English word-byword gloss.An English translation is shown below the example.Statistics of our translated Czech dataset and a comparison to the English original ofWen et al. (2015b).The average lexicalizations per slot value shows the number of different surface lexical forms per slot value, as it appears in the dataset.Numerals were disregarded when computing this value. '