Syntactic Manipulation for Generating more Diverse and Interesting Texts

Natural Language Generation plays an important role in the domain of dialogue systems as it determines how users perceive the system. Recently, deep-learning based systems have been proposed to tackle this task, as they generalize better and require less amounts of manual effort to implement them for new domains. However, deep learning systems usually adapt a very homogeneous sounding writing style which expresses little variation. In this work, we present our system for Natural Language Generation where we control various aspects of the surface realization in order to increase the lexical variability of the utterances, such that they sound more diverse and interesting. For this, we use a Semantically Controlled Long Short-term Memory Network (SC-LSTM), and apply its specialized cell to control various syntactic features of the generated texts. We present an in-depth human evaluation where we show the effects of these surface manipulation on the perception of potential users.


Introduction
In this paper, we describe our end-to-end trainable neural network for producing natural language descriptions of restaurants from meaning representations (MR). Recently, data-driven natural language generation (NLG) systems have shown great promise, especially as they can be easily adapted to new data or domains. End-toend systems based on deep learning can jointly learn sentence planning and sentence realization from unaligned data. However, a recurrent problem, which we found with the existing solutions for NLG, is that the generated utterances express a very homogeneous writing style. More precisely, most utterances start by using the restaurant name, the follow-up sentences usually begin with the pronoun "It", and each attribute-value pair is expressed using the same formulation across different utterances (see Table 1).
Green Man is a family friendly japanese restaurant in riverside near Express by Holiday Inn. Clowns is a pub near Crowne Plaza Hotel with a customer rating of 5 out of 5. Wildwood is an italian pub located near Raja Indian Cuisine in the city centre. It is not family-friendly. The Cricketers provides chinese food in the 20-25 price range. It is located in the riverside. It is near All Bar One. Its customer rating is high. Table 1: Examples to highlight the homogeneity of the utterances generated by state-of-the-art systems.
The publicly available E2E dataset by (Novikova et al., 2017) provides pairs of Meaning Representations (MR's) and several human generated reference utterances for the restaurant-domain. It is the first dataset to provide large amounts of training data with an open vocabulary, complex syntactic structures, and more variabilty in expressing the attributes. In this work, we exploit these characteristics of the dataset to generate utterances which express a higher diversity in their writing style. For this, we extend the Semantically Conditioned Long Short-term Memory Network (SC-LSTM) proposed by (Wen et al., 2015b) with surface features to control the manipulation of the surface realization.
Since the data contains a large variety of formulations for an attribute-value pair, a simple delexicalization of the utterance is not possible. This fact also increases the difficulty of evaluating the utterances for their correctness. Thus, we introduce a semantic reranking procedure based on classification algorithms trained to rate whether the attributes are rendered correctly. We evaluate our model on the E2E dataset and report the BLEU, NIST, METEOR, ROUGE-L and CIDEr scores. We measure the diversity of the generated utterances by counting the number of different uni-and bi-grams. Further, to evaluate the correctness of the generated utterances, we employ a soft metric based on the aforementioned classifiers. Finally, we present an in-depth human evaluation where we measured the effects of these more diverse utterances on the perceptions of potential users. More precisely, humans evaluated the quality and naturalness of an utterance, which of the attributes comprehensible, concise, elegant, and professional fits to the text, and which of the different systems generated the most preferred outputs. We release the code and all the scripts. 1

Related Work
The task of NLG is usually divided into separate subtasks such as content selection, sentence planning, and surface realization (Stent et al., 2004). Traditionally, the task has been solved by relying on rule-based methods, but these methods do not scale and are hardly adaptable to new domains. Recently, deep learning techniques have become more prominent for NLG. With these techniques, there now exists a large variety of different network architectures, each tackling a different aspect of NLG: (Wen et al., 2015b) propose an extension to the vanilla LSTM (Hochreiter and Schmidhuber, 1997) to control the semantic properties of an utterance, whereas (Hu et al., 2017) use variational autoencoder (VAE) and generative adversarial networks to control the generation of texts by manipulating the latent space; (Mei et al., 2016) employ an encoder-decoder architecture extended by a coarse-to-fine aligner to solve the problem of content selection; (Wen et al., 2016) apply data counter-fitting to generate out-of-domain training data for pretraining a model where there is little in-domain data available; (Semeniuta et al., 2017;Bowman et al., 2015) use a VAE trained in an unsupervised fashion on large amounts of data to sample texts from the latent space; and (Dušek and Jurcicek, 2016) use a sequence-to-sequence model with attention to generate natural language strings as well as deep syntax dependency trees from dialogue acts. All these approaches solve different aspects of the NLG task. 1 https://github.com/jderiu/e2e nlg In our work, we tackle the aspect of generating texts that display more complex and diverse syntactic structures. The dialogue system community has proposed most work on this topic, as the end-to-end trainable algorithms tend to produce the same universal answer to each input. In (Li et al., 2016a) the authors develop a new loss function based on mutual information, (Li et al., 2016b) propose a new decoding algorithm based on a modified beam search, which favors hypotheses from different parent nodes. In (Li et al., 2017) the authors aim to increase the diversity by removing training examples, which are similar to the most commonly used utterances. In (Shao et al., 2017) the authors propose a sequence-to-sequence model with an augmented attention mechanism, which takes into account parts of the target sentence. Finally, the authors adapt the beam-search ranking to work at a segment level and, thus, injecting diversity earlier during the decoding.

Task Definition
Natural language generation for dialogue systems describes the task of converting a meaning representation (MR) into an utterance in a natural language. The E2E training data consist of 50k instances in the restaurant domain, where one instance is a pair of a MR and an example utterance or reference. The data is split into training, development and test in a 76.5%-8.5%-15%-ratio. Each MR consists of 3-8 attributes and their values, see Table 2 for the domain ontology. The split ensures that the MRs in the different dataset-splits are distinct. The dataset contains an open vocabulary and more complex syntactic structures than other similar datasets, as shown in the dataset definition (Novikova et al., 2017). Especially, it contains various ways of expressing a single value of an attribute: for instance, the value 1 of 5 is expressed in the data as "one star rated", "rated with 1 of 5 stars", or "rated one out of five". In this work, we exploit this variety of formulation to produce utterances that express a more varied writing style.

Model
The goal of our model is to generate a text while providing the ability of controlling various semantic and syntactic properties of this text. Our model has two components: i) the generator and ii) semantic classifiers that rate the correctness of an ut-  terance. We use the Semantically Conditioned Long Shortterm Memory Network (SC-LSTM) proposed by (Wen et al., 2015b) as our generator, which has a specialized cell to process the one-hot encoded MR-vector. The semantic classifiers (SC) are trained for each attribute separately: they classify which value the generator rendered. With this, the correctness of an utterance can be determined, which is relevant when dealing with contradictory constraints during the generation of more diverse texts.

Semantically Conditioned LSTM
The SC-LSTM (Wen et al., 2015b) extends the original LSTM (Hochreiter and Schmidhuber, 1997) cell with a specialized cell, which processes the MR. The MR is represented as a one-hot encoded MR-vector d 0 , which represents the value for each attribute. This cell assumes the task of the sentence planner, as it treats the MR-vector as a checklist to ensure that the information is fully represented in the utterance. The cell acts as a forget gate, keeping track of which information has already been consumed. We briefly introduce the SC-LSTM as defined in (Wen et al., 2015b), which we will later on modify to meet our needs. Let w t ∈ R M be the input vector at time t, d t ∈ R D the MR-vector at time t, and N be the number of units of an SC-LSTM cell, then the formulation of the forward pass is defined as: where σ is the sigmoid function, and i t , f t , o t , r t ∈ [0, 1] N are the input, forget, output, and MRreading gates, and h t , c t ∈ [0, 1] N are the hidden state and the cell state. The weights W 5n,2n and W d ∈ R D×M are the model parameters to be learned. The prediction of the next token is performed by sampling from the probability distribution: where W s ∈ R N ×M is a weight matrix to be learned during training. During the training procedure the inputs to the SC-LSTM are the original tokens w t from the training set. On the other hand, when generating new utterances we use the previously generated token as input to generate the next token.
Loss To ensure that the SC-LSTM consumes the MR correctly, two conditions are defined: i) the MR-vector at the last time step d T has to be zero, which ensures that all the required information has been rendered, and ii) the gate should not consume too much of the dialogue act in one time step, i.e. the difference d t − d t−1 should be minimised. From these criteria, the reconstruction loss is adapted to: where the first term is the reconstruction error, which sums the cross-entropy loss for each time step and the following two terms ensure the two criteria defined above.

Semantic Classifiers
For each attribute a we train a CNN-based classifier D a . Each classifier is trained to detect which of the possible values for the attribute a is rendered in the utterance or if the attribute is present in the utterance at all. We train the classifiers on the training set, where the input is the utterance and the output is the value for the attribute a, which is defined in the MR. These classifiers measure the semantic correctness of the produced utterances by comparing the output of the classifier to the MR. If the classifier output corresponds to the value defined in the MR then we regard the attribute as being rendered correctly.

Syntactic Control
The utterances produced by the basic model described in Section 4 lack syntactic variety, they all follow the trivial structure. To control the syntactic expressions of an utterance we expand the MRvector with syntax specific features. More specifically, in this work we control three different surface features: i) the first word of the utterance, ii) the first word of each follow-up sentence in the utterance, and iii) for each attribute-value pair the formulation used to express it. For each of these control mechanisms, we produce one-hot encoded vectors and append these vectors to the MR-vector d 0 . Through this mechanism, we provide the SC-LSTM with more prior information on the structure of the utterance. Thus, it learns to correlate how to render the surface based on the surface information provided. In the following, we describe the three control mechanisms in detail.
First Word Control Most utterances generated by the vanilla SC-LSTM begin by using the restaurant name. The main reason for this behaviour is that 59% of all utterances in the dataset have this characteristic. All the other starting words are used much less frequently: e.g. only 7% of all utterances start with the word "There", which is the second most used word. The model optimizes to generate the utterance, which yields the lowest average loss. Without additional information, this equates to the most common structure of utterances found in the training set. The first word used in an utterance greatly impacts how the rest of the utterance is rendered. Thus, using different first words increases the diversity of the rendered utterances. To generate more uncommon utterances, we provide the model with the information about the first word in the utterance during training. For this, we select all the words that appear more than t = 60 times as first word in the training data, which results in a set of n = 20 different words 2 . We then extend the MR-vector by adding a onehot encoded vector u 0 ∈ R n+1 , where the vector is set to '1' at the index of the first word in the utterance of the training sample. During the training, we use a dummy-index at n + 1 in case the first word of the utterance is not present in the list of first words. During test-time the first word is sampled from the set of n first words. To improve the semantic correctness we use the sampling procedure to over-generate, i.e. m different words are sampled to generate m different utterances. Using the semantic classifiers, the produced utterances are ranked by their correctness score.

Follow-up First Word Control
We observe that the follow-up sentences in an utterance, which are produced by the vanilla SC-LSTM also follow the same pattern. More precisely, in cases where the utterance uses multiple sentences, the followup sentences usually begin with the pronoun 'It' which refers to the restaurant name mentioned in the first sentence. Similarly, to the First-Word-Control, we control the first word of follow-up sentences by using one-hot encoded vectors. The encoding states which word is used as first word of each follow-up sentence. As most utterances are composed between one and four sentences, we use three vectors to encode the first word of the first three follow-up sentences. There are n = 22 different first words used in follow-up sentences, thus, each vector f i is of length n + 1, where i ∈ {2, 3, 4} denotes the sentence enumeration. We add an extra dimension to denote the case where the number of sentences is less than i. This representation provides the ability to control the first word used in each follow-up sentence as well as the number of sentences rendered.

Attribute-Value Formulation Control
We observe that the vanilla SC-LSTM learns to use the most common formulation for an attribute-value pair. On average over all the attribute-value pairs, the most common formulation is used in 76% of the cases in the training set. It turns out that the most used formulation for most attribute-value pairs is equivalent to the surface form of the value itself. For example, the value "5 out of 5" is mostly expressed using the formulation: "... with a customer rating of 5 out of 5", instead of "It has an excellent customer rating" or other formulations.
To extract the different formulations of an attribute-value pair, we use a simple TF-IDF approach based on unigrams. For the complete list of formulations refer to Table 11 in Appendix A.
For each attribute, we treat the utterances for each value as one document, thus, the corpus is made of as many documents as there are values for this attribute. The score is computed as 1 + log(tf a iv ) * log(1 + N df a i ) where tf a iv is the term frequency of term i for value v and df a i is the document frequency of term i in the documents of attribute a. We keep only those terms whose score is higher than 3. We apply manual filtering to clean the list from terms, which do not describe the attributevalue pair. With this method, we get on average 4.2 terms per attribute-value pair. We extend the MR-vector with one one-hot encoded vector for each attribute-value pair.

Experimental Setting
The goal for our application is to generate descriptions for restaurants. The dataset from (Novikova et al., 2017) (Lavie and Agarwal, 2007), ROUGE-L (Lin, 2004), and CIDEr (Vedantam et al., 2015). Furthermore, we report various measures for lexical diversity: number of different tokens (#tokens), the type-token ratio (TTR) (Chotlos, 1944), the moving average type-token ratio (MSTTR) (Covington and McFall, 2010), and the measure of lexical diversity(MLTD) (McCarthy, 2005). Finally, we perform a human evaluation to measure the effect of the proposed manipulations on the user's perception.
Preprocessing Each utterance is treated as a string of characters, where each character is represented as a one-hot encoded vector. We replace the name and near values with the tokens 'X-name" and "X-near" respectively. The high diversity of the various formulations found for the attributevalue pairs, impedes us from replacing other attributes with placeholders. To generate the lexical features, we apply the Spacy-API 3 for word and sentence tokenization.
System Setup We train the SC-LSTM and the classifiers using AdaDelta (Zeiler, 2012) to optimize the loss function. We apply a softmax with decreasing temperature as proposed in (Hu et al., 2017) to approximate the discrete representation, which is used as input to the LSTM during the decoding stage. For the LSTM cell we use a hidden state of size 1024 and apply dropout as suggested 3 https://spacy.io/  in (Yarin and Ghahramani, 2016). For the classifiers we use a 2-layer CNN with 256 kernels of length 3. We use our character-based version of the SC-LSTM (vanilla) as well as the sequence-tosequence model by (Dušek and Jurcicek, 2016) (tgen) as baseline. We evaluate different versions of our model: the model where we control only the first word of the utterance (utt-fw), the model where we only control the first words of the follow-up sentences (follow-fw), the model where we only control the formulations of the attributevalue pairs (form), and the model where we control all three factors (full).
Output Generation The input to the system is a meaning representation (MR) which is converted into the MR-vector d 0 . For each MR, the system samples the syntactic control values at random, i.e. it samples the first word of the utterance, the first words of each of the follow-up sentences and the formulation for each attribute-value pair randomly from the list of their respective possibilities. Then, these syntactic features are encoded into the onehot format as described above. The input to the SC-LSTM is composed of both the MR-vector and the syntactic control vector. To ensure that the sampling of the syntactic features did not introduce semantic error, the system samples 10 different values for each of the three control types and produces one utterance for each combination, e.g. the full system produces 1000 sentences for each MR. We then use the classifiers (previously trained to evaluate if the utterance rendered the MR correctly) to rank the 1000 utterances w.r.t. their correctness. Finally, the system samples the final utterance from the set of utterances with the highest score (as there can be multiple utterances with the same score).   Table 5: Error Rate for each system, best system is highlighted in bold. The sc subscript denotes the scores computed by the classifiers.

Evaluation Metrics
We report the scores for the automatic evaluation. This includes the metrics BLEU, ROUGE-L, ME-TEOR, NIST, and CIDEr score, which rely on the comparison between the predicted utterance and multiple reference utterances. Table 3 shows that the surface manipulation leads to a decrease in all of these scores. The best scores for each metric is achieved by the tgen system. Its BLEU score is 3 points above the score achieved by vanilla.
The full system achieved the lowest scores in each metric. Generally speaking, the deeper the impact of the syntactic manipulation the lower the wordoverlap based score. This behaviour is explained by the fact that the baseline systems generate utterances which are syntactically similar to the most used structure in the gold-standard. The other systems generate sentences whose style and structure is much rarer in the gold-standard. For example, 59% of the reference utterances start with the standard pattern, whereas only 3% of the sentences generated by the full system follow this pattern. Although there are multiple reference utterances, it is not likely that one of these follows the syntactic choices of the syntactically controlled systems.  Table 6: Diversity scores for each system and the human texts. The highest score of a system is marked in bold.
human-written texts display the highest diversity across all scores. The full system achieves the highest scores out of all systems. Furthermore, both the vanilla and the tgen system obtain the lowest scores, thus, showing that the syntactic control mechanisms generate more diverse texts.

Classifier Performance
Since we use semantic classifiers to evaluate the correctness of the generated sentences, it is important to assess the quality of these classifiers. Table  4 shows the accuracy score for each of the classifiers on the testset. We note that all classifiers have a score greater than 0.9 except for the customer rating. The errors of the customer rating and the price classifiers stem from the semantic equivalence between the numerical and the verbal values which were used interchangeably in the references, e.g. when "price range is over £30" is expressed as "high-priced".

Correctness
We evaluate the correctness using a rule based system. We report the average error rate achieved by a system, as proposed by (Wen et al., 2015a), in Table 5, line ERR rule . The best error-rate is achieved by the full system, followed by utt-fw and form. This shows that our approach to rerank the utterances with the semantic classifiers works very well. For comparison, we also report the errorrates when using the semantic classifiers themselves to determine the correctness of an utterance ERR sc . It turns out that there is a mismatch between the scores achieved by the two metrics, especially for the tgen and vanilla system. This is due to the fact that the classifiers are used to filter the incorrect utterances, which leads the scores to be biased. Thus, it shows that the classifiers themselves are not suitable to compute a correctness score.

Qualitative Evaluation
In Table 8 two representative (cherry picked) examples are shown. For one MR we compare the outputs of all systems. In both examples the tgen and vanilla system produce utterances which follow the trivial pattern. The uff-fw and full systems produce a different style of utterance by starting the sentence with a preposition. The follow-fw system adds more variability to the utterance by starting the follow-up sentences with verbs (e.g. "Located") or nouns ("Children") instead of pronouns referring to the restaurant name. The form system adds more variability by using different ways of phrasing an attribute-value pair (e.g. replacing "high price range" with "expensive"). We added a list of randomly sampled (  Here, * implies a statistical significant difference between a system and the tgen system, measured with two-tailed Student's t-test with p < 0.05

Human Evaluation
To measure the effectiveness of our approach, we performed an extensive human evaluation. For this, we recruited judges from the Figure-Eight 4 platform. For each experiment the sentence is rated by three different judges.
Quality and Naturalness To show that the syntactic manipulations do not deteriorate the utterances, we evaluated the quality and naturalness of the utterances produced by the different systems.
Here, quality is defined to measure the grammatical correctness, the fluency and the correctness of the content, whereas naturalness measures the likelihood that the utterance was written by a human. For this, we sampled 250 MR's and generated the respective utterances for each system. The judges rated all utterances on a Likert scale from 1 to 5 for quality and on a scale from 1 to 3 for naturalness 5 . Table 7 shows the results for both the quality and naturalness evaluation. Statistical significance is measured by means of a two-tailed Student's t-test between the tgen system and the other systems. For quality there is no statistically significant difference between the tgen system and any other system. For naturalness there is no statistically significant between tgen and the syntactically controlled systems. However, there is a significant difference between tgen and vanilla.
In fact, the vanilla system is rated significantly higher in terms of naturalness than any other system. For both metrics, the scores of all systems are very high, thus, we conclude that the syntactical control mechanisms do not deteriorate the utterances.

Subjective Analysis
The main goal of the human evaluation is to understand how humans perceive the new utterances. For this, we compare the utterances of tgen and the full system by first sampling a MR, generate the utterance for each system, and let the human judges decide which of the two utterances they prefer. Since preference is a very subjective measure that might not give complete insight, we asked the judges to also state which utterance they find more comprehensible (is the utterance easier to understand), more concise (does the utterance convey the information clearly with as little text as possible), more elegant (is the utterance more nicely written, more poetic, display higher variability) and more professional (could this text be written by an experienced and well trained writer). Table 9 shows the ratio at which the system was selected for each of the five aforementioned categories alongside the interannotator agreement computed with the Fleiss' κ.
The results show that none of the two systems is significantly preferred by the judges, nor is any of the two systems rated as being more comprehensible. However, the judges perceive the full system to produce significantly more elegant and professional utterances, i.e. in 71.6% of the comparisons the utterances by the full system were rated as more elegant and in 66.6% as being more professional. On the other hand, the judges rate the utterances of the tgen system as being significantly more concise (75%). There is a moderate correlation between the preference of an utterance and the elegance (Spearman's Rho ρ = 0.557 ) or professionalism ( ρ = 0.569 ). Furthermore, there is a weak correlation between comprehensibility of a system and the professionalism ( ρ = 0.468). However, we found that there is a strong correlation between the preference and the comprehensibility of an utterance (ρ = 0.719). The evaluation shows that although the utterances by the full system are rated as being more elegant and professional, they are not necessarily preferred. The strongest indicator for preference is, thus, the comprehensibility. The Punter is a pub that serves English food in the high price range and is located in the city centre near Raja Indian Cuisine. tgen The Punter is an english pub in the city centre near Raja Indian Cuisine. It has a high price range and is not child friendly.

uttfw
In the city centre near Raja Indian Cuisine is a pub called The Punter. It serves English food and is not children friendly. The price range is high.

followfw
The Punter is a pub that serves English food. Located in the city centre near Raja Indian Cuisine, it is not children friendly and has a high price range. form The Punter is an expensive English Pub near Raja Indian Cuisine in the city centre. It is not child friendly.
full If you are looking for a pub serving English food, try The Punter. It is located in the city centre near Raja Indian Cuisine. Prices are on the higher end and it is not child friendly. 2 MR name=Giraffe, eatType=restaurant, food=French, area=riverside, familyFriendly=yes, near=Raja Indian Cuisine vanilla Giraffe is a family friendly restaurant that serves French food. It is located near Raja Indian Cuisine.
tgen Giraffe is a family friendly french restaurant near Raja Indian Cuisine in riverside.

uttfw
A French restaurant called Giraffe is located in the riverside area near Raja Indian Cuisine. It is child friendly.

followfw
Giraffe is a restaurant that serves French food. The restaurant is located near Raja Indian Cuisine in the riverside area. Children are welcome. form Giraffe is a French restaurant in the riverside area near Raja Indian Cuisine. It is family friendly. full In the riverside area there is a French restaurant called Giraffe. You will find it near Raja Indian Cuisine. Yes, it is family friendly.   Native vs. non-native speakers We observed that depending on whether the judges were native speaker or not the results were different. Thus, we repeated the same experiment by recruiting judges from non-native speaking countries 6 . Table 10 shows the results of the evaluation performed by the non-native speaking group. The differences of the ratings are significant. The non-native speakers rate the tgen system as significantly more comprehensible, more concise as well as more professional. There is still a high correlation between the preference and the comprehensibility of an utterance (Spearman's Rho ρ = 0.709). However, for the non-native group there is a significantly higher correlation between the comprehensibility and the professionalism of an utterance (Spearman's Rho ρ = 0.628) and a very high correlation between the preference and the professionalism (Spearman's Rho ρ = 0.714). This shows that the non-native speaking group finds it easier to understand the utterances produced by tgen and rates them as more preferable and more professional. The evaluation shows that the two groups have different preferences and perceptions of the utterances. An in-depth analysis on the reasons behind these differences is left to future work. Our experiments indicate that the differences are due to the differences in language proficiency, as there is a high correlation between the preference and the comprehensibility. However, to test this assumption, more characteristics about the judges need to be known.

Conclusion
In this work, we presented an end-to-end trainable deep-learning based system for the natural language generation task. With a simple control mechanism the utterances can be rendered more diverse and interesting. The human evaluation revealed that this control mechanism does not deteriorate the quality of the utterances in terms of semantic or grammatical errors. It further revealed that more diverse utterances are perceived as being more elegant and professional sounding to native speakers. Not surprisingly, the corpus-based metrics deteriorate when a more diverse vocabulary is used. One major challenge of this approach is the fact that during the generation the syntactic control features have to be sampled randomly to generate many utterances which have to be ranked and filtered. The solution to this inefficiency is part of future work.  20, less, than, under, pounds, inexpensive, below, lower £20-25 20, from, between, mid, 20-25, ranging, around more than £30 30, more, than, expensive, over, higher, above, costs, euros, costing The Punter is a family-friendly restaurant located in the city centre near Rainbow Vegetarian Café. It is cheap and has an average customer rating. tgen The Punter is an italian restaurant near Rainbow Vegetarian Café in the city centre. It is family-friendly and has a cheap price range and an average customer rating. utt Rainbow Vegetarian Café is a family-friendly restaurant called The Punter that serves Italian food and has an average customer rating. It is located in the city centre.
follow The Punter is a cheap Italian restaurant in the city centre near Rainbow Vegetarian Café. The Punter is family friendly and has an average customer rating.
form The Punter is an inexpensive Italian restaurant in the city centre near Rainbow Vegetarian Café. It is family friendly and has an average customer rating.
full In the city centre is a family-friendly restaurant called The Punter. This is a cheap Italian restaurant near Rainbow Vegetarian Café. It has an average customer rating.
human There is a cheap, restaurant that serves Italian, named The Punter, in the city centre near Rainbow Vegetarian Café. It has an average customer rating and is family friendly The Waterman is a restaurant providing Italian food in the less than 20 price range. It is located in the riverside. It is near Raja Indian Cuisine. tgen The Waterman is an italian restaurant in the riverside area near Raja Indian Cuisine. It is not familyfriendly and has a price range of less than 20. utt Italian restaurant The Waterman is located in the riverside area near Raja Indian Cuisine. It is not family-friendly and has a price range of less than 20.
follow The Waterman is a restaurant located near Raja Indian Cuisine in the riverside area. The price range is less than 20. They serve Italian food and are not family-friendly. form The Waterman is a restaurant providing Italian food in the low price range. It is located in the riverside area near Raja Indian Cuisine. It is not family friendly. The Wrestlers is a family friendly pub near Raja Indian Cuisine in the riverside area that serves Italian food for less than 20. tgen The Wrestlers is a family-friendly pub near Raja Indian Cuisine in the riverside area. It serves italian food for less than 20. utt Italian food is served at The Wrestlers pub located near Raja Indian Cuisine in the riverside area. It is family friendly and has a price range of less than 20.
follow The Wrestlers is a pub that serves Italian food. They are located in the riverside area near Raja Indian Cuisine. They are family friendly and the price range is less than 20.
form The Wrestlers is a family friendly pub serving Italian food in the low price range. It is located in the riverside area near Raja Indian Cuisine.
full On the riverside near Raja Indian Cuisine is a family friendly pub called The Wrestlers. The price range is less than 20 and they serve Italian food.
human The Wrestlers is a pub in the low price range that serves pasta. It is located near Raja Indian Cuisine and has a public restroom. Table 13: Randomly sampled output. A meaning representation is sampled at random, the respective utterance from each system is displayed.