Harvesting Creative Templates for Generating Stylistically Varied Restaurant Reviews

Many of the creative and figurative elements that make language exciting are lost in translation in current natural language generation engines. In this paper, we explore a method to harvest templates from positive and negative reviews in the restaurant domain, with the goal of vastly expanding the types of stylistic variation available to the natural language generator. We learn hyperbolic adjective patterns that are representative of the strongly-valenced expressive language commonly used in either positive or negative reviews. We then identify and delexicalize entities, and use heuristics to extract generation templates from review sentences. We evaluate the learned templates against more traditional review templates, using subjective measures of convincingness, interestingness, and naturalness. Our results show that the learned templates score highly on these measures. Finally, we analyze the linguistic categories that characterize the learned positive and negative templates. We plan to use the learned templates to improve the conversational style of dialogue systems in the restaurant domain.


Introduction
The restaurant domain has been one of the most common applications for spoken dialogue systems for at least 25 years (Polifroni et al., 1992;Whittaker et al., 2002;Stent et al., 2004;Devillers et al., 2004;Gasic et al., 2008). There has been a tremendous amount of previous work on natural language generation of recommendations and descriptions for restaurants (Howcroft et al., 2013;Wen et al., 2015;Novikova et al., 2016), some of # Stars Review This place is probably the worst thing that ever happened to the history of the known world. [...] The food, however, I initially would want to call unremarkable but I can't. I can't call it unremarkable because it is so incredibly remarkably terrible. [...] 2 2/5 Can't say anything about the food, as we were never served. We never saw a server, even after sitting at our   (Higashinaka et al., 2007b;Mairesse and Walker, 2010;Dethlefs et al., 2014). Given this, it is surprising that previous work has not especially noted that restaurant reviews are a fertile source of creative and figurative language. For example, consider the elaborate descriptions in the restaurant reviews in Table 1 1 , e.g. phrases such as worst thing that ever happened in the history of the known world along with incredibly remarkably terrible (Row 1), eat here everyday if I didn't think I'd end up 400 pounds and sooooooo heavenly (Row 4), and food so amazing you cannot eat [...] anywhere else (Row 5). These phrases express extremely valenced reactions to restaurants, their menu items, and related attributes, using figurative language.
The creativity exhibited in these user-generated restaurant reviews can be contrasted with natural language generation (NLG) for the restaurant domain. Methods for NLG typically begin with a structured meaning representation (MR), as shown in Table 2, and map these meaning representations into surface language forms, using a range of different methods, including template-based generation, statistically trained linguistically-informed NLG engines, and neural approaches (Bangalore and Rambow, 2000;Walker and Rambow, 2002). These approaches vary in the degree to which they can generate syntactically and semantically correct utterances, but in most cases the stylistic variation they can generate is extremely limited. Table 2 illustrates sample restaurant domain utterances produced by recent statistical/neural natural language generators (Higashinaka et al., 2007a;Wen et al., 2015;Novikova et al., 2016;Dusek and Jurcícek, 2016).
One of the most prominent characteristics of restaurant reviews in the Yelp corpus is the prevalent use of hyperbolic language, such as the phrase "incredibly remarkably terrible" in Table 1. Hyperbole is often found in persuasive language, and is classified as a form of figurative language (Mc-Carthy and Carter, 2004;Cano Mora, 2009). Colston and O'Brien describe how an event or situation evokes a scale, and how hyperbole exaggerates a literal situation, introducing a discrepancy between the "truth" and what is said (Colston and Keller, 1998;Colston and O'Brien, 2000). Hyperbole moves the strength of a statement up and down the scale, away from the literal meaning, where the degree of movement reflects the degree of contrast or exaggeration. Depending on what they modify, adverbial intensifiers like totally, absolutely, and incredibly can shift the strength of the assertion to extreme negative or positive.
Similarly, Kreuz and Roberts (1995) describe a standard frame for hyperbole in English where an adverb modifies an extreme, positive adjective, e.g. "That was absolutely amazing!" or "That was simply the most incredible dining experience in my entire life." Such frames can be seen in the reviews in Table 1, but we also see many other idiomatic hyperbolic expressions such as out of this world (Cano Mora, 2009).
Our goal is to develop a natural language generator for the restaurant domain that can harvest and make use of these types of stylistic variations. We explore a data-driven approach to automatically select stylistically varied utterances in the restaurant review domain as candidates for review construction. We empirically learn hyperbolic adjective patterns that are highly correlated with two classes (positive and negative reviews). Using different resources, we also identify and delexicalize restaurant, cuisine, food, service, and staff entities, and select short, single-entity utterances that are simple to templatize.
Our overall approach is thus similar to Higashinaka et al. (2007a,b), who describe a method for harvesting an NLG dictionary from restaurant reviews, however our focus on learning expressive language, in particular hyperbole as a type of figurative language, is novel. Our framework consists of the following steps: 1. Collect a large number of strongly positive and strongly negative reviews in the restaurant domain; 2. Use a linguistic pattern learner to identify linguistic frames that use hyperbole; 3. Create generation templates from the identified linguistic patterns and infer their contexts of use; 4. Learn to rank the generation templates for convincingness and quality.
We see Steps 1 to 3 as the overgeneration phase, aimed at vastly expanding the types of stylistic variation possible, while Step 4 is the ranking phase, in a classic overgenerate and rank NLG architecture (Langkilde and Knight, 1998;Rambow et al., 2001). We focus in this paper on Steps 1 to 3, expecting to improve these steps before we move on to Step 4. Thus, in this paper, we conducted an evaluation experiment to compare three different types of NLG templates: pre-defined BASIC templates similar to those used in current NLG engines for the restaurant domain Wen et al., 2015), the basic templates stylized with Emilios decor and service are both decent, but its food quality is nothing short of excellent. It serves Italian food and its in the City Centre.
Seq2Seq NLG (Nayak et al., 2017)  The food is phenomenal and the atmosphere is very unique. Babbo has excellent service. It has the best overall quality among the selected restaurants.
Unsupervised Method for Lexicon Learning (Higashinaka et al., 2007a)  our learned patterns for more HYPERBOLIC templates, and finally a class of CREATIVE templates that incorporate full sentence templates from user reviews. Our expectation was that many of the CREATIVE templates would fail to be appropriate to their contexts, but that our HYPERBOLIC templates would be both appropriate and more interesting and convincing than the BASIC templates. However, our results show that our creative templates are preferred as more convincing, interesting, and natural across the board. We discuss how we can use quantitative metrics associated with the learned templates for future ranking, and analyze characteristic linguistic categories in each class.

Data
Our restaurant review data comes from the Yelp dataset challenge, which includes 144K businesses with over 4.1M reviews. We randomly select 10K businesses located in the US that are classified as restaurants, resulting in a set of around 40K reviews. The data consists of around 4K 1 stars, 3.8K 2 stars, 5.6K 3 stars, 11.3K 4 stars, and 15K 5 stars. We divide the reviews by stars, and create three datasets: negative (using all of the 1-2 stars), positive (balancing the number of negative reviews using the 5 stars), and neutral (using all of the 3 stars). Table 3 shows our data distribution.

Learning Patterns for Hyperbole
Our goal is to learn patterns that are highly associated with the extreme positive and negative reviews, and that exemplify strong, expressive language. To automatically learn such patterns, we use the AutoSlog-TS weakly-supervised extraction pattern learner (Riloff, 1996). AutoSlog-TS uses a set of syntactic templates to learn lexically-grounded patterns. AutoSlog does not require fine-grained labels on training data: all it requires is that the training data be divided into two distinct classes. Here, we run two separate AutoSlog experiments, one in which the classes are POSITIVE compared to NEUTRAL, and the other where the NEGATIVE class is compared to NEUTRAL. We hypothesize that in this way, we can surface the most commonly used patterns from each class that are not necessarily sentimentrelated.
AutoSlog applies the Sundance shallow parser (Riloff and Phillips, 2004) to each sentence of each review, finds all possible matches for its syntactic templates, and then instantiates the syntactic templates with the words in the sentence to produce a specific lexico-syntactic expression. Most importantly, it uses the labels associated with the data to compute statistics for how frequently each pattern occurs in each class. Thus, for each pattern p, we learn the P(POSITIVE/NEGATIVE| p), the P(NEUTRAL| p), and the pattern's frequency. Table 4 shows examples of the patterns we learn and sample instantiations, with their respective frequency (F) and probabilities (P). In the pattern template column of Table 4, PassVP refers to passive voice verb phrases (VPs), ActVP refers to active voice VPs, InfVP refers to infinitive VPs, and AuxVP refers to VPs where the main verb is a form of to be or to have. Subjects (subj), direct objects (dobj), noun phrases (np), and possessives (genitives) can also be extracted by the patterns. Because we are particularly interested in descriptive patterns, we also use ngram pattern templates, AdjAdj, AdvAdj, AdvAdvAdj, as in related work (Oraby et al., 2015(Oraby et al., , 2016. Our goal is to find highly reliable patterns without sacrificing linguistic variation. Current statistical methods for training NLG engines typically eliminate linguistic variability by seeking to learn standard, more generic patterns that occur frequently in the data (Liu et al., 2016;Nayak et al., 2017). Since this phase of our work aims to vastly expand the amount of linguistic variation possible, we select instantiations that have a frequency of at least 3, and a probability of at least 0.75 association with the respective class (Oraby et al., 2015(Oraby et al., , 2016. We hypothesize that patterns that occur at least 3 times should be fairly reliable, and those that have at least a 75% probability of being associated with the positive or negative class should  We also observe that patterns learned using stricter thresholds (for example, frequency of at least 10 and probability of at least 0.9) also gives us useful patterns, and note that we can use the frequencies and probabilities in our future rank task. For larger coverage, we experiment with our less restrictive thresholds in the current work.

Designing Review Templates
To make use of the descriptive adjective patterns we learned, we needed to first identify what entities each of the patterns describes. To do this, we aggregate lexicons for each of five important restaurant entities: restaurant-type, cuisine, food, service, and staff using Wikipedia 2 and DBpedia 3 . We end up with 14 items for restaurant-types (e.g. "cafe"), 45 for cuisines (e.g. "Italian"), 4,913 for foods and ingredients (e.g. "sushi"), 12 for staff (e.g. "waiters"), and 2 for service (e.g. "customer service").

Basic Templates
To construct the most basic set of templates, we use simple relationships between adjectives and the entities they describe to define a set of sentences with entity slots, i.e. "They had [adj] (entity).", "The (entity) was|is [adj].", "The (entity) looked|tasted [adj]." We use basic lists of adjectives commonly found in reviews for these baseline templates. To vary the templates, we alternate between using only simple sentences, and sometimes combine related entities into more complex sentences (e.g. service and staff, or restauranttype and cuisine).

Hyperbolic Templates
For our hyperbolic templates, we replace the standard adjectives in the basic templates with adjective patterns learned from the restaurant reviews. To select appropriate adjectives patterns for replacement in each basic template, we first delexicalize the sentences that instantiate our learned adjective patterns for each class, and create sets of (entity, adjective pattern) pairs based on the relationship between the adjective and the entity ("is", "was", "tasted", etc.), as above. Using this method, we collect 37 restaurant, 30 cuisine, 247 food, 45 service, and 56 staff patterns for positive and 18 restaurant, 9 cuisine, 221 food, 75 service, and 61 staff patterns for negative. Table 5 shows example patterns in each class for the food and staff entity types.

Creative Templates
Finally, for our creative templates, we sample from our set of delexicalized sentences for each entity type, as long as they: • contain a single AutoSlog adjective pattern • contain a single identifiable entity type • are between 5-15 words long We enforce these limitations to gather simple sentences that are short enough to templatize. Thus, we end up with sentence templates for each entity type for both the positive and negative classes, collecting 146 restaurant, 61 cuisine, 743 food, 90 service, and 144 staff patterns for positive and 45 restaurant, 12 cuisine, 480 food, 126 service, and 89 staff patterns for negative. Table  6 shows examples of our templatized sentences for the positive and negative classes, with their AutoSlog-TS adjective patterns between brackets, and capitalized subject extractions when applicable. To construct a full review of a certain polarity, we randomly select a sentence from the sets for each entity type.
We hypothesized that the creative templates would optimize stylistic variation and hence interestingness, but that they would also include cases  Table 5: Sample Learned Adjective Patterns that would require further refinement, or perhaps elimination by a subsequent ranking phase. Since our focus here is on overgeneration, we include these and evaluate their quality. Table 7 shows examples of each template type we create.

Evaluating Template Styles
In order to evaluate our template variations, we choose to focus on three particular criteria: convincingness, interestingness, and naturalness. We evaluate convincingness because creative language such as hyperbole is often used in persuasive language, along with other figurative forms (Kreuz and Roberts, 1995). Naturalness is an important concern in generation, so we are also interested in the comparison between the perceived naturalness of each variation style, and we hypothesize that interestingness would increase as we used  The bar is beautiful. They had authentic japanese cuisine. The udon looked excellent. The hosts is dedicated. They had reliable customer service.

HYPERBOLIC
The bar is also very fresh. They had delicious authentic japanese cuisine. The udon looked so delicious. The hosts is also very friendly. They had such amazing customer service.

CREATIVE This is by far my favorite bar in town.
plus there is a great japanese cuisine grocery store that has tons of stuff. The udon is always fresh, delicious and made to order. Hosts was super friendly, looking forward to coming back and trying more items. The customer service is great and the employees are always super nice! Our objective is to evaluate whether we can improve upon vanilla-style hand-crafted templates for restaurant reviews by utilizing in hyperbolic and creative elements of organic reviews that we harvest. We set up an annotation experiment on Amazon Mechanical Turk 4 , where each Human Intelligence Task (HIT) presents Turkers with a sample of our three review variations, all of the same polarity and instantiated with the same entities. Turkers are asked to judge the reviews based on three criteria: convincingness (Do you believe the opinion given?), interestingness (Is the review engaging?), and naturalness (Is the review coherent?). Turkers are asked to rate each review on a three point scale (high, medium and low) for each criteria. We release 200 variation triples (100 per polarity class) and ask for five judgements per HIT, tagging a review with a quality if the majority of annotators agree on it (i.e. 3 or more Turkers). Average agreement for individual Turkers with the majority is above 73%. Figure 1 shows the distribution of high, medium, and low scores for each of the variation types for each criterion. From the results, we observe that for all criteria, the CREATIVE class has the highest distribution of high majority votes. Interestingly, although we hypothesized that the HY-PERBOLIC reviews would be better received than the BASIC reviews, we observe that in fact the BASIC reviews receive more high votes on convincingness. We note that for the future ranking, more context information is necessary when selecting appropriate hyperbolic patterns with which to modify the BASIC reviews. For example, if a learned pattern is OTHER AMAZING, the pattern should only be used when a set of items are being described, and not stand-alone. Similarly, the BA-SIC reviews are also more natural than the HYPER-BOLIC ones, although both variation types score very similar percentages for medium scores.
For the creative reviews, a crucial next step for ranking is to consider context and develop heuristics for finding the most appropriate entities for lexicalization. For example, for very specific creative templates such as: "I also got one that HAD NOT been separated , so it was [JUST HALF] of a <FOOD ENTITY> .", or "The <FOOD ENTITY> were similarly a mix of nearly raw to overly crisp.", it is necessary to select food items similar to the original instantiations, or to characterize and classify entities based on specific properties.

Figure 1: Distribution of Template Variations by Evaluation Criteria
Given the high appeal of the CREATIVE reviews on all counts, we are interested in more closely exploring examples in the data. Table 8 shows two examples of CREATIVE reviews: one that received high scores on all criteria, and one that received majority (no creative review received all lows). It is clear that the biggest disconnect in the low-scoring creative review is the coherence between sentences, which as an important next step to consider as future work given the proofof-concept presented here. We also note that we can also improve the fully high-scoring review by fixing grammatical errors and applying more informed content selection.
To get a better sense of how grammatically correct the review template variations are, we conduct another evaluation study where we present Turkers with the same set of reviews, and ask them to rate each review based on the content (checking subject-verb agreement, plurality, tense, etc.). Similar to the previous study, we gather 5 judgements for each set of three variations, and aggregate results using majority vote. Average agreement for individual Turkers with the majority in this task is above 80%, higher than the more sub-   Figure 2 shows the results of the study. We find that for all three variations, the med class receives the majority of the votes, but that the BASIC reviews are the most grammatically correct (since the templates are designed, not harvested). Similarly, the HYPERBOLIC reviews have the largest percentage of low scores, since their creation involves modifying templates with learned adjectives. Ranking the best patterns/sentences to use will allow us to improve the grammatical coherence of the templatized utterances for the HYPER-BOLIC and CREATIVE classes.
To better understand the linguistic characteristics of the creative reviews by class, we run the Linguistic Inquiry and Word Count (LIWC) tool (Pennebaker et al., 2015) on the full set of 100 POSITIVE and 100 NEGATIVE creative reviews. When comparing the linguistic categories for each class, we find that the difference between the POS-ITIVE and NEGATIVE reviews are significant (p < 0.05, t-test) for many of the categories. Table 9 shows some of the most interesting categories 5 .
On average, the POSITIVE templates are char-   Table 9: Statistically Significant LIWC Categories by Polarity acterized by word classes that exemplify achievement (e.g. "even better", "champion") and certainty (e.g. "always excellent", "absolutely amazing", and "definitely my go-to place"). As well as 1 st person statements relating to use of the senses (affective processes like "my favorite place to get rice in Las Vegas!", biological processes ("I just had the most amazingly delicious and freshly prepared couscous!"), and ingestion ("good, tasty comfort pizza").
The negative contains more oppositional language directed at the second person, often as advice ("you can get a much better pizza elsewhere at far less cost."), with categories like differentiation ("but it's not great"), and strong emotion indicators like anxiety ("horrible service, finally just left") and anger ("I was so angry that I contacted the restaurant manager").

Conclusions
In this paper, we show that we can construct convincing, interesting, and natural restaurant review templates by using a data-driven method to harvest highly descriptive sentences from hyperbolic restaurant reviews. We generate three variations of review templates, ranging from very basic, to hyperbolic, to very creative, and show that the creative ones are more appealing to readers than the others. Future work will focus on ranking the candidate sentence templates we harvest to improve review coherence. As we develop better templates, we will evaluate them against baselines from existing NLG systems to guide our generation of more exciting and expressive stylistically varied reviews.