Generating Summaries of Sets of Consumer Products: Learning from Experiments

We explored the task of creating a textual summary describing a large set of objects characterised by a small number of features using an e-commerce dataset. When a set of consumer products is large and varied, it can be difficult for a consumer to understand how the products in the set differ; consequently, it can be challenging to choose the most suitable product from the set. To assist consumers, we generated high-level summaries of product sets. Two generation algorithms are presented, discussed, and evaluated with human users. Our evaluation results suggest a positive contribution to consumers’ understanding of the domain.


Introduction
When presented with a large amount of data in tabular form, an additional textual summary could aid a reader's comprehension of the otherwise overwhelming information at hand. The task of automatically creating a summary from numerical data is an ongoing research area within Natural Language Generation (NLG). We explored this task in the context of generating a textual summary describing a large set of objects [products] from a large database, where each object is characterised by several product features.
Product set overviews can be written by hand if the category is known beforehand. For example, manually written product reviews often start with an overview paragraph that discusses a wider set of products of which the product is a member. However, when a consumer searches for products with keywords or through filters (e.g. on an ecommerce website), an overview of the returned set of search results would have to be automatically generated.
In this paper, we test the hypothesis that automatically generated textual summaries can be of benefit to customers. This can be seen as a specific instance of Shneiderman's Visual Information Seeking mantra (Shneiderman, 1996) of "Overview first, zoom and filter, then details-ondemand". One of the main ideas presented there is that it is beneficial for a reader to be exposed to an overview of the information before diving into specific details of interest.
There have been related NLG research about sets of objects, although with different goals or focuses. For example, to refer to or identify a set of objects within a larger set (Van Deemter, 2002), to perform a data-to-text analysis of tabularized data by records 1 , to generate a page title for set items with shared characteristics from existing metadata (Mathur et al., 2017), or to address the issue of missing data encountered in summarisation (Inglis et al., 2017). In contrast, our work explores summaries that describe commonalities and differences within a set in order to help a user make informed decisions in selecting an object from the set. Our work focuses particularly on Content Determination step in the NLG pipe-line (Reiter and Dale, 2000), including selecting features and values to be presented. the 30 top ranked pages which contained a list of TVs (not just one single product). We then defined a per-clause tagging scheme to identify aspects that could be generated from product specifications and to systematically observe how reviewers described sets of products. In our scheme, a clause could have multiple tags. There was one annotator involved in the tagging (the first author). Our finding are summarised below.
Feature Selection: We analysed how often each product feature gets mentioned in the reviews. We found, as shown in Table 1, besides the price, the most frequent features (in descending order) are screen size, resolution, smart/internet feature, brand, backlight technology, ports, and contrast.

Feature
Frequency ( Price Description: The product price in the reviews are typically mentioned only vaguely, using terms like "desirable price", "cheap", "expensive" or "premium". The description is vague even when numbers are involved e.g. "around £300". But when a crisp description is used it is more often found in the form of stating the starting point, e.g., "you can get a 1080p TV starting at £270" or the maximum e.g. "Discover the best 32 inch Smart TVs under £300 here".
Description of a Set of Items: Usually in a review, only a small number of sentences explicitly describe the set as a whole, for example "Most 32-inch TVs these days are labeled as HD Ready". When they do they uses quantifying words like "most". Numbers are described vaguely e.g. weight is mentioned as "light" or "response time" is either "fast" or "slow". Some features, for example the screen size, are mentioned both as exact numbers and vague description.
Price-Features Relationship: The relationship with price is used as a secondary justification to the features that the reviewers already think important, for example, "A TV with a 1920 × 1080 resolution [are] not even that much more expensive" or "good image quality and available smart features [...] carry a price premium." Based on this analysis, we decided that our summaries should describe the shape of the price curve, the important features, and the effect of these features on price.
A large part of the reviews gathered included domain knowledge, for example, descriptions of technical terms and other insights. This part of the reviews clearly could not be produced from specification table. There were also mentions of features that can be, non trivially, derived from the table, e.g. picture quality (which can be based on columns like resolution and contrast).
3 The Algorithms 3.1 Alg1. Summarising a set of products In our previous work (Kuptavanich, 2018), we presented an algorithm (called Alg1 here) to generate summaries consisting of (a) the shape of the price curve, (b) common features within the set and (c) features that influence price (Figure 1 gives an example of the generated text). The algorithm mainly used the influence of a feature on the product price to determine its importance. Alg1 only included content that could be generated from descriptions of items in a set being summarised. Following our analysis of the handwritten reviews, we adapted the algorithm. The resulting Alg2 allows for dynamic creation of sets through the use of feature filters and the contextualisation of these sets with respect to the unfiltered wider set as described below.
Shape of the Price Curve: Alg1 reports the median price and the price range of the set.
[alg2] additionally compares the median price of the filtered set against the median price of the wider category. For instance, the first 3 lines of Figure 3 show a situation where the user has filtered the set of TVs to those that are 40-59 inches with 4K ultra high definition. The underlined portion is generated only by Alg2.

Description of Important Features:
In the TV domain, the following features occurred most frequently: display size, display resolution, smart/internet feature, support content service, brand, display technology, connectivity technology (ports) and HDR. We therefore focussed on these features, but generated more detail about them than in Algo1. The description of each feature consisted of two parts. The first used quantifiers to describe the common values for the feature within the set. The second compared the median price of products with the said feature values against the median price of general products in this category and reported feature values that impacted on price (Figure 2). weight, we report them in the same fashion as the price (i.e. range and median value). Otherwise, we use the quantifiers "most" (more than 50%), "a large proportion" (more than 25%) and "some" (more than 10%).
(b) Comparatives and Qualifiers. In the second part, we also use phrases such as "more expensive" , "less expensive" or "about the same price" (when the difference is less than 5%). If the difference falls between 5 -10%, we qualify this using the word "slightly". This generates texts such as "TVs with Smart-Internet Feature are generally slightly more expensive (£475 vs £450)." (Figured 3, in the next section). The processes from document structuring through realization was carried out through template/schemata approach (McKeown, 1985). Also, the tone of the discourse is primarily to provide factual product overview without trying to be persuasive. Both algorithm were implemented using the Jinja2 2 template engine.

Evaluation Experiment
Our previous work (Kuptavanich, 2018) revealed difficulties designing a suitable task that reflected real consumer behaviours in the task based experiments, but a promising result with human rating. We therefore decided only to focus on human rating evaluation 3 . The scenario of interest is where a consumer is searching for products on an e-commerce website. Our Laboratory Human Ratings Evaluation experiment had three goals. First, we wanted to find out whether the text summaries generated by Alg2 were preferred over those generated by Alg1, and also over the static introductory text provided on the e-commerce site. Second, we asked the participants to identify parts of the summaries that were useful, parts that were unnecessary, and what they want to see added. Third, we wanted to also find out what product features are important in the decision-making process.

Method
Materials: We scraped TV product data from Amazon UK 4 during May -June 2018 to obtain 1478 products. We used this database to generate the summaries using both Alg1 and Alg2. As our baseline, we used Amazon's static text provided on their TV browsing page. An excerpt is shown in Figure 4. The full text can be found on Amazon UK TVs 5 page.
We used two product search scenarios on Amazon UK, based on its search filters. Each scenario produced a different set of search results and thus generated different summaries for Alg1 and Alg2.
Participants: Participants were 18 graduate students in Computing Science and Chemistry Department of University of Aberdeen recruited through the departments' internal student mailing lists.
Design and Procedure: In total, there were 2 pre-determined product search scenarios: • Scenario 1: 40 -59 inch TVs with Ultra HD • Scenario 2: TVs of any size that are smart TVs First, summaries [amz], [alg1], and [alg2] were presented in random order. To ensure that participants engaged with the task, each participant was asked to select one product. Then they were asked After the 2nd scenario was completed, we asked the participants to select 3 products they liked. Then we asked: [Q4]: "When buying a TV which feature do you think is most important?" [Q5]: "What information do you think should be in a summary?" [Q6]: "What kind of summary would help you choose a good TV?" For [Q4], participants could choose from a list with the following choices: price, screen size, supported content service, smart/internet, resolution, Freeview, connectivity (ports) and also could specify their own features. Free Text Answer: Many responses (7 in total) asked for the summary to be short and precise or even bulleted. Furthermore, to [Q1] most participants found the price range and the relationship between price and features useful, which was supported by the data in the ranking. For [Q2], participants wanted to see product rating and other features, e.g. display frequency, model year or warranty added to the summary. They wanted to see some explanation of the technical terms and/or specification (e.g. what a smart TV is and what it can do). To [Q3], most participants did not find the Amazon summary useful and thought that it was not necessary. To [Q4], participants emphasized price (14 counts), screen size (11 counts), resolution (10 counts) and smart/internet feature (8 counts) when buying a TV. To the questions [Q5] and [Q6], participants thought that the features and their descriptions (including terminology explanations), how the features impact the price, user reviews, and information about warranty make a good summary.

Discussion and Future Work
Generalisation of our findings -which were based on only a very small set of scenarios -is tricky: we do not know whether they generalise to different kinds of products (e.g., groceries or paintings) and to product sets of different sizes (e.g. a set of just 3 products). However, our results suggest that customers find high-level product set summaries of the type we investigated more useful than Amazon's static product category overviews. This was further confirmed by the free text question where many participants quote substantial parts of [alg2] summary as being useful.
In future, we aim to experiment with refinements and extensions of [alg2]. For instance, in order to expand the algorithm work with various product domains, an automation of the analysis of hand-written reviews has to be implemented.
Additionally, based on participants' comments, technical information (as canned text) could also be included into the summary. Since a number of readers pointed out that the summaries generated by [alg2] were too lengthy, the future version of the summary could be shorten (e.g., by omitting price comparisons in some cases). Some comments proposed that the summary should group together features that make the products different, separately from those the prod-ucts have in common, this, as well, has a potential as a next feature to be experimented on.
In addition, to mimic more of human-written texts, approches to reduce the repetition in the generated text could be considered.
Finally, a more seamless integration of the summary to e-commerce websites could also be considered, maybe as a browser extension or a website wrapper.