Nutri-bullets Hybrid: Consensual Multi-document Summarization

We present a method for generating comparative summaries that highlight similarities and contradictions in input documents. The key challenge in creating such summaries is the lack of large parallel training data required for training typical summarization systems. To this end, we introduce a hybrid generation approach inspired by traditional concept-to-text systems. To enable accurate comparison between different sources, the model first learns to extract pertinent relations from input documents. The content planning component uses deterministic operators to aggregate these relations after identifying a subset for inclusion into a summary. The surface realization component lexicalizes this information using a text-infilling language model. By separately modeling content selection and realization, we can effectively train them with limited annotations. We implemented and tested the model in the domain of nutrition and health – rife with inconsistencies. Compared to conventional methods, our framework leads to more faithful, relevant and aggregation-sensitive summarization – while being equally fluent.


Introduction
Articles written about the same topic rarely exhibit full agreement. To present an unbiased overview of such material, a summary has to identify points of consensus and highlight contradictions. For instance, in the healthcare domain, where studies often exhibit wide divergence of findings, such comparative summaries are generated by human experts for the benefit of the general public. 2 Ideally, this capacity will be automated given a large number of relevant articles and continuous influx of new ones that require a summary update to keep Pubmed studies on Pears and Cancer. The key facts (bold) and consensus (contradiction) are realized in the text generated by our model. it current. However, standard summarization architectures cannot be utilized for this task since the amount of comparative summaries is not sufficient for their training.
In this paper, we propose a novel approach to multi-document summarization based on a neural interpretation of traditional concept-to-text generation systems. Specifically, our work is inspired by the symbolic multi-document summarization system of (Radev and McKeown, 1998) which produces summaries that explicitly highlight agreements, contradictions and other relations across input documents. While their system was based on human-crafted templates and thus limited to a narrow domain, our approach learns different components of the generation pipeline from data.
To fully control generated content, we frame the task of comparative summarization as concept-totext generation. As a pre-processing step, we ex-tract pertinent entity pairs and relations (see Figure  1) from input documents. The Content Selection component identifies the key tuples to be presented in the final output and establishes their comparative relations (e.g., consensus) via aggregation operators. Finally, the surface realization component utilizes a text-infilling language model to translate these relations into a summary. Figure 1 exemplifies this pipeline, showing selected key pairs (marked in bold), their comparative relation -Contradiction (rows 1 &3 and rows 4&5 conflict), and the final summary. 3 This generation architecture supports refined control over the summary content, but at the same time does not require large amounts of parallel data for training. The latter is achieved by separately training content selection and content realization components. Since the content selection component operates over relational tuples, it can be robustly trained to identify salient relations utilizing limited parallel data. Aggregation operators are implemented using simple deterministic rules over the database where comparative relations between different rows are apparent. On the other hand, to achieve a fluent summary we have to train a language model on large amounts of data, but such data is readily available.
In addition to training benefits, this hybrid architecture enables human writers to explicitly guide content selection. This can be achieved by defining new aggregation operators and including new inference rules into the content selection component. Moreover, this architecture can flexibly support other summarization tasks, such as generation of updates when new information on the topic becomes available.
We apply our method for generating summaries of Pubmed publications on nutrition and health. Typically, a single topic in this domain is covered by multiple studies which often vary in their findings making it particularly appropriate for our model. We perform extensive automatic and human evaluation to compare our method against state-of-the-art summarization and text generation techniques. While seq2seq models receive competent fluency scores, our method performs stronger on task-specific metrics including relevance, content faithfulness and aggregation cognisance. Our method is able to produce summaries that receive 3 We compare the selected content with other entries in the database, identifying two contradictions. an absolute 20% more on aggregation cognisance, an absolute 7% more on content relevance and 7% on faithfulness to input documents than the next best baseline in traditional and update settings.

Related Work
Text-to-text Summarization Neural sequence-tosequence models (Rush et al., 2015;Cheng and Lapata, 2016;See et al., 2017) for document summarization have shown promise and have been adapted successfully for multi-document summarization Lebanoff et al., 2018;Baumel et al., 2018;Amplayo and Lapata, 2019;Fabbri et al., 2019). Despite producing fluent text, these techniques may generate false information which is not faithful to the original inputs (Puduppully et al., 2019;Kryściński et al., 2019), especially in low resource scenarios. In this work, we are interested in producing faithful and fluent text cognizant of aggregation amongst input documents, where few parallel examples are available.
Recent language modeling approaches (Devlin et al., 2018;Stern et al., 2019;Shen et al., 2020;Donahue et al., 2020) can also be extended for text completion. Our work is a text-infilling language model where we generate words in place of relation specific blanks to produce a faithful summary.
Prior work (Mueller et al., 2017;Fan et al., 2017;Guu et al., 2018) on text generation also control aspects of the produced text, such as style and length. While these typically utilize tokens to control the modification, using prototypes to generate text is also very common (Guu et al., 2017;Li, 2018;Shah et al., 2019). In this work, we utilize aggregation specific prototypes to guide aggregation cognizant surface realization.
Data-to-text Summrization Traditional approaches for data-to-text generation have operated on symbolic data from databases. McKeown and Radev (1995); Radev and McKeown (1998); Barzilay et al. (1998) introduce two components of content selection and surface realization. Content selection identifies and aggregates key symbolic data from the database which can then be realized into text using templates. Unlike modern data-totext systems (Wiseman et al., 2018;Puduppully et al., 2019;Sharma et al., 2019;Wenbo et al., 2019) these approaches capture document consensus and aggregation cognisance. While the neural approaches alleviate the need for human intervention, they do need an abundance of parallel data, Figure 2: Illustrating the flow of our Nutribullets Hybrid system. In this example, our model takes in four Pubmed studies to produce a database (a). The Content Selection model selects two tuples (bold) and identifies the aggregation operator as Contradiction (b). Finally, the Surface Realization model takes in the tuples and aggregation operator to produces a summary which is faithful to input entities and aggregation cognizant (c).
which are typically from one source only. Hence, modern techniques do not deal with input documents' consensus in low resource settings.

Method
Our goal is to generate a text summary y for a food from a pool of multiple scientific abstracts X. In this section, we describe the framework of our Nutribullets Hybrid system, illustrated in Figure 2.

Overview
We attain food health entity-entity relations, for both input documents X and the summary y, from entity extraction and relation classification modules trained on corresponding annotations (Table 2). Notations: For N input documents, we collect X G = {G x p } N p=1 , a database of entity-entity relations G x p . G p = (e k 1 , e k 2 , r k ) K k=1 is a set of K tuples of two entities e 1 , e 2 and their relation r. r represents relations such as the effect of a nutrition entity e 1 on a condition e 2 (see Table 2). 4 We have raw text converted into symbolic data.
Similarly, we denote the corpus of summaries as where y m is a concise summary, G y m is the set of entity-entity relation tuples and O y m is the realized aggregation, in M data points.
Modeling: Joint learning of content selection, information aggregation and text generation for multi- 4 We train an entity tagger and relation classifier to predict G and also for computing knowledge based evaluation scores. More details on models and results are shared later. document summarization can be challenging. This is further exacerbated in our technical domain with few parallel examples and varied consensus amongst input documents. To this end, we propose a solution using Content Selection and Aggregation and Surface Realization models.
Raw text from N input documents is converted into a mini-database X G of relation tuples. The content selection and aggregation model operates on such symbolic data. We use X G and Y to train the content selection model. During inference, we identify from X G a subset C of content to present in the final output. In order to produce a summary cognizant of consensus amongst inputs, we identify the aggregation operator O based on C and other relevant tuples in X G . The surface realization model produces a relevant, faithful and aggregation cognizant output. The model is trained only using Y . During inference, the model realizes text using the selected content C and the aggregation operator O.

Content Selection and Aggregation
Our content selection model takes a mini-database of entity-entity relation tuples X G as input, and outputs the key tuples C and the aggregation operator O.
Content selection and aggregation consists of two parts -(i) identifying key content P (C|X G ) and (ii) subsequently identifying the aggregation operator O using C, X G . Content Selection Identifying key content in-  volves selecting important, diverse and representative tuples from a database. While clustering and selecting from the database tuples is a possible solution, we model our content selection as a finite Markov decision process (MDP). This allows for an exploration of different tuple combinations while incorporating delayed feedback from various critical sources of supervision (similarity with target tuples, diversity amongst selected tuples etc). We consider a multi-objective reinforcement learning algorithm (Williams, 1992) to train the model. Our rewards (Eq. 2) allow for the selection of informative and diverse relation tuples.
where t is the current step, {c 1 , . . . , c t } is the content selected so far and {z 1 , z 2 , ..., z m−t } is the remaining entityentity relation tuples in the m-sized database. The action space is all the remaining tuples plus one special token, Z ∪ {STOP}. 5 The number of actions is equal to |m − t| + 1. As the number of actions is variable yet finite, we parameterize the policy π θ (a|s t ) with a model f which maps each action and state (a, s t ) to a score, in turn allowing a probability distribution over all possible actions using softmax. At each step, the probability that the policy selects z i as a candidate is: where c i * = arg max c j (cos(ẑ i ,ĉ j )) is the selected content closest to z i ,ẑ i andĉ i * are the encoded dense vectors, cos(u, v) = u·v ||u||·||v|| is the cosine similarity of two vectors and f is a feedforward neural network with non-linear activation functions that outputs a scalar score for each action a.
The selection process starts with Z. Our module iteratively samples actions from π θ (a|s t ) until selecting STOP, ending with selected content C and a corresponding reward. We can even allow for the selection of partitioned tuple sets by adding 5 STOP and NEW LIST get special embeddings. an extra action of "NEW LIST", which allows the model to include subsequent tuples in a new group.
We consider the following individual rewards: • R e = c∈C cos(ê 1c ,ê 1y ) + cos(ê 2c ,ê 2y ) is the cosine similarity of the structures of the selected content C with the structures present in the summary y (each summary structure accounted with only one c), encouraging the model to select relevant content.
the similarity between pairs within selected content C, encouraging the selection of diverse tuples.
• r p is a small penalty for each action step to encourage concise selection.
The multi-objective reward is computed as where w e , w d and r p are hyper-parameters. During training the model is updated based on the rewards. During inference the model selects an ordered set of key and diverse relation tuples corresponding to appropriate health conditions.
Consensus Aggregation Identifying the consensus amongst the input documents is critical in our multi-document summarization task. We model the aggregation operator of our Content Selection using simple one line deterministic rules as shown in Table 1. The rules are applied to the key C entity-entity relation pairs in context of X G . In our example in Figure 1, O is Contradiction because of rows 1&3 and rows 4&5 (rows 1&3 only would also make it Contradiction).

Surface Realization
The surface realization model P (y|O, C), performs the critical task of generating a summary guided by both the entity-entity relation tuples C and the aggregation operator O. The model allows for robust, diverse and faithful summarization compared to traditional template and modern seq2seq approaches.  We propose to model this process as a prototypedriven text infilling task. The entities from C are used as fixed tokens with relations as special blanks in between these entities. This is prefixed by a prototype summary corresponding to O. For the example shown in Figure 2, we concatenate using |SEN | a randomly sampled contradictory summary "Kale contains substances ... help fight cancer ... but the human evidence is mixed ." to C "<blank> pears <controls> ovarian cancer <de-creases> breast cancer <blank>". The infilling language model produces text corresponding to relations between entities while maintaining an overall structure which is cognizant of O. 6 The model is trained on the few sample summaries from the training set using G y m and O y m to produce y m . Providing aggregation and content guidance during generation alleviates the lowresource issue.

Summary and Update Setting
In this section we describe the setting of summary updates. In a real world setting, we would often receive new input documents such as scientific studies about the same subject which necessitate a change in an old summary. In context of our food and health summarization task, the goal is to update an old summary about a food and health condition on receiving results from new scientific studies from Pubmed. Our model can accommodate this scenario fairly easily. We describe the minor changes to the Content Selection and Aggregation and Surface Realization models for such a setting.
We are provided an original summary and can extract it's content C and can also construct the mini-database X G from the text of the new documents. We identify the aggregation between the new studies' X G and original summary's content C first. Depending on the aggregation identified, 6 Summaries in our training data are labelled with O y m as belonging to one of the four categories of Under-reported, Population Scoping, Contradiction or Agreement to accommodate such training. corresponding content C is selected from X G . For instance, in case of a contradiction, we are keen on identifying content leading to this contradiction. The subsequent Surface Realization is dependent on O, the selected C and the C present in the original summary (P (y|O, C + C )).

Experiments
Dataset We utilize a real world dataset for Food and Health summaries, crawled from https:// www.healthline.com/nutrition (Shah et al., 2021). The HealthLine dataset consists of scientific abstracts as inputs and human written summaries as outputs. The dataset consists of 6640 scientific abstracts from Pubmed, each averaging 327 words. The studies in these abstracts are cited by domain experts when writing summaries in the Healthline dataset, forming natural pairings of parallel data. Individual summaries average 24.5 words and are created using an average of 3 Pubmed abstracts. Each food has multiple bullet summaries, where each bullet typically talks about a different health impact (hydration, diabetes etc). We assign each food article randomly into one of the train, development or test splits. Entity tagging and relation classification annotations are provided for the Pubmed abstracts and the healthline summaries. Settings: We consider three settings. 1. Single Issue: We use the individual food and health issue summaries as a unique instance of food and single issue setting. We split 1894 instances 80%,10%,10% to train, dev and test.

Multiple Issues:
We group each food's article Pubmed abstract inputs and multiple summary outputs as a single parallel instance. 464 instances are split 80%,10%,10% to train, dev and test. 3. Summary Update: We consider two kinds of updates -new information is fused to an existing summary and new information contradicts an existing summary. For fusion we consider single issue summaries that have multiple conditions from different Pubmed studies (bananas + low blood pressure from one study and bananas + heart health from another study). We partition the Pubmed  studies to stimulate an update. The contradictory update setting is where we artificially introduce conflicting results in the input document set so that the aggregation changes from Agreement to Contradictory. We have a total of 103 test instances. All models are trained atop of Single issue data. Evaluation We evaluate our systems using the following automatic metrics. Rouge is an automatic metric used to compare the model output with the gold reference (Lin, 2004). KG(G) computes the number of entity-entity pairs with a relation in the gold reference, that are generated in the output. 7 This captures relevance in context of the reference. KG(I), similarly, computes the number of entityentity pairs in the output that are present in the input scientific abstracts. This measures faithfulness with respect to the input documents. Aggregation Cognisance (Ag) measures the accuracy of the model in producing outputs which are cognizant of the right aggregation from the input, (Under-reported, Contradiction or Agreement). We use a rule-based classifier to identify the aggregation implied by the model output and compare it to the actual aggregation operator based on the input Pubmed studies.
In addition to automatic evaluation, we have human annotators score our models on relevance and fluency. Given a reference summary, relevance indicates if the generated text shares similar information. Fluency represents if the generated text is grammatically correct and written in well-formed English. Annotators rate relevance and fluency on a 1-4 likert scale (Albaum, 1997). We have 3 annotators score every data point and report the average across the scores. Baselines In order to demonstrate the effectiveness of our method, we compare it against text2text and 7 We run entity tagging plus relation classification on top of the model output and gold summaries. We match the gold (e g i , e g j , r g ) tuples using word embedding based cosine similarity with the corresponding entities in the output structures Implementation Details Our policy network is a three layer feedforward neural network. We use a Transformer (Vaswani et al., 2017) implementation for Surface Realization. We train an off-the-shelf Neural CRF tagger (Yang and Zhang, 2018) for entity extraction. We use BERT (Devlin et al., 2018) based classifiers to predict the relation between two entities in a text trained using crowdsourced annotations from (Shah et al., 2021). Futher implementation details can be found in A.

Results
In this section, we describe the performance of our Nutribullet Hybrid system and baselines on summarization and summary updates. We report empirical results , human evaluation and present sample outputs, highlighting the benefits of our method.
Single and Multi-issues Summarization: We describe the results on the task of generating summaries. Table 3 presents the automatic evaluation results for the food and single issue summarization task. High KG(I) and KG(G) scores for our method indicate that the generated text is faithful to input entities and relevant. In particular, a high Aggregation Cognisance (Ag) score indicates that our model generates summaries which are cog-Transformer (baseline) * Whole -grain cereals may protect against obesity , diabetes and certain cancers. However , more research is needed .
* Whole grains , such as mozambican grass , are safe to eat with no serious side effects . * Whole -grain cereals may protect against obesity , diabetes and certain cancers. However , more research is needed .
* Whole grains , such as blueberries , are likely safe to eat with no serious side effects . * Whole grains are safe to eat. However , people with type 2 diabetes should avoid whole grains . * Whole grains are lower in carbs than whole grains , making them a good choice for people with type 2 diabetes.
Our Method * Whole grains has been shown to lower weight gain and improve various type 2 diabetes risk factors . * Whole grains has been shown to lower insulin resistance and improve various cancer risk factors . * Whole grains has been linked to several other potential health benefits , such as improved CVD risk , eyesight , and memory. However , more studies are needed to draw stronger conclusions. * There is some evidence , in both animals and humans , that whole grains can reduce mortality by regulating the hormone ghrelin. Table 4: Example outputs of our model and the Transformer baseline for a multi-issues summary. Trained on limited parallel data, the Transformer baseline produces repetitive text with factual inaccuracies, while our method is able to provide more accurate and diverse summarization.  nizant of the varying degrees of consensus in the input Pubmed documents. Compared to other baselines we also receive a competitive score on the automatic Rouge metric, beating Copy-gen, Entity Data2text and GraphWriter baselines while falling short (by 1.7%) of the Transformer baseline. The baselines, especially Transformer, tend to produce similar outputs for different inputs (see Table 4). Since a lot of these patterns are learned from the human summaries, Transformer receives a high Rouge score. However, as in the low resource regime, the baseline does not completely capture the content and aggregation, it fails to get a very high KG(G) or Ag score. A similar trend is observed for the other baselines too, which in this low resource regime produce a lot of false information, reflected in their low KG(I) scores. Human evaluation, conducted by considering scores,on a 1-4 Likert scale, from three annotators for each instance, shows the same pattern. Our model is able to capture the most relevant information, when compared against the gold summaries while producing fluent summaries. The Transformer baseline produces fluent summaries, which are not as relevant. The performance is poorer for the Copy-gen, Entity Data2text and GraphWriter models.
In the multi-issues setting, the baselines access the gold annotations with respect to the input documents' clustering. Our model conducts the extra task of grouping the selected tuples, using the "New List" action. Our model performs better than the baselines on both the KG(I) and KG(G) metrics as seen in Table 5. Again, the pattern of producing very similar and repetitive sentences hurts the baselines. They fail to cover different issues and tend to produce false information, in this low resource setting. Our model scores an 7% higher on KG(G) and 17% higher on KG(I) compared to the next best performance, in absolute terms. Table 4 shows the comparison between the outputs produced by our method and the Transformer baseline on the benefits of whole-grains. Our method conveys more relevant, factual and organized information in a concise manner.  Summary Update: We study the efficacy of our model to fuse information in existing summaries on receiving new Pubmed studies. As the KG(G) metric in 6 shows, our model is able to select and fuse more relevant information. Table 7 shows two examples of summaries on flaxseeds where our model successfully fuses new information.  evaluation results to demonstrate the efficacy of maintaining Aggregation Cognisance (Ag), which is critical when updating summaries on receiving contradictory results. The high performance in this update setting demonstrates the Surface Realization model's ability to produce aggregation cognizant outputs, in contrast to the baselines that do not learn this reasoning in a low resource regime. Analysis: Information Extraction and Content Aggregation Information extraction is the critical first step performed for the input documents in order to get symbolic data for content selection and aggregation. To this end, we report the performance of the information extraction system, which is composed of two models -entity extraction and relation classification. As reported in Table 8, the entity extraction model, a crf-based sequence tagging model, receives a token-level F1 score of 79%. The relation classification model, a BERT based text classifier, receives an accuracy of 69%.
The performance of the information extraction models is particularly important for the content aggregation sub-task. In order to analyse this quantitatively, we perform manual analysis of the 179 instances in the dev set and compare them to the system identified aggregation -information extraction followed by the deterministic rules in Table  1. Given the simplicity of our rules, system's 78% accuracy in Table 8 is acceptable. Deeper analysis shows that the performance is lowest for Population Scoping and Contradiction with an accuracy of 52% and 56% respectively. The performance of Population Scoping being low is down predominantly to the simplicity of the rules. Most mistakes occur when the input studies are review studies that don't mention any population but analyze results from several past work. Contradiction suffers because of the information extraction system and stronger models for the same should be able to alleviate the errors.  Table 8: Performance of our information extraction system and its impact on content aggregation.

Conclusion
While modern models produce fluent text in multidocument summarization, they struggle to capture the consensus amongst the input documents. This inadequacy -magnified in low resource domains, is addressed by our model. Our model is able to generate robust summaries which are faithful to content and cognizant of the varying consensus in the input documents. Our approach is applicable in summarization and textual updates. Extensive experiments, automatic and human evaluation underline its impact over state-of-the-art baselines.