A New Approach to Overgenerating and Scoring Abstractive Summaries

We propose a new approach to generate multiple variants of the target summary with diverse content and varying lengths, then score and select admissible ones according to users’ needs. Abstractive summarizers trained on single reference summaries may struggle to produce outputs that achieve multiple desirable properties, i.e., capturing the most important information, being faithful to the original, grammatical and fluent. In this paper, we propose a two-staged strategy to generate a diverse set of candidate summaries from the source text in stage one, then score and select admissible ones in stage two. Importantly, our generator gives a precise control over the length of the summary, which is especially well-suited when space is limited. Our selectors are designed to predict the optimal summary length and put special emphasis on faithfulness to the original text. Both stages can be effectively trained, optimized and evaluated. Our experiments on benchmark summarization datasets suggest that this paradigm can achieve state-of-the-art performance.


Introduction
The learning objective of a modern abstractive summarizer is to produce system outputs that resemble reference summaries on a word-to-word basis. It does not promote outputs that possess multiple desirable properties, i.e., capturing the most important information, being faithful to the original text, grammatical and fluent, though some of these properties are exhibited by system abstracts as a natural outcome of a learned summarizer (See et al., 2017;Takase et al., 2016;Tan et al., 2017;Chen and Bansal, 2018;Celikyilmaz et al., 2018;Gehrmann et al., 2018;Liu and Lapata, 2019;Lebanoff et al., 2019b;Fabbri et al., 2019;Bražinskas et al., 2020). Without direct optimization of desired properties, system abstracts often change the meaning of the original document or fail to convey the main concepts (Kryscinski et al., 2020). Table 1: Example of alternative summaries generated from the source text. Admissible summaries are marked by . System summaries that fail to preserve the meaning of the source input are marked by .
In this paper, we propose a new approach to overgenerate and select admissible summaries, which allows a summarizer to juggle multiple objectives and strike a good balance between them (Belz and Reiter, 2006). Our approach consists of two stages. Given a source text, a generator explores the space of all possible lengths to produce multiple variants of the target summary that contain diverse content. We then devise selectors to validate the quality of alternative summaries to predict whether they are admissible. Our selection mechanism can be customized to suit particular needs without changing the generation space. Both stages can be effectively trained, optimized and evaluated.
Crucially, we take a confidence-driven approach to summary generation rather than using a left-toright order. Beginning writers and language learners do not write in a strict sequential manner. In a similar vein, our generator produces a summary by "filling-in-the-blanks" with appropriate words. The most confident words are generated first, less vital ones later. With confidence-driven generation, our summarizer learns to dynamically add or remove content, and even paraphrase to produce a summary of a given length. In Table 2, we show an example illustrating the difference between our method and left-to-right generation. Our method dramatically enhances the capability of the generator, making it possible to explore summaries of varying lengths. Source Text: A court here Thursday sentenced a 24-year-old man to 10 years in jail after he admitted pummelling his baby son to death to silence him while watching television.

Left to Right Generation (1 Summary) Confidence Driven Generation (4 Summaries) Man who
Man gets 10 years Man who killed [. . . ] Man who kill the baby gets 10 years Man who killed baby to hear television better gets 10 Man who kill the baby to hear television gets 10 years Man who killed baby to hear television better gets 10 years Man who kill the baby to hear television better gets 10 years Table 2: An example of the difference between left-to-right and confidence-driven summary generation. (LEFT) A single summary is produced in a left-to-right order. (RIGHT) Four summaries are generated in a confidence-driven mode. The most confident words are generated first, less vital ones later. Our generator learns to dynamically add or remove content given a target length to produce summaries of varying lengths-short, medium and long. The output is a diverse set of alternative summaries.
Identifying admissible summaries with desired properties is critical for a summarizer. Summaries of very short lengths may fail to capture the main concepts, and this kind of incomplete or partial information can lead to false assumptions about the original content. Moreover, summaries of moderate lengths may still contain hallucinated content that is nonexistent in the source text (Maynez et al., 2020). We present two summary selectors to combat these issues. Our first selector aims to predict what summary length is most suitable for a source text, whereas a second selector puts special emphasis on the overall quality of the system summary, in particular its faithfulness to the original text (Falke et al., 2019;Durmus et al., 2020).
A novel dataset has been introduced in this work where we associate a source text with multiple summaries, and admissible ones are manually labelled by human annotators. Not only can the dataset be used to judge the effectiveness of summary selectors, but it provides a new testbed for future summarizers to compare their outputs against multiple reference summaries, which is key to improve the reliability of evaluation results (Louis and Nenkova, 2013). We have focused on generating abstractive summaries from single source sentences, but the insights gained from this study could inform the design of summarizers of all forms. Our method also has a great potential to incorporate human-in-theloop to teach the model to select the best summary. The main contributions of this paper are: • We propose a new approach to generate multiple variants of the target summary that have varying lengths, then score and select the best summaries according to our needs.
• Our generator controls over the length of the summary, which is especially well-suited when space is limited. Our selectors are designed to predict the optimal summary length and put special em-phasis on faithfulness to the original text.
• Our experiments on benchmark summarization datasets suggest that this paradigm can surpass results of previous studies or rival state-of-the-art. We conclude with a discussion of our key findings, which has implications for the development of robust abstractive summarizers. 1

Related Work
It is important for neural abstractive summarizers to produce summaries that are faithful to the original texts (Cao et al., 2017;Kryscinski et al., 2019;Lebanoff et al., 2019a;Dong et al., 2020;Zhang et al., 2020b). However, it remains questionable as to whether a summarizer must acquire that ability by learning from human reference summaries, or possibly through external resources such as textual entailment predictions (Falke et al., 2019). In this paper, we present a two-stage strategy to over-generate, then score system summaries externally for faithfulness and overall quality. Previous work has sought to control various aspects of the generated summary, including the style, length and amount of reused text (Kikuchi et al., 2016;Hu et al., 2017;Fan et al., 2018;Makino et al., 2019;Song et al., 2020). In contrast, our generator focuses on producing multiple variants of the target summary that have diverse content and varying lengths. It offers precise control over the length of the summary, which has an important implication for fair comparison between different summarization systems (Napoles et al., 2011;Shapira et al., 2018).
Our methodology allows for greater flexibility in designing summary selectors. The selectors may allow multiple admissible summaries to be identi- Step 1 Step 2 Step 3 Vocabulary Vocabulary Vocabulary Source Summary barks the Figure 1: An illustration of the generation process. A sequence of placeholders ("[MASK]") are placed following the source text. Our model simultaneously predicts the most probable tokens for all positions, rather than predicting only the most probable next token in an autoregressive setting. We obtain the token that has the highest probability, and use it to replace the [MASK] token of that position. Next, the model makes new predictions for all remaining positions, conditioned on the source text and all summary tokens seen thus far. Our generator produces a summary having the exact given length and with a proper endpoint.
fied for any source input according to users' needs.
On the contrary, post-editing of system summaries through a set of basic operations such as insertion and deletion (Gu et al., 2019;Malmi et al., 2019;Dong et al., 2019b; Correia and Martins, 2019) may have intrinsic limitations by learning from single reference summaries to produce single outputs. In this paper, we provide a new dataset where each source text is associated with multiple admissible summaries to encourage diverse outputs. Our generator is inspired by unsupervised pretraining of deep neural models (Peters et al., 2018;Radford et al., 2019;Devlin et al., 2019;Yan et al., 2020;Zhang et al., 2020a; and non-autoregressive machine translation (Gu et al., 2018;Ghazvininejad et al., 2019). Distinct from these is our confidence-driven generation that goes beyond left-to-right order. It uses a denoising objective during training and is conveniently transformed into a semi-autoregressive generator at test time. We introduce a customized beam search algorithm to promote the generation of diverse outputs. In the following section, we describe in detail our two-step strategy.

A Confidence-Driven Generator
We seek to produce a highly diverse set of alternative summaries from any source input, but standard neural language generators with beam search only produce high-likelihood sequences rather than diverse ones (Ippolito et al., 2019). To address this limitation, we devise a new generator that is capable of producing summaries of varying lengths. A long summary can cover more important information of the source text, whereas a short summary is easy-to-read. Moreover, it produces a summary having the exact given length and with a proper endpoint. This is achieved by shifting away from left-to-right generation but building a summary using a confidence-driven approach.
Our generator is illustrated in Figure 1. To generate a summary of L tokens, we place a number of [MASK] tokens following the source text, which serve as "placeholders" for summary tokens. Importantly, our generator simultaneously predicts the most probable tokens for all positions, as opposed to predicting only the most probable next token in an autoregressive setting. We obtain the token that has the highest probability across all positions, and use it to replace the [MASK] token of that position. Next, the model continues to make predictions for all remaining positions, conditioned on the source text and the summary tokens seen thus far of varying positions. Let be the source and y = {y j } M j=1 the summary sequence. Our confidence-driven generation process defines a new order of summary to- (1)), where θ are model parameters to be optimized during training. Our learning objective is to minimize the negative data log-likelihood (Eq. (2)) to predict missing tokens y * o j conditioned on the source text x and the summary tokens seen thus far y o <j . (1) Our generator is trained with a denoising objective. It consists of a decoder-only architecture with 12 Transformer blocks (Dong et al., 2019a). Given Input The Bank of Japan appealed to financial markets to remain calm Friday following the US decision to order Daiwa Bank Ltd. to close its US operations.  Our generator can dynamically add or remove content, and paraphrase to produce a summary of a given length.

L=6 BoJ
The numbers indicate the order in which the summary tokens are generated. "BoJ" stands for "Bank of Japan". It maps to two tokens according to Byte Pair Encoding (BPE). Each summary has an ending period, so the last word also maps to two tokens.
a source text and a summary, we replace a portion of their tokens by the [MASK] token, and the model is trained to reconstruct the original data from the corrupted text. It differs from autoregressive models in that the context of each position can consist of tokens from both left and right-a source word can attend to other source words and a summary word can attend to source words and summary words seen thus far of varying positions-hence capturing a bidirectional context. The training procedure is thus analogous to that of permutation-based language modeling . Our training schedule begins with masking out 10% of source tokens and linearly decreases it to 0% throughout training steps. Masking out a portion of source tokens helps the model learn contextualized representations given bidirectional context. On the target side, the schedule begins with masking out 90% of summary tokens and linearly decreases it to 60%. It allows the model to learn to predict missing summary tokens and copy source tokens to the summary. When a token is chosen, it is replaced with the [MASK] token 80% of the time, a random token of the vocabulary 10% of the time, and remains unchanged otherwise.
In Table 3, we present example summaries produced by our new confidence-driven generator for a source input. The summaries have varying lengths and levels of details. Our generator learns to add or remove content, and even paraphrase to produce a summary of a given length. We adjust the target summary length (L) to produce diverse summaries. Moreover, there exists more than one admissible summaries that capture the important information of the source text, while being grammatical and faithful to the original. It is important to note that, to decode the best summary of length L, our generator requires a position-aware beam search algorithm to explore the space of candidate summaries, which is described next.

Position-Aware Beam Search
A position-aware beam of size K not only contains the K-best candidate summaries having the highest log-likelihood at any time step, but it also records the positions of summary tokens seen thus far for each candidate summary. The tokens of candidate summaries can be decoded in any order and occur in different positions, marking an important distinction between position-aware and traditional beam search (Meister et al., 2020). The method is realized by associating each candidate summary with a binary matrix M ∈ {0, 1} L×|V| , which records what positions have been filled by which summary tokens and what positions remain available.
Concretely, we use S to denote a candidate sum-  Table 4: Corruption types. A positive instance for the selector consists of a ground-truth summary (marked by ) and its source text. A negative instance consists of a corrupted summary () and its source text. Entity Replacement: replacing a named entity of the ground-truth summary with a random entity. Negation: negating a ground-truth summary sentence. Incomplete Summary: replacing the ground-truth summary with one of its sentence constituents to produce a corrupted summary that contains 5 words or less. Search and Replace: swapping the ground-truth summary with a similar summary in the training set that have 4 or more common bigrams. Swap Segments: splitting the ground-truth into two parts of similar length, the parts are swapped to produce an ungrammatical summary. mary, score is its data log-likelihood and M is a binary mask (Line 9). Our generator predicts the token probabilities P L×|V| for all positions, conditioned on the source text and the summary tokens seen thus far. The binary mask M indicates positions that remain available (Line 11-12). We obtain the top-K tokens that have the highest probability scores across all positions, record their summary hypotheses and likelihood scores. These positions are then marked as taken (Line 14-18). The decoding process continues until all of the L positions are filled by summary tokens. This makes our method different from traditional beam search, the latter terminates when an end-of-sequence symbol [SEP] is generated for the summary. Particularly, our method is advantageous as it exerts precise control over the summary length. The model learns to decide what content to be included in the summary given the limited space available, yielding summaries with varying levels of details.

The Selectors
We present two selectors to respectively assess the overall quality of the summary and predict the optimal summary length. Our selectors assume the role of a responsible agent that, when provided with a source text and multiple alternative summaries, can effectively recognize the admissible ones. It has the potential to incorporate human-in-the-loop in future to teach the model to select best summaries.

Best Overall Quality
Our goal is to build a selector to discern the difference between high and low-quality summaries. In an ideal scenario, we have human annotators to vet each source text/summary pair, the annotated data are used to train the selector. The process, however, is both expensive and time-consuming. Inspired by Kryściński et al. (2020), we automatically construct a large number of minimally different pairs, where a positive instance comprises of the source text and its ground-truth summary, and a negative instance includes the source text and a corrupted summary. We experiment with various means to generate corrupted summaries from a ground-truth summary. The corruptions should resemble common mistakes made by neural abstractive summarizers, including generating factually incorrect details, failing to convey the main points of the source text, and being ungrammatical. The corruption types experimented in this paper are illustrated in Table 4.
Distinguishing our work from that of Kryściński et al. (2020) are (i) Search and Replace, we swap the ground-truth summary with a similar summary in the training set that have ≥4 common bigrams to form a negative instance. (ii) Swap Segments splits a ground-truth summary into two parts of similar lengths, then swaps them to produce an ungrammatical summary. (iii) Incomplete Summary replaces a ground-truth summary by one of its sentence constituents, yielding a corrupted summary that fails to convey the main ideas. These corruptions are designed to emulate system summaries that are too short to capture the main concepts, or contain hallucinated content that is not found in the source text.
We next build a binary classifier to predict if a summary is admissible given the source text. To distill information from the source text and the summary, we encode them into hidden vectors using RoBERTa . These are denoted by h x and h y , respectively. We create a vector for the pair, consisting of a concatenation of the two hidden vectors, their absolute difference |h x − h y | and their element-wise product (h x * h y ). ⊕ is a concatenation of vectors. The output vector h is expected to capture the gist of the source text and the summary, and a similar approach is being used for natural language inference . The vector h is fed to a feed-forward layer to predict whether the summary is admissible given the source text. We have chosen to design the selector as a classifier rather than a ranking model because there can exist multiple, equally valid summaries for any source input. The classifier allows us to identify admissible summaries that are not only true-to-original but has the best overall quality.

Best Summary Length
Finding a suitable length for the summary is one of the most important open problems in automatic summarization (Shapira et al., 2018;Sun et al., 2019). A summary should be shorter than the original, but long enough to include the most important information. Length normalization seeks to rescale the log-likelihood score of a summary, denoted by S(x, y) = log p θ (y|x), by its length |y|, with an exponent p (Eq. (3)). It is used by some neural abstractive summarizers (See et al., 2017;. However, the method does not consider the density of information in the source text and it may still generate ultra-short summaries. Instead, we attempt to estimate the appropriate length of the summary given a source text, denoted by L pred , and reward a system summary if it stays close to the estimated length (Huang et al., 2017). Concretely, we assign a per-word reward to the summary, represented by r min(|y|, L pred ) (Eq. (4)). A system summary continues to be rewarded until it System R-1 R-2 R-L lvt2k-1sent (Nallapati et al., 2016) 32.67 15.59 30.64 SEASS (Zhou et al., 2017) 36.15 17.54 33.63 DRGD  36.27 17.57 33.62 Pointer-Gen (See et al., 2017) 34.19 16.92 31.81 R3Sum  37.04 19.03 34.46 EntailGen (Guo et al., 2018) 35.98 17.76 33.63 BiSET  38.45 19.53 36.04 MASS (Song et al., 2019) 38.73 19.71 35.96 UniLM (Dong et al., 2019a) 38.90 20.05 36.00 PEGASUS (Zhang et al., 2020a) 39   (Lin, 2004). 2 reaches the predicted length (|y| ≤ L pred ). Beyond that, increasing the length of the summary does not lead to additional rewards. We obtain the predicted length L pred using a baseline abstractive summarizer, which takes the source text as input and greedily decodes a summary in a left-to-right manner until an end-of-sequence symbol is predicted; L pred is the length of the decoding sequence. r is a coefficient to scale the reward and it is tuned on the validation data. Finally, the reward-augmented log-likelihood S rwd (x, y) is used as a scoring function to rank all summary hypotheses of varying lengths.

Experiments
Datasets We perform extensive experiments on Gigaword (Parker, 2011) and Newsroom (Grusky et al., 2018) datasets. The goal is to generate an abstractive summary from a lengthy source sentence. For each article, we pair its first sentence with the title to form a summarization instance. Both datasets contain large collections of news articles. Gigaword (1995Gigaword ( -2010 contains 3,810,674 / 10,000 / 1,951 instances, respectively, in the train, validation and test splits. Newsroom (1998-2017) contains 199,341 / 21,530 / 21,377 instances, respectively. We conduct experiments on both datasets to demonstrate the generality of our two-staged strategy. Our method generates a diverse set of summaries from a source sentence in stage one, then score and select admissible summaries in stage two. The system summaries are evaluated using both automatic metrics (ROUGE; Lin, 2004) and human evaluation of information coverage, grammaticality  and faithfulness to the original text. We introduce a new dataset where a source sentence is associated with multiple summaries, and admissible ones are labelled by human annotators ( §5.1). The dataset will serve as a useful testbed for future summarization research, where multiple reference summaries is key to improve the reliability of evaluation results (Louis and Nenkova, 2013). This paper focuses on generating abstractive summaries from single source sentences. However, we expect the insights gained from this study to inform the design of future summarizers of different kinds.
Experimental Setup Our generator is initialized with RoBERTa-BASE  due to its high performance on generation-related tasks. We use Byte Pair Encoding (Sennrich et al., 2016) with a vocabulary of 50,265 tokens. The model contains 12 Transformer blocks (Vaswani et al., 2017), with a hidden size of 768 and 12 attention heads, for a total of 110M parameters. We fine-tune the model on the train split of Gigaword and Newsroom, respectively, before applying it to the test sets. The model is fine-tuned for 20 epochs. Each epoch contains 24k / 1.5k batches and our batch size is 128. The model uses 10k / 1k warm-up steps, respectively, for Gigaword and Newsroom. We use the AdamW (Loshchilov and Hutter, 2017) optimizer with an initial learning rate of 1e-4. The momentum parameters are set to 0.9 and 0.999. On a deep learning workstation equipped with 2x Titan RTX GPUs, our model takes 64 and 5.5 hours to fine-tune on Gigaword and Newsroom. At test time, our beam size is K=20. The model produces summaries ranging from L = 7 to 16 tokens for a given source sentence.
Our selector for best overall quality is trained using 1.8M instances automatically constructed from the train split of Gigaword. The set is balanced with an equal number of positive and negative instances. 226k instances are created with the type of Search and Replace, and 400k instances are cre-ated using each of the four remaining corruption types. The reward coefficient r is set to 2.0 across all experiments.

Automatic Evaluation
In Table 6, we present results on Gigaword and Newsroom test sets evaluated by ROUGE (Lin, 2004). We report R-1, R-2 and R-L F 1 -scores that respectively measure the overlap of unigrams, bigrams, and longest common subsequences between system and reference summaries. For each summarization instance, our generator produces multiple alternative summaries, ranging from L=7 to 16 tokens. E.g., "Daiwa Bank." corresponds to four tokens, 'Dai', 'wa', 'Bank' plus an ending period. Our BEST-QUALITY and BEST-LENGTH selectors each identifies a single best summary from the set of alternative summaries for each summarization instance.
We observe that the BEST-LENGTH selector has achieved the highest scores. It performs better than using any single target length for all summaries. Among summaries of different lengths, the highest R-2 F 1 -scores are obtained when the target summary length is set to 11 and 12 tokens, respectively, for Gigaword and Newsroom. This is close to the median length of reference summaries, which are 12 and 13 tokens for these datasets. Our findings show that, the target summary length can make a non-negligible impact on automatic evaluation results. It is best for system summaries to be long enough to include the most important information to achieve satisfying results.
In Table 5, we report results on the Gigaword test split that contains 1,951 instances. Our approach is compared against strong neural abstractive systems, including PEGASUS (Zhang et al., 2020a), UniLM (Dong et al., 2019a) and MASS (Song et al., 2019). These systems draw on large-scale unsupervised pretraining to improve the quality of summaries, yielding some of the best reported results. In comparison, our BEST-LENGTH selector either  Table 7: Example annotation interface. A human annotator is instructed to read over the summaries before seeing the source text to effectively recognize any hallucinated content that is not found in the source text. A native English speaker creates annotations for multiple instances, which are shared with all annotators to provide guidance. surpasses or performs comparably to these systems. The summaries selected by it achieve the highest R-2 F 1 -score of 20.4%. We further choose the summary that yields the highest score for each instance, creating an oracle set of summaries, which yield a R-2 F 1 -score of 33.4%. The results indicate that, with better summary selectors, there is a great potential that we can further boost summarization performance.
In Figure 2, we investigate the effectiveness of our position-aware beam search ( §3.1). The beam size K is set to {1, 5, 10, 15, 20}. We report the average R-2 F 1 -score across summaries of all lengths. Results show that our position-aware beam search is effective at decoding summaries and works robustly across a range of beam sizes. A larger beam (K=20) tends to give better results.

Human Evaluation
We are interested in a holistic evaluation of the multiple alternative summaries produced by the generator. To accomplish this, we develop a new dataset containing 500 summarization instances randomly sampled from the Gigaword test set. Our generator produces 7 alternative summaries for each instance, which have varying lengths that range from L= 7 to 13 tokens. We recruit human evaluators to judge the quality of each summary given its source text. 3 3 Our annotated dataset is available on Github at https:  Our annotation interface is presented in Table 7. A human annotator is instructed to read over all summaries before seeing the source text. It allows him/her to effectively recognize any hallucinated content that is not found in the source text. The annotator is asked to answer three yes-no questions. They include (a) has the summary successfully convey the main points of the source text? (b) is the summary truthful to the meaning of the original? (c) is the summary grammatical? A native speaker creates gold-standard annotations for multiple instances, they are shared with all annotators to provide guidance. Our annotators are recruited using Appen (appen.com). It is a crowdsourcing platform similar to Amazon Mechanical Turk (mturk.com), but provides great quality control mechanisms to ensure high-quality work.
We recruit 5 annotators to judge the quality of each summary. A summary is deemed admissible under a criterion if the majority answer is yes. We observe that, 74.2% of summaries produced by our generator are admissible under all three criteria. The results suggest that our generator is able to produce multiple, equally valid summaries for a given source text. We additionally examine the percentage of admissible summaries under each criterion, results are shown in Table 8. Grammaticality has the best performance (96.5%), followed by truthfulness (82.6%) and content coverage (80.7%). There appears to be room for improvement for the latter two aspects. Moreover, the summaries chosen by our BEST-QUALITY selector demonstrate a high ad-missible rate-93%, 90.8% and 97%-respectively for the three criteria, suggesting the effectiveness of the selector. Further, we observe a discrepancy between ROUGE and human judgments (Fabbri et al., 2020) as summaries yielding highest ROUGE scores are not always deemed admissible by human evaluators. We hope this dataset provides a testbed for future summarizers to be judged on their ability to produce multiple summaries per instance rather than a single summary.
In Table 3, we show example system summaries and the order in which summary tokens are produced. E.g., {2,5} indicate the two tokens "Bo-J" (Bank of Japan) are generated the 2nd and 5th place in the summary. We find that our generator can effectively decide what content should be included in the summary given the limited space available, yielding summaries with varying levels of details. Important spans such as "calls for calm" tend to be generated first, less vital ones later. Our findings corroborate the hypothesis that a masked language model may enable generation in a flexible word order (Liao et al., 2020). Further, we observe that the order in which tokens are generated is related to their dependencies ("call→for"), which supports the findings of Clark et al. (2019).

Conclusion
We investigate a new approach to neural abstractive summarization that focuses on producing multiple summary hypotheses with varying lengths and levels of details. Our selectors are designed to identify summaries that have the optimal length and the best overall quality. The approach obtains state-ofthe-art results on summarization benchmarks and opens up a potential new avenue for customizing summary selectors to suit users' needs.
Future work includes extending this research to long documents. Our confidence-driven generator and the selectors could potentially be extended to operate on spans of text (Joshi et al., 2020) rather than individual tokens, thus allowing for efficient generation of multiple summary hypotheses and identification of admissible summaries and/or summary segments.