Conditional Neural Generation using Sub-Aspect Functions for Extractive News Summarization

Much progress has been made in text summarization, fueled by neural architectures using large-scale training corpora. However, in the news domain, neural models easily overfit by leveraging position-related features due to the prevalence of the inverted pyramid writing style. In addition, there is an unmet need to generate a variety of summaries for different users. In this paper, we propose a neural framework that can flexibly control summary generation by introducing a set of sub-aspect functions (i.e. importance, diversity, position). These sub-aspect functions are regulated by a set of control codes to decide which sub-aspect to focus on during summary generation. We demonstrate that extracted summaries with minimal position bias is comparable with those generated by standard models that take advantage of position preference. We also show that news summaries generated with a focus on diversity can be more preferred by human raters. These results suggest that a more flexible neural summarization framework providing more control options could be desirable in tailoring to different user preferences, which is useful since it is often impractical to articulate such preferences for different applications a priori.


Introduction
Text summarization targets to automatically generate a shorter version of the source content while retaining the most important information. As a straightforward and effective method, extractive summarization creates a summary by selecting and subsequently concatenating the most salient semantic units in a document. Recently, neural approaches, often trained in an end-to-end manner, have achieved favorable improvements on various large-scale benchmarks (Nallapati et al., 2017;Narayan et al., 2018a;Liu and Lapata, 2019).
Despite renewed interest and avid development in extractive summarization, there are still long-standing, unresolved challenges. One major problem is position bias, which is especially common in the news domain, where the majority of research in summarization is studied. In many news articles, sentences appearing earlier tend to be more important for summarization tasks (Hong and Nenkova, 2014), and this preference is reflected in reference summaries of public datasets. However, while this tendency is common due to the classic textbook writing style of the "inverted pyramid" (Scanlan, 1999), news articles can be presented in various ways. Other journalism writing styles include anecdotal lead, question-and-answer format, and chronological organization (Stovall, 1985). Therefore, salient information could also be scattered across the entire article, instead of being concentrated in the first few sentences, depending on the chosen writing style of the journalist.
As the "inverted pyramid" style is widespread in news articles (Kryscinski et al., 2019), neural models would easily overfit on position-related features in extractive summarization tasks because of the data-driven learning setup which tags on to features that correlate the most with the output. As a result, those models would select the sentences at the very beginning of a document as best candidates regardless of considering the full context, resulting in sub-optimal models with fancy neural architectures that do not generalize well to other domains (Kedzie et al., 2018).
Additionally, according to Nenkova et al. (2007): "Content selection is not a deterministic process (Salton et al., 1997;Marcu, 1997;Mani, 2001). Different people choose different sentences to include in a summary, and even the same person can select different sentences at different times (Rath et al., 1961). Such observations lead to concerns about the advisability of using a single human model ...", such observations suggest that individuals differ on what she considers key information under different circumstances. This reflects the need to generate application-specific summaries, which is challenging without establishing appropriate expectations and knowledge of targeted readers prior to model development and ground-truth construction. However, publicly available datasets only provide one associated reference summary to a document. Without any explicit instructions and targeted applications or user preferences, ground-truth construction for summarization becomes an under-constrained assignment (Kryscinski et al., 2019). Therefore, it is challenging for end-to-end models to generate alternative summaries without proper anchoring from reference summaries, making it harder for such models to reach their full potential.
In this work, we propose a flexible neural summarization framework that is able to provide more explicit control options when automatically generating summaries (see Figure 1). Since summarization has been regarded as a combination of subaspect functions (e.g. information, layout) (Carbonell and Goldstein, 1998;Lin and Bilmes, 2012), we follow the spirit of sub-aspect theory and adopt control codes on sub-aspects to condition summary generation. The advantages are two-fold: (1) It provides a systematic approach to investigate and analyze how one might minimize position bias in extractive news summarization in neural modeling. Most, if not all, previous work like (Jung et al., 2019;Kryscinski et al., 2019) only focus on analyzing the degree and prevalence of position bias. In this work, we take one step further to propose a research methodology direction to disentangle position bias from important and non-redundant summary content. (2) Text summarization needs are often domain or application specific, and difficult to articulate a priori what the user-preferences are, thus requiring potential iterations to adapt and refine. However, human ground-truth construction for summarization is time-consuming and laborintensive. Therefore, a more flexible summary generation framework could minimize manual labor and generate useful summaries more efficiently.
An ideal set of sub-aspect control codes should characterize different aspects of summarization well in a comprehensive manner but at the same time delineate a relatively clear boundary between one another to minimize the set size (Higgins et al., 2017). To achieve this, we adopt the subaspects defined in (Jung et al., 2019): IMPORTANCE, DIVERSITY, and POSITION, and assess their characterization capability on the CNN/Daily Mail news corpus (Hermann et al., 2015) via quantitative analyses and unsupervised clustering. We utilize control codes based on these three sub-aspect functions to label the training data and implement our conditional generation approach with a neural selector model. Empirical results show that given different control codes, the model can generate output summaries of alternative styles while maintaining performance comparable to the state-of-the-art model; modulation with semantic sub-aspects can reduce systemic bias learned on a news corpus and improve potential generality across domains.

In Relation to Other Work
In text summarization, most benchmark datasets focus on the news domain, such as NYT (Sandhaus, 2008) and CNN/Daily Mail (Hermann et al., 2015), where the human-written summaries are used in both abstractive and extractive paradigms (Gehrmann et al., 2018). To improve the performance of extractive summarization, non-neural approaches explore various linguistic and statistical features such as lexical characteristics (Kupiec et al., 1995), latent topic information (Ying-Lang Chang and Chien, 2009), discourse analysis (Hirao et al., 2015;Liu and Chen, 2019), and graphbased modeling (Erkan and Radev, 2004;Mihalcea and Tarau, 2004) . In contrast, neural approaches learn the features in a data-driven manner. Based on recurrent neural networks, SummaRuNNer is one of the earliest neural models (Nallapati et al., 2017). Much development in extractive summarization has been made via reinforcement learning (Narayan et al., 2018b), jointly learning of scoring and ranking (Zhou et al., 2018), and deep contex-tual language models (Liu and Lapata, 2019).
Despite much development in recent neural approaches, there are still challenges such as corpus bias resulting from the prevalent "inverted pyramid" journalism writing style (Lin and Hovy, 1997), and system bias (Jung et al., 2019) stemming from position preference in the ground-truth. However, to date only analysis work has been done to characterize the position-bias problem and its ramifications, such as inability to generalize across corpora or domains (Kedzie et al., 2018;Kryscinski et al., 2019). Few, if any, have attempted to resolve this long-standing problem of position bias using neural approaches. In this work, we take a first stab to introduce sub-aspect functions for conditional extractive summarization. We explore the possibility of disentangling the three sub-aspects that are commonly used to characterize summarization: POSITION for choosing sentences by their position, IMPORTANCE for choosing relevant and repeating content across the document, and DIVERSITY for ensuring minimal redundancy between summary sentences (Jung et al., 2019) during the summary generation process. In particular, we use these three sub-aspects as control codes for conditional training. To the best of our knowledge, this is the first work in applying auxiliary conditional codes for extractive summary generation.
In other NLP tasks, topic information is used as conditional signals and applied to dialogue response generation (Xing et al., 2017) and pretraining of large-scale language models (Keskar et al., 2019) while sentiment polarity is used in text style transfer . In image style transfer, codes specifying color or texture are used to train conditional generative models (Mirza and Osindero, 2014;Higgins et al., 2017).

Similarity Metric: Semantic Affinity vs. Lexical Overlap
For benchmark corpora that are widely adopted, e.g. CNN/Daily Mail (Hermann et al., 2015), there are only golden abstractive summaries written by humans with no corresponding extractive oracle summaries. To convert the human-written abstracts to extractive oracle summaries, most previous work used ROUGE score (Lin, 2004), which counts contiguous n-gram overlap, as the similarity criteria to rank and select sentences from the source content. Since ROUGE scores only conduct lexi- Figure 2: Cumulative position distribution of oracles built on ROGUE (Blue) and BertScore (Orange). X axis is the ratio of article length. Y axis is the cumulative percentage of summary sentences.
cal matching using word overlapping algorithms, salient sentences from the source content paraphrased by human-editors could be overlooked as the ROUGE scores would be low, while sentences with a high count of common words could get an inflated ROUGE score (Kryscinski et al., 2019).
To tackle this drawback of ROUGE, we propose to apply the semantic similarity metric BertScore (Zhang et al., 2020) to rank the candidate sentences. BertScore has performed better than ROUGE and BLEU in sentence-level semantic similarity assessment (Zhang et al., 2020). Moreover, BertScore includes recall measures between reference and candidate sequences, a more suitable metric than distance-based similarity measures (Wieting et al., 2019;Reimers and Gurevych, 2019) for summarization related tasks, where there is an asymmetrical relationship between the reference and the generated text.

Oracle Construction and Evaluation
To build oracles with semantic similarity, we first segment sentences in source documents and humanwritten gold summaries 1 . Then we convert the text to a semantically rich distributed vector space. For each sentence in a gold summary, we use BertScore to calculate its semantic similarity with candidates from the source content, then the sentence with the highest recall score is chosen. Candidates with a recall score lower than 0.5 are excluded to streamline the selection process.
We observed that the oracle summaries generated through semantic similarity differ from those chosen from n-gram overlap. The positional distributions of two schemes are different, where early sentence bias is less significant for the BertScore scheme (see Figure 2). To further evaluate the effectiveness of this oracle construction approach, we conducted two assessments. ROUGE scores were computed with the gold summaries. Table 1 shows oracle summaries derived from BertScore are comparable though slightly lower than those from ROUGE, which is not unexpected given that BertScore is mismatched with the ROUGE metric. We also conducted two human evaluations. First, we ranked the candidate summary pairs of 50 news samples based on their similarity to human-written gold summaries (Narayan et al., 2018a). Four linguistic analyzers were asked to consider two aspects: informativeness and coherence (Radev et al., 2002). The evaluation score represents the likelihood of a higher ranking, and is normalized to [0, 1]. Next, we adopted the question-answering paradigm (Liu and Lapata, 2019) to evaluate 30 selected samples. For each sentence in the gold summary, questions were constructed based on key information such as events and named entities. Questions where the answer can only be obtained by comprehending the full summary were also included. Human annotators were asked to answer these questions given an oracle summary. The extractive summaries constructed with BertScore are significantly higher in all human evaluations (see Table 1).

Sub-Aspect Features in News Summarization
Conditional generation often uses control codes as an auxiliary vector to adjust pre-defined style features. Classic examples include sentiment polarity in style transfer  or physical attributes (e.g. color) in image generation (Higgins et al., 2017). However, for summarization it is challenging to pinpoint such intuitive or well-defined features, as the writing style could vary according to genre, topic, or editor preference.
In this work, we adopt position, importance and diversity as a set of sub-function features to characterize extractive news summarization (Jung et al., 2019). Considerations include: (1) "inverted pyramid" writing style is common in news articles, thus making layout or position a salient sub-aspect for summarization; (2) Importance sub-aspect indicates the assumption that repeatedly occurring content in the source document contains more important information; (3) Diversity sub-aspect suggests that selected salient sentences should maximize the semantic volume in a distributed semantic space (Lin and Bilmes, 2012;Yogatama et al., 2015).

Summary-Level Quantitative Analysis
We apply two methods to evaluate the compatibility and effectiveness of the sub-aspects we choose for extractive news summarization. First, we conduct a quantitative analysis on the CNN/Daily Mail corpus, based on the assumption that the writing style variability of summaries can be characterized through different combinations of sub-aspects (Lin and Bilmes, 2012).
For each source document, we converted all sentences to vector representations with a pre-trained contextual language model BERT (Devlin et al., 2019) 2 . For each sentence, we averaged hidden states of all tokens as the sentence embedding. Similar to (Jung et al., 2019), to obtain the subset of sentences which correspond to importance sub-aspect, Figure 4: Autoencoder with adversarial training strategy for unsupervised clustering of sentence-level distribution of sub-aspect functions. we adopted an N-Nearest method which calculates an averaged Pearson correlation between one sentence and the rest for all source sentence vectors, and collected the first-k candidates with the highest scores (k equals oracle summary length). To obtain the subset which corresponds to the diversity sub-aspect, we used one implementation 3 of the QuickHull algorithm (Barber et al., 1996) to find vertices, which can be regarded as sentences that maximize the volume size in a projected semantic space. For the subset that corresponds to the position sub-aspect, the first 4 sentences in the source document were chosen.
With three sets of sub-aspects, we quantified the distribution of different sub-aspects on the extractive oracle constructed in Section 3. An oracle summary will be mapped to the importance subaspect when at least two sentences in the summary are in the subset of importance sub-aspect. For those oracle summaries that are shorter than 3 sentences (occupying 19% of the oracle), only one sentence was used to determine which sub-aspect they would be mapped to. Note that the mapping is many to many; i.e. each summary can be mapped to more than one sub-aspect. Figure 3 displays the distribution of the three sub-aspect functions of the oracle summaries, where position occupies the largest area. This visualization shows that the three sub-aspects represent distinct linguistic attributes but could overlap with one another.

Sentence-Level Unsupervised Analysis
According to the mapping algorithm in the previous section, 39% summaries were not mapped to a subaspect. This finding motivated us to investigate the distribution of sub-aspect functions at the sentence level. Thus, we conducted unsupervised clustering, Figure 5: Sentence-level clustering result labeled with sub-aspect features. X axis is the cluster index. Y axis is the proportion of sub-aspect features in each cluster.
assuming that samples within one cluster are most similar to each other and they can be represented by the dominant feature.
As shown in Figure 4, we use an autoencoder architecture with adversarial training to model the correlation between document and summary sentences in the semantic space. The encoding component receives the source document representation and one summary sentence representation as input, and compresses it to a latent feature vector. Then, the latent vector and document vector are concatenated and fed to the decoding component to reconstruct the sentence vector. To obtain a compact yet effective latent vector representing the correlation between the source and summary, we adopt an adversarial training strategy as in . More specifically, the adversarial decoder we include aims to reconstruct the sentence vector directly from the latent vector. During the training process, we update parameters of the autoencoder with an adversarial penalty (see Appendix B for implementation details). After training this autoencoder, we conduct k-means clustering (k = 5) on the latent representation vectors. Then, we analyze the clustering output, with the sentence-level labels of sub-aspect functions as defined in Section 4.2. As shown in Figure 5, sentences with position subaspect is distributed relatively equally across each cluster, while importance and diversity dominate in respectively different clusters. Based on the clustering results, we assign the sub-aspect function which is dominant to unmapped sentences in the same cluster. For instance, diversity is assigned to unmapped sentences in cluster 0 and 1 while importance is assigned to those in cluster 3 and 4. By doing this, we reduce ⇡ 78% of unmapped sentences and further reduce 35% unmapped summaries using the same criteria in Section 4.2.

Conditional Neural Generation
In this section, we construct a set of control codes to specify the three sub-aspect features described in Section 4, and label the oracle summaries constructed in Section 3, then we propose a neural extractive model with a conditional learning strategy for a more flexible summary generation.

Control Code Specification Scheme
The control codes are constructed in the form of [importance, diversity, position] to specify subaspect features. We can flexibly indicate the 'ON' and 'OFF' state of each sub-aspect by switching its corresponding value to 1 or 0, thus enabling disentanglement of each sub-aspect function. For instance, the control code [1, 0, 0] would tell the model to focus more on importance during sentence scoring and selection, while [0, 1, 1] would focus on both diversity and position. Indeed, switching the position code to 0 would help the model obtain minimal position bias. Note that this does not mean the first few sentences would not be selected, as there is overlap between position, importance and diversity (shown in Figure 3). There are 8 control codes under this specification scheme, and we expect this code design can provide the model with sub-aspect conditions for generating summaries.

Neural Extractive Selector
Given a document D containing a number of sentences [s 0 , s 1 , ..., s n ], the content selector assigns a score y i 2 [0, 1] to each sentence i, indicating its probability of being included in the summary. A neural model can be trained as an extractive selector for text summarization tasks by contextually modeling the source content.
Here, we implemented and adapted the neural extractive selector in a sequence labeling manner (Kedzie et al., 2018). As shown in Figure 6, the model consists of three components: a contextual encoding component, a selection modeling component and an output component. First, we used BERT in the contextual encoding component to obtain feature-rich sentence-level representations. Then, in the training process, we concatenated these sentence embeddings with the pre-calculated control code vector and fed them to the next layer, which models the contextual hidden states with the conditional signals. Next, a linear layer with Sigmoid function receives the hidden states and produces scores for each segment between 0 and 1  as the probability of extractive selection. While this architecture is straightforward, it has shown to be competitive when combined with state-of-the-art contextual representation (Liu and Lapata, 2019).
In our setting, sentences were processed by a subword tokenizer (Wu et al., 2016) and their embeddings were initialized with 768-dimension "baseuncased" BERT (Devlin et al., 2019) and were fixed during training. Lengthy source documents were not truncated. For the selection modeling component, we applied a multi-layer Bi-directional LSTM (Schuster and Paliwal, 1997) and a Transformer network (Vaswani et al., 2017) and it was empirically shown that a two-layer Bi-LSTM performed best (see Appendix C for more implementation details). During testing, sentences with the top-3 selection probability were extracted as output summary, and we used the Trigram Blocking strategy (Paulus et al., 2017) to reduce redundancy.
6 Experimental Results and Analysis

Quantitative Analysis
To test the possibility of reducing position bias by conditioning summary generation, we switched the position code to 0 and compared the position Figure 8: Sub-aspect mapping of generated summary with importance-focus code [1,0,0]. Left panel: one sentence in the summary belongs to importance subaspect. Right panel: two sentences in the summary belong to importance sub-aspect. Contour lines denote the number of generated summaries. of selected sentences in summaries generated by our model to the state-of-the-art baseline BertEXT, based on fine-tuning BERT (Liu and Lapata, 2019). The results show that BertEXT has a 50% chance of choosing the first 10% of sentences in the document. While the proposed framework still has a stronger tendency to choose sentences from the first 30% of the sentences, its position distribution is flattened compared to that of BertEXT.
We respectively switched importance and diversity codes to 1 and categorized the generated summaries into subset of each sub-aspect function as in Section 4.2. As shown in Figure 8 and 9, summaries in the subset of importance and diversity weigh higher when the corresponding control codes are ON. Together, these results demonstrate the feasibility of our proposed framework, which can generate output summaries of alternative styles when given different control codes.

Automatic Evaluation
We calculated F1 ROUGE scores for generated summaries under 8 control codes, and compared them with the BertScore oracle (see Section 3), the Lead-3 baseline by selecting first-3 sentences as summary, and several competitive extractive models: SummaRuNNer (Nallapati et al., 2017), Trans-formerEXT and BertEXT (Liu and Lapata, 2019). From Table 2 we observe that: (1) Summary generated from code [0,0,1] is similar to LEAD-3 but can dynamically learn the positional features not limited to the first 3 sentences, while isolating out diversity and importance features. (2) Only focusing on the importance sub-aspect leads to the worst performance, but performance can be improved when considering other sub-aspects. (3) Focusing on the diversity sub-aspect (i.e. Code [0,1,0]) can generate results comparable to strong baselines. Figure 9: Sub-aspect mapping of generated summary with diversity-focus code [0,1,0]. Left panel: one sentence in the summary belongs to diversity sub-aspect. Right panel: two sentences in the summary belong to diversity sub-aspect. Contour lines denote the number of generated summaries.

Human Evaluation
In addition to automatic evaluation, the human evaluation was conducted by experienced linguistic analysts using Best-Worst Scaling (Louviere et al., 2015). Analysts were given 50 news articles randomly chosen from the CNN/Daily Mail test set and the corresponding summaries from 6 systems: the oracle, BertEXT, three codes disabling sub-aspect position, and one code enabling position. They were asked to decide the best and the worst summaries for each document in terms of informativeness and coherence (Radev et al., 2002;Narayan et al., 2018a). We collected judgments from 5 human evaluators for each comparison. For each evaluator, the documents were randomized differently. The order of summaries for each document was also shuffled differently for each evaluator. The score of a model was calculated as the percentage of times it was labeled as best minus the percentage of times it was labeled as worst, ranging from 1.0 to 1.0. Since these differences come in pairs, the sum of all the evaluation scores for all summary types adds up to zero. We observed that   summaries under diversity code are more favored than those under importance, and their combination can further produce better results (see Table  3). These findings resonate those from the automatic evaluation, suggesting that whether the evaluation metric is lexical overlap (ROUGE) or human judgement, the diversity sub-aspect plays a more salient role than importance. Moreover, both automatic and human evaluations show that summarizing with semantic-related sub-aspect condition codes achieves reasonable summaries. Examples in Appendix D show that generated summaries are not position-biased yet still preserve key information from the source content.

Inference on Samples of Shuffled Sentences
To further assess the decoupling between using subaspect signals and positional information learned by the model, we conducted an experiment on samples with shuffled sentences, similar to document shuffle in (Kedzie et al., 2018). In our setting, we only introduce the shuffle process in the model inference phase. We shuffled the sentences of all test samples we used in Section 6.2, then applied the well-trained model to generate the predicted summaries. As shown in Table 4, outputs under position sub-aspect and BertEXT suffer a significant drop in performance when we shuffle the sentence order. By comparison, there is far less decrease between the shuffled and in-order samples under diversity and importance control code, demonstrating that the latent features of these two  semantic-related sub-aspects rely less on the position information, suggesting that applying semantic sub-aspects in the training process can reduce systemic bias learned by the model on a corpus with strong position preference.

Inference on AMI Meeting Corpus
We also conducted an inference experiment on a less position-biased corpus. The AMI corpus (Carletta et al., 2005) is a collection of meetings annotated with text transcriptions with human-written summaries. Different from news summarization, meeting summaries are abstractive with extracted keywords. Unlike the previous comparison work in (Kedzie et al., 2018), we did not train the model from scratch with the AMI training set. Instead, we only applied the pre-trained model (without any fine-tuning) in Section 6 for summarization inference on its test set (20 meeting transcript-summary pairs). Table 5 shows summaries under importance code obtain the highest ROUGE-1 and ROUGE-2 scores, better than the best-reported model in (Kedzie et al., 2018). Not surprisingly, summaries under the position code do not perform well, as there is less position bias in AMI. These findings suggest that our models with semantic-related control codes generalize across domains.

Conclusion
We proposed a neural framework for conditional extractive news summarization. In particular, subaspect functions of importance, diversity and position are used to condition summary generation. This framework enables us to reduce position bias, a long-standing problem in news summarization, in generated summaries while preserving comparable performance with other standard models. Moreover, our results suggest that with conditional learning, summaries can be more efficiently tailored to different user preferences and application needs.