A Closer Look at Data Bias in Neural Extractive Summarization Models

In this paper, we take stock of the current state of summarization datasets and explore how different factors of datasets influence the generalization behaviour of neural extractive summarization models. Specifically, we first propose several properties of datasets, which matter for the generalization of summarization models. Then we build the connection between priors residing in datasets and model designs, analyzing how different properties of datasets influence the choices of model structure design and training methods. Finally, by taking a typical dataset as an example, we rethink the process of the model design based on the experience of the above analysis. We demonstrate that when we have a deep understanding of the characteristics of datasets, a simple approach can bring significant improvements to the existing state-of-the-art model.


Introduction
Neural network-based models have achieved great success on summarization tasks (See et al., 2017;Celikyilmaz et al., 2018;Jadhav and Rajan, 2018). Current studies on summarization either explore the possibility of optimization in terms of networks' structures (Zhou et al., 2018;Chen and Bansal, 2018;Gehrmann et al., 2018), the improvement in terms of training schemas Narayan et al., 2018;Wu and Hu, 2018;Chen and Bansal, 2018), or the information fusion with large pre-trained knowledge (Peters et al., 2018;Devlin et al., 2018;Liu, 2019;Dong et al., 2019). More recently, Zhong et al. (2019) conducts a comprehensive analysis on why existing summarization systems perform so well from above three aspects. Despite their success, a relatively missing topic 1 * These three authors contributed equally. † Corresponding author. 1 Concurrent with our work, (Jung et al., 2019) makes a similar analysis on datasets biases and presents three factors is to analyze and understand the impact on the models' generalization ability from a dataset perspective. With the emergence of more and more summarization datasets (Sandhaus, 2008;Nallapati et al., 2016;Cohan et al., 2018;Grusky et al., 2018), the time is ripe for us to bridge the gap between the insufficient understanding of the nature of datasets themselves and the increasing improvement of the learning methods.
In this paper, we take a step towards addressing this challenge by taking neural extractive summarization models as an interpretable testbed, investigating how to quantify the characteristics of datasets. As a result, we could explain the behaviour of our models and design new ones. Specifically, we seek to answer two main questions: Q1: In the summarization task, different datasets present diverse characteristics, so what is the bias introduced by these dataset choices and how does it influence the model's generalization ability? We explore two types of factors: constituent factors and style factors, and analyze how they affect the generalization of neural summarization models respectively. These factors can help us diagnose the weakness of existing models.
Q2: How different properties of datasets influence the choices of model structure design and training schemas? We propose some measures and examine their abilities to explain how different model architectures, training schemas, and pretraining strategies react to various properties of datasets.
Our contributions can be summarized as follows: Main Contributions 1) For the summarization task itself, we diagnose the weakness of existing learning methods in terms of networks' structures, training schemas, and pre-trained knowledge. Some observations could instruct future researchers which matter for the text summarization task. for a new state-of-the-art performance. 2) We show that a comprehensive understanding of the dataset's properties guides us to design a more reasonable model. We hope to encourage future research on how characteristics of datasets influence the behavior of neural networks. We summarize our observations as follows: 1) Existing models under-utilize the nature of the training data. We demonstrate that a simple training method on CNN/DM (dividing training set based on domain) can achieve significant improvement. 2) BERT is not a panacea and will fail in some situation. The improvement brought by BERT is related to the style factor defined in this paper. 3) It is difficult to handle the hard cases (defined by style factor) via architecture design and pre-training knowledge under the extractive framework. 4) Based on the sufficient understanding of the nature of datasets, a more reasonable data partitioning (based on constituent factors) method can be mined.

Related Work
We briefly outline connections and differences to following related lines of research.
Neural Extractive Summarization Recently, neural network-based models have achieved great success in extractive summarization. (Celikyilmaz et al., 2018;Jadhav and Rajan, 2018;Liu, 2019). Existing works on text summarization can roughly fall into one of three classes: exploring networks' structures with suitable bias (Cheng and Lapata, 2016;Nallapati et al., 2017;Zhou et al., 2018); introducing new training schemas (Narayan et al., 2018;Wu and Hu, 2018;Chen and Bansal, 2018) and incorporating large pre-trained knowledge (Peters et al., 2018;Devlin et al., 2018;Liu, 2019;Dong et al., 2019). Instead of exploring the possibility for a new state-of-the-art along one of above three lines, in this paper, we aim to bridge the gap between the lack of understanding of the characteristics for the datasets and the increasing development of above three learning methods. Concurrent with our work, (Jung et al., 2019) conducts a quite similar analysis on datasets biases and proposes three factors which matter for the text summarization task. One major difference between these two works is that we additionally focus on how dataset biases influence the designs of models.
Understanding the Generalization Ability of Neural Networks While neural networks have shown superior generalization ability, yet it remains largely unexplained. Recently, some researchers begin to take a step towards understanding the generalization behaviour of neural networks from the perspective of network architectures or optimization procedure (Schmidt et al., 2018;Baluja and Fischer, 2017;Zhang et al., 2016;Arpit et al., 2017). Different from these work, in this paper, we claim that interpreting the generalization ability of neural networks is built on a good understanding of the characteristic of the data.

Learning Methods
Generally, given a dataset D, different learning methods are trying to explain the data in diverse ways, which show different generalization behaviours. Existing learning methods for extractive summarization systems vary in architectures designs, pre-trained strategies and training schemas.
Architecture Designs Architecturally speaking, most of existing extractive summarization systems consists of three major modules: sentence encoder, document encoder and decoder.
In this paper, our architectural choices vary with two types of document encoders: LSTM 2 (Hochreiter and Schmidhuber, 1997) and Transformer (Vaswani et al., 2017) while we keep the sentence encoder (convolutional neural networks) and decoder (sequence labeling) unchanged 3 . The base model in all experiments refers to Transformer equipped with sequence labelling.
Pre-trained Strategies To explore how different pre-trained strategies influence the model, we take two types of pre-trained knowledge into consideration: we choose Word2vec (Mikolov et al., 2013) as an investigated exemplar for non-contextualized word embeddings and adopt BERT as a contextualized word pre-trainer (Devlin et al., 2018).
Training Schemas In general, we train a monolithic model to fit the dataset, but in particular, when the data itself has some special properties, we can introduce different training methods to fully exploit all the information contained in the data.

Multi-domain Learning
The basic idea of multi-domain learning in this paper is to introduce domain tag as a low-dimension vector which can augment learned representations. Domain-aware model will make it possible to learn domain-specific features.
2. Meta-learning we also try to make models aware of different distribution by metalearning. Specifically, for each iteration, we sample several domains as meta-train and the other as meta-test. The meta-test gradients will be combined with the meta-train gradients and finally update the model.

Datasets
We explore four mainstream news articles summarization datasets (CNN/DM, Newsroom, NYT50 and DUC2002) which are various in their publications. We also modify two large-scale scientific paper datasets (arXiv and PubMed) to investigate characteristics for different domains. Detailed statistics are illustrated in Table 2.

Quantifying Characteristics of Text Summarization Datasets
In this paper, we present four measures to quantify the characteristics of summarization datasets, which can be abstracted into two types: constituent factor and style factors.

Constituent Factors
Motivation When the neural summarization model determines whether a sentence should be extracted, the representation of the sentence consists of two components: position representation 4 , which indicates the position of the sentence in the document; content representation, which contains the semantic information of the sentence. Therefore, we define the position and content information of the sentence as constituent factors, aiming to explore how the selected sentences in the test set relate to the training set in terms of position and content information.

Positional Information
Positional Value (P-Value) Given a document D = s 1 , · · · , s n , for each sentence s i with label y i = 1, we introduce the notion of positional value p i ∈ 1, · · · , K, whose value is the output of the mapping function p i = f(i).
Positional Coverage Rate (PCR) Taking positional value p as a discrete random variable, we can define the discrete probability distribution of p over a dataset D, where N u denotes the number of sentence with p = u and N sent represents the number of sentences with y i = 1 in dataset D.
Based on above definition, for any two datasets D A and D B , we could quantify the proximity of their positional value distribution where KL(·) denotes KL-divergence function. P A and P B represent two position value distribution over two datasets. The datasets with similar positional value distribution usually have large PCR η p .

Content Information
Content Value (C-Value) Given a dataset D, we want to find the patterns that appear most frequently in the ground truth 5 of D and score them. For each sentence in gound truth, we remove the stop words and punctuation, replace all numbers with "0", and perform lemmatization on each token. After the pre-processing, we treat n-gram (n > 1) as the pattern in D and calculate the score ϕ(pt i , D) for  Table 2: Detailed statistics of six datasets. Density and Compression are style factors in Section 4.2. Leadk indicates ROUGE score of the first k sentences in the document and Ext-Oracle indicates ROUGE score of sentences in the ground truth, they represent the lower and upper bound of extractive models respectively. The figure in parentheses after the datasets denotes the number of sentences extracted in Lead-k, which is close to the average number of Ext-Oracle labels.
each pattern as follows: where N pt i denotes the number of i-th pattern.

Content Coverage Rate (CCR)
We introduce the notion of η c to measure the degree of contents' overlap between training and test set in which the sentences with ground truth labels reside in.
where φ denotes the set 6 of patterns which is helpful to pick out ground truth sentences. Sim(·) measures the similarity of two patterns, D tr and D te represent the training set and test set of D respectively.

Style Factors
Motivation Different from constituent factors, style factors influence the generalization ability of summarization models by adjusting the learning difficulty of samples' features.
For this type of factor, we did not propose a new measure, but adopt the indicators DENSITY, COMPRESSION proposed by (Grusky et al., 2018) 7 We claim that the contribution here is to focus on the understanding of these metrics and explore the reasons why they affect the performance of summarization models, which is missing from previous work. More importantly, only when we understand how these metrics affect the performance of the 6 We choose 100 bigrams and trigrams as the set. 7 DENSITY and COMPRESSION was originally used to describe the diversity between datasets in the construction of new datasets. models can we use them to explain some of the differences in model generalization.

Density
Density is used to qualitatively measure the degree to which a summary is derivative of a document (Grusky et al., 2018). Specifically, given a document D and its corresponding summary S, Density(D,S) measures the percentage of words in the summary that are from document.
where | · | denotes the number of words. F(D, S) is a set of extractive fragments, which characterize the the longest shared token sequence.

Compression
Compression is used to characterize the word ratio between the document and summary (Grusky et al., 2018). For the P-Value, the threshold set can be denoted as {t 0 = 0, t 1 , · · · , t K = ∞}. We calculate P os(i) for each sentence s i : and define p i = k if t k−1 ≤ P os(i) < t k . The P os(i) considers both absolute and relevant position of the sentence in the document. In the experiment, we make K = 5 and choose {0, 3, 7, 15, 35, ∞} for the threshold set.
For the C-Value, we calculate the score for each sentence based on the pattern score from training set.
where s i denotes sentence in the ground truth of test set. The score indicates the degree of overlap between the sentence and important patterns of the training set. We then sort all the sentences in ascending order by score and divide them into five intervals with the same number of sentences.
As shown in Figure 1, when the sentence is in the front of the document or contains more salient patterns, the accuracy of the model to extract sentences is higher. The phenomenon means that our proposed P-Value and C-Value reflect position distribution and content information of a specific dataset to a certain extent, and the model does learn constituent factors and uses them to determine whether a sentence is selected.

Exp-II: Cross-dataset Generalization
From the above experiments, we can see that P-Value and C-Value are sufficient to characterize some attributes in a specific dataset, but beyond that, we seek to understand the differences between mainstream datasets through PCR and CCR.
We calculate PCR/CCR score and measure the performance of the base model by ROUGE-2 score on five datasets. We can see from Table 3 that the training and test set of the same dataset always have the highest PCR/CCR score, which indicates the distribution between them is the closest based on consitituent factors. Furthermore, model performance is also in accord with this trend. Consistency presented by the experiment, on the one hand, illustrates that there are significant shifts between different datasets, which results in performance differences of the model in cross-dataset setting, on  the other hand, it reflects that position distribution and content information are the key factors of such dataset-shift.
After verifying the validity of PCR and CCR, we utilize them to estimate the distance between the real distribution of datasets. For instance, news articles datasets (CNN/DM, NTY50 and Newsroom) and scientific paper datasets (arXiv and PubMed) both have lower scores in terms of two metrics, that is to say, there is a larger shift between them, which is also in line with our knowledge. Based on the estimation, we can understand more deeply the impact of different datasets on the generalization ability of various neural extractive summarization models.

Style Factors
We integrate training set, validation set and test set as a whole set and divide it into three parts according to the density or compression of each article and name them "low", "medium" and "high". For example, articles in "density, high" represents these articles have a higher density in the entire dataset. Based on above operation, we break down the test set and attempt to analyze how style factors influence the model performance.
Exploration of Density Density represents the overlap between the summary and the original text, so the samples with high density are more friendly to extractive models. Consequently, it is easy for us to understand the higher the density, the higher the   Table 4. However, the F 1 value of prediction is also positively correlated with the density, which means that density is closely related to the learning difficulty.  Table 5: Experiment about DENSITY, Pct denotes the percentage of ψ(s i , S) to si∈D ψ(s i , S). The first three sentences contain more salient information in samples with higher density.
In order to comprehend this correlation, we conduct the following experiment. Given an article and summary pair, we assign a score ψ(s i , S) to each sentence in article to indicate how much sailent information is contained in the sentence.
ψ(s i , S) = LCS(s i , S) / |s i | (10) where LCS(s i , S) denotes the longest common subsequence length (not counting stop words and punctuation) of the sentence and summary. We calculate the percentage of ψ(s i , S) to s i ∈D ψ(s i , S) and present the results of the three highest-scoring sentences in Table 5. Obviously, in samples with high density, the salient information is more concentrated in a few sentences, making it easier for the model to extract correct sentences.
Therefore, for dataset with high density, we can try to introduce external knowledge into the model, which helps the model better understand the semantic information, and thus easier to capture sentences with salient patterns. In addition, models with external knowledge should have better gen- eralization ablity when transferred to high-density dataset. These inferences will be verified in Section 6.1 and 6.2.1.
Exploration of Compression Documents with high compression tend to have fewer sentences because summaries usually have a similar length in the same dataset. So the results of compression in Table 4 are in line with our expectations, how the model represents long documents to get good performance in text summarization task remains a challenge (Celikyilmaz et al., 2018).
Unlike the exploration of density, we attempt to understand how the model extracts sentences when faced with different compression samples. We utilize an attribution technique called Integrated Gradients (IG) (Sundararajan et al., 2017) to separate the position and content information of each sentence. The setting of input x and baseline x' in this paper is close to Mudrakarta et al. (2018) 8 , but it is worth noting that our base model adds positional embedding to each sentence, so input x and baseline x' both have positional information.
We tend to think that F(x ) denotes the attribution of positional information, and F(x) -F(x ) denotes the attribution of content information when model makes decisions, where F : R n → [0, 1] represents a deep network. Figure 2 illustrates that as compression increases, the help provided by positional information is gradually reduced and content information becomes more important to the model. In other words, the model can perceive the compression and decide whether to pay more attention to positional information or important patterns, this observation is helpful for us to design models or study their generalization ability in Section 6.2.1.  Table 6: Performance of models equipped with different types of knowledge on CNN/DM dataset. BERT here removes the gradient as a way of introducing external knowledge.

Bridge the Gap between Dataset Bias and Model Design Prior
In this section we investigate how different properties of datasets influence the choices of model structures, pre-trained strategies, and training schemas.

Idea of Experiment Design
Through the above analysis in Section 4, the constituent factors reflect the relationship between diverse data distributions and style factors directly affect the learning difficulty of samples' features. Based on the different attributes of the above two types of factors, we designed the following investigation accordingly: for the style factor, we not only investigate the influence of different model architectures and pre-trained strategies on it, but utilize it to explain the generalization behaviour of the models. For the constituent factors, we discuss their effects on different training strategies, such as multi-domain learning and meta-learning, because these learning modes are all about how to better model various types of distributions.

Style Factors Bias
In this section, we study whether the samples with different learning difficulties described by the style factors can be well handled through the improvement of structure or the introducing of pre-training knowledge or we need to extend our model in other ways. Table 6 shows the breakdown performance on CNN/DM based on DENSITY and COMPRESSION. And we can observe that: 1) An obvious trend is that LSTM performs better than Transformer with increasing difficulty in sample learning (low density and high compression). For instance, LSTM performs worse than Transformer on the subset with high density, while surpasses Transformer when the density of testing examples becomes lower. 2) Generally, the introducing of pre-training word vectors can improve the overall results of the models. However, we found that increasing the learning difficulty of samples would weaken the benefits brought by pre-trained embeddings.
3) The prospects for further gains for these hard cases described by style factor from novel architecture design and knowledge pre-training seem quite limited, suggesting that perhaps we should explore other ways, such as generating summaries instead of extracting.

Constituent Factors Bias
We design our experiment towards the answer to two main questions as follows.  Table 7: Results of four models under two types of evaluation settings: IN-DATASET, and CROSS-DATASET. Bold indicates the best performance of all models, red indicated the best performance other than BERT. Table 7 shows the results of four models under two types of evaluation settings: IN-DATASET, and CROSS-DATA, and we have the following findings: 1) For IN-DATASET setting, comparing the Tag and the basic models, we find a very simple method that assign each sample a domain tag could achieve improvement. The reason here we claim is that domain-aware model makes full use of the nature of dataset. 2) For multi-domain and meta-learning model, we attempt to explain from the perspective of data distribution. Although meta-learning obtains worse performance under IN-DATASET setting, it yet has achieved impressive performance under CROSS-DATASET setting. Concretely, metalearning model surpasses Tag model on three datasets: DUC2002, NYT50 and Newsroom, whose distribution is closer to CNN/DM based on constituent factors in Table 3. Correspondingly, Tag model uses a randomly initialized embedding for zero-shot transfer, and we suspect that this perturbation unexpectedly generalizes well on some far-distributed datasets (arXiv and PubMed). 3) BERT has shown its superior performance and nearly outperforms all competitors. However, the generalization ability of BERT is poor on arXiv, PubMed and DUC2002 compared to the performance improvement in IN-DATASET setting. In contrast, BERT shows good generalization when tranferring to datasets with high density and compression (NYT50 and Newsroom). As we have discussed in Sec. 5.2, samples with high style factors require model to capture salient patterns, which is exactly the improvement of introducing external knowledge from BERT.

Exp-II: Searching for a Good Domain
The second question we study is what makes a good domain? To answer this question, we define the concept of domain based not solely on the dataset, but divide the training set by directly utiliz-   (2019), which is to fine tune BERT on CNN/DM.
ing the constituent factors. Specifically, we explore the following different settings: 1) Random tag: Each sample is assigned a random "pseudo-domains" tag.
2) Domain: Divide training samples according to the domain (CNN or DM) they belong to .
3) P-and C-Value: Each sentence is assigned a tag by its corresponding P-Value and C-value scores.
We experiment with tags on our base model and the current state-of-the-art model Liu (2019). Liu (2019) and the results are presented in Table 8, we can obtain the following observations: 1) Random partitioning does not make sense and cannot lead to the improvement of performance. Conversely, the partitions based on the constituent factors have obtained the benefit. 2) This simple learning method that dividing the training set based on domain has shown considerable benefit, which can be complementary to the improvement brought by BERT.
3) The division based on the constituent factors (P-value & C-value) achieves the best result in the context of BERT, which implies that for the summarization task, mining the characteristics of the dataset itself plays an important role.

Conclusion
In this paper, we conduct a data-dependent understanding of neural extractive summarization models, exploring how different factors of datasets influence these models and how to make full use of the nature of dataset so as to design a more powerful model. Experiments with in-depth analyses diagnose the weakness of existing models and provide guidelines for future research.