Modeling Content Importance for Summarization with Pre-trained Language Models

Modeling content importance is an essential yet challenging task for summarization. Previous work is mostly based on statistical meth-ods that estimate word-level salience, which does not consider semantics and larger context when quantifying importance. It is thus hard for these methods to generalize to semantic units of longer text spans. In this work, we apply information theory on top of pre-trained language models and deﬁne the concept of importance from the perspective of information amount . It considers both the semantics and context when evaluating the importance of each semantic unit. With the help of pre-trained language models, it can easily generalize to different kinds of semantic units ( n -grams or sentences). Experiments on CNN/Daily Mail and New York Times datasets demonstrate that our method can better model the importance of content than prior work based on F1 and ROUGE scores.


Introduction and Related Work
Text summarization aims to compress long document(s) into a concise summary while maintaining the salient information. It often consists of two critical subtasks, important information identification and natural language generation (for abstractive summarization). With the advancements of large pre-trained language models (PreTLMs) (Devlin et al., 2019;Yang et al., 2019), state-of-the-art results are achieved on both natural language understanding and generation. However, it is still unclear how well these large models can estimate "content importance" for a given document.
Previous studies for modeling importance are either empirical-based, which implicitly encode importance during document summarization, or theory-based, which often lacks support by empirical experiments (Peyrard, 2019). Benefiting from the large-scale summarization datasets (Nallapati et al., 2016;Narayan et al., 2018), data-driven approaches (Nallapati et al., 2017;Paulus et al., 2018;Zhang et al., 2019) have made significant progress. Yet most of them conduct the information selection implicitly while generating the summaries. It lacks theory support and is hard to be applied to low-resource domains. In another line of work, structure features (Zheng and Lapata, 2019), such as centrality, position, and title, are employed as proxies for importance. However, the information captured by these features can vary in texts of different genres.
To overcome this problem, theory-based methods (Louis, 2014;Peyrard, 2019;Lin et al., 2006) aim to formalize the concept of importance, and develop general-purpose systems by modeling the background knowledge of readers. This is based on the intuition that humans are good at identifying important content by using their own interpretation of the world knowledge. Theoretical models usually rely on information theory (IT) (Shannon, 1948). Louis (2014) uses Dirichlet distribution to represent the background knowledge and employs Bayesian surprise to find novel information. Peyrard (2019) instead models the importance with entropy, assuming the important words should be frequent in the given document but rare in the background.
However, statistical method is only a rough evaluation for informativity, which largely ignores the effect of semantic and context. In fact, the information amount of units is not only determined by frequency, but also by its semantic meaning, context, as well as reader's background knowledge. In addition, bag-of-words approaches are difficult to generalize beyond unigrams due to the sparsity of n-grams when n is large.
In this paper, we propose a novel and generalpurpose approach to model content importance for summarization. We employ information theory on top of pre-trained language models, which are expected to better capture the information amount of semantic units by leveraging their meanings and context. We argue that important content contains information that cannot be directly inferred from context and background knowledge. Large pretrained language models are suitable for our study since they are trained from large-scaled datasets consisting of diverse documents and thus containing a wide range of knowledge.
We conduct experiments on popular summarization benchmarks of CNN/Daily Mail and New York Times corpora, where we show that our proposed method can outperform prior importance estimation models. We further demonstrate that our method can be adapted to model semantic units of different scales (n-grams and sentences).

Methodology
In this section, we first estimate the amount of information by using information theory with pretrained language models ( §2.1 and §2.2), where we consider both the context and semantic meaning of a given text unit. We then propose a formal definition of importance for text summarization from a perspective of information amount ( §2.3).

Information Theory
Information theory (IT), as invented by Shannon (1948), has been used on words to quantify their "informativity". Concretely, IT uses the frequency of semantic units x i to approximate the probability P (x i ) and uses negative logarithm of frequency as the measurement for information, which is called self-info 1 : It approximates the information amount of a unit (e.g. word) in a given corpus. However, traditional IT suffers from the sparsity problem of longer n-grams and also ignores semantics and context. Advanced compression algorithms in IT (Hirschberg and Lelewer, 1992) attempt to model the context to better estimate the information amount. But due to the sparsity, they can only count up to third-order statistics. Statistical methods are nearly impossible to reliably calculate the probability of x i conditioned on its context, 1 The unit of information is "bit", with base of 2. In the rest of this paper, we omit base 2 for brevity.  Figure 1: Information amount evaluation with language models. Here we take a subsequence x 3 x 4 as example.
[M] denotes mask and PLMs/MLMs/ALMs are three different options for language models.
where conditions for different models are omitted for brevity. e.g., P (x i | · · · , x i−1 , x i+1 , · · · ), as the number of combinations of the context can be explosive.

Using Language Models in Information Theory
With the development of deep learning, neural language models can efficiently predict the probability of a specified unit, such as a word or a phrase, given its context, which makes it feasible to calculate high-order approximation for the information amount.
We thus propose to use neural language models to replace the statistical models for estimating the information amount of a given semantic unit. Language models can be categorized as follows, and we present information estimation method for each as shown in Fig. 1.
Auto-regressive Language Model (ALM) (Bengio et al., 2000) is the most commonly used probabilistic model to depict the distribution of language, which is usually referred as unidirection LM (UniLM). Given a sequence of tokens x 0:T = [x 0 , x 1 , · · · , x T ], UniLMs use leftward content to estimate the conditional probability for each token: P (x t |x <t ) = g UniLM (x <t ), where g UniLM denotes a neural network for language model and x <t represents the sequence from x 0 to x t−1 . Then the joint probability of a subsequence is factorized as: After applying Eq.
(1) to both sides of Eq. (2), we can obtain the information amount of the subse-quence conditioned on its context as: Masked Language Model (MLM) is proposed by Taylor (1953) and combined with pre-training by Devlin et al. (2019) to encode bidirectional context. MLM masks a certain number of tokens from the input sequence, then predicts these tokens based on the unmasked ones. The conditional probability of a masked token x t can be estimated as: P (x t |x =t ) = g MLM (x =t ), where = t indicates that the t-th token is masked. Information amount of a given subsequence of the input is calculated as: Since MLMs encode both leftward and rightward context, intuitively, it can better estimate the information of current tokens than UniLMs. Permutation Language Model (PLM) is proposed by (Yang et al., 2019) to combine ALMs and MLMs, by considering the dependency between the masked tokens as well as overcoming the problem caused by discrepancy of pre-training and fine-tuning in MLMs. It models the dependency of the tokens by maximizing the expected likelihood of all possible permutations of factorization orders. The probability prediction can be formalized as: P (z t |z <t ) = g PLM (z <t ) where z denotes a possible permutation sequence of input. Information of a subsequence is estimated as:

Modeling Importance with Pre-trained Language Model
We argue that important content should be hard to be predicted based on background knowledge only; it should be also difficult to be inferred from the context. Moreover, detecting important content is to find the most informative part from the input. As described in (Shann, 1989), the information amount is a quantification of the uncertainty we have for the semantic units. But the degree of uncertainty is relative to reader's background knowledge. The less knowledge the reader has, the more uncertainty the source shows. We thus employ pre-trained language models, which contain a wide range of knowledge, to represent background knowledge. If a semantic unit is frequently mentioned in the training corpus, it will get high probability during inference and thus low information amount. We further propose a notion of importance as the information amount conditional on the background knowledge: where X − x i means the context excluding 2 the unit x i from input X and K denotes the knowledge encoded in the pre-trained model. In practice, when calculating the importance of a semantic unit, we first exclude all its occurrences from the input document, and let the PreTLMs predict the probability of each occurrence, based on which the information amount is calculated. As the same unit may appear at multiple positions in the input, summation is used as the final value of information amount. Based on our notion of importance, a summarization model is to maximize the overall importance of a subset x of the input X, with a length constraint, such as x i ∈x |x i | < l max : 3 Experimental Setups Semantic Units and Tasks. Our theory can be generalized for evaluating the importance for any scale of semantic units. To verify the effectiveness of our theory, we instantiate the semantic unit with three common forms: unigram, bigram and sentence. In this way, our method can also be regarded as a general unsupervised information extraction system, serving as a keyphrase extraction or sentence-level extractive summarization model. As our method exploits the existed PreTLMs and needs no additional training, it has the potential of benefiting the low-resource languages and domains.
In unigram scenario, we simply instantiate semantic unit x i as a token w t and calculate its importance with Imp(w t ) = − log P (w t |w =t , K). For evaluation, top-k important ones are selected and F 1 score is calculated by comparing against the reference, where the value of k is set by grid search. Importance of bigrams, e.g., x i = w t w t+1 , can be represented as a joint probability of two tokens: Imp(w t w t+1 ) = − log P (w t w t+1 |w t / ∈[t,t+1] , K). Same as unigrams, F 1 score is computed to evaluate the accuracy.
By extending the formula of bigram importance to longer sequences, we get importance definition  Table 1: Results of importance modeling. UNI./BI. denote unigram and bigram. R-1/R-2/R-L are ROUGE-1, ROUGE-2 and ROUGE-L respectively. Best results per metric are in bold. Among our models (bottom), IMP yields significantly higher scores on all metrics except when using unigrams as semantic unit and with sentences (based on R-1) on NYT (Welch's t-test, p<0.05).
for a sentence as: Imp(s i ) = I(s i |w / ∈s i , K) = − log P (s i |w / ∈s i , K). For evaluation, we select a subset of sentences with Eq. (7) and calculate the ROUGE scores (Lin, 2004) against reference summary. The length constraints for CNN/DM and NYT are set to 105 and 95 tokens respectively.
Datasets. We evaluate our method on the test set of two popular summarization datasets: CNN/Daily Mail (abbreviated as CNN/DM) (Nallapati et al., 2017) and New York Times (Sandhaus, 2008). Following See et al. (2017) 3 , we use the nonanonymized version that does not replace the name entities, which is most commonly used in recent work. We preprocess them as described in (Paulus et al., 2018). For unigram experiments, we remove all the stop words and punctuation in the reference summaries and treat the notional words as the predicting targets. For bigram, we first collect all the bigrams in source document and then discard the ones containing stop words or punctuation. The rest bigrams are employed as the predicting targets.
Comparisons. We compare our method with two types of models: (1) the methods that estimate importance for n-grams. We consider TF·IDF, a numerical statistic to reflect how important a term is to a document, and STM (Peyrard, 2019), a simple theoretic model for content importance based on statistical information theory. (2) unsupervised models for extractive summarization. We adopt centrality-based models LEXRANK (Erkan and Radev, 2004), TEXTRANK (Mihalcea and Tarau, 2004) and TEXTRANK+BERT (Zheng and Lapata, 2019), a frequency-based model SUM-BASIC (Ani Nenkova, 2005), and BAYESIANSR (Louis, 2014) which scores words or sentences with Bayesian surprise.
As shown in Table 1, our method IMP consistently outperform prior models. Among comparisons, we can see that theory-based methods, STM and BAYESIANSR, achieve better results. This is because they have statistics estimated for background distribution, which helps filter out common words. The significant advantage of our method verifies our hypothesis that pre-trained language models better characterize the background knowledge, which in turn more precisely calculate the importance of each semantic unit. Moreover, our methods have a more significant improvement on bigram-level prediction than unigram-level. This is due to the fact that IMP-based models overcome the sparsity issue, where they can evaluate the importance of a phrase by considering its semantic meaning and context. Surprisingly, our method can also generalize to sentence-level semantic units and serve as an unsupervised extract-based model for summarization. Our models achieve significantly higher ROUGE scores than previous work by average 2.02. This observation inspires a potential future direction for sentence-level importance modeling based on background knowledge as well as context information.
We also compare the performance of PreTLMs in different categories. MLMs, including BERT and DISTILLBERT, have the best overall performance, since they are able to encode bidirectional context. PLM, i.e. XLNet, is slightly inferior to MLMs because the probabilities of the words are related to the order of their permutation, which may hurt importance estimation by our method.

Future Work
In the future work, we would like to fine-tune the current language models on the target of max P (x i |X − x i ) to better align with the interpretation of information theory. Currently, the PreTLMs mostly mask the text randomly, which still differ from our current method's objective.
Background knowledge also deserves further investigation. The background knowledge of our methods comes from the pre-training process of language models, suggesting that the information distribution largely depends on the training data. Meanwhile, most PreTLMs are trained with Wikipedia or books, which may affect the determination of content importance from text with different styles. So domain-specific knowledge, such as genres or topics, can be included in the future work.

Conclusion
We propose to use large pre-trained language models to estimate the information amount of given text units, by filtering out the background knowledge as encoded in the large models. We show that the large pre-trained models can be used as unsupervised methods for content importance estimation, where significant improvement over nontrivial baselines is achieved on both keyphrase extraction and sentence-level extractive summarization tasks.