SupMMD: A Sentence Importance Model for Extractive Summarisation using Maximum Mean Discrepancy

Most work on multi-document summarization has focused on generic summarization of information present in each individual document set. However, the under-explored setting of update summarization, where the goal is to identify the new information present in each set, is of equal practical interest (e.g., presenting readers with updates on an evolving news topic). In this work, we present SupMMD, a novel technique for generic and update summarization based on the maximum mean discrepancy from kernel two-sample testing. SupMMD combines both supervised learning for salience and unsupervised learning for coverage and diversity. Further, we adapt multiple kernel learning to make use of similarity across multiple information sources (e.g., text features and knowledge based concepts). We show the efficacy of SupMMD in both generic and update summarization tasks by meeting or exceeding the current state-of-the-art on the DUC-2004 and TAC-2009 datasets.


Introduction
Multi-document summarization is the problem of producing condensed digests of salient information from multiple sources, such as articles.Concretely, suppose we are given two sets of articles (denoted set A and set B) on a related topic (e.g., climate change, the COVID-19 pandemic), separated by publication timestamp or geographic region.We may then identify three possible instantiations of multi-document summarization (see Figure 1): (i) generic summarization, where the goal is to summarize a set (A or B) individually.(ii) comparative summarization, where the goal is to summarize a set (B) against another set (A) while highlighting the differences.(iii) update summarization, where the goal is both generic summarization of set A and comparative summarization of set B versus A. Most existing work on this topic has focused on the generic summarization task.However, update summarization is of equal practical interest.
Intuitively, the comparative aspect of this setting aims to inform a user of new information on a topic they are already familiar with.
Multi-document extractive summarization methods can be unsupervised or supervised.Unsupervised methods typically define salience (or coverage) using a global model of sentence-sentence similarity.Methods based on retrieval (Goldstein et al., 1999), centroids (Radev et al., 2004), graph centrality (Erkan and Radev, 2004), or utility maximization (Lin andBilmes, 2010, 2011;Gillick and Favre, 2009) have been well explored.However, sentence salience also depends on surface features (e.g., position, length, presence of cue words); effectively capturing these requires supervised models specific to the dataset and task.A body of work has incorporated such information through supervised learning, for example based on point processes (Kulesza and Taskar, 2012), learning important words (Hong and Nenkova, 2014), graph neural networks (Yasunaga et al., 2017), and support vector regression (Varma et al., 2009).These supervised methods have either a separate model for learning and inference, leading to a disconnect between learning sentence salience and sentence selection (Varma et al., 2009;Yasunaga et al., 2017;Hong and Nenkova, 2014), or are designed specifically for generic summarization (Kulesza and Taskar, 2012).In this work, we propose SupMMD, which has a single model of learning salience and inference and can be applied to generic and comparative summarization.We make the following contributions: (1) We present SupMMD, a novel technique for both generic and update summarization that combines supervised learning for salience and unsupervised learning for coverage and diversity.SupMMD has a single model for learning and inference.
(2) We adapt multiple kernel learning (Cortes et al., 2010) into our model, which allows similarity across multiple information sources (e.g., text features and knowledge based concepts) to be used.
(3) We show that SupMMD meets or exceeds the state-of-the-art in generic and update summarization on the DUC-2004 and TAC-2009 datasets.

Literature Review
Multi-document summarization can be extractive, where salient pieces of the original text such as sentences are selected to form the summary; or abstractive, where a new text is generated by paraphrasing important information.The former is popular as it often creates semantically and grammatically correct summaries (Nallapati et al., 2017).In this work, we focus on generic and update multidocument summarization in the extractive setting.
There has been limited research into update and comparative summarization.Notable prior work includes maximizing concept coverage using ILP (Gillick et al., 2009), learning sentence scores using a support vector regressor (Varma et al., 2009), and temporal content filtering (Zhang et al., 2009).Bista et al. (2019) cast the comparative summarization problem as classification, and use MMD (Gretton et al., 2012).In this work, we adapt their method to learn sentence importances driven by surface features.

Summarization as Classification
We review a perspective introduced by Bista et al. (2019), where summarization is viewed as classification, and provide a brief introduction to Maximum Mean Discrepancy (MMD).Both these ideas form the basis of our subsequent method.

Generic Summarization as Classification
Let {V t } T t=1 be T topics of articles that we wish to summarize.For a topic t, we wish to select summary sentences S t .Bista et al. (2019) formulated summarization as selecting prototypes that minimize the accuracy of a powerful classifier between sentences in the input and summary.The intuition is that a powerful classifier should not be able to distinguish between the sentences from articles and summary sentences.Formally, we pick where S t ⊂ 2 V t : ∀ S ∈S s∈S len(s) ≤ L comprise subsets of V t with upto L words, and Acc(X,Y ) is the accuracy of the best possible classifier that distinguishes between elements in sets X and Y ; we shall shortly realize this using MMD.

Comparative Summarization as Competing Binary Classification
For comparative summarization between two sets A and B, Bista et al. (2019) introduced an additional term into (3.1),giving rise to competing goals for the classifier: it should not be able to distinguish between the summaries and sentences from set B, but should be able to distinguish between the summaries and sentences from set A. Formally, let V t B be the set of sentences in set B, V t A be the sentences in set to compare (set A).Then, for suitable λ > 0, we seek S t , the summary sentences of set B, The hyperparameter λ controls the relative importance of accurately representing articles in set B, versus not representing the articles in set A.

Maximum Mean Discrepancy (MMD)
The MMD is a kernel-based measure of the distance between two distributions.More formally: Definition 3.1.Let H be a Reproducing Kernel Hilbert Space (RKHS) with associated kernel k.
Let F be the set of functions h : X → R in the unit ball of H, where X is a topological space.Then, the MMD between distributions p,q is the maximal difference in expectations of functions from F under p,q (Gretton et al., 2012): A small MMD value indicates that p, q are similar.Given finite samples X ∼ p n and Y ∼ q m , an empirical estimate of the MMD, denoted as MMD 2 F (X,Y ), can be computed as:

MMD for Summarization
The MMD corresponds to the minimal achievable loss of a centroi-based kernel classifier (Sriperumbudur et al., 2009).Consequently, we use MMD 2 F (V,S) to approximate the Acc(V,S) in (3.1) and (3.2), using a suitable kernel k that measures the similarity of two sentences.Intuitively, this selects summaries S which best represent the distribution of original sentences V .
Note that if we expand MMD 2 F (V,S) as per (3.4) and later in §4.6, the first term is irrelevant for optimization.The second and third term capture the coverage and diversity of the summary sentences without any supervision.Hence, this is an unsupervised summarization.

The SupMMD Method
We start by developing a technique for incorporating sentence importance into MMD for the purpose of generic multi-document extractive summarization.We then extend this method to comparative summarization, and incorporate multiple different kernels to use a diverse sets of features.

From MMD to Weighted MMD
Unsupervised MMD (Bista et al., 2019) selects representative sentences that cover relevant concepts while retaining diversity.The notion of representativeness is based on a global model of sentence-sentence similarity; however, this notion of representativeness is not necessarily well matched to the selection of salient information.Salience of a sentence may be determined by surface features such as position in the article, or number of words.For example, news articles are often written such that sentences at the start of a article have the characteristics of a summary (Kedzie et al., 2018).Learning a notion of salience that is specific to the summarization task and dataset requires supervised training.Thus, we extend the MMD model by incorporating supervised sentence importance weighting.
Let v, s ∈ X be independent samples drawn from the distributions of article sentences p and summary sentences q on the space of all sentences X.We define non-negative importance functions f p θ , f q θ parameterized by learnable parameters θ.We restrict these functions so that E p f p θ (v) = 1 and E q f q θ (s) = 1.Equipped with f θ , we may modify MMD such that the importance of sentences which are good summary candidates is increased.Definition 4.1.The weighted MMD MMD F (p,q,θ) between p,q is Note that classic MMD (3.3) is a special case of (4.1) where f θ ≡ 1.
In practice, the supremum over all h is impossible to compute directly.We thus derive an alternative form for Equation 4.1.Lemma 4.1.For h H ≤ 1, (4.1) is equivalently In the above, φ : X → F is a canonical feature mapping of sentences and summaries from X to RKHS.The derivation, which mirrors a similar derivation for MMD (Gretton et al., 2012), is given in the Appendix.

Importance Function
We use log-linear models as importance functions, as they are a common choice of sentence importance (Kulesza and Taskar, 2012) and easy to fit when training data is scarce.Formally, the log-linear importance function is: where ω(v) is the surface features of sentence v.We can define the empirical estimates f nt θ (v), f mt θ (s) of the importance functions f p θ (v) and f q θ (s) as: where nt = |V t | is the number of sentences and mt = |S t | is the number of summary sentences in topic t.

Training: Generic Summarization
The parameters θ of the log-linear importance function must be learned from data, so we define a loss function based on weighted MMD.Let {(V t ,S t )} T t=1 be the T training tuples.Then, the loss of topic t is the square of importance weighted empirical MMD between sentences and summary sentences from within the topic: where MMD 2 F (V t ,S t ,θ) is an empirical estimate of the weighted MMD 2 F (p,q,θ).Applying the kernel trick to Equation 4.4 gives (see Appendix): Equation 4.5 is the loss for a single topic but during training we will instead minimize the average loss over all topics in the training set, i.e., min θ 1 T T t=1 L t (V t , S t , θ).Intuitively, we learn the parameters θ by minimizing an importance weighted distance between sentences and ground truth summary sentences over all topics.

Training: Comparative Summarization
We now extend the learning task to comparative summarization using the competing binary classifiers idea of Bista et al. (2019) (cf. §3.2).Specifically, we replace the accuracy terms in Equation 3.2 with the square of weighted MMD.Given the T comparative training tuples {(V t B ,V t A ,S t )} T t=1 , then the objective is to minimize: Note there are two sets of importance parameters θ B ,θ A one for each of the two document sets.

Multiple Kernel Learning
We employ Multiple Kernel Learning (MKL) to make use of data from multiple sources in our MMD summarization framework.We adapt two stage kernel learning (Cortes et al., 2010), where different kernels are linearly combined to maximize the alignment with the target kernel of the classification problem.Since MMD can be interpreted as classifiability (Sriperumbudur et al., 2009) MKL fits neatly into our MMD based summarization objective.Intuitively, MKL should identify a good combination of kernels for building a classifier that separates summary and non-summary sentences.
Let {k i } p i=1 be p kernel functions.For topic t, let K t i be the kernel matrix according to kernel function k i , and K t i = U nt K t i U nt be the centered kernel matrix, with U nt = I − 11 T /n t .Let y t = {±1} n t be the ground truth summary labels with y t i = +1 iff i ∈ S t .The target kernel y t (y t ) T represents the ideal notion of similarity between sentences.The non-negative kernel weights w which lead to the optimal alignment with the target kernel are given by (Cortes et al., 2010) where The kernel function must be characteristic for MMD to be a valid metric (Muandet et al., 2017).Most popular kernels used for bag of words like text features (including TF-IDF), the linear kernel (k(x, y) = x, y ) and the cosine kernel (k(x,y) = x,y x y ), are not characteristic (Sriperumbudur et al., 2010).Fortunately, the exponential kernel, k(x, y) = exp(γk (x,y)), γ > 0, is characteristic for any kernel k (Steinwart, 2001).
Hence, we use the normalized exponential kernel combined with the cosine kernel, k(x, y) = exp(−γ)exp γ p i=1 w i •cos x (i) ,y (i) .

Inference
Given a learned importance function f θ , we may find the best set of summary sentences St for generic summarization via: Similarly, for the comparative task, with learned importance functions, we seek St as: Both these inference problems are budgeted maximization problems, which are often solved by greedy algorithms (Lin and Bilmes, 2010).The generic unsupervised summarization task is submodular and monotone under certain conditions (Kim et al., 2016), so greedy algorithms have good theoretical guarantees (Nemhauser et al., 1978).While our supervised variants do not have these guarantees, we find that greedy optimization nonetheless leads to good solutions.

Experimental Setup
We include guidance on applying SupMMD and the details required to reproduce our experiments.

Datasets
We use four standard multi-document summarization benchmark datasets: DUC-

Data Preprocessing and Preparation
The DUC and TAC datasets are provided as collections of XML documents, so it is necessary to extract relevant text and then perform sentence and word tokenization.For DUC we clean the text using various regular expressions the details of which are provided in our code release.We train PunktSentenceTokenizer to detect sentence boundaries, and use the standard NLTK (Bird, 2006) word tokenizer.For the TAC dataset, we use the preprocessing pipeline employed by Gillick et al. (2009) 2 .This enables a cleaner comparison with the state-of-the-art ICSI (Gillick et al., 2009) method on the TAC dataset.For all datasets, we keep the sentences between 8 and 55 words per Yasunaga et al. (2017).

Feature Representations
Our method requires two different sets of sentence features: text features, which are used to compute the sentence-sentence similarity as part of the kernel; and surface features, which are used in learning the sentence importance model.

Text Features
Each sentence has three different feature representations: unigrams, bigrams and entities.The unigrams are stemmed words, with stop words from the NLTK english list removed.The bigrams are a combination of stemmed unigrams and bigrams.

Surface Features
We use 10 surface features for the DUC dataset, and 12 for the TAC dataset: position: There are five position features.Four indicators denote the 1 st , 2 nd , 3 rd or a later position of the sentence in the article.The final feature gives the position relative to the length of the article.counts: There are two count features: the number of words and number of nouns.We use the spaCy3 part of speech tagging to find nouns.tfisf: This is the sum of the TS-ISF scores for unigrams composing the sentence.For sentence s, this is w∈s isf(w)•tf(w,s), where isf(w) is the inverse sentence frequency of unigram w, and tf(w,s) is the term frequency of w in s.We report the number of topics in each dataset, along with the number of sentences after preprocessing.We show the ROUGE scores of our oracle method and the one by Liu and Lapata (2019) with average number of sentence in summary from each method.
btfisf: The boosted sum of TS-ISF scores for unigrams composing the sentence.Specifically, we compute w∈s isf(w)•b(w)•tf(w,s), where we boost the score of unigrams w that appear in the first sentence of the article as b(w).In the generic summarization b(w) = 2, for comparative summarization b(w) = 3, as used by Gillick et al. (2009).Unigrams that do not appear in the first sentence of the article have b(w) = 1.lexrank The LexRank score (Erkan and Radev, 2004) computed on the bigrams' cosine similarity.
For the TAC datasets, we additionally use: par_start: An indicator whether the sentence begins a paragraph.This is provided by the preprocessing pipeline from ICSI (Gillick et al., 2009).qsim: The fraction of topic description unigrams present in each sentence; these topic descriptions are only available for TAC.

Oracle Extraction
Both DUC and TAC provide four human written summaries for each topic.Since our goal is extractive summarization with supervised training, we need to know which sentences in the articles could be used to construct the summaries in the training set.The article sentences that best match the abstractive summaries are called the oracles (S t ).
Algorithm 1 Oracle extraction Our extraction algorithm (Algorithm 1), is inspired by Liu and Lapata (2019).We greedily select sentences (s) which provide the maximum gain in extraction score α(S t ,H t ) against the human summaries (H t ) until a word budget (L) is reached.We only include sentences between 8 to 55 words as suggested by Yasunaga et al. (2017), and set a budget of 104 words to ensure our oracle summaries are within 100±4 words, consistent with the evaluation ( §5.6).
In contrast to Liu and Lapata (2019) which uses only ROUGE-2 recall score (Lin, 2004), our method balances both ROUGE-1 and ROUGE-2 recall scores using the harmonic mean and explicitly accounts for sentence length.Grid search on the validation sets shows that the optimal value for r is 0.4 across different datasets and summarization tasks.As reported in Table 1, on average our method produces oracles consisting of more sentences and with higher ROUGE-1 and ROUGE-2 scores compared to oracles from Liu and Lapata (2019).This is consistent across all datasets.

Implementation Details
Supervised variants use an 2 regularized log-linear model of importance ( §4.2) trained using the oracles ( §5.4) as ground truth.We selected the number of training epochs using 5-fold cross validation.We then tune the other hyperparameters on the training set.The hyperparameters of the generic summarization task are: γ, a parameter of the kernel; β, the 2 regularization weight for the log-linear importance function; and r, which defines the length dependent scaling factor in greedy selection (Lin and Bilmes, 2010).The comparative objective (4.6) has an additional hyperparameter λ, which controls the comparativeness.More implementation details are provided in the Appendix.We will make implementation publicly available4 .

Evaluation Settings
To evaluate our methods we use the ROUGE (Lin, 2004) metric, the de facto choice for evaluating both generic summarization (Hong and Nenkova, 2014;Cho et al., 2019;Yasunaga et al., 2017;Kulesza and Taskar, 2012), and update summarization (Varma et al., 2009;Gillick and Favre, 2009;Zhang et al., 2009;Li et al., 2009).ROUGE metrics have been shown to correlate with human judgments (Lin, 2004) in generic summarization task.Our recent work (Bista et al., 2019) shows that human judgments are consistent with the automatic metrics for evaluating comparative summaries.
Both DUC and TAC evaluations use the first 100 words of the generated summary.Our DUC-2004 evaluation setup mirrors Hong et al. (2014).This allows us to compare performance with the state-of-the-art methods they reported and other works also evaluated using this setup 5 .As is standard for the DUC-2004 datasets, we report ROUGE-1 and ROUGE-2 recall scores.
For TAC-2009 datasets (both set A and B) , we adopt the evaluation settings from the TAC-2009 competition 6 so we can compare against the three best performing systems in the competition 7 .As is standard for the TAC-2009 dataset, we report ROUGE-2 and ROUGE-SU4 recall scores.

Baselines
DUC-2004: We select the top performing methods from a recent benchmark paper (Hong et al., 2014) to serve as baselines and report ROUGE scores from the benchmark paper.They are: ICSI: an integer linear programming method that maximizes coverage (Gillick et al., 2009), DPP: a determinantal point process method that learns sentence quality and maximizes diversity (Kulesza and Taskar, 2012), Submodular: a method based on a learned mixture of submodular functions (Lin and Bilmes, 2012), OCCAMS_V: a method base on topic modeling (Conroy et al., 2013), Regsum: a method that focuses on learning word importance (Hong and Nenkova, 2014), Lexrank: a popular graph based sentence scoring method (Erkan and Radev, 2004).
We also include recent deep learning methods evaluated using the same setup as Hong et al. (2014) and report ROUGE scores from the individual papers: DPPSim: an extension to the DPP model which learns the sentence-sentence similarity using a 5 ROUGE 1.5.5 with args -n 4 -m -a -l 100 -x -c 95 -r 1000 -f A -p 0.5 -t 0 6 tac.nist.gov/2009/Summarization 7args -n 4 -w 1.2 -m -2 4 -u -c 95 -r 1000 -f A -p 0.5 -t 0 -a -l 100 capsule network (Cho et al., 2019), HiMAP: a recurrent neural model that employs a modified pointer-generator component (Fabbri et al., 2019), and GRU+GCN: a model that uses a graph convolution network combined with a recurrent neural network to learn sentence saliency (Yasunaga et al., 2017).
TAC-2009: As baselines for the TAC-2009 dataset we use the top three systems in the TAC-2009 competition for each task, resulting in four systems altogether.To the best of our knowledge these systems are the current state-of-the-art.We report the ROUGE scores from the competition.The systems are: ICSI: with two variants: Sys.34 uses integer linear programming to maximize coverage of concepts (Gillick et al., 2009), and Sys.40, which additionally uses sentence compression to generate new candidate sentences, IIT: uses a support vector regressor to predict sentence ROUGE scores (Varma et al., 2009), ICTCAS: a temporal content filtering method (Zhang et al., 2009), and ICL: a manifold ranking based method (Li et al., 2009).

Experimental Results
We compare our methods with the baselines on the DUC-2004, TAC-2009-A and TAC-2009-B datasets.We present several variants of our method to analyze the effects of different components and modeling choices.We report the performance of unsupervised MMD (UnsupMMD) which does not explicitly consider sentence importance.For our supervised method SupMMD, we report the performance with a bigram kernel (SupMMD) and combined kernels (SupMMD + MKL).We also evaluated the impact of our oracle extraction method by replacing it with the extraction method suggested by Liu and Lapata (2019) in SupMMD + alt oracles.Meanwhile, SupMMD + MKL + compress presents the result of applying sentence compression (Gillick et al., 2009) to our model.

Generic Summarization
The performance of our methods on the DUC-2004 generic summarization task are shown in Table 2. On the DUC-2004 dataset all SupMMD variants exceed the state-of-the-art, when evaluated with ROUGE-2, and perform similarly to the best existing methods when evaluated with ROUGE-1.(2019) and implemented in SupMMD (alt oracle) gives lower performance than our technique.For example, on DUC-2004 SupMMD (alt oracle) has a ROUGE-1 of 39.02 and ROUGE-2 of 10.22, while SupMMD has a ROUGE-1 of 39.36 and a ROUGE-2 of 10.31.Thus, the advantages of our proposed oracle extraction method are substantial and consistent across multiple datasets and evaluation metrics.
Multiple Kernel Learning: We observe that combining multiple kernels helps the performance of SupMMD models on the generic summarization task.SupMMD + MKL which combines both bigram and entity kernels has a ROUGE-2 of 10.54 on DUC-2004, while SupMMD only uses the bigrams kernel and scores 10.31 in ROUGE-2.Multiple kernels show even clearer gains in the TAC-2009-A dataset.
Sentence compression incorporated into the post-processing steps of SupMMD + MKL + compress does not clearly improve the results over SupMMD + MKL.On TAC-2009-A, compression clearly reduces performance, and on DUC-2004 SupMMD + MKL + compress has a higher ROUGE-1 score but a lower ROUGE-2 score than SupMMD + MKL.Incorporating compression into the summarization pipeline is an appealing direction for future work.

Comparative Summarization
The results for the comparative summarization task on the TAC-2009-B dataset are shown in Table 4.Our supervised MMD variants SupMMD and Sup-MMD + MKL both outperform the state-of-the-art baseline ICSI in ROUGE-SU4 but fall short in ROUGE-2.It would be hard to claim that either method is superior in this instance; however, it does show that SupMMD -which uses a substantially different approach to that of ICSI -provides an alternative state-of-the-art.Thus SupMMD further TAC-2009-B R2 RSU4 ICSI (Sys.34)(Gillick et al., 2009) 10.39 13.85 ICSI (Sys.40)(Gillick et al., 2009)  maps out the set of techniques that are useful for comparative summarization.As per the generic summarization task, both our supervised training method and oracle extraction method are essential for achieving good performance in ROUGE-2 and ROUGE-SU4.We also identify sentence position and btfisf as important features for sentence salience (see the Appendix).
Multiple kernels as in SupMMD + MKL has relatively little effect, reducing the ROUGE-2 score to 10.24 from the slightly higher 10.28 achieved by SupMMD.A similar small decrease is seen for ROUGE-SU4.Manual inspection shows that the summaries from SupMMD and SupMMD + MKL methods are largely identical with differences primarily on topic D0908, which covers political movements in Nepal.The key entities in this topic are not resolved accurately by DBPedia Spotlight, contributing additional noise and affecting the MKL approach.
Model variants: We have tested an additional variant of our model for comparative summarization, SupMMD 2 , which defines two different importance functions: one for each of the two document sets -A and B (See §4.4 for details).In contrast, SupMMD has a single importance function shared between document sets, i.e., in Equation (4.6), θ A = θ B .SupMMD 2 performed substantially worse than SupMMD in both metrics, for example, SupMMD has a ROUGE-2 of 10.28 while SupMMD 2 has a ROUGE-2 of 9.94.We conjecture that a single importance function performs better when training data is relatively scarce because it reduces the number of parameters and simplifies the learning problem.Techniques for tying together the parameters for both importance functions, such as with a hierarchical Bayesian model, are left as future work.

Conclusion
In this work, we present SupMMD, a novel technique for update summarization based on the maximum mean discrepancy.SupMMD combines supervised learning for salience, and unsupervised learning for coverage and diversity.Further, we adapt multiple kernel learning to exploit multiple sources of similarity (e.g., text features and knowledge based concepts).We show the efficacy of SupMMD in both generic and update summarization tasks on two standard datasets, when compared to the existing approaches.We also show that the importance model we introduce on top of our existing unsupervised MMD (Bista et al., 2019) improves the summarization performance substantially on both generic and comparative summarization tasks.
For future work, we leave the task of incorporating embeddings features such as BERT (Devlin et al., 2019), and evaluating with large generic multi-document summarization dataset Multi-News (Fabbri et al., 2019).

A Background
Theory on Kernels and MMD In this section, we provide a brief overview of kernels and Maximum mean Discrepancy (MMD).
For a detailed overview, we refer readers to to Muandet et al. (2017) and Gretton et al. (2012) from which this brief overview is taken.

A.3 More on MMD
Recall that F is a class of RKHS functions within the unit ball, i.e. h ∈ H, h H ≤ 1. Suppose H admits a feature map φ : X → H.Then, per Gretton et al. (2012), we may solve the supremum in Equation 3.3 as Hence, MMD is computed as the distance between the mean feature embeddings under each distribution, for a suitable kernel-based feature space (Gretton et al., 2012).
Eq. (A.1) involves explicitly evaluating the arbitrarily high-dimensional features.Instead, the kernel trick allows efficient computation of MMD 2 F (p, q) by evaluating just pairwise kernels.Supposing H has induced kernel k, we have

A.4 Characteristic Kernel
For a distribution p, and kernel with feature map φ : X → H, the kernel mean map is A kernel k is characteristic if the map µ : p → µ p is injective.A characteristic kernel ensures MMD is 0 if and only if p = q, i.e., no information is lost in mapping the distribution into the RKHS (Muandet et al., 2017).

B Proof of Lemma 4.1
The weighted MMD MMD F (p,q,θ) ( §4.1), where F contains functions h : X → R within unit ball RKHS H ( h H ≤ 1) is defined as: Recall f θ is a non-negative importance weighting function.Then, according to Patil and Rao (1978), the weighted probability density pθ of p is: ] and similarly qθ for q.

D Training details
We train generic summarization model with full batch LBFGS (Liu and Nocedal, 1989) with learning rate 0.005.We train comparative summarization model using Yogi optimizer (Zaheer et al., 2018), with a mini batch size of 8 topics, learning rate 0.002, and decreasing the learning rate by half every 20 epochs.We choose the number of training epochs by validating across 5 folds with early stopping.We set the patience to 20 epochs for early stopping with LBFGS optimizer and 50 epochs with Yogi optimizer.We tune the other hyperparameters on the training set, and the optimal hyperparameters of best model (SupMMD + MKL) and searched space are shown in Table 5.The kernel combination weights w ( §4.5) are also shown in Table 5.The kernel combination weights (w) are written in order: unigrams, bigrams and entities.

E Additional results
In this section we provide some additional results.We analyze the correlation between normalized ROUGE recall scores of the sentences and sentence scores from SupMMD and Lexrank.The normalized rouge score of each sentence is defined as ROUGE(s) = ROUGE(s) #words(s) .As shown in Table 8, we find that SupMMD has a slightly high correlation with sentence rouge scores.This suggests that SupMMD is better in capturing sentence importance for summarization.

E.2 Feature correlations
We analyze the correlation between various surface features and sentence importance scores from SupMMD and Lexrank (Erkan and Radev, 2004)  shown in table 6, SupMMD has higher correlation with relative position, signifying the importance of position of sentence in summary sentences.Lexrank has a higher correlations with the number of words, number of nouns and TFISF scores of the sentences, which is expected as Lexrank is an eigenvector centrality of sentence-sentence similarity matrix.This suggest SupMMD is able to learn that first few sentences are important in news summarization.Similar result is reported by Kedzie et al. (2018), where they show that the first few sentences are important in creating summary of news articles.

E.3 Example summary
We present the update summaries (Set A and B) of topic D0906, which contains articles about "Rains and mudslides in Southern California" in Table 7.We highlight few phrases in bold which could help us to identify the difference between set A and B. Summaries from ICSI and SupMMD methods suggest that set A contains articles describing events from earlier days of the disaster and set B contains articles from later stage of the disaster.

Figure 1 :
Figure 1: Different summarization tasks: (a) Generic (b) Comparative (c) Update.Two sets of articles (set A and B) are denoted by red and blue circles, respectively.Summary prototypes are bold circles, and information coverage of each tasks is filled with respective colors.

Table 1 :
Dataset statistics and oracle performance.
.1 Positive Definite Kernels and Kernel Trick Definition A.1.A functionk : X × X → R is called positive definite kernel if it is symmetric, i.e. ∀ x,y∈X k(x, y) = k(y, x) and gram matrix is positive definite, i.e. ∀ n∈N ∀ c 1 ,c 2 ,..cn∈R n i,j=1 c i c j k(x i ,x j ) ≥ 0. Theorem A.1.If a kernel is positive definite, there exists a feature map φ : X → H such that ∀ x,y∈X k(x,y) = φ(x), φ(y) H .This is known as the kernel trick in machine learning.The feature space H is called a Reproducing Kernel Hilbert Space (RKHS), and the kernel k is also known as reproducing kernel.An RKHS is a Hilbert space of functions where all function evaluations are bounded, i.e. ∀ x∈X ∀ h∈H ∃ c>0 |h(x)| ≤ c h H . H , where φ : X → H are canonical feature map associated with RKHS H, and φ(x) = k(.,x).A RKHS is fully characterized by its reproducing kernel k, or a positive definite kernel k uniquely determines a RKHS and vice versa .Hence, E p [h(x)] = h, E p [φ(x)] H , which is known as the Riesz representer theorem

Table 5 :
Optimal hyperparameters, their search space and MKL combination weights on each dataset.

Table 8 :
Correlation of sentence importance scores with normalized sentence ROUGE scores.

Table 6 :
Correlation of some features with sentence scores from SupMMD and Lexrank eigenvector centrality.ICSI A fourth day of thrashing thunderstorms began to take a heavier toll on southern California with at least three deaths blamed on the rain, as flooding and mudslides forced road closures and emergency crews carried out harrowing rescue operations.Downtown Los Angeles has had more than 15 inches of rain since Jan. 1, more than its average rainfall for an entire year, including 2.6 inches, a record.Meteorologists say Southern California has not been hit by this much rain in nearly 40 years.The disaster was the latest caused by rain and snow that has battered California since Dec. 25.Californians braced for even more rain as they struggled to recover from storms that have left at least nine people dead, triggered mudslides and tornadoes, and washed away roads and runways.The record, 38.18 inches (96.98 centimeters), was set in 1883-1884.Mudslides forced Amtrak to suspend train service between Los Angeles and Santa Barbara through at least Thursday.A winter storm pummeled Southern California for the third straight day, claiming the lives of three people and raising fears of mudslides, even as homes around the region were evacuated.Staff Writers Rick Orlov and Lisa Mascaro contributed to this story.SupMMD Downtown Los Angeles has had more than 15 inches of rain since Jan. 1, more than its average rainfall for an entire year, including 2.6 inches, a record.A fourth day of thrashing thunderstorms began to take a heavier toll on southern California with at least three deaths blamed on the rain, as flooding and mudslides forced road closures and emergency crews carried out harrowing rescue operations.The roads in Los Angeles County were equally frustrating.Part of a rain-saturated hillside gave way, sending a Mississippi-like torrent of earth and trees onto four blocks of this oceanfront town and killing two men.Storms have caused $52.5 million (euro39.8 million) in damage to Los Angeles County roads and facilities since the beginning of the year.Multi-million-dollar homes collapsed and mudslides trapped residents in their homes as a heavy rains that have claimed three lives pelted Los Angeles for the fifth straight day.In scenes reminiscent of the aftermath of the Northridge Earthquake 11 years ago this month, Los Angeles area residents faced gridlocked freeways and roads Wednesday while cleanup crews cleared mud, rubble and debris left from a two-week siege of rain.A recordshattering storm slammed Southern California for a sixth straight day Tuesday, triggering mudslides and tornadoes and forcing more road closures, but forecasters predicted it would wane Wednesday before a new storm moves in Sunday night.

Table 7 :
Example summaries of topic D0906, containing articles about "Rains and mudslides in Southern California".