Improving the Similarity Measure of Determinantal Point Processes for Extractive Multi-Document Summarization

The most important obstacles facing multi-document summarization include excessive redundancy in source descriptions and the looming shortage of training data. These obstacles prevent encoder-decoder models from being used directly, but optimization-based methods such as determinantal point processes (DPPs) are known to handle them well. In this paper we seek to strengthen a DPP-based method for extractive multi-document summarization by presenting a novel similarity measure inspired by capsule networks. The approach measures redundancy between a pair of sentences based on surface form and semantic information. We show that our DPP system with improved similarity measure performs competitively, outperforming strong summarization baselines on benchmark datasets. Our findings are particularly meaningful for summarizing documents created by multiple authors containing redundant yet lexically diverse expressions.


Introduction
Multi-document summarization is arguably one of the most important tools for information aggregation. It seeks to produce a succinct summary from a collection of textual documents created by multiple authors concerning a single topic (Nenkova and McKeown, 2011). The summarization technique has seen growing interest in a broad spectrum of domains that include summarizing product reviews (Gerani et al., 2014;, student survey responses (Luo and Litman, 2015;Luo et al., 2016), forum discussion threads (Ding and Jiang, 2015;Tarnpradab et al., 2017), and news articles about a particular event (Hong et al., 2014). Despite the empirical success, most of the datasets remain small, and the cost of hiring hu-man annotators to create ground-truth summaries for multi-document inputs can be prohibitive.
Impressive progress has been made on neural abstractive summarization using encoder-decoder models (Rush et al., 2015;See et al., 2017;Paulus et al., 2017;Chen and Bansal, 2018). These models, nonetheless, are data-hungry and learn poorly from small datasets, as is often the case with multidocument summarization. To date, studies have primarily focused on single-document summarization (See et al., 2017;Celikyilmaz et al., 2018;Kryscinski et al., 2018) and sentence summarization (Nallapati et al., 2016;Cao et al., 2018; in part because parallel training data are abundant and they can be conveniently acquired from the Web. Further, a notable issue with abstractive summarization is the reliability. These models are equipped with the capability of generating new words not present in the source. With greater freedom of lexical choices, the system summaries can contain inaccurate factual details and falsified content that prevent them from staying "true-to-original." In this paper we instead focus on an extractive method exploiting the determinantal point process (DPP; Kulesza and Taskar, 2012) for multidocument summarization. DPP can be trained on small data, and because extractive summaries are free from manipulation, they largely remain true to the original. DPP selects a set of most representative sentences from the given source documents to form a summary, while maintaining high diversity among summary sentences. It is one of a family of optimization-based summarization methods that performed strongest in previous summarization competitions (Gillick and Favre, 2009;Lin and Bilmes, 2010;Kulesza and Taskar, 2011).
Diversity is an integral part of the DPP model. It is modelled by pairwise repulsion between sentences. In this paper we exploit the capsule net-works (Hinton et al., 2018) to measure pairwise sentence (dis)similarity, then leverage DPP to obtain a set of diverse summary sentences. Traditionally, the DPP method computes similarity scores based on the bag-of-words representation of sentences (Kulesza and Taskar, 2011) and with kernel methods (Gong et al., 2014). These methods, however, are incapable of capturing lexical and syntactic variations in the sentences (e.g., paraphrases), which are ubiquitous in multi-document summarization data as the source documents are created by multiple authors with distinct writing styles. We hypothesize that the recently proposed capsule networks, which learn high-level representations based on the orientational and spatial relationships of low-level components, can be a suitable supplement to model pairwise sentence similarity.
Importantly, we argue that predicting sentence similarity within the context of summarization has its uniqueness. It estimates if two sentences contain redundant information based on both surface word form and their underlying semantics. E.g., the two sentences "Snowstorm slams eastern US on Friday" and "A strong wintry storm was dumping snow in eastern US after creating traffic havoc that claimed at least eight lives" are considered similar because they carry redundant information and cannot both be included in the summary. These sentences are by no means semantically equivalent, nor do they exhibit a clear entailment relationship. The task thus should be distinguished from similar tasks such as predicting natural language inference (Bowman et al., 2015;Williams et al., 2018) or semantic textual similarity (Cer et al., 2017). In this work, we describe a novel method to collect a large amount of sentence pairs that are deemed similar for summarization purpose. We contrast this new dataset with those used for textual entailment for modeling sentence similarity and demonstrate its effectiveness on discriminating sentences and generating diverse summaries. The contributions of this work can be summarized as follows: • we present a novel method inspired by the determinantal point process for multi-document summarization. The method includes a diversity measure assessing the redundancy between sentences, and a quality measure that indicates the importance of sentences. DPP extracts a set of summary sentences that are both representative of the document set and remain diverse; • we present the first study exploiting capsule networks for determining sentence similarity for summarization purpose. It is important to recognize that summarization places particular emphasis on measuring redundancy between sentences; and this notion of similarity is different from that of entailment and semantic textual similarity (STS); • our findings suggest that effectively modeling pairwise sentence similarity is crucial for increasing summary diversity and boosting summarization performance. Our DPP system with improved similarity measure performs competitively, outperforming strong summarization baselines on benchmark datasets.

Related Work
Extractive summarization approaches are the most popular in real-world applications (Carbonell and Goldstein, 1998;Daumé III and Marcu, 2006;Galanis and Androutsopoulos, 2010;Hong et al., 2014;Yogatama et al., 2015). These approaches focus on identifying representative sentences from a single document or set of documents to form a summary. The summary sentences can be optionally compressed to remove unimportant constituents such as prepositional phrases to yield a succinct summary (Knight and Marcu, 2002;Zajic et al., 2007;Martins and Smith, 2009;Berg-Kirkpatrick et al., 2011;Thadani and McKeown, 2013;Wang et al., 2013;Li et al., 2013Li et al., , 2014Filippova et al., 2015;Durrett et al., 2016). Extractive summarization methods are mostly unsupervised or lightly-supervised using thousands of training examples. Given its practical importance, we explore an extractive method in this work for multi-document summarization. It is not uncommon to cast summarization as a discrete optimization problem (Gillick and Favre, 2009;Takamura and Okumura, 2009;Lin and Bilmes, 2010;Hirao et al., 2013). In this formulation, a set of binary variables are used to indicate whether their corresponding source sentences are to be included in the summary. The summary sentences are selected to maximize the coverage of important source content, while minimizing the summary redundancy and subject to a length constraint. The optimization can be performed using an off-the-shelf tool such as Gurobi, IBM CPLEX, or via a greedy approximation algorithm. Notable optimization frameworks include integer linear programming (Gillick and Favre, 2009), determinantal point processes (Kulesza and Taskar, 2012), submodular functions (Lin and Bilmes, 2010), and minimum dominating set (Shen and Li, 2010). In this paper we employ the DPP framework because of its remarkable performance on various summarization problems (Zhang et al., 2016).
Recent years have also seen considerable interest in neural approaches to summarization. In particular, neural extractive approaches focus on learning vector representations of source sentences; then based on these representations they determine if a source sentence is to be included in the summary (Cheng and Lapata, 2016;Yasunaga et al., 2017;Nallapati et al., 2017;Narayan et al., 2018). Neural abstractive approaches usually include an encoder used to convert the entire source document to a continuous vector, and a decoder for generating an abstract word by word conditioned on the document vector (Paulus et al., 2017;Tan et al., 2017;Guo et al., 2018;Kedzie et al., 2018). These neural models, however, require large training data containing hundreds of thousands to millions of examples, which are still unavailable for the multi-document summarization task. To date, most neural summarization studies are performed for single document summarization.
Extracting summary-worthy sentences from the source documents is important even if the ultimate goal is to generate abstracts. Recent abstractive studies recognize the importance of separating "salience estimation" from "text generation" so as to reduce the amount of training data required by encoder-decoder models (Gehrmann et al., 2018;Lebanoff et al., 2018Lebanoff et al., , 2019. An extractive method is often leveraged to identify salient source sentences, then a neural text generator rewrites the selected sentences into an abstract. Our pursuit of the DPP method is especially meaningful in this context. As described in the next section, DPP has an extraordinary ability to distinguish redundant descriptions, thereby avoiding passing redundant content to the abstractor that can cause an encoderdecoder model to fail.

The DPP Framework
Let Y = {1, 2, · · · , N} be a ground set containing N items, corresponding to all sentences of the source documents. Our goal is to identify a subset of items Y ⊆ Y that forms an extractive summary of the document set. A determinantal point pro-cess (DPP; Kulesza and Taskar, 2012) defines a probability measure over all subsets of Y s.t.
where det(·) is the determinant of a matrix; I is the identity matrix; L ∈ R N×N is a positive semidefinite matrix, known as the L-ensemble; L ij measures the correlation between sentences i and j; and L Y is a submatrix of L containing only entries indexed by elements of Y . Finally, the probability of an extractive summary Y ⊆ Y is proportional to the determinant of the matrix L Y (Eq. (1)). Kulesza and Taskar (2012) provide a decomposition of the L-ensemble matrix: L ij = q i · S ij · q j where q i ∈ R + is a positive real number indicating the quality of a sentence; and S ij is a measure of similarity between sentences i and j. This formulation separately models the sentence quality and pairwise similarity before combining them into a unified model. Let Y = {i, j} be a summary containing only two sentences i and j, its probability P(Y ; L) can be computed as Eq.
(3) indicates that, if sentence i is of high quality, denoted by q i , then any summary containing it will have high probability. If two sentences i and j are similar to each other, denoted by S ij , then any summary containing both sentences will have low probability. The summary Y achieving the highest probability thus should contain a set of high-quality sentences while maintaining high diversity among the selected sentences (via pairwise repulsion). det(L Y ) also has a particular geometric interpretation as the squared volume of the space spanned by sentence vectors i and j, where the quality measure indicates the length of the vector and the similarity indicates the angle between two vectors ( Figure 1). We adopt a feature-based approach to compute sentence quality: q i = exp(θ x i ). In particular, x i is a feature vector for sentence i and θ are the feature weights to be learned during training. Kulesza and Taskar (2011) Figure 1: The DPP model specifies the probability of a summary P(Y = {i, j}; L) to be proportional to the squared volume of the space spanned by sentence vectors i and j.
a sentence TF-IDF vector. The model parameters θ are optimized by maximizing the log-likelihood of training data (Eq. (4)) and this objective can be optimized efficiently with subgradient descent. 2 During training, we create the ground-truth extractive summary (Ŷ ) for a document set based on human reference summaries (abstracts) using the following procedure. At each iteration we select a source sentence sharing the longest common subsequence with the human reference summaries; the shared words are then removed from human summaries to avoid duplicates in future selection. Similar methods are exploited by Nallapati et al. (2017) and Narayan et al. (2018) to create ground-truth extractive summaries. At test time, we perform inference using the learned DPP model to obtain a system summary (Y ). We implement a greedy method (Kulesza and Taskar, 2012) to iteratively add a sentence to the summary so that P(Y ; L) yields the highest probability (Eq. (1)), until a summary length limit is reached.
For the DPP framework to be successful, the sentence similarity measure (S ij ) has to accurately capture if any two sentences contain redundant information. This is especially important for multidocument summarization as redundancy is ubiquitous in source documents. The source descriptions frequently contain redundant yet lexically diverse expressions such as sentential paraphrases where people write about the same event using distinct styles (Hu et al., 2019). Without accurately modelling sentence similarity, redundant content can make their way into the summary and further prevent useful information from being included given the summary length limit. Existing cosine similarity measure between sentence TF-IDF vectors can be incompetent in modeling semantic relatedness. In the following section we exploit the recently introduced capsule networks (Hinton et al., 2018) to measure pairwise sentence similarity; it considers if two sentences share any words in common and more importantly the semantic closeness of sentence descriptions.

An Improved Similarity Measure
Our goal is to develop an advanced similarity measure for pairs of sentences such that semantically similar sentences can receive high scores despite that they have very few words in common. E.g., "Snowstorm slams eastern US on Friday" and "A strong wintry storm was dumping snow in eastern US after creating traffic havoc that claimed at least eight lives" have only two words in common. Nonetheless, they contain redundant information and cannot both be included in the summary.
Let {x a , x b } ∈ R E×L denote two sentences a and b. Each consists of a sequence of word embeddings, where E is the embedding size and L is the sentence length with zero-padding to the right for shorter sentences. A convolutional layer with multiple filter sizes is first applied to each sentence to extract local features (Eq. (5)), where x a i:i+k−1 ∈ R kE denotes a flattened embedding for position i with a filter size k, and u a i,k ∈ R d is the resulting local feature for position i; f is a nonlinear activation function (e.g., ReLU); {W u , b u } are model parameters.
We use u a i ∈ R D to denote the concatenation of local features generated using various filter sizes. Following Kim et al. (2014), we employ filter sizes k ∈ {3, 4, 5, 6, 7} with an equal number of filters (d) for each size (D = 5d). After applying maxpooling to local features of all positions, we obtain a representation u a = max-pooling(u a i ) ∈ R D for sentence a; and similarly we obtain u b ∈ R D for sentence b. It is not uncommon for state-ofthe-art sentence similarity classifiers  to concatenate the two sentence vectors, their absolute difference and element-wise product [u a ; u b ; |u a − u b |; u a • u b ], and feed this representation to a fully connected layer to predict if two sentences are similar. Nevertheless, we conjecture that such representation may be insufficient to fully characterize the relationship between components of the sentences in order to model sentence similarity. For example, the term "snowstorm" in sentence a is semantically related to "wintry storm" and "dumping snow" in sentence b; this low-level interaction indicates that the two sentences contain redundant information and it cannot be captured by the above model. Importantly, the capsule networks proposed by Hinton et al. (2018) are designed to characterize the spatial and orientational relationships between low-level components. We thus seek to exploit CapsNet to strengthen the capability of our system for identifying redundant sentences.
∈ R D be low-level representations (i.e., capsules). We seek to transform them to high-level capsules {v j } M j=1 ∈ R B that characterize the interaction between low-level components. Each low-level capsule u i ∈ R D is multiplied by a linear transformation matrix to dedicate a portion of it, denoted byû j|i ∈ R B , to the construction of a high-level capsule j (Eq. (6)); where {W v ij } ∈ R D×B are model parameters. To reduce parameters and prevent overfitting, we further encourage sharing parameters over all lowlevel capsules, yielding W v 1j = W v 2j = · · · , and the same parameter sharing is described in . By computing the weighted sum of u j|i , whose weights c ij indicate the strength of interaction between a low-level capsule i and a highlevel capsule j, we obtain an (unnormalized) capsule (Eq. (7)); we then apply a nonlinear squash function g(·) to normalize the length the vector to be less than 1, yielding v j ∈ R B .
Routing (Sabour et al., 2017;Zhao et al., 2019) aims to adjust the interaction weights (c ij ) using an iterative, EM-like method. Initially, we set {b ij } to be zero for all i and j. Per Eq. (8), c i becomes a uniform distribution indicating a lowlevel capsule i contributes equally to all its upper level capsules. After computingû j|i and v j using Eq. (6-7), the weights b ij are updated according to the strength of interaction (Eq. (9)). Ifû j|i agrees with a capsule v j , their interaction weight will be increased, and decreased otherwise. This process is repeated for r iterations to stabilize c ij .
The high-level capsules {v j } M j=1 effectively encode spatial and orientational relationships of lowlevel capsules. To identify the most prominent interactions, we apply max-pooling to all high-level capsules to produce v = max-pooling j (v j ) ∈ R B . This representation v, aimed to encode interactions between sentences a and b, is concatenated with [u a ; u b ; |u a −u b |; u a •u b ] and binary vectors [z a ; z b ] that indicate if any word in sentence a appears in sentence b and vice versa; they are used as input to a fully connected layer to predict if a pair of sentences contain redundant information. Our loss function contains two components, including a binary cross-entropy loss indicating whether the prediction is correct or not, and a reconstruction loss for reconstructing a sentence a conditioned on u a by predicting one word at a time using a recurrent neural network, and similarly for sentence b. A hyperparameter λ is used to balance contributions from both sides. In Figure 2 we present an overview of the system architecture, and hyperparameters are described in the supplementary.

Datasets
To our best knowledge, there is no dataset focusing on determining if two sentences contain redundant information. It is a nontrivial task in the context of multi-document summarization. Further, we argue that the task should be distinguished from other semantic similarity tasks: semantic textual similarity (STS; Cer et al., 2017) assesses to what degree two sentences are semantically equivalent to each other; natural language inference (NLI; Bowman et al., 2015) determines if one sentence ("hypothesis") can be semantically inferred from the other sentence ("premise"). Nonetheless, redundant sentences found in a set of source documents discussing a particular topic are not necessarily semantically equivalent or express an entailment relationship. We compare different datasets in §6.
Sentence redundancy dataset A novel dataset containing over 2 million sentence pairs is introduced in this paper for sentence redundancy prediction. We hypothesize that it is likely for a summary sentence and its most similar source sentence to contain redundant information. Because humans create summaries using generalization, paraphrasing, and other high-level text operations, a summary sentence and its source sentence can be semantically similar, yet contain diverse expressions. Fortunately, such source/summary sentence pairs can be conveniently derived from singledocument summarization data. We analyze the CNN/Daily Mail dataset (Hermann et al., 2015) that contains a massive collection of single news articles and their human-written summaries. For each summary sentence, we identify its most similar source sentence by calculating the averaged R-1, R-2, and R-L F-scores (Lin, 2004) between a source and summary sentences. We consider a summary sentence to have no match if the score is lower than a threshold. We obtain negative examples by randomly sampling two sentences from a news article. In total, our training / dev / test sets contain 2,084,798 / 105,936 / 86,144 sentence DUC-04 System R-1 R-2 R-SU4 Opinosis (Ganesan et al., 2010) 27.07 5.03 8.63 Extract+Rewrite  28.90 5.33 8.76 Pointer-Gen (See et al., 2017) 31.43 6.03 10.01 SumBasic (Vanderwende et al., 2007) 29.48 4.25 8.64 KLSumm (Haghighi et al., 2009) 31.04 6.03 10.23 LexRank (Erkan and Radev, 2004) 34.44 7.11 11.19 Centroid (Hong et al., 2014) 35.49 7.80 12.02 ICSISumm (Gillick and Favre, 2009) 37.31 9.36 13.12 DPP (Kulesza and Taskar, 2011  pairs and we make the dataset available to advance research on sentence redundancy.

Summarization datasets
We evaluate our DPPbased system on benchmark multi-document summarization datasets. The task is to create a succinct summary with up to 100 words from a cluster of 10 news articles discussing a single topic. The DUC and TAC datasets (Over and Yen, 2004;Dang and Owczarzak, 2008) have been used in previous summarization competitions. In this paper we use DUC-03/04 and TAC-08/09/10/11 datasets that contain 60/50/48/44/46/44 document clusters respectively. Four human reference summaries have been created for each document cluster by NIST assessors. Any system summaries are evaluated against human reference summaries using the ROUGE software (Lin, 2004) 3 , where R-1, -2, and -SU4 respectively measure the overlap of unigrams, bigrams, unigrams and skip bigrams with a maximum distance of 4 words. We report results on DUC-04 (trained on DUC-03) and TAC-11 (trained on TAC-08/09/10) that are often used as standard test sets (Hong et al., 2014).

Experimental Results
In this section we discuss results that we obtained for multi-document summarization and determining redundancy between sentences.

Summarization Results
We compare our system with a number of strong summarization baselines (Table 1 and 2). In particular, SumBasic (Vanderwende et al., 2007) is an extractive approach assuming words occurring fre- Opinosis (Ganesan et al., 2010) 25.15 5.12 8.12 Extract+Rewrite  29.07 6.11 9.20 Pointer-Gen (See et al., 2017) 31.44 6.40 10.20 SumBasic (Vanderwende et al., 2007) 31.58 6.06 10.06 KLSumm (Haghighi et al., 2009) 31.23 7.07 10.56 LexRank (Erkan and Radev, 2004) 33.10 7.50 11.13 DPP (Kulesza and Taskar, 2011  quently in a document cluster are more likely to be included in the summary; KL-Sum (Haghighi and Vanderwende, 2009) is a greedy approach adding a sentence to the summary to minimize KL divergence; and LexRank (Erkan and Radev, 2004) is a graph-based approach computing sentence importance based on eigenvector centrality. We additionally consider abstractive baselines to illustrate how well these systems perform on multi-document summarization: Opinosis (Ganesan et al., 2010) focuses on creating a word cooccurrence graph from the source documents and searching for salient graph paths to create an abstract; Extract+Rewrite  selects sentences using LexRank and condenses each sentence to a title-like summary; Pointer-Gen (See et al., 2017) seeks to generate abstracts by copying words from the source documents and generating novel words not present in the source text.
Our DPP-based framework belongs to a strand of optimization-based methods. In particular, IC-SISumm (Gillick et al., 2009) formulates extractive summarization as integer linear programming; it identifies a globally-optimal set of sentences covering the most important concepts of the source documents; DPP (Kulesza and Taskar, 2011) selects an optimal set of sentences that are representative of the source documents and with maximum diversity, as determined by the determinantal point process. Gong et al. (2014) show that the DPP performs well on summarizing both text and video.
We experiment with several variants of the DPP model: DPP-Capsnet computes the similarity between sentences (S ij ) using the CapsNet described in Sec. §4 and trained using our newly-constructed sentence redundancy dataset, whereas the default DPP framework computes sentence similarity as the cosine similarity of sentence TF-IDF vectors. DPP-Combined linearly combines the cosine sim-ilarity with the CapsNet output using an interpolation coefficient determined on the dev set 4 . Table 1 and 2 illustrate the summarization results we have obtained for the DUC-04 and TAC-11 datasets. Our DPP methods perform superior to both extractive and abstractive baselines, indicating the effectiveness of optimization-based methods for extractive multi-document summarization. The DPP optimizes for summary sentence selection to maximize their content coverage and diversity, expressed as the squared volume of the space spanned by the selected sentences.
Further, we observe that the DPP system with combined similarity metrics yields the highest performance, achieving 10.14% and 10.13% F-scores respectively on DUC-04 and TAC-11. This finding suggests that the cosine similarity of sentence TF-IDF vectors and the CapsNet semantic similarity successfully complement each other to provide the best overall estimate of sentence redundancy. A close examination of the system outputs reveal that important topical words (e.g., "$3 million") that are frequently discussed in the document cluster can be crucial for determining sentence redundancy, because sentences sharing the same topical words are more likely to be considered redundant. While neural models such as the CapsNet rarely explicitly model word frequencies, the TF-IDF sentence representation is highly effective in capturing topical terms.
In Table 3 we show example system summaries and a human-written reference summary. We observe that LexRank tends to extract long and comprehensive sentences that yield high graph centrality; the abstractive pointer-generator networks, despite the promising results, can sometimes fail to generate meaningful summaries (e.g., "a third of all 3-year-olds · · · have been given to a child"). In contrast, our DPP method is able to select a balanced set of representative and diverse summary sentences. We next compare several semantic similarity datasets to gain a better understanding of modeling sentence redundancy for summarization.

Sentence Similarity
We compare three standard datasets used for semantic similarity tasks, including SNLI (Bowman et al., 2015), used for natural language inference, STS-Benchmark (Cer et al., 2017) for semantic

LexRank Summary
• The official, Dr. Charles J. Ganley, director of the office of nonprescription drug products at the Food and Drug Administration, said in an interview that the agency was "revisiting the risks and benefits of the use of these drugs in children" and that "we're particularly concerned about the use of these drugs in children less than 2 years of age." • The Consumer Healthcare Products Association, an industry trade group that has consistently defended the safety of pediatric cough and cold medicines, recommended in its own 156-page safety review, also released Friday, that the FDA consider mandatory warning labels saying that they should not be used in children younger than two.
• Major makers of over-the-counter infant cough and cold medicines announced Thursday that they were voluntarily withdrawing their products from the market for fear that they could be misused by parents.

Pointer-Gen Summary
• Dr. Charles Ganley, a top food and drug administration official, said the agency was "revisiting the risks and benefits of the use of these drugs in children," the director of the FDA's office of nonprescription drug products.
• The FDA will formally consider revising labeling at a meeting scheduled for Oct. 18-19.
• The withdrawal comes two weeks after reviewing reports of side effects over the last four decades, a 1994 study found that more than a third of all 3-year-olds in the United States were estimated to have been given to a child.

DPP-Combined Summary
• Johnson & Johnson on Thursday voluntarily recalled certain infant cough and cold products, citing "rare" instances of misuse leading to overdoses.
• Federal drug regulators have started a broad review of the safety of popular cough and cold remedies meant for children, a top official said Thursday.
• Safety experts for the Food and Drug Administration urged the agency on Friday to consider an outright ban on over-the-counter, multi-symptom cough and cold medicines for children under 6.
• Major makers of over-the-counter infant cough and cold medicines announced Thursday that they were voluntarily withdrawing their products from the market for fear that they could be misused by parents.

Human Reference Summary
• On March 1, 2007, the Food/Drug Administration (FDA) started a broad safety review of children's cough/cold remedies.
• They are particularly concerned about use of these drugs by infants.
• By September 28th, the 356-page FDA review urged an outright ban on all such medicines for children under six.
• Dr. Charles Ganley, a top FDA official said "We have no data on these agents of what's a safe and effective dose in Children." The review also stated that between 1969 and 2006, 123 children died from taking decongestants and antihistimines.
• On October 11th, all such infant products were pulled from the markets. Table 3: Example system summaries and the human reference summary. LexRank extracts long and comprehensive sentences that yield high graph centrality. Pointer-Gen (abstractive) has difficulty in generating faithful summaries (see the last bullet "all 3-year-olds ... have been given to a child"). DPP is able to select a balanced set of representative and diverse sentences.

Dataset
Train Dev Test Accu. (Cer et al., 2017)   equivalence, and our newly-constructed Src-Summ sentence pairs. Details are presented in Table 4. We observe that CapsNet achieves the highest prediction accuracy of 94.8% on the Src-Summ dataset and it yields similar performance on SNLI, indicating the effectiveness of CapsNet on characterizing semantic similarity. STS appears to be a more challenging task, where CapsNet yields 64.7% accuracy. Note that we perform two-way classification on SNLI to discriminate entailment and contradiction. The STS dataset is too small to be used to train CapsNet without overfitting, we thus pre-train the model on Src-Summ pairs, and use the train split of STS to fine-tune parameters. Src-Summ Pairs (a) He ended up killing five girls and wounding five others before killing himself. (b) Nearly four months ago, a milk delivery-truck driver lined up 10 girls in a one-room schoolhouse in this Amish farming community and opened fire, killing five of them and wounding five others before turning the gun on himself. Table 5: Example positive () and negative () sentence pairs from the semantic similarity datasets. Table 5 shows example positive and negative sentence pairs from the STS, SNLI, and Src-Summ datasets. The STS and SNLI datasets are constructed by human annotators to test a system's capability of learning sentence representations. The sentences can share very few words in common but still express an entailment relationship (positive); or the sentences can share a lot of words in common yet they are semantically distinct (negative). These cases are usually not seen in summarization datasets containing clusters of documents discussing single topics. The Src-Summ dataset successfully strike a balance between shar- Figure 3: Heatmaps for topic D31008 of DUC-04 (cropped to 200 sentences) that shows the cosine similarity score of sentence TF-IDF vectors (Cosine, left), and the CapsNet output trained respectively on SNLI (right) and Src-Summ (middle) datasets. The short offdiagonal lines are near-identical sentences found in the document cluster.

STS-Benchmark
ing common words yet containing diverse expressions. It is thus a good fit for training classifiers to detect sentence redundancy. Figure 3 compares heatmaps generated by computing cosine similarity of sentence TF-IDF vectors (Cosine), and training CapsNet on SNLI and Src-Summ datasets respectively. We find that the Cosine similarity scores are relatively strict, as a vast majority of sentence pairs are assigned zero similarity, because these sentences have no word overlap. At the other extreme, CapsNet+SNLI labels a large quantity of sentence pairs as false positives, because its training data frequently contain sentences that share few words in common but nonetheless are positive, i.e., expressing an entailment relationship. The similarity scores generated by CapsNet+SrcSumm are more moderate comparing to CapsNet+SNLI and Cosine, suggesting the appropriateness of using Src-Summ sentence pairs for estimating sentence redundancy.

Conclusion
We strengthen a DPP-based multi-document summarization system with improved similarity measure inspired by capsule networks for determining sentence redundancy. We show that redundant sentences not only have common words but they can be semantically similar with little word overlap. Both aspects should be modelled in calculating pairwise sentence similarity. Our system yields competitive results on benchmark datasets surpassing strong summarization baselines.