Cascaded Attention based Unsupervised Information Distillation for Compressive Summarization

When people recall and digest what they have read for writing summaries, the important content is more likely to attract their attention. Inspired by this observation, we propose a cascaded attention based unsupervised model to estimate the salience information from the text for compressive multi-document summarization. The attention weights are learned automatically by an unsupervised data reconstruction framework which can capture the sentence salience. By adding sparsity constraints on the number of output vectors, we can generate condensed information which can be treated as word salience. Fine-grained and coarse-grained sentence compression strategies are incorporated to produce compressive summaries. Experiments on some benchmark data sets show that our framework achieves better results than the state-of-the-art methods.


Introduction
The goal of Multi-Document Summarization (MDS) is to automatically produce a succinct summary, preserving the most important information of a set of documents describing a topic 1 (Luhn, 1958;Edmundson, 1969;Goldstein et al., 2000;Erkan and Radev, 2004b;Wan et al., 2007;Nenkova and McKeown, 2012). Considering the procedure of summary writing by humans, when people read, they will remember and forget part * The work described in this paper is supported by grants from the Research and Development Grant of Huawei Technologies Co. Ltd (YB2015100076/TH1510257) and the Grant Council of the Hong Kong Special Administrative Region, China (Project Code: 14203414). 1 A topic represents a real event, e.g., "AlphaGo versus Lee Sedol". of the content. Information which is more important may make a deep impression easily. When people recall and digest what they have read to write summaries, the important information usually attracts more attention (the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether deemed subjective or objective, while ignoring other perceivable information 2 ) since it may repeatedly appears in some documents, or be positioned in the beginning paragraphs.
In the context of multi-document summarization, to generate a summary sentence for a key aspect of the topic, we need to find its relevant parts in the original documents, which may attract more attention. The semantic parts with high attention weights plausibly represent and reconstruct the topic's main idea. To this end, we propose a cascaded neural attention model to distill salient information from the input documents in an unsupervised data reconstruction manner, which includes two components: reader and recaller. The reader is a gated recurrent neural network (LSTM or GRU) based sentence sequence encoder which can map all the sentences of the topic into a global representation, with the mechanism of remembering and forgetting. The recaller decodes the global representation into significantly fewer diversified vectors for distillation and concentration. A cascaded attention mechanism is designed by incorporating attentions on both the hidden layer (dense distributed representation of a sentence) and the output layer (sparse bag-of-words representation of summary information). It is worth noting that the output vectors of the recaller can be viewed as word salience, and the attention matrix can be used as sentence salience. Both of them are automatically learned by data reconstruction in an un-supervised manner. Thereafter, the word salience is fed into a coarse-grained sentence compression component. Finally, the attention weights are integrated into a phrase-based optimization framework for compressive summary generation.
In fact, the notion of "attention" has gained popularity recently in neural network modeling, which has improved the performance of many tasks such as machine translation (Bahdanau et al., 2015;Luong et al., 2015). However, very few previous works employ attention mechanism to tackle MDS. Rush et al. (2015) and Nallapati et al. (2016) employed attention-based sequenceto-sequence (seq2seq) framework only for sentence summarization. Gu et al. (2016), Cheng andLapata (2016), andNallapati et al. (2016) also utilized seq2seq based framework with attention modeling for short text or single document summarization. Different from their works, our framework aims at conducting multi-document summarization in an unsupervised manner.
Our contributions are as follows: (1) We propose a cascaded attention model that captures salient information in different semantic representations.
(2) The attention weights are learned automatically by an unsupervised data reconstruction framework which can capture the sentence salience. By adding sparsity constraints on the number of output vectors of the recaller, we can generate condensed vectors which can be treated as word salience; (3) We thoroughly investigate the performance of combining different attention architectures and cascaded structures. Experimental results on some benchmark data sets show that our framework achieves better performance than the state-of-the-art models.

Overview
Our framework has two phases, namely, information distillation for finding salient words/sentences, and compressive summary generation. For the first phase, our cascaded neural attention model consists of two components: reader and recaller as shown in Figure 1. The reader component reads in all the sentences in the document set corresponding to the topic/event. The information distillation happens in the recaller component where only the most important information is preserved. Precisely, the recaller outputs fewer vectors s than that of the input Enc Dec Figure 1: Our cascaded attention based unsupervised information distillation framework. X is the original input sentence sequence of a topic. H i is the hidden vectors of sentences. "Enc" and "Dec" represent the RNN-based encoding and decoding layer respectively. c g is the global representation for the whole topic. A h and A o are the distilled attention matrices for the hidden layer and the output layer respectively, representing the salience of sentences. H o is the output hidden layer. s 1 and s 2 are the distilled condensed vectors representing the salience of words. Note that they are neither origin inputs nor golden summaries. sentences x for the reader.
After the learning of the neural attention model finishes, the obtained salience information will be used in the second phase for compressive summary generation. This phase consists of two components: (i) the coarse-grained sentence compression component which can filter the trivial information based on the output vectors S from the neural attention model; (ii) the unified phrasebased optimization method for summary generation in which the attention matrix A o is used to conduct fine-grained compression and summary construction.

Reader
In the reader stage, for each topic, we extract all the sentences X = {x 1 , x 2 , . . . , x m } from the set of input documents corresponding to a topic and generate a sentence sequence with length m. The sentence order is the same as the original order of the documents. Then the reader reads the whole sequence sentence by sentence. We employ the bag-of-words (BOW) representation as the initial semantic representation for sentences. Assume that the dictionary size is k, then x i ∈ R k . Sparsity is one common problem for the BOW representation, especially when each vector is generated from a single sentence. Moreover, downstream algorithms might suffer from the curse of dimensionality. To solve these problems, we add a hidden layer H v (v for input layer) which is a densely distributed representation above the input layer as shown in Figure 1. Such distributed representation can provide better generalization than BOW representation in many different tasks (Le and Mikolov, 2014;Mikolov et al., 2013). Specifically, the input hidden layer will project the input sentence vector x j to a new space R h according to Equation 1. Then we obtain a new sentence se- where W v xh and b v h are the weight and bias respectively. The superscript v means that the variables are from the input layer.
While reading the sentence sequence, the reader should have the ability of remembering and forgetting. Therefore, we employ the RNN models with various gates (input gate, forget gate, etc.) to imitate the remembering and forgetting mechanism. Then the RNN based neural encoder (the third layer in Figure 1) will map the whole embedding sequence to a single vector c g which can be regarded as a global representation for the whole topic. Let t be the index of the sequence state for the sentence x t , the hidden unit h e t (e for encoder RNN) of the RNN encoder can be computed as: where the RNN f (·) computes the current hidden state given the previous hidden state h e t−1 and the sentence embedding h v t . The encoder generates hidden states {h e t } over all time steps. The last state {h e m } is extracted as the global representation c g for the whole topic. The structure for f (·) can be either an LSTM (Hochreiter and Schmidhuber, 1997) or GRU (Cho et al., 2014).

Recaller
The recaller stage is a reverse of the reader stage, but it outputs less number of vectors in S as shown in Figure 1. Given the global representation c g , the past hidden state h d t−1 (d for decoder RNN) from the decoder layer, an RNN based decoder generates several hidden states according to: We use c g to initialize the first decoder hidden state. The decoder will generate several hidden states {h d t } over pre-defined time steps. Then, similar to the reader stage, we add an output hidden layer after the decoder layer: where W o hh and b o h are the weight and bias respectively for the projection from h d t to h o t . Finally, the output layer maps these hidden vectors to the condensed vectors S = [s 1 , s 2 , . . . , s n ], Each output vector s t has the same dimension k as the input BOW vectors and is obtained as follows: For the purpose of distillation and concentration, we restrict n to be very small.

Cascaded Attention Modeling
Salience estimation for words and sentences is a crucial component in MDS, especially in the unsupervised summarization setting. We propose a cascaded attention model for information distillation to tackle the salience estimation task for MDS. We add attention mechanism not only in the hidden layer, but also in the output layer. By this cascaded attention model, we can capture the salience of sentences from two different and complementary vector spaces. One is the embedding space that provides better generalization, and the other one is the BOW vector space that captures more nuanced and subtle difference. For each output hidden state h o t , we align it with each input hidden state h v i by an attention vector a h t,i ∈ R m (recall that m is the number of input sentences). a h t,i is derived by comparing h o t with each input sentence hidden state h v i : where score(·) is a content-based function to capture the relation between two vectors. Several different formulations can be used as the function score(·) which will be elaborated later. Based on the alignment vectors {a h t,i }, we can create a context vector c h t by linearly blending the sentence hidden states {h v i }: Then the output hidden state can be updated based on the context vector. Leth o t = h o t , then update the original state according to the following operation: The alignment vector a h t,i captures which sentence should be attended more in the hidden space when generating the condensed representation for the whole topic.
Besides the attention mechanism on the hidden layer, we also directly add attention on the output BOW layer which can capture more nuanced and subtle difference information from the BOW vector space. The hidden attention vector a h t,i is integrated with the output attention by a weight λ a ∈ [0, 1]: The output context vector is computed as: To update the output vector s t in Equation 5, we develop a different method from that of the hidden attentions. Specifically we use a weighted combination of the context vectors and the original outputs with λ c ∈ [0, 1]. Lets t = s t , then the updated s t is: The parameters λ a and λ c can also be learned during training. There are several different alternatives for the function score(·): Considering their behaviors as studied in (Luong et al., 2015), we adopt "concat" for the hidden attention layer, and "dot" for the output attention layer.

Unsupervised Learning
By minimizing the loss owing to using the condensed output vectors to reconstruct the original input sentence vectors, we are able to learn the solutions for all the parameters as follows.
where Θ denotes all the parameters in our model. In order to penalize the unimportant terms in the output vectors, we put a sparsity constraint on the rows of S using l 1 -regularization, with the weight λ s as a scaling constant for determining its relative importance.
Lets be the magnitude vector computed from the columns in S (S ∈ R n×k ). Once the training is finished, each dimension of the vectors can be regarded as the word salience score. According to Equation 14, s i ∈ S is used to reconstruct the original sentence space X, and n m (the number of sentences in X is much more than the number of vectors in S) Therefore a large value ins means that the corresponding word contains important information about this topic and it can serve as the word salience.
Moreover, the output layer attention matrix A o can be regarded as containing the sentence salience information. Note that each output vector s i is generated based on the cascaded attention mechanism. Assume that a o i = A o i,: ∈ R m is the attention weight vector for s i . According to Equation 9, a large value in a o i conveys a meaning that the corresponding sentence should contribute more when generating s i . We also use the magnitude of the columns in A o to represent the salience of sentences.

Coarse-grained Sentence Compression
Using the information distillation result from the cascaded neural attention model, we conduct coarse-grained compression for each individual sentence. Such strategy has been adopted in some multi-document summarization methods (Li et al., 2013;Wang et al., 2013;Yao et al., 2015). Our coarse-grained sentence compression jointly considers word salience obtained from the neural attention model and linguistically-motivated rules. The linguistically-motivated rules are designed based on the observed obvious evidence for uncritical information from the word level to the clause level, which include news headers such as "BEI-JING, Nov. 24 (Xinhua) -", intra-sentential attribution such as ", police said Thursday", ", he said", etc. The information filtered by the rules will be processed according to the word salience score. Information with smaller salience score (< ) will be removed.

Phrase-based Optimization for Summary Construction
After coarse-grained compression on each single sentence as described above, we design a unified optimization method for summary generation. We refine the phrase-based summary construction model in  by adjusting the goal as compressive summarization. We consider the salience information obtained by our neural attention model and the compressed sentences in the coarse-grained compression component. Based on the parsed constituency tree for each input sentence as described in Section 2.3.1, we extract the noun-phrases (NPs) and verb-phrases (VPs). The salience S i of a phrase P i is defined as: where a i is the salience of the sentence containing P i . tf (t) is the frequency of the concept t (unigram/bigram) in the whole topic. Thus, S i inherits the salience of its sentence, and also considers the importance of its concepts. The overall objective function of our optimization formulation for selecting salient NPs and VPs is formulated as an integer linear programming (ILP) problem: where α i is the selection indicator for the phrase P i , S i is the salience scores of P i , α ij and R ij is the co-occurrence indicator and the similarity of a pair of phrases (P i , P j ) respectively. The similarity is calculated by the Jaccard Index based method. Specifically, this objective maximizes the salience score of the selected phrases as indicated by the first term, and penalizes the selection of similar phrase pairs.
In order to obtain coherent summaries with good readability, we add some constraints into the ILP framework such as sentence generation constraint: Let β k denote the selection indicator of the sentence x k . If any phrase from x k is selected, β k = 1. Otherwise, β k = 0. For generating a compressed summary sentence, it is required that if β k = 1, at least one NP and at lease one VP of the sentence should be selected. It is expressed as: Other constraints include sentence number, summary length, phrase co-occurrence, etc. For details, please refer to McDonald (2007), Woodsend and Lapata (2012), and . The objective function and constraints are linear. Therefore the optimization can be solved by existing ILP solvers such as the simplex algorithm (Dantzig and Thapa, 2006). In the implementation, we use a package called lp solve 3 .
In the post-processing, the phrases and sentences in a summary are ordered according to their natural order if they come from the same document. Otherwise, they are ordered according to the timestamps of the corresponding documents.

Settings
For text processing, the input sentences are represented as BOW vectors with dimension k. The dictionary is created using unigrams and named entity terms. The word salience threshold used in sentence compression is 0.005. For the neural network framework, we set the hidden size as 500. All the neural matrix parameters W in hidden layers and RNN layers are initialized from a uniform distribution between [−0.1, 0.1]. Adadelta (Schmidhuber, 2015) is used for gradient based optimization. Gradient clipping is adopted by scaling gradients then the norm exceeded a threshold of 10. The maximum epoch number in the optimization procedure is 200. We limit the number of distilled vectors n = 5. The attention cascaded parameter λ a and λ c can be learned by our model. The sparsity penalty λ s in Equation 14 is We use ROUGE score as our evaluation metric (Lin, 2004) with standard options. F-measures of ROUGE-1 (R-1), ROUGE-2 (R-2) and ROUGE-SU4 (R-SU4) are reported.

Effect of Existing Salience Models and Different Attention Architectures
We quantitatively evaluate the performance of different variants on the dataset of TAC 2010. The experimental results are shown in Table 1. Note that the summary generation phase for different methods are the same, and only the salience estimation methods are different. Commonly used existing methods for salience estimation include: concept weight (CW)  and sparse coding (SC) . As mentioned in Section 2.2.3, there are several alternatives for the attention scoring function score(·): dot, tensor, and concat. Moreover, we also design experiments to show the benefit of our cascaded attention mechanism versus the single attention method. AttenC denotes the cascaded attention mechanism. AttenH and AttenO represent the attention only on the hidden layer or the output layer respectively without cascaded combination. Among all the methods, the cascaded attention model with dot structure achieves the best performance. The effect of different RNN models, such as LSTM and GRU, is similar. However, there are less parameters in GRU resulting in improvements for the efficiency of training. Therefore, we choose AttenC-dot-gru as the attention structure of our framework in the subsequent experiments. Moreover, the results without coarse-grained sen-

Main Results of Compressive MDS
We compare our system C-Attention with several unsupervised summarization baselines and state-of-the-art models. Random baseline selects sentences randomly for each topic. Lead baseline (Wasson, 1998) ranks the news chronologically and extracts the leading sentences one by one. TextRank (Mihalcea and Tarau, 2004) and LexRank (Erkan and Radev, 2004a) estimate sentence salience by applying the PageRank algorithm to the sentence graph. PKUTM (Li et al., 2011) employs manifold-ranking for sentence scoring and selection; ABS-Phrase  generates abstractive summaries using phrase-based optimization framework. Three other unsupervised methods based on sparse coding are also compared, namely, DSDR (He et al., 2012), MDS-Sparse (Liu et al., 2015), and RA-MDS . As shown in Table 2, Table 3, and Table 4, our system achieves the best results on all the ROUGE metrics. The reasons are as follows: (1) The attention model can directly capture the salient sentences, which are obtained by minimizing the global data reconstruction error; (2) The cascaded structure of attentions can jointly consider the embedding vector space and bag-of-words vector space when conducting the estimation of sentence salience; (3) The coarse-grained sentence compression based on distilled word salience, and the fine-grained compression via phrase-based unified optimization framework can generate more concise and salient summaries. It is worth noting that PKUTM used a Wikipedia corpus for providing domain knowledge. The system SWING (Min et al., 2012) is the best system for TAC 2011. Our results are not as good as SWING. The reason is that SWING employs category-specific features and requires supervised training. These features help them select better category-specific content for the summary. In contrast, our model is basically unsupervised.

Linguistic Quality Evaluation
The linguistic quality of summaries generated by ABS-Phrase, PKUTM, and our model from 20 topics of TAC 2011 is evaluated using the five linguistic quality questions on grammaticality (Q1), non-redundancy (Q2), referential clarity (Q3), focus (Q4), and coherence (Q5) in Document Understanding Conferences (DUC). A Likert scale with five levels is employed with 5 being very good with 1 being very poor. A summary was blindly evaluated by three assessors on each question. The results are given in Table 5. PKUTM is an extractive method that picks the original sentences, hence it achieves higher score in Q1 grammaticality. ABS-Phrase is an abstractive method and can generate new sentences by merging differ-  Grammaticality of our compression-based framework is better than ABS-Phrase, but not as good as PKUTM. However, our framework performs the best on some other metrics such as Q2 (nonredundancy) and Q4 (focus). The reason is that our framework can compress and remove some uncritical and redundancy content from the original sentences, which leads to better performance on Q2 and Q4.

Case Study: Distilled Word Salience
As mentioned above, the output vectors S in our neural model contain the distilled word salience information. In order to show the performance of word salience estimation, we select 3 topics (events) from different categories of TAC 2011: "Finland Shooting", "Heart Disease", and "Hiv Infection Africa". For each topic, we sort the dictionary terms according to their salience scores, and extract the top-10 terms as the salience estimation results as shown in Table 6. We can see that the top-10 terms reveal the most important information of each topic. For the topic "Finland Shooting", there is a sentence from the golden summary "A teenager at a school in Finland went on a shooting rampage Wednesday, November 11, 2007, killing 8 people, then himself." It is obvious that the top-10 terms from Table 6 can capture this main point.

Case Study: Attention-based Sentence Salience
In our model, the distilled attention matrix A o can be treated as sentence salience estimation. Let a be the magnitude of the columns in A o and a ∈ R m . a i represents the salience of the sentence x i . We collect all the attention vectors for 8 topics of TAC 2011, and display them as an image as shown in Figure 2. The x-axis represents the sentence id (we show at most 100 sentences), and the y-axis represents the topic id. The gray level of pixels in the image indicates different salience scores, where dark represents a high salience score and light represents a small score. Note that different topics seem to hold different ranges of salience scores because they have different number of sentences, i.e. m. According to Equation 9, topics containing more sentences will distribute the attention to more units, therefore, each sentence will get a relatively smaller attention weight. But this issue does not affect the performance of MDS since different topics are independently processed. In Figure 2, there are some chunks in each topic (see Topic 3 as an example) having higher attention weights, which indeed automatically captures one characteristic of MDS: sentence position is an important feature for news summarization. As observed by several previous studies Min et al., 2012), the sentences in the beginning of a news document are usually more important and tend to be used for writing model summaries. Manual checking verified that those highattention chunks correspond to the beginning sentences. Our model is able to automatically capture this information by assigning the latter sentences in each topic lower attention weights. Table 7 shows the summary of the topic "Hawkins Robert Van Maur" in TAC 2011. The summary contains four sentences, which are all compressed with different compression ratio. Some uncritical information is excluded from the summary sentences, such as "police said Thursday" in S2, "But" in S3, and "he said" in S4. In addition, the VP "killing eight people" in S2 is also excluded since it is duplicate with the phrase "killed eight people" in S3. Moreover, from the case we can find that the compression operation did not harm the linguistic quality.

Related Works
According to different machine learning paradigms, summarization models can be divided into supervised framework and unsupervised framework. Some previous works have been proposed based on unsupervised models. For example, Mihalcea and Tarau (2004) and Erkan and Radev (2004a) estimated sentence salience by applying the PageRank algorithm to the sentence graph. He et al. (2012), Liu et al. (2015), Li et al. (2015) and Song et al. (2017) employed sparse coding techniques for finding the salient sentences as summaries.  conducted salience estimation jointly considering reconstructions on several different vector spaces generated by a variational auto-ecoder framework.
Some recent works utilize attention modeling based recurrent neural networks to tackle the task of single-document summarization. Rush et al. (2015) proposed a sentence summarization framework based on a neural attention model using a supervised sequence-to-sequence neural machine translation model. Gu et al. (2016) combined a copying mechanism with the seq2seq framework to improve the quality of the generated summaries. Nallapati et al. (2016) also employed the typical attention modeling based seq2seq framework, but utilized a trick to control the vocabulary size to improve the training efficiency. However, few previous works employ attention mechanism to tackle the unsupervised MDS problem. In contrast, our attention-based framework can generate summaries for multi-document summarization

Conclusions
We propose a cascaded neural attention based unsupervised salience estimation method for compressive multi-document summarization. The attention weights for sentences and salience values for words are both learned by data reconstruction in an unsupervised manner. We thoroughly investigate the performance of combining different attention architectures and cascaded structures. Experimental results on some benchmark data sets show that our framework achieves good performance compared with the state-of-the-art methods.