Iterative Document Representation Learning Towards Summarization with Polishing

In this paper, we introduce Iterative Text Summarization (ITS), an iteration-based model for supervised extractive text summarization, inspired by the observation that it is often necessary for a human to read an article multiple times in order to fully understand and summarize its contents. Current summarization approaches read through a document only once to generate a document representation, resulting in a sub-optimal representation. To address this issue we introduce a model which iteratively polishes the document representation on many passes through the document. As part of our model, we also introduce a selective reading mechanism that decides more accurately the extent to which each sentence in the model should be updated. Experimental results on the CNN/DailyMail and DUC2002 datasets demonstrate that our model significantly outperforms state-of-the-art extractive systems when evaluated by machines and by humans.


Introduction
A summary is a shortened version of a text document which maintains the most important ideas from the original article. Automatic text summarization is a process by which a machine gleans the most important concepts from an article, removing secondary or redundant concepts. Nowadays as there is a growing need for storing and digesting large amounts of textual data, automatic summarization systems have significant usage potential in society.
Extractive summarization is a technique for generating summaries by directly choosing a subset of salient sentences from the original document to constitute the summary. Most efforts made towards extractive summarization either rely * Corresponding author: Rui Yan (ruiyan@pku.edu.cn) on human-engineered features such as sentence length, word position, and frequency (Cohen, 2002;Radev et al., 2004;Woodsend and Lapata, 2010;Yan et al., 2011aYan et al., ,b, 2012 or use neural networks to automatically learn features for sentence selection (Cheng and Lapata, 2016;Nallapati et al., 2016a).
Although existing extractive summarization methods have achieved great success, one limitation they share is that they generate the summary after only one pass through the document. However, in real-world human cognitive processes, people read a document multiple times in order to capture the main ideas. Browsing through the document only once often means the model cannot fully get at the document's main ideas, leading to a subpar summarization. We share two examples of this. (1) Consider the situation where we almost finish reading a long article and forget some main points in the beginning. We are likely to go back and review the part that we forget. (2) To write a good summary, we usually first browse through the document to obtain a general understanding of the article, then perform a more intensive reading to select salient points to include in the summary. In terms of model design, we believe that letting a model read through a document multiple times, polishing and updating its internal representation of the document can lead to better understanding and better summarization.
To achieve this, we design a model that we call Iterative Text Summarization (ITS) consisting of a novel "iteration mechanism" and "selective reading module". ITS is an iterative process, reading through the document many times. There is one encoder, one decoder, and one iterative unit in each iteration. They work together to polish document representation. The final labeling part uses outputs from all iterations to generate summaries. The selective reading module we design is a modi-fied version of a Gated Recurrent Unit (GRU) network, which can decide how much of the hidden state of each sentence should be retained or updated based on its relationship with the document.
Overall, our contribution includes: 1. We propose Iterative Text Summarization (ITS), an iteration based summary generator which uses a sequence classifier to extract salient sentences from documents.
2. We introduce a novel iterative neural network model which repeatedly polishes the distributed representation of document instead of generating that once for all. Besides, we propose a selective reading mechanism, which decides how much information should be updated of each sentence based on its relationship with the polished document representation. Our entire architecture can be trained in an end-to-end fashion.
3. We evaluate our summarization model on representative CNN/DailyMail corpora and benchmark DUC2002 dataset. Experimental results demonstrate that our model outperforms state-of-the-art extractive systems when evaluated automatically and by human.

Related Work
Our research builds on previous works in two fields: summarization and iterative modeling. Text summarization can be classified into extractive summarization and abstractive summarization. Extractive summarization aims to generate a summary by integrating the most salient sentences in the document. Abstractive summarization aims to generate new content that concisely paraphrases the document from scratch.
With the emergence of powerful neural network models for text processing, a vast majority of the literature on document summarization is dedicated to abstractive summarization. These models typically take the form of convolutional neural networks (CNN) or recurrent neural networks (RNN). For example, Rush et al. (2015) propose an encoder-decoder model which uses a local attention mechanism to generate summaries. Nallapati et al. (2016b) further develop this work by addressing problems that had not been adequately solved by the basic architecture, such as keyword modeling and capturing the hierarchy of sentenceto-word structures. In a follow-up work, Nallapati et al. (2017) propose a new summarization model which generates summaries by sampling a topic one sentence at a time, then producing words using an RNN decoder conditioned on the sentence topic. Another related work is by See et al. (2017), where the authors use "pointing" and "coverage" techniques to generate more accurate summaries.
Despite the focus on abstractive summarization, extractive summarization remains an attractive method as it is capable of generating more grammatically and semantically correct summaries. This is the method we follow in this work. In extractive summarization, Cheng and Lapata (2016) propose a general framework for single-document text summarization using a hierarchical article encoder composed with an attention-based extractor. Following this, Nallapati et al. (2016a) propose a simple RNN-based sequence classifier which outperforms or matches the state-of-art models at the time. In another approach, Narayan et al. (2018) use a reinforcement learning method to optimize the Rouge evaluation metric for text summarization. The most recent work on this topic is (Wu and Hu, 2018), where the authors train a reinforced neural extractive summarization model called RNES that captures cross-sentence coherence patterns. Due to the fact that they use a different dataset and have not released their code, we are unable to compare our models with theirs.
The idea of iteration has not been well explored for summarization. One related study is Xiong et al. (2016)'s work on dynamic memory networks, which designs neural networks with memory and attention mechanisms that exhibit certain reasoning capabilities required for question answering. Another related work is (Yan, 2016), where they generate poetry with iterative polishing sn chema. Similiar method can also be applied on couplet generation as in (Yan et al., 2016). We take some inspiration from their work but focus on document summarization. Another related work is (Singh et al., 2017), where the authors present a deep network called Hybrid MemNet for the single document summarization task, using a memory network as the document encoder. Compared to them, we do not borrow the memory network structure but propose a new iterative architecture.

Problem Formulation
In this work, we propose Iterative Text Summarization (ITS), an iteration-based supervised model for extractive text summarization. We treat the extractive summarization task as a sequence labeling problem, in which each sentence is visited sequentially and a binary label that determines whether or not it will be included in the final summary is generated.
ITS takes as input a list of sentences s = {s 1 , . . . , s ns }, where n s is the number of sentences in the document. Each sentence s i is a list of words: where n w is the word length of the sentence. The goal of ITS is to generate a score vector y = {y 1 , . . . , y ns } for each sentence, where each score y i ∈ [0, 1] denotes the sentence's extracting probability, that is, the probability that the corresponding sentence s i will be extracted to be included in the summary. We train our model in a supervised manner, using a corresponding gold summary written by human experts for each document in training set. We use an unsupervised method to convert the human-written summaries to gold label vector y = {y 1 , ..., y ns }, where y i ∈ {0, 1} denotes whether the i-th sentence is selected (1) or not (0). Next, during training process, the cross entropy loss is calculated between y and y , which is minimized to optimize y. Finally, we select three sentences with the highest score according to y to be the extracted summary. We detail our model below. Fig.1. It consists of multiple iterations with one encoder, one decoder, and one iteration unit in each iteration. We combine the outputs of decoders in all iterations to generate the extracting probabilities in the final labeling module.

ITS is depicted in
Our encoder is illustrated in the shaded region in the left half of Fig.1. It takes as input all sentences as well as the document representation from the previous unit D k−1 , processes them through several neural networks, and outputs the final state to the iterative unit module which updates the document representation.
Our decoder takes the form of a bidirectional RNN. It takes the representation of sentence generated by the encoder as input, and its initial state is the polished document representation D k . Our last module, the sentence labeling module, concatenates the hidden states of all decoders together to generate an integrated score for each sentence.
As we apply supervised training, the objective is to maximize the likelihood of all sentence labels y = {y 1 , ..., y ns } given the input document s and model parameters θ: (1) 4 Our Model

Encoder
In this subsection, we describe the encoding process of our model. For brevity, we drop the superscript k when focusing on a particular layer. All the W 's and b's in this section with different superscripts or subscripts are the parameters to be learned.
Sentence Encoder: Given a discrete set of sentences s = {s 1 , . . . , s ns }, we use a word embed- The sentence encoder can be based on a variety of encoding schemes. Simply taking the average of embeddings of words in a sentence will cause too much information loss, while using GRUs or Long Short-Term Memory (LSTM) requires more computational resources and is prone to overfitting. Considering above, we select positional encoding described in (Sukhbaatar et al., 2015) as our sentence encoding method. Each sentence rep- Note that throughout this study, we use GRUs as our RNN cells since they can alleviate the overfitting problem as confirmed by our experiments. As our selective reading mechanism (which will be explained later) is a modified version of original GRU cell, we give the details of the GRU here. GRU is a gating mechanism in recurrent neural networks, introduced in (Cho et al., 2014). Their performance was found to be similar to that of LSTM cell but using fewer parameters as described in (Hochreiter and Schmidhuber, 1997). The GRU cell consists of an update gate vector To further study the interactions and information exchanges between sentences, we establish a Bi-directional GRU (Bi-GRU) network taking the sentence representation as input: whereŝ i is the sentence representation input at time step i, − → s i is the hidden state of the forward GRU at time step i, and ← − s i is the hidden state of the backward GRU. This architecture allows information to flow back and forth to generate new sentence representation ← → s i . Document Encoder: We must initialize a document representation before polishing it. Generating the document representation from sentence representations is a process similar to generating the sentence representation from word embeddings. This time we need to compress the whole document, not just a sentence, into a vector. Because the information a vector can contain is limited, rather than to use another neural network, we simply use a non-linear transformation of the average pooling of the concatenated hidden states of the above Bi-GRU to generate the document representation, as written below: where '[·;·]' is the concatenation operation. Selective Reading module: Now we can formally introduce the selective reading module in Fig.1. This module is a bidirectional RNN consisting of modified GRU cells whose input is the sentence representation ← → s = { ← → s 1 , ..., ← → s ns }. In the original version of GRU, the update gate u i in Equation 2 is used to decide how much of hidden state should be retained and how much should be updated. However, due to the way u i is calculated, it is sensitive to the position and ordering of sentences, but loses information captured by the polished document representation.
Herein, we propose a modified GRU cell that replace the u i with the newly computed update gate g i . The new cell takes in two inputs, the sentence representation and the document representation from the last iteration, rather than merely the sentence representation. For each sentence, the selective network generates an update gate vector g i in the following way: where ← → s i is the i-th sentence representation, D k−1 is the document representation from last iteration. Equation 5 now becomes: We use this "selective reading module" to automatically decide to which extent the information of each sentence should be updated based on its relationship with the polished document. In this way, the modified GRU network can grasp more accurate information from the document.

Iterative Unit
After each sentence passes through the selective reading module, we wish to update the document representation D k−1 with the newly constructed sentence representations. The iterative unit (also depicted above in Fig.1) is designed for this purpose. We use a GRU iter cell to generate the polished document representation, whose input is the final state of the selective reading network from the previous iteration, h ns and whose initial state is set to the document representation of the previous iteration, D k−1 . The updated document representation is computed by:

Decoder
Next, we describe our decoders, which are depicted shaded in the right part of Fig.1. Following most sequence labeling task (Xue and Palmer, 2004;Carreras and Màrquez, 2005) where they learn a feature vector for each sentence, we use a bidirectional GRU dec network in each iteration to output features so as to calculate extracting probabilities. For k-th iteration, given the sentence representation ← → s as input and the document representation D k as the initial state, our decoder encodes the features of all sentences in the hidden state h k = {h k 0 , ..., h k ns }:

Sentence Labeling Module
Next, we use the feature of each sentence to generate corresponding extracting probability. Since we have one decoder in each iteration, if we directly transform the hidden states in each iteration to extracting probabilities, we will end up with several scores for each sentence. Either taking the average or summing them together by specific weights is inappropriate and inelegant. Hence, we concatenate hidden states of all decoders together and apply a multi-layer perceptron to them to generate the extracting probabilities: where y = {y 1 , ..., y ns }, y i is the extracting probability for each setence. In this way, we let the model learn by itself how to utilize the outputs of all iterations and assign to each hidden state a reliable weight. In section 6, we will show that this labeling method outperforms other methods.

Experiment Setup
In this section, we present our experimental setup for training and estimating our summarization model. We first introduce the datasets used for training and evaluation, and then introduce our experimental details and evaluation protocol.

Datasets
In order to make a fair comparison with our baselines, we used the CNN/Dailymail corpus which was constructed by Hermann et al. (2015). We used the standard splits for training, validation and testing in each corpus (90,266/1,220/1,093 documents for CNN and 196,557/12,147/10,396 for DailyMail). We followed previous studies in using the human-written story highlight in each article as a gold-standard abstractive summary. These highlights were used to generate gold labels when training and testing our model using the greedy search method similar to (Nallapati et al., 2016a).
We also tested ITS on an out-of-domain corpus, DUC2002, which consists of 567 documents. Documents in this corpus belong to 59 various clusters and each cluster has a unique topic. Each document has two gold summaries written by human experts of length around 100 words.

Implementation Details
We implemented our model in Tensorflow (Abadi et al., 2016). The code for our models is available online 1 . We mostly followed the settings in (Nallapati et al., 2016a) and trained the model using the Adam optimizer (Kingma and Ba, 2014) with initial learning rate 0.001 and anneals of 0.5 every 6 epochs until reaching 30 epochs. We selected three sentences with highest scores as summary. After preliminary exploration, we found that arranging them according to their scores consistently achieved the best performance. Experiments were performed with a batch size of 64 documents. We used 100-dimension GloVe (Pennington et al., 2014) embeddings trained on Wikipedia 2014 as our embedding initialization with a vocabulary size limited to 100k for speed purposes. We initialized out-of-vocabulary word embeddings over a uniform distribution within [-0.2,0,2]. We also padded or cut sentences to contain exactly 70 words. Each GRU module had 1 layer with 200-dimensional hidden states and with either an initial state set up as described above or a random initial state. To prevent overfitting, we used dropout after each GRU network and embedding layer, and also applied L2 loss to all unbiased variables. The iteration number was set to 5 if not specified. A detailed discussion about iteration number can be found in section 7.

Baselines
On all datasets we used the Lead-3 method as a baseline, which simply chooses the first three sentences in a document as the gold summary. On DailyMail datasets, we report the performance of SummaRuNNer in (Nallapati et al., 2016a) and the model in (Cheng and Lapata, 2016), as well as a logistic regression classifier (LReg) that they used as a baseline. We reimplemented the Hybrid MemNet model in (Singh et al., 2017) as one of our baselines since they only reported the performance of 500 samples in their paper. Also, Narayan et al. (2018) released their code 2 for the REFRESH model, we used their code to produce Rouge recall scores on the DailyMail dataset as they only reported results on CNN/DailyMail joint dataset. Baselines on CNN dataset are similar.
On DUC2002 corpus, we compare our model with several baselines such as Integer Linear Programming (ILR) and LReg. We also report the performance of the newest neural networks model including (Nallapati et al., 2016a;Cheng and Lapata, 2016;Singh et al., 2017).

Evaluation
In the evaluation procedure, we used the Rouge scores, i.e. Rouge-1, Rouge-2, and Rouge-L, corresponding to the matches of unigram, bigrams, and Longest Common Subsequence (LCS) respectively, to estimate our model. We obtained our Rouge scores using the standard pyrouge package 3 . To compare with other related works, we used full-length F1 score on the CNN corpus, limited length of 75 bytes and 275 bytes recall score on DailyMail corpus. As for the DUC2002 corpus, following the official guidelines, we examined the Rouge recall score at the length of 75 words. All results in our experiment are statistically significant using 95% confidence interval as estimated by Rouge script. Schluter (2017) noted that only using the Rouge metric to evaluate summarization quality can be misleading. Therefore, we also evaluated our model using human evaluation. Five highly educated participants were asked to rank 40 summaries produced by four models: the Lead-3 baseline, Hybrid MemNet, ITS, and human-authored highlights. We chose Hybrid MemNet as one of the human evaluation baselines since its performance is relatively high compared to other baselines. Judging criteria included informativeness and coherence. Test cases were randomly sampled from DailyMail test set.
6 Experiment analysis Table 1 shows the performance comparison of our model with other baselines on the DailyMail dataset with respect to Rouge score at 75 bytes and 275 bytes of summary length. Our model performs consistently and significantly better than other models on 75 bytes, while on 275 bytes, the improvement margin is smaller. One possible interpretation is that our model has high precision on top rank outputs, but the accuracy is lower for lower rank sentences. In addition, (Cheng and Lapata, 2016)   to create sentence-level extractive labels to train their model, while our model uses an unsupervised greedy approximation instead. We also examined the performance of our model on CNN dataset as listed in Table 2. To compare with other models, we used full-length Rouge F1 metric as reported by Narayan et al. (2018). Results demonstrate that our model has a consistently best performance on different datasets.
In Table 3, we present the performance of ITS on the out of domain DUC dataset. Our model outperforms or matches other basic models including LReg and ILR as well as neural network baselines such as SummaRuNNer with respect to the ground truth at 75 bytes, which shows that our model can be adapted to different copora maintaining high accuracy.
In order to explore the impact of internal structure of ITS, we also conducted an ablation study in Table 4. The first variation is the same model without the selective reading module. The second one sets the iteration number to one, that is, a model without iteration process. The last variation is to apply MLP on the output from the last iteration instead of concatenating the hidden states of all decoders. All other settings and parameters are the same. Performances of these models are worse than that of ITS in all metrics, which demonstrates   the preeminence of ITS. More importantly, by this controlled experiment, we can verify the contribution of different module in ITS.

Further discussion
Analysis of iteration number: We did a broad sweep of experiments to further investigate the influence of iteration process on the generated summary quality. First, we studied the influence of iteration number. In order to make a fair comparison between models with different iteration number, we trained all models for same epochs without tuning. Fig.2  Iteration Number 11.6 11.8 the result of training the model for only one epoch outperforms the state-of-the-art in (Singh et al., 2017), which demonstrates that our selective reading module is effective. The fact that continuing this process increase the performance confirms that the iteration idea behind our model is useful in practice. Based on above observation, we set the default iteration number to be 5. Analysis of polishing process: Next, to fully investigate how the iterative process influences the extracting results, we draw heatmaps of the extracting probabilities for each decoder at each iteration. We pick two representative cases in Fig.3, where the x-axis represents the sentence index and y-axis is the iteration number, x-axis labels are omitted. The darker the color is, the higher the extracting probability is. In Fig.3(a), it can be seen that when the iteration begins, most sentences have similar probabilities. As we increase the number of iteration, some probabilities begin to fall and others saturate. This means that the model already has preferred sentences to select. Another interesting feature we found is that there is a tran-Models 1st 2nd 3rd 4th Lead-3 0.12 0.11 0.25 0.52 Hybrid MemNet 0.24 0.25 0.28 0.23 ITS 0.31 0.34 0.23 0.12 Gold 0.33 0.30 0.24 0.13 Table 5: System ranking comparison with other baselines on DailyMail corpus. Rank 1 is the best and Rank 4 is the worst. Each score represents the percentage of the summary under this rank.
sitivity between iterations as shown in Fig.3(b). To be specific, the sentences which are not preferred by iteration 3 remain low probabilities in the next two iterations, while sentences with relatively high scores are still preferred by iteration 4 and 5.
Human Evaluation: We gave human evaluators three system-generated summaries, generated by Lead-3, Hybrid MemNet, ITS, as well as the human-written gold standard summary, and asked them to rank these summaries based on summary informativeness and coherence. Table 5 shows the percentages of summaries of different models under each rank scored by human experts. It is not surprising that gold standard has the most summaries of the highest quality. Our model has the most summaries under 2nd rank, thus can be considered 2nd best, following are Hybrid MemNet and Lead-3, as they are ranked mostly 3rd and 4th. By case study, we found that a number of summaries generated by Hybrid MemNet have two sentences the same as ITS out of three, however, the third distinct sentence from our model always leads to a better evaluation result considering overall informativeness and coherence. Readers can refer to the appendix to see our case study.

Conclusion
In this work, we introduce ITS, an iteration based extractive summarization model, inspired by the observation that it is often necessary for a human to read the article multiple times to fully understand and summarize it. Experimental results on CNN/DailyMail and DUC corpora demonstrate the effectiveness of our model.