Extract with Order for Coherent Multi-Document Summarization

In this work, we aim at developing an extractive summarizer in the multi-document setting. We implement a rank based sentence selection using continuous vector representations along with key-phrases. Furthermore, we propose a model to tackle summary coherence for increasing readability. We conduct experiments on the Document Understanding Conference (DUC) 2004 datasets using ROUGE toolkit. Our experiments demonstrate that the methods bring significant improvements over the state of the art methods in terms of informativity and coherence.


Introduction
The task of automatic document summarization aims at finding the most relevant informations in a text and presenting them in a condensed form. A good summary should retain the most important contents of the original document or a cluster of documents, while being coherent, non-redundant and grammatically readable. There are two types of summarizations: abstractive summarization and extractive summarization. Abstractive methods, which are still a growing field are highly complex as they need extensive natural language generation to rewrite the sentences. Therefore, research community is focusing more on extractive summaries, which selects salient (important) sentences from the source document without any modification to create a summary. Summarization is classified as single-document or multi-document based upon the number of source document. The information overlap between the documents from the same topic makes the multi-document summarization more challenging than the task of summarizing single documents.
One crucial step in generating a coherent summary is to order the sentences in a logical manner to increase the readability. A wrong order of sentences convey entirely different idea to the reader of the summary and also make it difficult to understand. In a single document, summary information can be presented by preserving the sentence position in the original document. In multi-document summarization, the sentence position in the original document does not provide clue to the sentence arrangement. Hence it is a very challenging task to perform the arrangement of sentences in the summary.

Related Work
During a decade, several extractive approaches have been developed for automatic summary generation that implement a number of machine learning, graph-based and optimization techniques. LexRank (Erkan and Radev, 2004) and TextRank (Mihalcea and Tarau, 2004) are graph-based methods of computing sentence importance for text summarization. The RegSum system ) employs a supervised model for predicting word importance. Treating multidocument summarization as a submodular maximization problem has proven successful by (Lin and Bilmes, 2011). Unfortunately, none of the above systems care about the coherence of the final extracted summary.
In very recent works using neural network, (Cheng and Lapata, 2016) proposed an attentional encoder-decoder and (Nallapati et al., 2017) used a simple recurrent network based sequence classifier to solve the problem of extractive summarization. However, they are limited to single document settings, where sentences are implicitly ordered according to the sentence position.  proposed graph-based techniques to tackle coherence, which is also limited to single document summarization. Moreover, a recent work (Wang et al., 2016) actually proposed a multi-document summarization system that combines both coherence and informativeness but this system is limited to syntactic linkages between entities.
In this paper, we implement a rank based sentence selection using continuous vector representations along with key-phrases. We also model the coherence using semantic relations between entities and sentences to increase the readability.

Sentence Extraction
We here successively describe each of the steps involved in the sentence extraction process such as sentence ranking, sentence clustering, and sentence selection.

Preprocessing
Our system first takes a set of related texts as input and preprocesses them which includes tokenization, Part-Of-Speech (POS) tagging, removal of stopwords and Lemmatization. We use NLTK toolkit 1 to preprocess each sentence to obtain a more accurate representation of the information.

Sentence Similarity
We take the pre-trained word embeddings 2 (Mikolov et al., 2013) of all the non stopwords in a sentence and take the weighted vector sum according to the term-frequency (T F ) of a word(w) in a sentence(S). Where, E is the word embedding model and idx(w) is the index of the word w. More formally, for a given sentence S in the document D, the weighted sum becomes, Then we calculate cosine similarity between the sentence vectors obtained from the above equation to find the relative distance between S i and S j . We also calculate N ESim(S i , S j ) by finding the Named Entities present in S i and S j using NLTK Toolkit, then calculating their overlap.
The overall similarity calculation involves both CosSim(S i , S j ) and N ESim(S i , S j ) where, 0 ≤ λ ≤ 1 decides the relative contributions of them to the overall similarity computation. This standalone similarity function will be used in this work with different λ values to accomplish different tasks.

Sentence Ranking
In this section, we rank the sentences by applying TextRank algorithm (Mihalcea and Tarau, 2004) which involves constructing an undirected graph where sentences are vertices, and weighted edges are formed connecting sentences by a similarity metric. TextRank determines the similarity based on the lexical overlap between two sentences. However, this algorithm has a serious drawback: If two sentences are talking about the same topic without using any overlapped words, there will be no edge between them. Instead, we use the continuous skip-gram model introduced by (Mikolov et al., 2013) to measure the semantic similarity along with the entity overlap. We use the similarity function described in Equation (1) by setting λ = 0.3.
After we have our graph, we can run the main algorithm on it. This involves initializing a score of 1 for each vertex, and repeatedly applying the TextRank update rule until convergence. The update rule is: Where, Rank(S i ) indicates the importance score assigned to sentence S i . N (S i ) is the set of neighboring sentences of S i , and 0 ≤ d ≤ 1 is a dampening factor, which the literature suggests its setting to 0.85. After reaching convergence, we extract the sentences along with TextRank scores.

Sentence Clustering
The sentence clustering step allows us to group similar sentences. We use a hierarchical agglomerative clustering (Murtagh and Legendre, 2014) with a complete linkage criteria. This method proceeds incrementally, starting with each sentence considered as a cluster, and merging the pair of similar clusters after each step using bottom up approach. The complete linkage criteria determines the metric used for the merge strategy. In computing the clusters, we use the similarity function described in Equation (1) with λ = 0.4. We set a similarity threshold (τ = 0.5) to stop the clustering process. If we cannot find any cluster pair with a similarity above the threshold, the process stops, and the clusters are released. The clusters may be small, but are highly coherent as each sentence they contain must be similar to every other sentence in the same cluster.
This sentence clustering step is very important due to two main reasons, (1) Selecting at most one sentence from each cluster of related sentences will decrease redundancy from the summary side (2) Selecting sentences from the diverse set of clusters will increase the information coverage from the document side as well.

Sentence Selection
In this work, we use the concept-based ILP framework introduced in (Gillick and Favre, 2009) with some suitable changes to select the best subset of sentences. This approach aims to extract sentences that cover as many important concepts as possible, while ensuring the summary length is within a given budgeted constraint. Unlike (Gillick and Favre, 2009) which uses bigrams as concepts, we use keyphrases as concepts. Keyphrases are the words or phrases that represent the main topics of a document. Sentences containing the most relevant keyphrases are important for the summary generation. We extracted the keyphrases from the document cluster using RAKE 3 (Rose et al., 2010). We assign a weight to each keyphrase using the score returned by RAKE.
Let w i be the weight of keyphrase i and k i a binary variable that indicates the presence of keyphrase i in the extracted sentences. Let l j be the number of words in sentence j, s j a binary variable that indicates the presence of sentence j in the extracted sentence set and L the length limit 3 https://github.com/aneesha/RAKE for the set. Let Occ ij indicate the occurrence of keyphrase i in sentence j, the ILP formulation is, We try to maximize the weight of the keyphrases (2) in the extracted sentences, while avoiding repetition of those keyphrases (4, 5) and staying under the maximum number of words allowed for the sentence extraction (3).
In addition to (Gillick and Favre, 2009), we put some extra features like maximizing the sentence rank scores returned from the sentence ranking section. In order to ensure only one sentence per cluster in the extracted sentences we add an extra constraint (6). In this process, we extract the optimal combination of sentences that maximize informativity while minimizing redundancy (Figure 1 illustrates our sentence extraction process in brief).

Sentence Ordering
Classic reordering approaches include inferring order from weighted sentence graph (Barzilay et al., 2002), or perform a chronological ordering algorithm (Cohen et al., 1999) that sorts sentences based on timestamp and position.
We here propose a simple greedy approach to sentence ordering in multi-document settings. Our assumption is that a good sentence order implies the similarity between all adjacent sentences since word repetition (more specifically, named entity repetition) is one of the formal sign of text coherence (Barzilay et al., 2002). We define coherence of document D which consists of sentences from   Task-2) for the baseline, state-of-the-art and our system.

Figure 1: Sentence Extraction Process
S 1 to S n in the following equation. For calculating Sim(S i , S i+1 ), we use the similarity function described in equation (1) with λ = 0.5, giving the named entities a little more preference.
We propose a greedy algorithm for placing a sentence in a document based on the coherence score we discussed above 4 . At the beginning, we randomly select a sentence from the extracted sentences without any position information and place the sentence in the ordered set D. We then incrementally add each extracted sentences to the document set D using Algorithm (1) to get the final order of summary sentences.
Algorithm 1: Place a sentence to a document Procedure SentencePositioning(D, Sn) Data: Input document D which is assumed sorted. New sentence Sn which we will place in the document D. Result: Return new document Dn after placing the sentence Sn. t ← 1; Sn from the t th position of the document Dtmp ; end t ← t + 1; end return Dn;

Evaluation
We evaluate our system ILPRankSumm (ILP based sentence selection with TextRank for Extractive Summarization) using ROUGE 5 (Lin, 2004) on DUC 2004 (Task-2, Length limit(L) = 100 words). However, ROUGE scores are biased towards lexical overlap at surface level and insensitive to summary coherence. Moreover, sophisticated coherence evaluation metrics are seldom adopted for summarization thus many of the previous systems used human evaluation for measuring readability. For this reason, we evaluate our summary coherence using (Lapata and Barzilay, 2005) (Barzilay and Lapata, 2008) which defines coherence probabilities for an ordered set of sentences.

Baseline Systems
We compare our system with baseline (LexRank, GreedyKL) and state of the art systems (Submodular, ICSISumm). LexRank (Erkan and Radev, 2004) represents input texts as graph where nodes are the sentences and the edges are formed between two sentences if the cosine similarity is above a certain threshold. Sentence importance is calculated by running the PageRank algorithm on the graph. GreedyKL (Haghighi and Vanderwende, 2009) iteratively selects the next sentence for the summary that will minimize the KL divergence between the estimated word distributions. (Lin and Bilmes, 2011) treat the document summarization problem as maximizing a Submodular function under a budget constraint. They achieved a near-optimal information coverage and non-redundancy using a modified greedy algorithm. On the other hand, ICSISumm (Gillick and Favre, 2009) employs a global linear optimization framework, finding the globally optimal summary rather than choosing sentences according to their importance in a greedy fashion.
The summaries generated by the baselines and the state-of-the-art extractive summarizers on the DUC 2004 dataset were collected from .

Results
Our results include R-1, R-2, and R-SU4, which counts matches in unigrams, bigrams, and skipbigrams respectively. The skip-bigrams allow four words in between. According to Table 1, R-1, R-2 scores obtained by our system outperform all the baselines and state of the art systems on DUC 2004 datasets. One of the main reasons of getting the improved R-1 and R-2 score is the use of keyphrases. Moreover, there is no significant difference between our proposed system and submodular in case of R-SU4. We also get better coherence probability because of our sentence ordering technique. The system's output for a randomly selected document set (e.g. d30015t) from DUC 2004 is shown in Table 2.

Limitations
One of the essential properties of the text summarization systems is the ability to generate a summary with a fixed length (DUC 2004, Task-2: Length limit = 100 words). According to  all the summarizer from the previous research either truncated the summary to 100 th word, or removed the last sentence from the summary set. In this paper, we follow the second one to produce grammatical summary. However, the first one produces a certain ungrammatical sentence, later one can lose a lot of information in Summary Generated (After Sentence Extraction) But U.S. special envoy Richard Holbrooke said the situation in the southern Serbian province was as bad now as two weeks ago. A Western diplomat said up to 120 Yugoslav army armored vehicles, including tanks, have been pulled out. On Sunday,Milosevic met with Russian Foreign Minister Igor Ivanov and Defense Minister Igor Sergeyev, Serbian President Milan Milutinovic and Yugoslavia's top defense officials. To avoid such an attack, Yugoslavia must end the hostilities, withdraw army and security forces, take urgent measures to overcome the humanitarian crisis, ensure that refugees can return home and take part in peace talks, he said.
Summary Generated (After Sentence Ordering) On Sunday, Milosevic met with Russian Foreign Minister Igor Ivanov and Defense Minister Igor Sergeyev, Serbian President Milan Milutinovic and Yugoslavia's top defense officials. But U.S. special envoy Richard Holbrooke said the situation in the southern Serbian province was as bad now as two weeks ago. A Western diplomat said up to 120 Yugoslav army armored vehicles, including tanks, have been pulled out. To avoid such an attack, Yugoslavia must end the hostilities, withdraw army and security forces, take urgent measures to overcome the humanitarian crisis, ensure that refugees can return home and take part in peace talks, he said. the worst case, if the sentences are long. We more focus on the grammaticality of the final summary.

Conclusion and Future Work
In this work, we implemented an ILP based sentence selection along with TextRank scores and key phrases for extractive multi-document summarization. We further model the coherence to increase the readability of the generated summary. Evaluation results strongly indicate the benefits of using continuous word vector representations in all the steps involved in the overall system. In future, we will focus on jointly extracting the sentences to maximize informativity and readability while minimizing redundancy using the same ILP model. Moreover, we will also try to propose a solution for the length limit problem.