Krimping texts for better summarization

Automated text summarization is aimed at extracting essential information from original text and presenting it in a minimal, often predeﬁned, number of words. In this paper, we introduce a new approach for unsupervised extractive summarization, based on the Minimum Description Length (MDL) principle, using the Krimp dataset compression algorithm (Vreeken et al., 2011). Our approach represents a text as a transactional dataset, with sentences as transactions, and then describes it by itemsets that stand for frequent sequences of words. The summary is then compiled from sentences that compress (and as such, best describe) the document. The problem of summarization is reduced to the maximal coverage, following the assumption that a summary that best describes the original text, should cover most of the word sequences describing the document. We solve it by a greedy algorithm and present the evaluation results.


Introduction
Many unsupervised approaches for extractive text summarization follow the maximal coverage principle (Takamura and Okumura, 2009;Gillick and Favre, 2009), where the extract that maximally covers the information contained in the source text, is selected. Since the exhaustive solution demands an exponential number of tests, approximation techniques, such as a greedy approach or a global optimization of a target function, are utilized. It is quite common to measure text informativeness by the frequency of its componentswords, phrases, concepts, and so on. A different approach that received much less attention is based on the Minimum Description Length (MDL) principle, defining the best summary as the one that leads to the best compression of the text by providing its shortest and most concise description. The MDL principle is widely useful in compression techniques of non-textual data, such as summarization of query results for OLAP applications. (Lakshmanan et al., 2002;Bu et al., 2005) However, only a few works on text summarization using MDL can be found in the literature. Authors of (Nomoto and Matsumoto, 2001) used K-means clustering extended with MDL principle for finding diverse topics in the summarized text . Nomoto in (2004) also extended the C4.5 classifier with MDL for learning rhetorical relations. In (Nguyen et al., 2015) the problem of micro-review summarization is formulated within the MDL framework, where the authors view the tips as being encoded by snippets, and seek to find a collection of snippets that produce the encoding with the minimum number of bits. This paper introduces a new MDL-based approach for extracting relevant sentences into a summary. The approach represents documents as a sequential transactional dataset and then compresses it by replacing frequent sequences of words by codes. The summary is then compiled from sentences that best compress (or describe) the document content. The intuition behind this approach says that a summary that best describes the original text should cover its most frequent word sequences. As such, the problem of summarization is very naturally reduced to the maximal coverage problem. We solve it by the greedy method which ranks sentences by their coverage of best compressing frequent word sequences and selects the top-ranked sentences to a summary. There are a few works that applied the common data mining techniques for calculating frequent itemsets from transactional data to the text summarization task (Baralis et al., 2012;Agarwal et al., 2011;Dalal and Zaveri, 2013), but none of them followed the MDL principle. The comparative results on three different corpora show that our approach outperforms other unsupervised state-of-the-art summarizers.

Methodology
The proposed summarization methodology, denoted by Gamp 1 , is based on the MDL principle that is defined formally as follows (Mitchell, 1997). Given a set of models M, a model M ∈ M is considered the best if it minimizes L(M ) + L(D|M ) where L(M ) is the bit length of the description of M and L(D|M ) is the bit length of the dataset D encoded with M . In our approach, we first represent an input text as a transactional dataset. Then, using the Krimp dataset compression algorithm (Vreeken et al., 2011), we build the MDL for this dataset using its frequent sequential itemsets (word sequences). The sentences that cover most frequent word sequences are chosen to a summary. The following subsections describe our methodology in detail.

Problem statement
We are given a single text or a collection of texts about the same topic, composed of a set of sentences S 1 , . . . , S n over terms t 1 , . . . , t m . The word limit W is defined for the final summary. We represent a text as a sequential transactional dataset. Such a dataset consists of transactions (sentences), denoted by T 1 , . . . , T n , and unique items (terms 2 ) I 1 , . . . , I m . Items are unique across the entire dataset. The number n of transactions is called the size of a dataset. Transaction T i is a sequence of items from I 1 , . . . , I m , denoted by T i = (I i 1 , . . . , I i k ); the same item may appear in different places within the same transaction. Support of an item sequence s in the dataset is the ratio of transactions containing it as a subsequence to the dataset size n, i.e., supp(s) According to the MDL principle, we are interested in the minimal size of a compressed dataset D|CT after frequent sequences in D are encoded with the compressing set-codes from the Coding Table CT , where shorter codes are assigned to more frequent sequences. The description length of non-encoded terms is assumed proportional to their length (number of characters). We rank sentences by their coverage of the best compressing set, which is the number of CT members in the sentence. The sentences with the highest coverage score are added to the summary as long as its length does not exceed W .

Krimping text
The purpose of the Krimp algorithm (Vreeken et al., 2011) is to use frequent sets (or sequences) to compress a transactional database in order to achieve MDL for that database. Let FreqSeq be the set of all frequent sequences in the database. A collection CT of sequences from FreqSeq (called the Coding Table) is called best when it minimizes L(CT ) + L(D|CT ). We are interested in both exact and inexact sequences, allowing a sequence to have gaps inside it as long as the ratio of sequence length to sequence length with gaps does not exceed a pre-set parameter Gap ∈ (0, 1]. Sequences with gaps make sense in text data, as phrasing of the same fact or entity in different sentences may differ. In order to encode the database, every member s ∈ CT is associated with its binary prefix code c (such as Huffman codes for 4 members: 0, 10, 110, 111), so that the most frequent code has the shortest length. We use an upper bound C on the number of encoded sequences in the coding table CT , in order to limit document compression. Krimp-based extractive summarization (see Algorithm 1) is given a document D with n sentences and m unique terms. The algorithm parameters are described in Table 1: # note description affects 1 W summary words limit summary length 2 Supp minimal support bound -number of frequent minimal fraction of word sequences sentences containing |FreqSeq|, a frequent sequence compression rate 3 C maximal number of codes as in 2 4 Gap maximal allowed sequence gap ratio as in 2 The algorithm consists of the following steps.
1. We find all frequent term sequences in the document using Apriori-TID algorithm (R and R, 1994) for the given Supp and Gap and store them in set FreqSeq, which is kept in Standard Candidate Order 3 . The coding table CT is initialized to contain all single normalized terms and their frequencies.
CT is always kept in Standard Cover Order 4 (Steps 1 and 2 in Algorithm 1).
2. We repeatedly choose frequent sequences from the set FreqSeq so that the size of the encoded dataset is minimal, with every selected sequence replaced by its code. Selection is done by computing the decrease in the size of the encoding when each one of the sequences is considered to be a candidate to be added to CT (Step 3 in Algorithm 1).
3. The summary is constructed by incrementally adding sentences with the highest coverage of encoded term sequences (Step 4 in Algorithm 1) that are not covered by previously selected sentences. The sentences are selected in the greedy manner as long as the word limit W is not exceeded.
Algorithm 1: Gamp: Krimp-based extractive summarization with gaps Input: (1) A document, containing sentences S1, . . . , Sn after tokenization, stemming and stop-word removal; (2) normalized terms T1, . . . , Tm; (3) summary size W (3) limit C on the number of codes to use; (4) minimal support bound Supp; (5) maximal gap ratio Gap. Output: Extractive summary Summary /* STEP 1: Frequent sequence mining */ FreqSeq ← inexact frequent sequences of terms from {T1, . . . , Tm} appearing in at least Supp fraction of sentences and having a gap ratio of at least Gap; Sort FreqSeq according to Standard Candidate Order; /* STEP 2: Initialize the coding table */ Add all terms T1, . . . , Tm and their support to CT ; Keep CT always sorted according to Standard Cover Order; Initialize prefix codes according to the order of sequences in CT ; /* STEP 3: Find the best encoding */ EncodedData ← PrefixEncoding({S1, . . . , Sn}, CT ); Remove supersets of BestCode from FreqSeq; CodeCount++; end Summary ← ∅; /* STEP 4: Build the summary */ while #words(Summary) < W do Find the sentence Si that covers the largest set T of terms in CT and add it to Summary; Remove terms of T from CT ; end return Summary Example 2.1 Let dataset D contain following three sentences (taken from the "House of cards" TV show): S1 = A hunter must stalk his prey until the hunter becomes the hunted. S2 = Then the prey becomes the predator. S3 = Then the predator and the hunter fight.
After stemming, tokenization and stop-word removal we obtain unique (stemmed) terms: Now we can view sentences as the following sequences of normalized terms. S1 = (t1, t2, t3, t4, t1, t5, t4) S2 = (t4, t5, t7) S3 = (t7, t1, t8) Initial coding table CT will contain all frequent single terms in Standard Cover Order: t 5 , t 1 , t 7 , t 4 , t 8 , t 2 , t 3 . Let the minimal support bound be 2 3 , i.e., to be frequent a sequence must appear in at least 2 sentences, and let the gap ratio be 1 2 . Also, let the limit C be 4, meaning that only the first four entries of the coding table will be used for encoding. There exists a frequent sequence (t 4 , t 5 ) that appears twice in the text, once in S 1 with a gap and once in S 2 without a gap. We add it to the coding Now S 1 covers 3 out of 4 entries in CT , while S 2 and S 3 cover 2 terms each. If our summary is to contain just one sentence, we select S 1 .

Experimental settings and results
We performed experiments on three corpora from the Document Understanding Conference (DUC) : 2002, 2004, and 2007 (duc, 2002 2007), summarized in Table 2  by the ROUGE-1 and ROUGE-2 (Lin, 2004) recall scores, with the word limit indicated in Table 2, without stemming and stopword removal. The results of the introduced algorithm (Gamp) may be affected by several input parameters: minimum support (Supp), codes limit (C), and the maximal gap allowed between frequent words (Gap). In order to find the best algorithm settings for a general case, we performed experiments that explored the impact of these parameters on the summarization results. First, we experimented with different values of support count in the range of [2,10]. The results show that we get the best summaries using the sequences that occur in at least four document sentences. A limit on the number of codes is an additional parameter.We explored the impact of this parameter on the quality of generated summaries.  We compared the Gamp algorithm with the two known unsupervised state-of-the-art summarizers denoted by Gillick (Gillick andFavre, 2009) andMcDonald (McDonald, 2007). As a baseline, we used a very simple approach that takes first sentences to a summary (denoted by TopK). It is noteworthy that, in addition to the greedy approach, we also evaluated the global optimization with maximizing coverage and minimizing redundancy using Linear Programming (LP). However, experimental results did not provide any improvement over the greedy approach. Therefore, we report only the results of the greedy solution.

Conclusions
In this paper, we introduce a new approach for summarizing text documents based on their Minimal Description Length. We describe documents using frequent sequences of their words. The sentences with the highest coverage of the best compressing set are selected to a summary. The experimental results show that this approach outperforms other unsupervised state-of-the-art methods when summarizing long documents or sets of related documents. We would not recommend using our approach for summarizing single short documents which do not contain enough content for providing a high-quality description. In the future, we intend to apply the MDL method to keyword extraction, headline generation, and other related tasks.