Sparse Optimization for Unsupervised Extractive Summarization of Long Documents with the Frank-Wolfe Algorithm

We address the problem of unsupervised extractive document summarization, especially for long documents. We model the unsupervised problem as a sparse auto-regression one and approximate the resulting combinatorial problem via a convex, norm-constrained problem. We solve it using a dedicated Frank-Wolfe algorithm. To generate a summary with $k$ sentences, the algorithm only needs to execute $\approx k$ iterations, making it very efficient. We explain how to avoid explicit calculation of the full gradient and how to include sentence embedding information. We evaluate our approach against two other unsupervised methods using both lexical (standard) ROUGE scores, as well as semantic (embedding-based) ones. Our method achieves better results with both datasets and works especially well when combined with embeddings for highly paraphrased summaries.


Introduction
With the overwhelming increase of digital information, automatic text summarization has become important for many applications such as financial reviews, medical articles, etc. Manually summarizing this amount of information takes a considerable amount of time and effort. This has motivated the study of efficient and reliable automatic text summarization methods. The task of automatic summarization is the process of generating a condensed version of a text that best describes the original one (Hahn and Mani, 2000;Luhn, 1958). The two mainstream approaches in the field of automatic summarization are extractive and abstractive.
Extractive approaches generate summaries by selecting a subset of informative words, phrases, or sentences directly from the source text. In contrast, abstractive approaches use linguistic methods to decompose and build a semantic representation of the text and use natural language generation techniques to generate a summary (Chopra et al., 2016;Nallapati et al., 2016;Zeng et al., 2016;Rush et al., 2015). In recent years, neural network architectures have made abstractive summarization popular. However, abstractive approaches are generally harder to develop as they require high performing natural language generation techniques, which is also an active research field. Besides these two categories, mixed strategies that combine both extractive and abstractive approaches have also been proposed in recent literature (Peng et al., 2019;Cao et al., 2018;See et al., 2017). Previous work in extractive approaches to summarization include statistical (Saggion and Poibeau, 2012;Das and Martins, 2007;Goldstein et al., 1999;Kupiec et al., 1995;Paice, 1990), graph-based and optimizationbased ones. The graph-based approaches treat text as a network instead of as a simple bag of words and use graph-based ranking methods to generate a summary (Erkan and Radev, 2011;Ouyang et al., 2009;Mihalcea and Tarau, 2004). Optimizationbased methods use techniques such as sparse optimization (Yao et al., 2015;Elhamifar and Vidal, 2013), integer linear programming (ILP) (Qian and Liu, 2013;Berg-Kirkpatrick et al., 2011;Woodsend and Lapata, 2011;Gillick and Favre, 2009) and constraint optimization (Durrett et al., 2016;McDonald, 2007) to reconstruct the summary.
In this work, we focus on extractive summarization for long documents. Performing automatic text summarization for long documents is especially challenging as obtaining high quality human summaries for long documents is often quite costly and time consuming. Recent works on extractive summarization have been focusing on neural network architectures (Nallapati et al., 2017;Cheng and Lapata, 2016). Although these methods are successful in generating summaries for short documents, they often have difficulties with long input sequences (Shao et al., 2017).
Most recent works have started to investigate neural extractive summarization methods for long documents (Xiao and Carenini, 2019;Wang et al., 2017). However, these methods are supervised and require high quality training data in order to train the neural network models. This creates challenges for domains that do not have massive training datasets. Kedzie et al. (2018) compared recent neural extractive summarization models across different domains including news, personal stories, and medical articles. They found that many sophisticated neural extractive summarizers do not have better performance than those consisting of simpler models, and that word embedding averaging performs equally or better than RNNs or CNNs for sentence embedding. This suggests that a simpler model combined with pre-trained word embeddings show promise for summarizing long documents in domains that have few or no training data.
In this work, we propose an unsupervised method for extracting long documents based on a sparse optimization framework and solve it using a dedicated Frank-Wolfe algorithm, which can be combined with pre-trained word embeddings to construct a distributive input representation. Our work is based on the previous work of Cheng et al. (2018) but designed specifically for the summarization task. The proposed framework is an unsupervised model that is efficient and does not require a training corpora, as typical supervised solutions would require. We test our method on two datasets that contain long documents, 2019 FINANCIAL OUTLOOKS and CLASSICAL LITERATURE, and compare it against two baselines: sparse subspace clustering (SSC) and TextRank. The experimental results show that our method gives a higher ROUGE score than our baseline for both datasets. In particular, when combined with sentence embedding, our method gives a higher semantic ROUGE score when evaluated on paraphrased summaries. Moreover, we show that our method is computationally more efficient compared to others.

Methodologies
Notation We denote X (i) and X (i) as the i-th row and i-th column of a matrix X respectively. The matrix X t denotes the value of X at iteration t while [X t ] (i) and [X t ] (i) denotes the i-th row and i-th column of X t . The sum ∑ n i=1 X (i) 2 is the L 2 norm group LASSO constraint. The norm · F is the Forbenium norm.

Sparse auto-regressive problem
Extractive summarization aims at finding a minimal set of representative sentences of the original document that effectively summarizes the entire document. Let A ∈ R d×n be the data matrix that represents the document where each column of A represents a sentence in the source document. Here, d is the number of features for each sentence, and n is the number of sentences in the source document. The source document A is written as A a 1 a 2 · · · a n where the column vector a i is a sentence in the source document. Finding the set of representative sentences assumes that the source document A can be approximated by a sparse combination of sentences in the document: The column vector x i is a decision variable to be learned. Our goal is to select k sentences whose corresponding decision variable x i is non-zero. If we write it in a matrix form with x T i being the row of the matrix variable X, we can formulate the above problem as an auto-regressive problem of the form: where X is row-sparse. Here, X (i) represents the i-th row of the matrix variable X, v i is its norm, and the constraint is written in terms of the L 0 -norm (cardinality, or number of non-zero entries) of v, effectively forcing at least n − k entire rows of X to be zero, thereby singling out a short list of at most k sentences that well represent the whole data set.
The above problem is non-convex and hard to solve but can be well approximated by the so-called L 1 -norm heuristic, leading to a convex approximation: where β is a hyper-parameter and indirectly controls the row-sparsity (number of non-zero rows). Note that the model simply uses the L 1 -norm of vector v in (1) to approximate the cardinality constraint on v. If X * is the solution of problem (2), then columns in the data matrix A (j) that correspond to the non-zero rows of X * , X (j) = 0, are the selected sentences.

Frank-Wolfe unsupervised extractive summarization
The Frank-Wolfe (FW) or conditional gradient algorithm is an iterative first-order optimization algorithm for constrained convex optimization (Frank and Wolfe, 1956). Although the algorithm was introduced over half a century ago, it has experienced a revival in recent years due to its projection-free iterations and broad applications in machine learning (Jaggi, 2013). The FW algorithm solves a general constrained optimization problem of the form min x∈D f (x), where the convex function f is differentiable and L-Lipschitz and the domain D is a convex compact set. At each iteration, the FW algorithm requires solving a linear approximation to the objective function over the domain, often referred to as a linear minimization oracle (LMO), and then updates the solution accordingly. At each iteration, we first calculate the gradient, solve the LMO problem to find a descent direction, calculate the step size by line search or by 2 t+2 , and update the estimate. Algorithm 1 summarizes the FW process.
Algorithm 1 Frank-Wolfe algorithm 1: Let t ← 0 and x 0 ∈ D 2: for t = 0, 1, . . . , do 3: Unlike other descent methods for constrained optimization that require a projection step at each iteration, the FW algorithm is a projection-free algorithm and only needs to solve the LMO. Applying the FW algorithm to our sparse constrained optimization problem (2) results in an unsupervised method for extractive summarization. Algorithm 1 is written in terms of vector variable x; however, it is straightforward to extend it to the matrix variable X. The algorithm starts with X 0 ← 0, meaning no sentence is selected at first. Then the algorithm greedily selects one sentence at each iteration. The algorithm terminates once k rows of X t are nonzero (dense) or when the algorithm converges to a row-sparse solution with k * < k non-zero rows. The complete algorithm of Frank-Wolfe unsupervised extractive summarization is outlined in algorithm 2. Next, we explain the details of the algorithm and provide an efficient gradient calculation scheme.
Linear minimization oracle The algorithm requires solving the LMO at each iteration. The solution of the LMO S t specifies the direction of descent at each step.
Because of the group LASSO constraint in (2), the solution matrix S t is a rank-1 matrix. The nonzero row of S t is chosen based on the maximum L 2 norm of the gradient's rows: We denote the non-zero row of S t at row j as [S t ] (j) . The magnitude of [S t ] (j) is β and the direction is chosen to minimize the inner product: The algorithm produces sparse and low-rank iterates since at most one extra row of X becomes non-zero in each step by the addition of r t S t .

Efficient gradient calculation
The algorithm requires calculating the gradient at each iteration. The gradient of the objective function in (2) is: The matrix K = A T A can be calculated once and used throughout the algorithm. Explicitly calculating the gradient is expensive due to the matrixmatrix product (naively O(n 3 )). However, the structure of the problem allows us to efficiently calculate KX at each iteration. From line 6 in Algorithm 1, we know that X t is a weighted average of X t−1 and a rank-1 matrix S t−1 : This suggests that (KX) t is a weighted average of (KX) t−1 and KS t−1 = K (j) [S t−1 ] (j) . K (j) is the j-th column corresponding to the j-th non-zero row of S t−1 . Since (KX) t−1 is known at iteration t, we are only required to calculate K (j) [S t−1 ] (j) , which is extremely fast (in O(n)).
Stopping criteria The algorithm terminates once k rows of X t are non-zero (dense) or when X t converges to a row-sparse solution such that − ∇ f (X t ), S t − X t < . Once the algorithm terminates, it returns the k sentences that correspond to the non-zero rows of X t by GetSummary(X t , k).
Sentence similarity measure We note that the gradient of (2) depends only on the kernel (or, "Gram") matrix K = A T A and not on A. This matrix is akin to a similarity matrix, with K ij measuring the similarity between sentences i and j. If the matrix A is normalized, K ij 's are cosine similarities. As a result, we may replace the matrix K with any matrix Φ(A) that offers a good similarity measure between sentences. This allow us to incorporate various sentence scoring functions Φ(·).
In this paper, we experimented with two such similarity measures: 1) TF-IDF-like, and 2) sentence embedding.
For TF-IDF-like similarity measure, we use Okapi BM25 (Robertson and Zaragoza, 2009) to construct the kernel matrix K. BM25 and its variants represent the state-of-the-art TF-IDF-like sentence scoring functions. Similarly, any sentence embedding technique can be used to embed the document matrix A in a much lower dimensional space; that is, we can set K ij = φ(a i ) T φ(a j ), with φ(a) the (low-dimensional) vector representing the sentence a. In this work, we use a simple yet effective sentence embedding method called smooth inverse frequency (SIF) (Arora et al., 2017) to measure the sentence similarities. In Arora et al. (2017), the authors show that SIF, a simple weighted average of word vectors modified by SVD, outperforms complex methods such as RNNs and LSTMs. More sophisticated sentence embedding techniques such as neural architectures can also be used here; however, once should also consider the cost of computing the kernel matrix K with such a technique.
In the following, the acronyms FWSum-BM25 and FWSum-SIF are used to refer to the corresponding Frank-Wolfe unsupervised extractive summarization method used in conjunction with the BM25 and SIF similarity kernels. Dernoncourt et al. (2018) surveyed the current large-scale dataset for summarization. Most of them are relatively short; usually less than 2 pages. To experiment on long documents, we used the recently open-sourced 2019 FINANCIAL OUT-LOOKS and CLASSICAL LITERATURE dataset 1 , which contain much longer documents than those surveyed in Dernoncourt et al. (2018).

Datasets
2019 FINANCIAL OUTLOOKS This corpus contains 10 publicly available reports on finance from a number of large financial institutions. Each report ranges from 10 to 144 pages, with a median length of 33 pages. There are no Gold summaries per se since the data is not annotated by a human. Hence, we chose to define the gold summaries as the collection of sentences or parts of sentences that appear in bold in the content, or any sentences that are highlighted as an insert within the content. This is a reasonable heuristic as these parts are generally prepared by the authors to highlight the takeaway of the content.
CLASSICAL LITERATURE The corpus contains summaries of books that have been summarized by human writers. The corpus contains 11 Englishlanguage classical books ranging from 53 to 1139 pages, with a median length of 198 pages. The Gold summaries for each chapter of the book are retrieved from WikiSummary 2 .

Baselines
We compared our method with two other unsupervised extractive approaches; one uses a sparse optimization-based method and the other uses a graph-based method.
Sparse subspace clustering (SSC) Sparse subspace clustering (SSC) solves a sparse optimization program on the auto-regressive problem similar to (1), called the self-expressiveness property of the data (Elhamifar and Vidal, 2013). This property assumes that each data point can be efficiently reconstructed by a combination of other points in the data and that there exists a sparse representation 1 https://github.com/SumUpAnalytics/go ldsum 2 http://wikisum.com/w/Main Page Algorithm 2 Frank-Wolfe unsupervised extractive summarization 1: input β, k, 2: initialize X 0 , (KX) 0 , r 0 , t ← 0, 0, 0, 1 3: compute K = A T A or Φ(A) 4: for t = 1, 2, . . . do 5: 10: 11: break k rows are non-zero or X t converges 14: of the data point. The authors consider a convex relaxation as we did in (2) since solving the original sparse optimization is in general NP-hard. SSC uses the Alternating Direction Method of Multipliers (ADMM) for solving the sparse optimization problem. In our work, we employ the Frank-Wolfe algorithm on the problem, which is more efficient compared to SSC. Subsequent work tried to speed up SSC (You et al., 2016) but with an expense of removing the group LASSO constraint that is crucial for our summarization problem. In our work, we are able to preserve the group LASSO constraint and obtain a faster run-time. In our experiment, we used the implementation of Elhamifar and Vidal (2013), which can be found on their website 3 .
TextRank TextRank (Mihalcea and Tarau, 2004) is a commonly used graph-based unsupervised extractive summarization method. It is also very efficient when extracting summaries from a long document. TextRank employs the similar idea of PageRank where vertices in the graph are sentences in the document and edges between two sentences are measured as a function of their content overlap.

Lexical and semantic ROUGE scores
We evaluate the systems using the ROUGE-1, ROUGE-2, and ROUGE-L (Lin, 2004) so as to account for different summary lengths. The raw ROUGE score only measures the lexical overlaps between the generated summaries and the reference summaries. We refer to the raw ROUGE score defined in Lin (2004) as the lexical ROUGE and used the implementation of the Python rouge library 4 . When summarizing a long document, humans tend to paraphrase the source document in order to condense and synthesize the information. However, the lexical ROUGE scores are unable to measure the quality of paraphrasing. To address this shortcoming of lexical ROUGE when the summaries are paraphrased, word embedding ROUGE scores (Ng and Abrecht, 2015) are also used to evaluate the quality of the generated summaries. The word embedding ROUGE scores are more capable of measuring semantic similarity of the words instead of only lexical overlaps. Ng and Abrecht (2015) showed that the embedding ROUGE achieved better correlations with human assessments compared to lexical ROUGE when measured with the Spearman and Kendall rank coefficients on the TAC AESOP summarization dataset. We refer to the word embedding ROUGE scores as the semantic ROUGE in our evaluation.

Results and Analysis
In our experiment, we set the number of selected sentences k to be the same as the length of reference summary for all methods. The performance of FWSum-BW25 performs especially well when evaluated with lexical ROUGE, highlighting its capabilities of capturing lexical information (measured by unigram and bigram). When evaluated on the FINANCIAL OUTLOOK data, FWSum-BW25 and TextRank generally outperform FWSum-SIF, with FWSum-BM25 being the best performing method. Presumably, this is due to the fact that the Gold summaries of the FINANCIAL OUTLOOK data are taken directly from the source document without much paraphrasing, favoring sentence scoring functions that directly measure the content overlaps.
However, when evaluated by the semantic ROUGE on the CLASSICAL LITERATURE data, FWSum-SIF start to show promises. The Gold summaries of the CLASSICAL LITERATURE data are written by human writers and are highly paraphrased and condensed. As a result, semantic ROUGE is a better measurement for this dataset. As shown in the table, FWSum-SIF starts to outper-form other methods by a significant amount. The improvement over the other methods suggests that using embedding in the sentence scoring function allows for comparisons based on the semantics of words sequences.
This results show that different sentence scoring functions may be used based on the nature of the summary. For summaries that are mostly taken from the source document without much paraphrasing, a lexical overlap or TF-IDF-like kernel matrix may be used. For summaries that are highly paraphrased, an embedding-like kernel matrix may be more suitable. Our method is able to work with both.
Computational complexity Our method requires an up-front cost of calculating the kernel matrix K. Each subsequent iteration requires mostly the LMO and gradient calculation as detailed in section 2. By exploiting the structure of the problem, we are able to avoid explicitly calculating the full gradient. Furthermore, due to the greedy nature of the algorithm, it terminates when k sentences are selected or the solution converges with k * < k sentences. This means that the algorithm only needs to execute ≈ k iterations; each iteration has a cost linear in problem size. Figure 1 compares the algorithm run-time of our method (FWSum-BM25), TextRank and SSC. As shown in the figure, our method is the most efficient among the three, show-ing its potential for summarizing long documents.

Conclusion
Unsupervised document summarization has been a challenging task, especially on long documents. In this work, we propose an efficient unsupervised extractive summarization model that is suitable for long documents by employing a dedicated Frank-Wolfe algorithm. Our method allows one to incorporate sentence embedding or any sentence scoring functions that is best suited for the dataset or the application. We evaluate our method and compare it with two other unsupervised extractive summarization methods on two datasets that are much longer than other summarization corpora used in the past. We evaluate the methods on both lexical and semantic ROUGE in order to overcome the shortcoming of lexical ROUGE and to provide a better assessment of the quality of the summaries. We observed that our methods (both FWSum-BM25 and FWSum-SIF) achieve the best results for both datasets and that FWSum-SIF works especially well with summaries that are paraphrased. Our results also motivate the exploration of different kernel functions or embedding methods, which is left as a future work.