Discrete Optimization for Unsupervised Sentence Summarization with Word-Level Extraction

Automatic sentence summarization produces a shorter version of a sentence, while preserving its most important information. A good summary is characterized by language fluency and high information overlap with the source sentence. We model these two aspects in an unsupervised objective function, consisting of language modeling and semantic similarity metrics. We search for a high-scoring summary by discrete optimization. Our proposed method achieves a new state-of-the art for unsupervised sentence summarization according to ROUGE scores. Additionally, we demonstrate that the commonly reported ROUGE F1 metric is sensitive to summary length. Since this is unwillingly exploited in recent work, we emphasize that future evaluation should explicitly group summarization systems by output length brackets.


Introduction
Sentence summarization transforms a long source sentence into a short summary, while preserving key information (Rush et al., 2015). Sentence summarization has wide applications, for example, news headline generation and text simplification.
State-of-the-art sentence summarization systems are based on sequence-to-sequence neural networks (Rush et al., 2015;Nallapati et al., 2016;Wang et al., 2019), which require massive parallel data for training. Therefore, unsupervised sentence summarization has recently attracted increasing interest. Cycle-consistency approaches treat the summary as a discrete latent variable and use it to reconstruct the source sentence (Wang and Lee, 2018;Baziotis et al., 2019). Such latent-space generation fails to explicitly model the resemblance between the source sentence and the target summary. 1 Our code and system outputs are available at: https://github.com/raphael-sch/HC_ Sentence_Summarization bhp billiton dropping hostile bid for rio tinto bhp billiton drops rio tinto takeover bid summary: reference: Figure 1: Summarizing a sentence x by hill climbing. Each row is a Boolean vector a t at a search step t . A black cell indicates a word is selected, and vice versa. Randomly swapping two values in the Boolean vector yields a new summary that is scored by an objective function that measures language fluency and semantic similarity. If the new summary increases the objective, this summary is accepted as the current best solution. Rejected solutions are not depicted. Zhou and Rush (2019) propose a left-to-right beam search approach based on a heuristically defined scoring function. However, beam search is biased towards the first few words of the source.
In this paper, we propose a hill-climbing approach to unsupervised sentence summarization, directly extracting words from the source sentence. This is motivated by the observation that humanwritten reference summaries exhibit high word overlap with the source sentence, even preserving word order to a large extent. To perform word extraction for summarization, we define a scoring function -similar to Miao et al. (2019) and Zhou and Rush (2019) -that evaluates the quality of a candidate summary by language fluency, semantic similarity to the source, and a hard constraint on output length. We search towards our scoring function by first choice hill-climbing (FCHC), shown in Figure 1. We start from a random subset of words of the required output length. For each search step, a new candidate is sampled by randomly swapping a selected word and a non-selected word. We accept the new candidate if its score is higher than the current one. In contrast to beam search (Zhou and Rush, 2019), our summary is not generated sequentially from the beginning of a sentence, and therefore not biased towards the first few words.
Due to the nature of the search action, our approach is able to explicitly control the length of a summary as a hard constraint. In all previous work, the summary length is weakly controlled by length embeddings or a soft length penalty (Zhou and Rush, 2019;Wang and Lee, 2018;Fevry and Phang, 2018;Baziotis et al., 2019). Thus, the generated summaries by different systems vary considerably in average length, for example, ranging from 9 to 15 on a headline corpus (Section 4.1). Previous work uses ROUGE F1 to compare summaries that might differ in length. We show that ROUGE F1 is unfortunately sensitive to summary output length, in general favoring models that produce longer summaries. Therefore, we argue that controlling the output length should be an integral part of the summarization task and that a fair system comparison can only be conducted between summaries in the same length bracket.
Our model establishes a new state-of-the-art for unsupervised sentence summarization across all commonly-used length brackets and different ROUGE metrics on the Gigaword dataset for headline generation (Rush et al., 2015) and on DUC2004 (Over and Yen, 2004).
The main contributions of this paper are: • We propose a novel method for unsupervised sentence summarization by hill climbing with word-level extraction. • We outperform current unsupervised sentence summarization systems, including more complex sentence reconstruction models. • We show that ROUGE F1 is sensitive to summary length and thus emphasize the importance of explicitly controlling summary length for a fair comparison among different summarization systems.
Sentence summarization yields a short summary for a long sentence. Hori and Furui (2004) and Clarke and Lapata (2006) extract single words from the source sentence based on language model fluency and linguistic constraints. They search via dynamic programming with a trigram language model, which restricts the model capacity. The Hedge Trimmer method (Dorr et al., 2003) also uses hand-crafted linguistic rules to remove constituents from a parse tree until a certain length is reached. Rush et al. (2015) propose a supervised abstractive sentence summarization system with an attention mechanism (Bahdanau et al., 2015), and they also introduce a dataset for headline generation derived from Gigaword. 2 Subsequent models for this dataset were also supervised and mostly based on Seq2seq architectures (Nallapati et al., 2016;Chopra et al., 2016;Wang et al., 2019).
Recently, unsupervised approaches for sentence summarization have attracted increasing attention. Fevry and Phang (2018) learn a denoising autoencoder and control the summary length by a length embedding. Wang and Lee (2018) and Baziotis et al. (2019) use cycle-consistency (He et al., 2016) to learn the reconstruction of the source sentence and return the intermediate discrete representation as a summary. Zhou and Rush (2019) use beam search to optimize a scoring function, which considers language fluency and contextual matching.
Our work can be categorized under unsupervised sentence summarization. We accomplish this by word-level extraction from the source sentence.
Constrained Sentence Generation. Neural sentence generation is usually accomplished in an autoregressive way, for example, by recurrent neu-ral networks generating words left-to-right. This is often enhanced by beam search (Sutskever et al., 2014), which keeps a beam of candidates in a partially greedy fashion. A few studies allow hard constraints on this decoding procedure. Hokamp and Liu (2017) use grid-beam search to impose lexical constraints during decoding. Anderson et al. (2017) propose constrained beam search to predict fixed image tags in an image transcription task. Miao et al. (2019) propose a Metropolis-Hastings sampler for sentence generation, where hard constraints can be incorporated into the target distribution. This is further extended to simulated annealing (Liu et al., 2020), or applied to the text simplification task (Kumar et al., 2020). Different from the above concurrent work, this paper applies the stochastic search framework to text summarization, and design our specific search space and search actions for word extraction.
In previous work on text summarization, length embeddings (Kikuchi et al., 2016;Fan et al., 2018) have been used to indicate the desired summary length. However, these are not hard constraints, because the model may learn to ignore such information.

Proposed Model
Given a source sentence x = (x 1 , x 2 , . . . , x n ) as input, our goal is to generate a shorter sentence y = (y 1 , y 2 , . . . , y m ) as a summary of x. We perform word-level extraction, in addition keeping the original word order intact. Thus, y is a subsequence of x. Our word-level extraction optimizes a manually defined objective function f (y; x, s), where the summary length s is predefined (s < n) and not subject to optimization. In the remainder of this section, we will describe the objective function, search space, and the search algorithm in detail.

Search Objective
We define an objective function f (y; x, s), which our algorithm maximizes. It evaluates the fitness of a candidate sentence y as the summary of an input x, involving three aspects, namely, language fluency f← → LM (y), semantic similarity f SIM (y; x), and a length constraint f LEN (y, s). This is given by where the relative weight γ balances f← → LM (y) and f SIM (y; x). We treat the summary length as a hard constraint, and therefore we do not need a weighting hyperparameter for f LEN .
Language Fluency. The language fluency scorer quantifies how grammatical and idiomatic a candidate summary y is. Our model generates a candidate summary in a non-autoregressive fashion, in contrast to the beam search in Zhou and Rush (2019). Thus, we are able to simultaneously consider forward and backward language models, using the geometric average of their perplexities. Using both forward and backward language models is less biased towards sentence beginnings or endings.
Our fluency scorer is the inverse perplexity.
Depending on applications, the language models could be pretrained on a target corpus. 3 In this case, the fluency scorer also measures whether the summary style is consistent with the target language. This could be important in certain applications, e.g., headline generation, where the summary language differs from the input in style.
Semantic Similarity. A semantic similarity scorer ensures that the summary keeps the key information of the input sentence. We adopt the cosine similarity between sentence embeddings as fSIM(y; x) = cos(e(x), e(y)), where e is a sentence embedding method. In our work, we use unigram word embeddings learned by the sent2vec model (Pagliardini et al., 2018). Then, e(x) is computed as the average of these unigram embeddings, weighted by the inverse-document frequency (idf ) of the words. We use sent2vec because it is trained in an unsupervised way on individual sentences. By contrast, other unsupervised methods like SiameseCBOW (Kenter et al., 2016) or BERT (Devlin et al., 2019) use adjacent sentences as part of the training signal.
Length Constraint. Our discrete searching approach is able to impose the output length as a hard constraint, allowing the model to generate summaries of any given length. Suppose the desired output length is s, then our length scorer is In other words, a candidate summary y is infeasible if it does not satisfy the length constraint.
In practice, we implement this hard constraint by searching among feasible solutions only.

Search Space
Most sentence generation models choose a word from the vocabulary at each time step, such as autoregressive generation that predicts the next word (Sutskever et al., 2014;Rush et al., 2015), and edit-based generation with deletion or insertion operations (Miao et al., 2019;Dong et al., 2019). In these cases, the search space is |V| s , given a vocabulary V and a summary length s. However, reference summaries are highly extractive. In the headline generation dataset (Rush et al., 2015), for example, 45% of the words in the reference summary also appear in the source sentence. This yields a ceiling of 45 ROUGE-1 F1 points 4 for a purely extractive method, which is higher than the current state-of-the-art supervised abstractive result of 39 points (Wang et al., 2019). We are thus motivated to propose our word-extraction approach that extracts a subsequence of the input as the summary. Additionally, we arrange the words in the same order as the input, motivated by the monotonicity assumption in summarization Raffel et al., 2017).
Formally, we define the search space as a = (a 1 , . . . , a n ) ∈ {0, 1} n , where n is the length of the input sentence x. The vector a is a Boolean filter over the source words x. The summary sequence can then be represented by y = x a , i.e., we sequentially extract words from the source sequence x by the Boolean vector a. If a i = 1, then x i is extracted for the summary, and vice versa.
Further, we only consider the search space of all feasible solutions {a : f (x a ; x, s) > −∞}. That is to say, the candidate summary has to satisfy the length constraint in Section 3.1. Equivalently, the output length can be expressed by a constraint on the search space such that i a i = s.
The above restrictions reduce the search space to n s solutions. In a realistic setting, our search Algorithm 1 First-Choice Hill Climbing input objective function f (y; x, s), source sentence x, summary length s, number of steps T , initial random solution a0, neighbor function q(a |a) space is much smaller than that of generating words from the entire vocabulary.

Search Algorithm
We optimize our objective function f (y; x, s) by first-choice hill climbing (FCHC, Russell and Norvig, 2016). This is a stochastic optimization algorithm that proposes a candidate solution by local change at every search step. The candidate is accepted if it is better than the current solution. Otherwise, the algorithm keeps the current solution. FCHC maximizes the objective function in a greedy fashion and yields a (possibly local) optimum.
Algorithm 1 shows the optimization procedure of our FCHC. For each search step, a new candidate is sampled from the neighbor function q(a |a). This is accomplished by randomly swapping two actions a i and a j for a i = a j , i.e., replacing a word in the summary with a word from the source sentence that is not in the current summary. The order of selected words is kept as in the source sentence. If the candidate solution achieves a higher score, then it is accepted. Otherwise, the candidate is rejected and the algorithm proceeds with the current solution. Our search terminates if it exceeds a predefined budget. The last solution is returned as the summary, as it is also the best-scored candidate due to our greedy algorithm.
One main potential drawback of hill climbing algorithms is that they may get stuck in a local optimum. To alleviate this problem, we restart the algorithm with multiple random initial word selections a 0 and return the overall best solution. We set the number of restarts as β R · ns 2 and number of search steps as β T · ns 2 , where β R and β T are controlling hyperparameters. We design the formula to encourage more search for longer input sentences, but only with a tractable growth: linear for input length and quadratic for summary length. As the summary length is usually much smaller than the input length, quadratic search is possible. Increasing the number of restarts (and search steps) monotonically improves the scoring function, and thus in practice can be set according to the available search budget.
Other discrete optimization algorithms can be explored for sentence generation, such as simulated annealing (Liu et al., 2020) and genetic algorithms. Our analysis on short sentences (where exhaustive search is tractable) showed that hill climbing with restarts achieves ROUGE scores similar to exhaustive search (Section 5.4).

Evaluation Framework
In this section, we will describe the datasets, evaluation metrics, and a widely used baseline (called Lead). Additionally, we report the observation that the commonly used evaluation metric, ROUGE F1, is sensitive to summary length, preferring longer summaries. Thus, we propose to group models with similar output length during evaluation for fair comparison.

Datasets
We evaluate our models on the dataset provided for DUC2004 Task 1 (Over and Yen, 2004) and a headline generation corpus 5 (Rush et al., 2015), both widely adopted in the summarization literature.
The DUC2004 dataset is designed and used for testing only. It consists of 500 news articles, each paired with four human written summaries. We follow Rush et al. (2015) and adopt DUC2004 for sentence summarization by using only the first sentence of an article as input. The reference summaries are around 10 words long on average.
The headline generation dataset (Rush et al., 2015) is derived from the Gigaword news corpus. Each headline/title is viewed as the reference summary of the first sentence of an article. The dataset contains 3.8M training instances and 1951 test instances. The average headline contains ∼8 words; the average source sentence contains ∼30 words. We use 500 held-out validation instances for hyperparameter tuning. Note that the training set is only used to train a language model and sent2vec embeddings. The summarization process itself is not trained in our approach. 5 https://github.com/harvardnlp/NAMAS

Lead Baselines
Lead baselines are a strong competitor that extracts the first few characters or words of the input sentence. The DUC2004 shared task includes a Lead baseline, which extracts the first 75 characters as the summary. We call it Lead-C-75. For the Gigaword dataset, the reference has 8 words on average, and it is common to compare with a Lead variant that chooses the first 8 words. We call this baseline Lead-N-n when we choose n words. For fair comparison with previous work (Baziotis et al., 2019;Fevry and Phang, 2018) in Section 5.2, we further introduce a new variant that returns the first p percent of source words as the summary. We denote this baseline by Lead-P-p.

ROUGE Scores
Summarization systems are commonly evaluated by ROUGE scores (Lin, 2004). The ROUGE-1 (or ROUGE-2) score computes the unigram (or bigram) overlap of a generated summary and the reference. ROUGE-L calculates the longest common subsequence. Depending on the dataset, either ROUGE Recall or ROUGE F1 variant is adopted. Since the ROUGE Recall metric is not normalized with regard to length, DUC2004 standard evaluation truncates the summary at 75 characters. This procedure was also adopted by Rush et al. (2015) for the headline generation task, but later Chopra et al. (2016) proposed to report the "more balanced" ROUGE F1 metric for the Gigaword headline generation dataset and abandoned truncation. We follow previous work and use ROUGE F1 for headline generation and truncated ROUGE Recall for DUC2004.

Summary Length
As mentioned, ROUGE F1 was introduced to the evaluation of sentence summarization to better compare models with different output lengths (Chopra et al., 2016;Nallapati et al., 2016). To investigate the effect of summary length on ROUGE F1, we calculate ROUGE F1 scores for the Lead-N-n and Lead-P-p baselines with different length parameters. Figure 2 shows that ROUGE F1 peaks at n ≈ 18 or p ≈ 50. The difference between the maximum performance at n ≈ 18 and the widely adopted baseline (Lead-N-8) is large: 4.2 ROUGE-1 F1 points. A similar effect is observed by Sun et al. (2019) for document summarization. This shows that ROUGE F1 is still sensitive to summary length, and this effect should be considered during evaluation. We propose to report the average output length of a model and only compare models in the same length bracket.

Setup
We conduct experiments with two settings, dependent on how the scorers f← → LM and f SIM are trained. In the first setting, we train the language model and sent2vec embeddings on the source (article) side of the Gigaword headline generation dataset. This complies with Fevry and Phang (2018) and Baziotis et al. (2019). In the second setting, we train the language model and sent2vec embeddings on the target (title) side like Zhou and Rush (2019). In both settings, we do not need parallel source-target pairs.
For output length, our headline generation experiment sets the desired target length as 8 words, 10 words, and 50% of the input, as these mirror either the average reference summary length or the average output lengths of our competitors (Wang and Lee, 2018;Zhou and Rush, 2019;Fevry and Phang, 2018;Baziotis et al., 2019). For DUC2004, the desired summary length is set to 13 words, because the standard evaluation script truncates after the first 75 characters (roughly 13 words) in the summary.
Our forward and backward language models use long short term memory units (Hochreiter and Schmidhuber, 1997) and are optimized for 50 epochs by stochastic gradient descent. Embeddings and hidden sizes are set to 1024 dimensions.
We tune hyperparameters on the development data of the headline corpus, and set the weighting parameter γ to 12 for all models. The search steps and restarts are set to β T = 0.1 and β R = 0.035, respectively. We see a sharp performance improvement when we do more searching. Thus, we choose β T and β R at the critical values due to efficiency concerns.

Competing Models
Besides the Lead baselines discussed in Section 4.2, we compare our models with state-of-the-art unsupervised sentence summarization systems.
Wang and Lee (2018) 6 use cycle-consistency to reconstruct source sentences from the headline generation corpus (Rush et al., 2015). The latent discrete representation, learned to be similar to (non-parallel) headlines, is used as the summary.
Zhou and Rush (2019) optimize an objective function involving language fluency and contextual matching. Their language modeling scorer is trained on headlines of the Gigaword training set; their contextual matching scorer is based on ELMo embeddings (Peters et al., 2018) trained with the Billion Word corpus (Chelba et al., 2013). Their summary length is controlled by a soft length penalty during beam search.
Fevry and Phang (2018) 7 learn a denoising autoencoder (Vincent et al., 2008) to reconstruct source sentences of the Gigaword training set. Summary length is set to 50% of the input length and is controlled by length embeddings in the decoder. Baziotis et al. (2019) 8 propose SEQ 3 that uses cycle-consistency to reconstruct source sentences from the Gigaword training set. The length is also set to 50% of the input length, controlled by length embeddings in the intermediate decoder.
For the DUC2004 dataset, TOPIARY (Zajic et al., 2004) is the winning system in the competition. They shorten the sentence by rule-based syntaxtree trimming (Dorr et al., 2003), but enhance the resulting summary with topics that are learned on  (Chelba et al., 2013); twitter: the Twitter corpus (Pagliardini et al., 2018); Our hill-climbing (HC) approaches are named in the format of HC data outputLength.

Results
Results for Headline Generation. We first compare with Lead-N-8 (Group A, Table 1). This is a standard baseline in previous work, because the average reference summary contains eight words. Unfortunately, none of the previous papers consider output length during evaluation, making comparisons between their (longer) output summaries and the Lead-N-8 baseline unfair, as discussed in Section 4.4. Our approach, which explicitly controls summary length, considerably outperforms the Lead-N-8 baseline in a fair setting.
Next, we compare with state-of-the-art unsupervised methods, whose output summary has roughly 10 words on average (Group B). In this case, we set our hard length constraint as 10 and include the Lead-N-10 baseline for comparison. Trained on the title side only, our HC title 10 model outperforms these competing methods in all ROUGE F1 scores. In particular, Zhou and Rush (2019) use the target side to train the language model, plus the Billion Word Corpus to pretrain embeddings used in the contextual matching scorer. With the same extra corpus to pretrain our sent2vec embeddings, our HC title+billion 10 variant achieves even better performance, outperforming Zhou and Rush (2019) by 2.32 ROUGE-1 and 1.41 ROUGE-L points.
The Billion Word Corpus, however, includes complete articles, which implicitly yields unaligned parallel data. This could be inappropriate for an unsupervised method. Thus, we further train sent2vec embeddings on the Twitter corpus by Pagliardini et al. (2018). The HC title+twitter 10 also performs better than HC title 10 and other competitors.
In Group C, we compare with the models whose summaries have an average length of 50% of the input sentence. We set our desired target length to 50% as well, and include the Lead-P-50 baseline. Previous studies report a performance improvement over the Lead-N-8 baseline, but in fact, Table 1 shows that they do not outperform the appropriate Lead baseline Lead-P-50. Our model is the only unsupervised summarization system that outperforms the Lead-P-50 baseline on this dataset, even though it is trained solely on the article side.
It is noted that our models trained on the title side (HC title) consistently outperform those trained on the article side (HC article). This is not surprising because the former can generate headlines from the learned target distribution. This shows the importance of learning a summary language model even if we do not have supervision of parallel sourcetarget data.
Results for DUC2004. Table 2 shows the results on the DUC2004 data. As this dataset is for test only, we directly transfer the models HC article and HC title from the headline generation corpus with the same hyperparameters (except for length). As shown in the table, we outperform all previous methods and the Lead-C-75 baseline. The results are consistent with Table 1, showing the generalizability of our approach.
Human Evaluation. We conduct human evaluation via pairwise comparison of system outputs, in the same vein as (West et al., 2019). The annotator sees the source sentence along with the headline generated by our system and a competing method, presented in random order. The annotator is asked to compare the fidelity and fluency of the two systems, choosing among the three options (i) the first headline is better (ii) the second headline is better, and (iii) both headlines are equally good/bad. This task is repeated for 100 instances with 5 annotators each. The final label is selected by majority voting. The inter-annotator agreement (Krippendorff's alpha) is 0.25 when our model is compared with Wang and Lee (2018) and 0.17 with Zhou and Rush (2019).
We report the aggregated score of our system in Table 3. For each sample, we count 1 point if our model wins, 0 points if it ties, -1 point if it loses. The points are normalized by the number of samples. The results show an advantage of our model over Wang and Lee (2018), especially in fluency. Our model is also on par with Zhou and Rush (2019). Note again that we achieve this with fewer data.

Analysis
In this section, we conduct an in-depth analysis of our model, based on HC title 10 for headline generation.
Search Objective. Table 4 provides an ablation study on our objective function. It shows that both language fluency and semantic similarity play a   (Wang and Lee, 2018) and ZR (Zhou and Rush, 2019), in terms of average score of fidelity and fluency: 1 (wins), 0 (ties), and -1 (loses).
Objective  role in measuring the quality of a summary. The bi-directional language model is also slightly better than a uni-directional language model. Search Algorithm. In Figure 3, we compare our FCHC with the theoretical optimum on short sentences where exhaustive search is tractable. For only 3% of the instances with source sentence length between 25 and 30 words, our FCHC algorithm does not find the global optimum. In 21% of those cases, the better objective score leads to a higher ROUGE-L score. This shows that FCHC with restarts is a powerful enough search algorithm for word extraction-based sentence summarization.
Positional Bias. We analyze the positional bias of each algorithm by plotting the normalized frequency of extracted words within four different areas of the source sentence. As shown in Figure 4, the extraction positions of words in the reference headlines are slightly skewed towards the beginning of the source sentence. Our hill-climbing algorithm performs distributed edits over the sentence, which is reflected in the flatter graph across the source sentence areas. By contrast, beam search (Zhou and Rush, 2019) is more biased towards the first quarter of the source sentence. Cycle consistency models (Wang and Lee, 2018;Baziotis et al., 2019) show a strong bias towards the first half of the source sentence. We suspect that the reconstruction decoder is easily satisfied with the beginning of the source sentence as the discrete latent variable,  Figure 4: Positional bias for different systems, calculated for the headline generation test set. The source sentence is divided into four areas: 0-25%, 25-50%, 50-75%, and 75-100% of the sentence. The y-axis shows the normalized frequency of how often a word in the summary is extracted from one of the four source sentence areas.
because of its autoregressive decoding.
Case Study. We show example summaries generated by our system in Figure 5. We see that the HC title models indeed learn the style of headlines, known as headlinese. As shown, HC title often uses simple tense and drops articles (e.g., "a" and "the"). The summaries generated by HC article tend to waste word slots by including an uninformative determiner.