Multi-document Summarization with Maximal Marginal Relevance-guided Reinforcement Learning

While neural sequence learning methods have made significant progress in single-document summarization (SDS), they produce unsatisfactory results on multi-document summarization (MDS). We observe two major challenges when adapting SDS advances to MDS: (1) MDS involves larger search space and yet more limited training data, setting obstacles for neural methods to learn adequate representations; (2) MDS needs to resolve higher information redundancy among the source documents, which SDS methods are less effective to handle. To close the gap, we present RL-MMR, Maximal Margin Relevance-guided Reinforcement Learning for MDS, which unifies advanced neural SDS methods and statistical measures used in classical MDS. RL-MMR casts MMR guidance on fewer promising candidates, which restrains the search space and thus leads to better representation learning. Additionally, the explicit redundancy measure in MMR helps the neural representation of the summary to better capture redundancy. Extensive experiments demonstrate that RL-MMR achieves state-of-the-art performance on benchmark MDS datasets. In particular, we show the benefits of incorporating MMR into end-to-end learning when adapting SDS to MDS in terms of both learning effectiveness and efficiency.


Introduction
Text summarization aims to produce condensed summaries covering salient and non-redundant information in the source documents. Recent studies on single-document summarization (SDS) benefit from the advances in neural sequence learning Chen and Bansal, 2018;Narayan et al., 2018) as well as pretrained language models (Liu and Lapata, 2019;Lewis et al., 2019;Zhang et al., 2020) and make great progress. However, in multi-document summarization (MDS) tasks, neural models are still facing challenges and often underperform classical statistical methods built upon handcrafted features (Kulesza and Taskar, 2012).
We observe two major challenges when adapting advanced neural SDS methods to MDS: (1) Large search space. MDS aims at producing summaries from multiple source documents, which exceeds the capacity of neural SDS models  and sets learning obstacles for adequate representations, especially considering that MDS labeled data is more limited. For example, there are 287K training samples (687 words on average) on the CNN/Daily Mail SDS dataset  and only 30 on the DUC 2003 MDS dataset (6,831 words). (2) High redundancy. In MDS, the same statement or even sentence can spread across different documents. Although SDS models adopt attention mechanisms as implicit measures to reduce redundancy (Chen and Bansal, 2018), they fail to handle the much higher redundancy of MDS effectively (Sec. 4.2.3).
There have been attempts to solve the aforementioned challenges in MDS. Regarding the large search space, prior studies Zhang et al., 2018) perform sentence filtering using a sentence ranker and only take top-ranked K sentences. However, such a hard cutoff of the search space makes these approaches insufficient in the exploration of the (already scarce) labeled data and limited by the ranker since most sentences are discarded, 2 albeit the discarded sentences are important and could have been favored. As a result, although these studies perform better than directly applying their base SDS models Tan et al., 2017) to MDS, they do not outperform state-of-the-art MDS methods (Gillick and Favre, 2009;Kulesza and Taskar, 2012). Regarding the high redundancy, various redundancy measures have been proposed, including heuristic postprocessing such as counting new bi-grams (Cao et al., 2016) and cosine similarity (Hong et al., 2014), or dynamic scoring that compares each source sentence with the current summary like Maximal Marginal Relevance (MMR) (Carbonell and Goldstein, 1998). Nevertheless, these methods still use lexical features without semantic representation learning. One extension (Cho et al., 2019) of these studies uses capsule networks  to improve redundancy measures. However, its capsule networks are pre-trained on SDS and fixed as feature inputs of classical methods without end-to-end representation learning.
In this paper, we present a deep RL framework, MMR-guided Reinforcement Learning (RL-MMR) for MDS, which unifies advances in SDS and one classical MDS approach, MMR (Carbonell and Goldstein, 1998) through end-to-end learning. RL-MMR addresses the MDS challenges as follows: (1) RL-MMR overcomes the large search space through soft attention. Compared to hard cutoff, our soft attention favors top-ranked candidates of the sentence ranker (MMR). However, it does not discard low-ranked ones, as the ranker is imperfect, and those sentences ranked low may also contribute to a high-quality summary. Soft attention restrains the search space while allowing more exploration of the limited labeled data, leading to better representation learning. Specifically, RL-MMR infuses the entire prediction of MMR into its neural module by attending (restraining) to important sentences and downplaying the rest instead of completely discarding them. (2) RL-MMR resolves the high redundancy of MDS in a unified way: the explicit redundancy measure in MMR is incorporated into the neural representation of the current state, and the two modules are coordinated by RL reward optimization, which encourages non-redundant summaries.
We conduct extensive experiments and ablation studies to examine the effectiveness of RL-MMR. Experimental results show that RL-MMR achieves state-of-the-art performance on the DUC 2004 (Paul and James, 2004) and TAC 2011 (Owczarzak and Dang, 2011) datasets (Sec. 4.2.1). A comparison between various com-bination mechanisms demonstrates the benefits of soft attention in the large search space of MDS (Sec. 4.2.2). In addition, ablation and manual studies confirm that RL-MMR is superior to applying either RL or MMR to MDS alone, and MMR guidance is effective for redundancy avoidance (Sec. 4.2.3).

Contributions.
(1) We present an RL-based MDS framework that combines the advances of classical MDS and neural SDS methods via end-to-end learning.
(2) We show that our proposed soft attention is better than the hard cutoff of previous methods for learning adequate neural representations. Also, infusing the neural representation of the current summary with explicit MMR measures significantly reduces summary redundancy. (3) We demonstrate that RL-MMR achieves new state-ofthe-art results on benchmark MDS datasets.

Problem Formulation
We define D = {D 1 , D 2 , ..., D N } as a set of documents on the same topic. Each document set D is paired with a set of (human-written) reference summaries R. For the convenience of notation, we denote the j-th sentence in D as s j when concatenating the documents in D. We focus on extractive summarization where a subset of sentences in D are extracted as the system summary E. A desired system summary E covers salient and non-redundant information in D. E is compared with the reference summaries R for evaluation.

The RL-MMR Framework
Overview. At a high level, RL-MMR infuses MMR guidance into end-to-end training of the neural summarization model. RL-MMR uses hierarchical encoding to efficiently encode the sentences in multiple documents and obtains the neural sentence representation A j . RL-MMR models salience by combining MMR and sentence representation A j , and measures redundancy by infusing MMR with neural summary representation z t , which together form the state representation g t . At each time step, one sentence is extracted based on the MMR-guided sentence representation and state representation, and compared with the reference, the result (reward) of which is then back-propagated for the learning of both neural representation and MMR guidance. An illustrative figure of RL-MMR is shown in Fig. 1 Figure 1: An overview of the proposed MDS framework RL-MMR. Neural sentence representation A j is obtained through sentence-level convolutional encoder and document-level bi-LSTM encoder. MMR guidance is incorporated into neural sentence representation A j and state representation g t through soft attention and end-toend learned through reward optimization.
In the following, we first describe MMR and the neural sentence representation. We then introduce the neural sentence extraction module and how we incorporate MMR guidance into it for better MDS performance. Finally, we illustrate how neural representation and MMR guidance are jointly learned via end-to-end reinforcement learning.

Maximal Marginal Relevance
Maximal Marginal Relevance (MMR) (Carbonell and Goldstein, 1998) is a general summarization framework that balances summary salience and redundancy. Formally, MMR defines the score of a sentence s j at time t as m t j = λS(s j , D) − (1 − λ) max e∈Et R(s j , e), where λ ∈ [0, 1] is the weight balancing salience and redundancy. S(s j , D) measures how salient a sentence s j is, estimated by the similarity between s j and D. S(s j , D) does not change during the extraction process. E t consists of sentences that are already extracted before time t. max e∈Et R(s j , e) measures the redundancy between s j and each extracted sentence e and finds the most redundant pair. max e∈Et R(s j , e) is updated as the size of E t increases. Intuitively, if s j is similar to any sentence e ∈ E t , it would be deemed redundant and less favored by MMR. There are various options regarding the choices of S(s j , D) and R(s j , e), which we compare in Sec. 4.2.4.
We denote the (index of the) sentence extracted at time t as e t . MMR greedily extracts one sentence at a time according to the MMR score: e t = arg max s j ∈D\Et m t j . Heuristic and determin-istic algorithms like MMR are rather efficient and work reasonably well in some cases. However, they lack holistic modeling of summary quality and the capability of end-to-end representation learning.

Neural Sentence Representation
To embody end-to-end representation learning, we leverage the advances in SDS neural sequence learning methods. Unlike prior studies on adapting SDS to MDS , which concatenates all the documents chronologically and encodes them sequentially, we adapt hierarchical encoding for better efficiency and scalability. Specifically, we first encode each sentence s j via a CNN (Kim, 2014) to obtain its sentence representation. We then separately feed the sentence representations in each document D i to a bi-LSTM (Huang et al., 2015). The bi-LSTM generates a contextualized representation for each sentence s j , denoted by h j . We form an action matrix A using h j , where the j-th row A j corresponds to the j-th sentence (s j ) in D. A pseudo sentence indicating the STOP action, whose representation is randomly initialized, is also included in A, and sentence extraction is finalized when the STOP action is taken (Mao et al., 2018(Mao et al., , 2019.

Neural Sentence Extraction
We briefly describe the SDS sentence extraction module (Chen and Bansal, 2018) that we base our work on, and elaborate in Sec 3.4 how we adapt it for better MDS performance with MMR guidance. The probability of neural sentence extraction is measured through a two-hop attention mechanism. Specifically, we first obtain the neural summary representation z t by feeding previously extracted sentences (A e i ) to an LSTM encoder. A time-dependent state representation g t that considers both sentence representation A j and summary representation z t is obtained by the glimpse operation (Vinyals et al., 2016): where W 1 , W 2 , v 1 are model parameters. a t represents the vector composed of a t j . With z t , the attention weights α t j are aware of previous extraction. Finally, the sentence representation A j is attended again to estimate the extraction probability.
where W 3 , W 4 , v 2 are model parameters and previously extracted sentences {e i } are excluded. The summary redundancy here is handled implicitly by g t . Supposedly, a redundant sentence s j would receive a low attention weight a t j after comparing A j and z t in Eq. 1. However, we find such latent modeling insufficient for MDS due to its much higher degree of redundancy. For example, when news reports start with semantically similar sentences, using latent redundancy avoidance alone leads to repeated summaries (App B  Table 8). Such observations motivate us to incorporate MMR, which models redundancy explicitly, to guide the learning of sentence extraction for MDS.

MMR-guided Sentence Extraction
In this section, we describe several strategies of incorporating MMR into sentence extraction, which keeps the neural representation for expressiveness while restraining the search space to fewer promising candidates for more adequate representation learning under limited training data. Hard Cutoff. One straightforward way of incorporating MMR guidance is to only allow extraction from the top-ranked sentences of MMR. We denote the sentence list ranked by MMR scores m t j as M t . Given p t j -the neural probability of sentence extraction before softmax, we set the probability of the sentences after the first K sentences in M t to −∞. In this way, the low-ranked sentences in MMR are never selected and thus never included in the extracted summary. We denote this variant as RL-MMR HARD-CUT .
There are two limitations of conducting hard cutoff in the hope of adequate representation learning: (L1) Hard cutoff ignores the values of MMR scores and simply uses them to make binary decisions. (L2) While hard cutoff reduces the search space, the decision of the RL agent is limited as it cannot extract low-ranked sentences and thus lacks exploration of the (already limited) training data. To tackle L1, a simple fix is to combine the MMR score m t j with the extraction probability measured by the neural sentence representation.
where β ∈ [0, 1] is a constant. FF(·) is a two-layer feed-forward network that enables more flexibility than using raw MMR scores, compensating for the magnitude difference between the two terms. We denote this variant as RL-MMR HARD-COMB .
Soft Attention. To deal with L2, we explore soft variants that do not completely discard the low-ranked sentences but encourage the extraction of top-ranked sentences. The first variant, RL-MMR SOFT-COMB , removes the constraint of s j ∈ M t 1:K in Eq. 5. This variant solves L2 but may re-expose the RL agent to L1 since its MMR module and neural module are loosely coupled and there is a learnable layer in their combination.
Therefore, we design a second variant, RL-MMR SOFT-ATTN , which addresses both L1 and L2 by tightly incorporating MMR into neural representation learning via soft attention. Specifically, the MMR scores are first transformed and normalized: µ t = softmax(FF(m t )), and then used to attend neural sentence representation A j before the two-hop attention: A j = µ t j A j . The state representation g t , which captures summary redundancy, is also impacted by MMR through the attention between summary representation z t and MMRguided sentence representation A j in Eq. 1. L1 is addressed as µ t represents the extraction probability estimated by MMR. L2 is resolved since the top-ranked sentences in MMR receive high attention, which empirically is enough to restrain the decision of the RL agent, while the low-ranked sentences are downplayed but not discarded, allowing more exploration of the search space.

MDS with Reinforcement Learning
The guidance of MMR is incorporated into neural representation learning through end-to-end RL training. Specifically, we formulate extractive MDS as a Markov Decision Process, where the state is defined by (D \ E t , g t ). At each time step, one action is sampled from A given p t j , and its reward is measured by comparing the extracted sentence e t with the reference R via ROUGE (Lin, 2004), i.e., r t = ROUGE-L F1 (e t , R). At the final step T when the STOP action is taken, an overall estimation of the summary quality is measured by r T = ROUGE-1 F1 (E, R). Reward optimization encourages salient and non-redundant summariesintermediate rewards focus on the sentence salience of the current extracted sentence and the final reward captures the salience and redundancy of the entire summary.
Similar to prior RL-based models on SDS (Paulus et al., 2018;Chen and Bansal, 2018;Narayan et al., 2018), we use policy gradient (Williams, 1992) as the learning algorithm for model parameter updates. In addition, we adopt Advantage Actor-Critic (A2C) optimization -a critic network is added to enhance the stability of vanilla policy gradient. The critic network has a similar architecture to the one described in Sec. 3.2 and uses the sentence representation A to generate an estimation of the discounted reward, which is then used as the baseline subtracted from the actual discounted reward before policy gradient updates.

Experiments
We conduct extensive experiments to examine RL-MMR with several key questions: (Q1) How does RL-MMR perform compared to state-of-the-art methods? (Q2) What are the advantages of soft attention over hard cutoff in learning adequate neural representations under the large search space? (Q3) How crucial is the guidance of MMR for adapting SDS to MDS in the face of high redundancy?

Experimental Setup
Datasets. We take the MDS datasets from DUC and TAC competitions which are widely used in prior studies (Kulesza and Taskar, 2012;. Following convention Cao et al., 2017;Cho et al., 2019) (Li et al., 2017;Zhang et al., 2018;Cho et al., 2019), we measure ROUGE-1/2/SU4 F1 scores (Lin, 2004). The evaluation parameters are set according to Hong et al. (2014) with stemming and stopwords not removed. The output length is limited to 100 words. These setups are the same for all compared methods. 3 Compared Methods. We compare RL-MMR with both classical and neural MDS methods. Note that some previous methods are incomparable due to differences such as length limit (100 words or 665 bytes) and evaluation metric (ROUGE F1 or recall). Details of each method and differences in evaluation can be found in App. A.
We use RL-MMR SOFT-ATTN as our default model unless otherwise mentioned. Implementation details can be found in App. A.4. We also report Oracle, an approximate upper bound that greedily extracts sentences to maximize ROUGE-1 F1 given the reference summaries (Nallapati et al., 2017).

Comparison with the State-of-the-art
To answer Q1, we compare RL-MMR with stateof-the-art summarization methods and list the comparison results in Tables 1 and 2. On DUC 2004, we observe that rnn-ext + RL, which we base our framework on, fails to achieve satisfactory performance even after fine-tuning. The large performance gains of RL-MMR over rnn-ext + RL demonstrates the benefits of guiding SDS models with MMR when adapting them to MDS. A similar conclusion is reached when comparing PG and PG-MMR. However, the hard cutoff   in PG-MMR and the lack of in-domain fine-tuning lead to its inferior performance. DPP and DPP-Caps-Comb obtain decent performance but could not outperform RL-MMR due to the lack of endto-end representation learning. Lastly, RL-MMR achieves new state-of-the-art results, approaching the performance of Oracle, which has access to the reference summaries, especially on ROUGE-2. We observe similar trends on TAC 2011 in which RL-MMR again achieves state-of-the-art performance. The improvement over compared methods is especially significant on ROUGE-1 and ROUGE-SU4.

Analysis of RL-MMR Combination
We answer Q2 by comparing the performance of various combination mechanisms for RL-MMR. Table 3, RL-MMR HARD-COMB performs better than RL-MMR HARD-CUT , showing the effectiveness of using MMR scores instead of degenerating them into binary values. We test RL-MMR SOFT-COMB with different β but it generally performs much worse than other variants, which implies that naively incorporating MMR into representation learning through weighted average may loosen the guidance of MMR, losing the benefits of both modules. Infusing MMR via soft attention of the action space performs the best, demonstrating the effectiveness of MMR guidance in RL-MMR SOFT-ATTN for sentence representation learning.  Hard Cutoff vs. Soft Attention. We further compare the extracted summaries of MMR, RL-MMR HARD-CUT , and RL-MMR SOFT-ATTN to verify the assumption that there are high-quality sentences not ranked highly by MMR and thus neglected by the hard cutoff. In our analysis, we find that when performing soft attention, 32% (12%) of extracted summaries contain low-ranked sentences that are not from M 1 1:K when K = 1 (K = 7). We then evaluate those samples with low-ranked sentences extracted and conduct a pairwise comparison. On average, we observe a gain of 18.9% ROUGE-2 F1 of RL-MMR SOFT-ATTN over MMR, and 2.71% over RL-MMR HARD-CUT , which demonstrates the benefits of soft attention.

Degree of RL-MMR Combination.
To study the effect of RL-MMR combination in different degrees, we vary the cutoff K in RL-MMR HARD-CUT and analyze performance changes. As listed in Table 4, a small K(= 1) imposes tight constraints and practically degrades RL-MMR to vanilla MMR. A large K(= 50) might be too loose to limit the search space effectively, resulting in worse performance than a K(= 7, 10) within the proper range. When K is increased to 100, the impact of MMR further decreases but still positively influences model performance compared to the vanilla RL (K = ∞), especially on ROUGE-1.

Effectiveness of MMR Guidance
To answer Q3, we compare RL-MMR with vanilla RL without MMR guidance in terms of both training and test performance. We also inspect details such as runtime and quality of their extracted summaries (provided in App.B).
Training Performance. To examine whether MMR guidance helps with the learning efficiency of MDS, we plot the learning curves of vanilla RL and RL-MMR HARD-CUT in Fig. 2. RL-MMR receives a significantly better initial reward on the training set because MMR provides prior knowledge to extract high-quality sentences. In addition, RL-MMR has lower variance and achieves faster convergence than RL due to MMR guidance. Note that the final reward of vanilla RL on the plateau is higher than RL-MMR, which is somewhat expected since RL can achieve better fitting on the training set when it has less guidance (constraint). Test Performance. We compare the test performance of vanilla RL and RL-MMR in

Ablation of MMR
In this section, we conduct more ablation of MMR given its decent performance. We study the balance between salience and redundancy, and the performance of different similarity measures. Specifically, we use TF-IDF and BERT (Devlin et al., 2019) as the sentence (document) representation and measure cosine similarity in S(s j , D) and R(s j , e). We also explore whether a semantic textual similarity model, SNN (Latkowski, 2018), is more effective in measuring redundancy R(s j , e) than TF-IDF. The TF-IDF features are estimated on the MDS datasets while the neural models are pre-trained on their corresponding tasks. Balance between Salience and Redundancy. By examining the y-axis in Fig. 3, we observe that con-  sidering both salience and redundancy (best λ = 0.5˜0.8) performs much better than only considering salience (λ = 1) regardless of the specific measures, further indicating the necessity of explicit redundancy avoidance in MDS.
Comparison of Similarity Measures. By varying the x values in Fig. 3, TF-IDF and neural estimations are combined using different weights. Although BERT and SNN (combined with TF-IDF) perform slightly better at times, they often require careful hyper-parameter tuning (both x and y). Hence, We use TF-IDF as the representation in MMR throughout our experiments.

Output Analysis
We analyze the outputs of the best-performing methods in Table 6. DPP-Caps-Comb still seems to struggle with redundancy as it extracts three sentences with similar semantics ("Turkey wants Italy to extradite Ocalan"). MMR and DPP-Caps-Comb both extract one sentence regarding a hypothesis that "Ocalan will be tortured", which is not found in the reference. RL-MMR has a more salient and non-redundant summary, as it is end-to-end trained with advances in SDS for sentence representation learning while maintaining the benefits of classical MDS approaches. In contrast, MMR alone only considers lexical similarity; The redundancy mea-sure in DPP-Caps-Comb is pre-trained on one SDS dataset with weak supervision and fixed during the training of DPP.

Related Work
Multi-document Summarization. Classical MDS explore both extractive (Erkan and Radev, 2004;Haghighi and Vanderwende, 2009) and abstractive methods (Barzilay et al., 1999;Ganesan et al., 2010). Many neural MDS methods Zhang et al., 2018) are merely comparable or even worse than classical methods due to the challenges of large search space and limited training data. Unlike DPP-Caps-Comb (Cho et al., 2019) that incorporates neural measures into classical MDS as features, RL-MMR opts for the opposite by endowing SDS methods with the capability to conduct MDS, enabling the potential of further improvement with advances in SDS.
Bridging SDS and MDS. Initial trials adapting SDS models to MDS Zhang et al., 2018) directly reuse SDS models Tan et al., 2017). To deal with the large search space, a sentence ranker is used in the adapted models for candidate pruning. Specifically,  leverages MMR (Carbonell and Goldstein, 1998) to rank sentences, allowing only the words in the top-ranked sentences to appear in the generated summary. Similarly, Zhang et al. (2018) uses topic-sensitive PageRank (Haveliwala, 2002) and computes attention only for the top-ranked sentences. Unlike RL-MMR, these adapted models use hard cutoff and (or) lack endto-end training, failing to outperform state-of-theart methods designed specifically for MDS (Gillick and Favre, 2009;Kulesza and Taskar, 2012).

Conclusion
We present a reinforcement learning framework for MDS that unifies neural SDS advances and Maximal Marginal Relevance (MMR) through end-toend learning. The proposed framework leverages the benefits of both neural sequence learning and statistical measures, bridging the gap between SDS and MDS. We conduct extensive experiments on benchmark MDS datasets and demonstrate the superior performance of the proposed framework, especially in handling the large search space and high redundancy of MDS. In the future, we will investigate the feasibility of incorporating classical MDS guidance to abstractive models with large-scale pre-training (Gu et al., 2020) and more challenging settings where each document set may contain hundreds or even thousands of documents. |D i |, min |D i |, and max |D i | denote the average / min / max number of sentences in a document set.
|s j | denotes the average number of words in a document set.

A.2 Remarks on Experimental Setup
We note that there are plenty of inconsistencies in the previous work on MDS and some results cannot be directly compared with ours. Specifically, there are three major differences that may lead to incomparable results as follows. First, while in the original DUC competitions an output length of 665 bytes is adopted, more recent studies mostly take a length limit of 100 words following Hong et al. (2014), and some do not have any length limit (usually resulting in higher numbers). Second, some papers report ROUGE recall Cao et al., 2017;Nayeem et al., 2018;Gao et al., 2019) while others (including ours) report ROUGE F1 following the trend on SDS Zhang et al., 2018;Cho et al., 2019). Third, while DUC 2004 and TAC 2011 are usually used as test sets, the training sets used in different studies often vary. We follow the same setup as the compared methods to ensure a fair comparison.

A.3 Description of Extractive Baselines
SumBasic (Vanderwende et al., 2007) is based on word frequency and hypothesizes that the words occurring frequently are likely to be included in the summary. KLSumm (Haghighi and Vanderwende, 2009) greedily extracts sentences as long as they can lead to a decrease in KL divergence. LexRank (Erkan and Radev, 2004) computes sentence salience based on eigenvector centrality in a graph-based representation. Centroid (Hong et al.,Figure 4: We use the same shape to denote semantically similar sentences. Directly applying RL to MDS encounters large search space and high redundancy, resulting in repeated summaries. MMR guides RL by attending to salient and non-redundant candidates. instead of ROUGE-based measures  and advanced neural-based measures (Cho et al., 2019;Devlin et al., 2019;Latkowski, 2018) as they are faster to compute and comparable in performance. We pre-train rnn-ext + RL (Chen and Bansal, 2018) on the CNN/Daily Mail SDS dataset  as in  but continue fine-tuning on the indomain training set. We train RL-MMR using an Adam optimizer with learning rate 5e-4 for RL-MMR SOFT-ATTN and 1e-3 for the other variants without weight decay. we tested various reward functions, such as different ROUGE metrics, the MMR scores, and intrinsic measures based on sentence representation, and found them comparable or worse than the current one. One may also use other semantic metrics such as MoverScore (Zhao et al., 2019) and FAR .

B Detailed Analysis of RL-MMR
Additional Illustration. We provide an illustration in Fig. 4 to better elucidate the motivation of RL-MMR. Note that RL-MMR is mostly based on SDS architectures while achieving state-of-theart performance on MDS, while existing combination approaches that achieve decent performance (e.g., DPP-Caps) are based on MDS architectures. Runtime and Memory Usage. RL-MMR is time and space efficient for two reasons. First, its hierarchical sentence encoding is much faster than a word-level sequence encoding mechanism while still capturing global context. Second, the guidance of MMR provides RL-MMR with a "warmup" effect, leading to faster convergence. In our experiments, one epoch of RL-MMR takes 0.87 to 0.91s on a GTX 1080 GPU with less than 1.2 GB memory usage. The number of epochs is set to 10,000 and we adopt early stopping -the training process terminates if RL-MMR cannot achieve better results on the validation set after 30 continuous evaluations. As a result, the runs often terminate before 5,000 epochs, and the overall training time ranges from 40 to 90 minutes. Detailed Examples. In Table 8, we show the extracted summaries of vanilla RL and RL-MMR for the same document set. Without the guidance of MMR, the RL agent is much more likely to extract redundant sentences. In the first example, RL extracts two semantically equivalent sentences from two different documents. These two sentences would have similar sentence representation h i j , and the latent state representation g t itself might not be enough to avoid redundant extraction. In contrast, RL-MMR selects diverse sentences after extracting the same original sentence as RL thanks to the explicit redundancy measure in MMR. In the second example, the issue of redundancy in RL is even more severe -all four extracted sentences of RL are covering the same aspect of the news. RL-MMR again balances sentence salience and redundancy better than vanilla RL, favoring diverse sentences. Such results imply that pure neural representation is insufficient for redundancy avoidance in MDS and that classical approaches can serve as a complement.