Q-learning with Language Model for Edit-based Unsupervised Summarization

Unsupervised methods are promising for abstractive text summarization in that the parallel corpora is not required. However, their performance is still far from being satisfied, therefore research on promising solutions is on-going. In this paper, we propose a new approach based on Q-learning with an edit-based summarization. The method combines two key modules to form an Editorial Agent and Language Model converter (EALM). The agent predicts edit actions (e.t., delete, keep, and replace), and then the LM converter deterministically generates a summary on the basis of the action signals. Q-learning is leveraged to train the agent to produce proper edit actions. Experimental results show that EALM delivered competitive performance compared with the previous encoder-decoder-based methods, even with truly zero paired data (i.e., no validation set). Defining the task as Q-learning enables us not only to develop a competitive method but also to make the latest techniques in reinforcement learning available for unsupervised summarization. We also conduct qualitative analysis, providing insights into future study on unsupervised summarizers.


Introduction
Automatic text summarization 2 is an attractive technique for helping humans to grasp the content of documents effortlessly. While supervised neural methods have shown good performances (See et al., 2017;Zhang et al., 2019), the unsupervised approach is starting to attract interest due to its advantage of not requiring costly parallel corpora. However, the empirical performance of unsupervised methods is currently behind that of state-of-theart supervised models (Zhao et al., 2018  . Unsupervised text summarization is still developing and is now at the stage where various solutions should be actively explored. One previous unsupervised approach extends neural encoder-decoder modeling to the zero paired data scenario, where a model is trained with a paradigm called compression-reconstruction (CR) learning (Miao and Blunsom, 2016;Fevry and Phang, 2018;Zhao et al., 2018). The mechanism is similar to that of the back-translation (Sennrich et al., 2016): the model consists of a compressor (i.e., summarizer) and a reconstructor, and they are co-trained so that the reconstructor can recover the original sentence from the summary generated by the compressor (Miao and Blunsom, 2016; the left side of Figure 1). Experimental results showed that such an unsupervised encoder-decoder-based summarizer is able to learn the mapping from a sentence to a summary without paired data (Baziotis et al., 2019;Yang et al., 2020).
Reinforcement learning (RL) is also a potential solution for the no paired data situation. In related fields, for example, there are unsupervised methods for text simplification and text compression with policy-gradient learning (Zhang and Lapata, 2017;Zhao et al., 2018). Recent RL techniques take a value-based approach (e.g., Q-learning) such as DQN (Mnih et al., 2015) or the combination of policy and value-based approaches such as Asynchronous Advantage Actor-Critic (Mnih et al., 2016). A critical requirement to leverage a valuebased method is a value function that represents the goodness of an action on a given state (Sutton et al., 1998). We can naturally define the value function by utilizing the CR-learning paradigm, and it makes the latest value-based approaches available for unsupervised text summarization.
In this paper, we propose a new method based on Q-learning and an edit-based summarization (Gu et al. 2019;Malmi et al. 2019; right side of Figure 1). The edit-based summarization generates a summary by operating an edit action (e.g., keep, remove, or replace) for each word in the input sentence. Our method implements the editing process with two modules: 1) an Editorial Agent that predicts edit actions, and 2) a Language Mmodel (LM) converter that deterministically decodes a sentence on the basis of action signals, which we call EALM. The CR learning is defined on the Qlearning framework to train the agent to predict edit actions that instruct the LM converter to produce a good summary. Although a vast action space causing sparsity in reward, such as the word generation of an encoder-decoder model, is generally difficult to be learned in RL, our method mitigates this issue thanks to its fewer edit actions and the deterministic decoding of a language model. Moreover, the formulation by Q-learning enables us to incorporate the latest techniques in RL.
The main contribution of this paper is that we provide a new solution in the form of an unsupervised edit-based summarization leveraging Qlearning and a language model. Experimental results show that our method achieved a competitive performance with encoder-decoder-based methods even with truly no paired data (i.e., no validation set), and qualitative analysis brings insights as to what current unsupervised models are missing. Also, the problem formulation on Q-learning enables us to import the latest techniques in RL, which leads to potential improvements in future research.

Task Definition
We begin by formally defining the problem of unsupervised summarization with the CR learning. The goal of the task is to produce an informative summary y consisting of M words y 1 , y 2 , ..., y M for a given input sentence x consisting of N words x 1 , x 2 , ..., x N where M < N . The challenge in this task is to learn the transformation from x to y with only the input sentence x.
To tackle this, the CR learning introduces an additional transformation called reconstruction. The reconstruction requests to reproduce the input sentencex from the generated summary y wherex is the reproduced sentence consisting of N wordŝ x 1 ,x 2 , ...,x N . In terms of the generated sentences y andx, let C be the compression function and R be the reconstruction function: where θ C and θ R are their respective parameters. Thus, the task can be written as the following optimization problem: where f (x, y) and g(x,x) are functions to return a higher value for favorable y andx in regard to the input sentence x. According to the CR learning's hypothesis that the summary should contain enough information to guess the original contents, y becomes favorable when the difference between x andx is smaller while y maintains the essential aspects of a summary (e.g., shortness, fluency).

Previous Method
The previous approaches use a generative encoderdecoder model (Sutskever et al. 2014), for the compression and reconstruction functions (Miao and Blunsom, 2016;Fevry and Phang, 2018;Wang and Lee, 2018;Baziotis et al., 2019). Although the objective functions and implementation details differ depending on the study, the underlying motivation entails the same hypothesis as the CR learning. For example, Baziotis et al. (2019) introduced four objective functions -discrepancy of y from a pretrained language model, topical distance of x and y, and the length of y and probability difference of x i andx i -where the former threes can be regarded as f (x, y) and the final one as g(x,x). While such an encoder-decoder model has performed well on many generation tasks, it suffers from inherent difficulties related to repetition (See et al., 2017), length control (Kikuchi et al., 2016, and exposure bias (Ranzato et al., 2016). It also runs into convergence problems when co-training multiple generators (Salimans et al., 2016).

Proposed Method
Our proposed method, which we call EALM, consists of two essential modules: the editorial agent and the LM converter. The agent sends action signals (keep, remove, or replace each word in a sentence) to the conveter, which then deterministically transforms the input sentence according to the signals. We train the agent to find action signals so that the LM converter produces sentences demanded by the CR learning. In the following sections, we first share the background of Q-learning ( §4.1) and then present how to put the task and our approach on the Q-learning framework ( §4.2). We next explain the core algorithmic details ( §4.3) and finish with explanations about training and inference ( §4.4).

Preliminaries
Q-learning is a popular approach in RL as represented by Deep Q-Networks (DQN, Mnih et al. 2015). Q-learning leverages an action-value function to estimate the value of a pair of state and action with respect to a policy π. The action-value function (i.e., Q-function) is represented as the expected reward for the state-action pair: where s is a state, a is an action, r is a reward function for the state-action pair, and γ is the discount factor. Hence, to solve a text summarization task via Q-learning, we first need to appropriately define the state, action, and reward function.

Unsupervised Edit-based Summarization with Q-learning
In our approach, given an input sentence x, we define a state s i in regard to each word x i . An action a i for the state s i is chosen from among the three options, A = {Remove, Keep, Replace}. The goal of the editorial agent is to provide the optimal action sequence, a = {a 0 , a 1 , . . . , a N }, by iteratively making action decisions on each word ( §4.3.1). To obtain y andx, we propose a deterministic transformation algorithm based on a and the LM converter ( §4.3.2). Finally, we define the reward function r to evaluate the action and action sequence in terms of the produced sentences y and x ( §4.3.3). The reward function is designed to align with the CR learning paradigm and leads the agent into bringing the action sequence that generates an appropriate y andx.

Algorithms
In this section, we describe three principle algorithms: 1) how to create s i and to predict a i , 2) how to generate y andx by means of a and the LM converter, and 3) how to compute the reward.

Iterative Action Prediction
The overall flow of iteratively predicting an action for each word is shown in Figure 2. The agent predicts an action for a state (i.e., a word) one by one, so we call one prediction a step and express it with a subscription (t). For example, s i(t) and a i(t) respectively denote the state and action for x i at the t-th step. Note that a i(t) has a predicted action if the agent has already done the prediction on x i by the tth step, otherwise a i(t) is Keep. Also, we prepare a Boolean vector u (t) of length N representing the prediction statuses at the t-th step; u i(t) is 1 if the prediction on the i-th word has been finished, otherwise, 0. The order to predict an action is determined by Q-values. Let s * and a * be a stateaction pair to be operated next, which comes from the maximum Q-values over unoperated states: The agent then reiterates the predictions until it finishes determining an action on all words. By defining the state in regard to a single word instead of a whole sentence and asking the agent to determine the prediction order, we can handle variable sentence lengths in natural language. Note that this is not a left-to-right process; the agent conducts the prediction in the order of "confidence". Next we explain how to encode s i(t) . To send the agent contextual information, such as the previous decisions, the prediction statuses, and the whole sentence, we dynamically create a state s i(t) with a concatenation of two encodings; local encoding l i(t) and global encoding g i(t) To create the two encodings, first, we map x i to a fixed-sized vector e i with an arbitrary encoder (we , and e i is repeatedly used throughout the process regardless of the steps. Then, we define the local encoding as where b a i(t) and b u i(t) are learnable bias vectors for the action and prediction status of the i-th word, respectively. Next, we create the global encoding in a self-attention fashion as where w j(t) is computed with ReLU: 3 Thanks to the bias terms in l i(t) and the selfattention in g i(t) , s i(t) is aware of the previous decisions for each word and the interactions between those decisions. In addition, BERT encoding e i enables us to take a whole sentence into account.

Deterministic Decoding by Language Model with Action Signals
In this section, we explain how to compress and reconstruct sentences in a deterministic manner 3 We used ReLU(·) instead of the conventional exp(·) because exp(·) caused the exploding gradient in our case. The procedure to obtain y andx by using a and MLM is shown in Figure 3. First, we convert x to a skeleton sequence z consisting of N tokens z 1 , z 2 , ..., z N where z i is x i if a i is Keep, otherwise a null token . We then define our compression and reconstruction functionsC andR as A word is predicted only for given by Replace in compression, but it does so for all in reconstruction. Also, we set the original sentence as a prefixed context, which comes from x in compression and y in reconstruction, to make MLM aware of a former meaning. An example is shown in Figure 3, where MLM receives "Machine learning is not perfect .
[MASK] is [MASK] ." as the compression input and predicts words for the [MASK]s. If there are multiple masks, we conduct the prediction in an autoregressive fashion (see Appendix A.1). Note that while any language model can be used for the LM converter, MLM is advantageous because it utilizes before and after contexts, and there is no restriction on looking ahead at upcoming words.

Stepwise Reward Computation
In this section, we explain the reward computation of the chosen action by referring to y andx.
As stated in §4.3.1, we have an action sequence a (t) for every step t. When we applyC and R to all the a (t) , we can obtain a list of tuples (s, a, r, x, y,x) (t) . A tuple -let us say, experience -enables us to evaluate a state-action pair with respect to a single transition. In this section, we propose three techniques -step reward, violation penalty, and summary assessment -to evaluate the agent's behavior with the stepwise experiences. Refer to Table 1 to see how these work in reward computation with an actual example.
Before moving on to the details, let us define two important notions throughout this section, compression rate (cr) and reconstruction rate (rr): The CR learning assumes that the higher values of cr and rr are better. We use these for calculating rewards and pruning experiences.
Step Reward. The task of the agent is to produce an action sequence with which the LM converter generates an appropriately compressed sentence while keeping the reconstruction successful. As such, we define the reward function r as where r SR is the step reward that are designed to encourage the agent to improve the compression and reconstruction rate, respectively. r SA is an additional score from the qualitative assessment of y, which we explain later. Returning to the step reward r SR , it is a multiplication of r C and r R defined as where τ (t) is a minimum requirement for the reconstruction rate at the t-th step and is defined as If we set τ = 1 that requests perfect reconstruction, then τ (t) = 1 regardless of t. However, we need to forgive reconstruction failure to some extent because of the information loss in compression, and τ adjusts the allowed number of failures. For example, τ = 0.5 requests the model to recover at least half of the original sentence correctly.
Let us describe the behavior of the step reward r SR . First, the reward is 0 when the agent chooses Keep or Replace because r C = 0 due to there being no change in the length of y. Second, the reward gets a positive value when the agent chooses Remove and satisfies the requirement for the reconstruction rate (rr (t) > τ (t) ). Third, the reward gets a negative value when the agent chooses Remove, but the reconstruction rate is less than the requirement. In short, the step reward recommends Remove as long as the agent can recover the original word, and otherwise, Keep or Replace.
Violation Penalty. Sequential modeling, including that performed by our agent, essentially suffers from error propagation caused by incorrect predictions at an earlier stage (Collins and Roark, 2004). The violation penalty mitigates this issue by giving a negative reward to the latest problematic action and excluding experiences after the mistake.

Not used in training
Agent breaks the constraint Figure 4: Violation penalty for compression (left) and reconstruction (right). The x-axis is step and the y-axis is each ratio. The horizontal lines in the middle are ρ and τ , and the dashed lines represent ρ (t) and τ (t) . The circles represent a step where the agent breaks the constraints.
Here, in addition to τ , we introduce the hyperparameter ρ, which represents a minimum requirement for the compression rate. ρ (t) denotes its threshold at the t-step defined as ρ (t) = t ρ N , and the agent must satisfy the condition cr (t) > ρ (t) . As the penalty, we forcibly assign −1 reward for the state-action pair at the T -th step when the agent breaks either constraint of τ (T ) or ρ (T ) . In addition, we ignore experiences from step (T + 1) and onward. If the agent keeps predicting until the end, we define T = N . Figure 4 shows how these constraints work for the experience sequence.
Summary Assessment. Although the step reward considers the compression and reconstruction ratios, it ignores the critical aspects of the generated summary such as replacement with a shorter synonym and fluency as a sentence. Here, we explain the r SA mentioned in the previous paragraph and describe how to reflect such qualitative assessments to the reward given to the agent.
As the essential properties for y, we take three perspectives into account: informativeness, shortness, and fluency. The informativeness refers to how much y retains the original meaning of x, and the shortness and fluency are self-explanatory. To reflect these perspectives onto the agent's decision, we define r SA as where sim computes a similarity score of x and y, and llh computes a log-likelihood of y. α and β are hyperparameters to adjust the importance of sim and llh. In addition to r SR , we give r SA to the experiences from the beginning to T -th steps as defined in the step reward paragraph. Let us explain the terms inside the square brackets first. The first term, which is the multiplication of cr (T ) and rr (T ) , aims for shortness and informativeness. It gets a higher value when the agent achieves the right balance of compression and reconstruction. The second term sim aims to evaluate informativeness brought about by Replace. Concretely, sim returns a semantic similarity score in the range of [0, 1] through the sentence vectors of x and y (T ) rather than just checking exact matches of words. The last term llh represents fluency via the log-likelihood of y (T ) given by a pre-trained language model (Zhao et al., 2018). We use BERT for the computation of sim and llh (Devlin et al. 2019; Wang and Cho 2019; see Appendix A.3). Finally, T /N is the ratio of the number of operated words. It becomes closer to 1 when the agent is reaching a termination, i.e., finishing the prediction on all words by avoiding the violation penalty, which makes r SA larger. In contrast, the agent who fails at an earlier stage gets a small value of r SA .

Training and Inference
Training. Leveraging the experiences (s, a, r, x, y,x) in the replay buffer (Lin, 1992), the agent learns the policy for summarizing a sentence x within the Q-learning framework. Specifically, we utilize DQN (Mnih et al., 2015) to learn the Q-function Q * corresponding to the optimal policy by minimizing the loss, where ψ = r + γ max a Q * (s , a ) andQ is a target Q-function whose parameters are periodically updated in accordance with the latest network parameters. During the collection of experiences, RL requires the agent to explore an action on a given state for finding a better policy. As a unique point in this work, the agent must explore not only the action but also the order to predict. For both explorations, we use the -greedy algorithm (Watkins, 1989) that stochastically forces the agent to ignore Q-values and to behave randomly (see Appendix A.2).
Inference. Our modeling that provides y andx for each step has another advantage in terms of the inference. For the final output, we use y at the t * -th step that achieves the best balance of the compression and reconstruction ratios, where t * = No experiences due to the violation occurred at the step 3. 6 Table 1: An example of stepwise reward computation. It breaks the reconstruction constraint at the step 3 when removing force, so r SR = −1. r SA is computed at the step 3 by 0.5 × (0.25 + 0.5 × 0.1 + 1.0 × 0.1) = 0.20, and it is used for the step 1 and 2 as well. The settings of hyperparameters are τ = 0.5, ρ = 0.3, α = 0.1, and β = 0.1. arg max t {cr (t) +rr (t) }. This is based on the tradeoff relationship of compression and reconstruction as seen in the precision-recall curve.

Experiment
Baselines. We compare our proposed approach with three baselines: Lead-N, which simply takes the beginning N words as the summary, SEQ3, a recent encoder-decoder model (Baziotis et al., 2019), and CMatch, a new approach without explicit reconstruction learning (Zhou and Rush, 2019). To conduct qualitative analysis on generated summaries, we ran the baselines ourselves with a replicated model for SEQ3 4 and the provided model for CMatch. 5 Also, we test two types of SEQ3 models: one tuned with a validation set (SEQ3 + ) and the other with parameters at the last iteration in the training (SEQ3 − ). This is because EALM and CMatch do not need paired data even for validation.
Proposed method. We implemented EALM as follows. The Q-network of the agent consists of a two-layered MLP with 200 units per layer and ReLU. We used the Adam optimizer with the learning rate of 0.001 and apply gradient clipping by 1. For the epsilon-greedy strategy, we first set the exploring probability to 0.9 and decay it by multiplying by 0.995 every 100 updates until it reaches the minimum exploration rate of 0.03. We set the discount factor γ to 0.995. The size of the replay buffer is 2000, and we sample 128 experiences as 4 https://github.com/cbaziotis/seq3. We ran the training script with the same configuration as the original paper except for decreasing the batch size from 128 to 32 due to our GPU limitation. We trained three models and obtained slightly lower scores than the ones reported in the original paper. We report the averaged score among the three models. 5 https://github.com/jzhou316/ Unsupervised-Sentence-Summarization a batch for one update. As the final model, we use parameters at a time when the averaged score of reward in the replay buffer is maximum, i.e., our model does not need a validation set. The hyperparameters of step reward (τ , ρ; §4.3.3) are set to 0.5 and 0.3, respectively. The hyperparameters of summary assessment (α, β; §4.3.3) are both set to 0.1. We train three models with the same configuration and report their averaged score, as Q-learning inherently contains randomness in training.
Dataset. The same as Baziotis et al. (2019), we train our model on the Gigaword corpus (GIGA, Rush et al., 2015). However, we used only 30K sentences randomly picked from sentences with less than 50 words for the training of EALM. This is because the whole data, 3.8M sentences, is too large to expose the agent to different experiences from the same sentence. 6 Note that we used the entirety of sentences for the training of the SEQ3 models.
We followed Baziotis et al. (2019) in the evaluation as well, using the test set consisting of the GIGA (1897 sentences) and DUC datasets (DUC3 with 624 sentences, DUC4 with 500 sentences; Over et al. 2007).
All models follow the same tokenization policy: the default tokenization in GIGA, DUC3, and DUC4. Although BERT (which EALM uses) has its own vocabulary based on subwords, we do not apply subwording to go along with a single tokenization policy. Therefore, words not in the BERT vocabulary are interpreted as unknown words, and the ratio of unknown words was around 10% in GIGA.  Evaluation. In our quantitative analysis, we examine the ROUGE scores. 7 To mitigate the bias to longer sentences in ROUGE calculation, we capped all summaries at the first 75 bytes. Note that the averaged sentence length of gold summaries after the capping were 8.58, 9.59, and 10.25 for GIGA , DUC3, and DUC4, respectively. Also, we examine sentence length (LEN) and count of new words (NW; number of words that are used in a generated summary but do not appear in the input sentence). Additionally, we show qualitative comparisons with a manual check of generated summaries. Although a questionnaire survey is often conducted to assess the deeper quality of summaries such as informativeness and readability, this still hides the exact points of model's strengths and weaknesses. We consider that specific indications provide insights on future work for the current unsupervised summarizers. We manually checked more than 200 summaries for each model and each dataset and include a few samples in Appendix (A.6).
Results.  of new words was the lowest.
CMatch achieved the highest scores of ROUGE and meaningful length in GIGA. The scores of R-2 and R-L were superior to others by about two points, which means CMatch captured not only salient words but also word co-occurrences. However, for generating summaries, CMatch uses a language model trained with gold summaries in GIGA. In other words, it may just internally store the probable word distributions in summary sentences on GIGA. Actually, the results on DUC3 and DUC4 were not better than those on GIGA. Even though CMatch does not require paired data, it is not practical to collect enough summaries to train a language model for each domain.
SEQ3 showed a competitive performance with other models, but its scores dropped when no validation set was available. The requirement of a validation set is a keen disadvantage because creating input-summary pairs comes at a significant human labor cost.
While almost all of the best scores were given by the statistical models, Lead-15 also performed competitively. This result indicates that unsupervised summarization methods can not yet overcome the trivial baseline. One significant barrier preventing the progress of unsupervised methods is presumably the difficulty of rephrasing. For writing a good summary, condensing a longer expression into a shorter form is essential. As seen in the NW column in Table 2, the number of new words was less than one in SEQ3 + , CMatch, and EALM. The current models tend to operate just by copyand-paste, which is consistent with the report by Baziotis et al. (2019).
Finally, we manually assess the summaries produced by each model and sum up their pros and cons in Table 3 Table 4. First, we found that a summary of SEQ3 was likely to be an exact copy of the input sentence from the top, but it kept sentences grammatical and informative. Rephrasing by SEQ3 did not meet our expectation in most cases, such as changing a week of the day (e.g., Wednesday to Thursday) or a common adjective to a pronoun adjective (e.g., astronomical to her). CMatch stably generated fluent summaries in GIGA, as seen in the ROUGE scores. It also generated grammatically correct sentences such as number agreement (e.g., nec agrees ...). In the DUC datasets, however, meaningless summaries increased, such as containing no important information (e.g., nasa observes). Relatedly, CMatch's summaries on DUC3 and DUC4 were too short, and we found that more than half of the summaries consisted of less than or qeual to 5 words. Finally, EALM's outputs tended to be longer due to containing non-informative portions (e.g., nasa officials told ...). It was also likely to be ungrammatical due to leaving only a functional word (e.g., mechanical problems that threaten ...) or deleting required prepositions (e.g., ... agreed (to) join ...). Those failures resulted in lower readability. However, EALM tried to keep keywords from the whole input even though they exist at latter positions in a sentence, which is also supported by the relatively higher score of R-1 and R-L. Although this challenge caused low readable and ungrammatical summaries, it is an interesting research direction to sophisticate such EALM's behavior.

Conclusion
We brought the Q-learning framework into unsupervised text summarization and proposed a new method EALM that is an edit-based unsupervised summarizer leveraging a Q-learning agent and a language model. The experments showed that EALM performed competitively with the previous encoder-decoder-based methods. However, in qualitative analysis, we found that the quality of the generated summaries of any unsupervised model was not sufficient, and there are individual limitations for each model. These issue must be overcome as the step forward to generating practically available summaries without paired data. In particular for EALM, there is room for improvement by importing the latest techniques in RL. Our work paves the way for further research on bridging Q-learning and unsupervised text summarization.

A.1 Autoregressive Prediction with MLM
Algorithm 1 describes the autoregressive prediction with MLM, which we used when an input contains multiple masks.
Algorithm 1 Autoregressive prediction with MLM Input: a sentence x that includes [MASK]s Outpit: a sentence x after replacing all [MASK]s with predicted words

A.2 Exploration of Prediction Order
As explained in section 4.3 in the main paper, the editorial agent explores the order to predict. While the agent basically chooses a state with a maximum Q-value as the next state, we sometimes pick a most uncertain state instead. We define the uncertainty of a state by the entropy of action probabilities as H(s) = − a∈A Q(s, a) log Q(s, a), and then s * and a * are selected as s * = arg max s∈s t H(s) , a * = arg max a∈A Q(s * , a) .

A.3 Semantic Similarity and Log-likelihood Computation in Summary Assessment
Semantic Similarity. We use a pre-trained model to predict the semantic similarity of pairedsentences with their BERT encodings. 8 The model is trained in a supervised manner with a pair of sentences and their similarity score. The original library outputs a real-valued score in the range of [0, 5], whereas we normalize it to [0, 1].
Log-likelihood. We compute the log-likelihood of a compressed sentence by using BERT as follows (Wang and Cho, 2019): 1 M i∈M log(P (y i | y \i )) .
However, our llh function performs thresholdingnamely, it returns 1 if the score is beyond a threshold, otherwise 0 -because the raw log-likelihood score is not scaled with the other rewards. We empirically set the threshold to 0.005.

A.4 Relaxations in rr (t) Calculation
The calculation of the reconstruction rate introduced in section 4.3.3 is based on an exact match of each word of x andx. Given the ambiguity of natural language, this is very strict, so the agent rarely acquires rewards. We relax this situation by 1) excluding stop words in the calculation and 2) comparing with top-k candidates. Therefore, the equation of rr can be formally re-written as where L k (z \i ) returns top-k probable words for the i-th position and W is a pre-defined set of words. We set k = 10. We used common stopwords (e.g., him, the) and infrequent words in GIGA for W .

A.5 Experimental Details
Computing Infrastructure. We run the models on a machine with the below specifications: • Ubuntu 18.04 • Intel(R) Xeon(R) @ 2.60GHz • RAM 120GB • NVIDIA Tesla P100 Model Size. In EALM, the number of trainable parameters was 348208 in our experimental setting, which is all for the editorial agent. There are no trainable parameters for the language model.
Hypperparameter Search. We did not conduct a hyperparameter search. We empirically determined the values described in the main paper (the "Proposed method" paragraph in §5).
Runtime Speed. EALM processes a sentence in three seconds pm average on the above GPU.

A.6 Generated Summaries
Samples of the summaries generated by each model are listed in the tables on the following next pages. These examples are taken from the first sentences for GIGA and randomly picked for DUC3 and DUC4. We also include human-generated summaries (i.e., gold reference).

INPUT Human
SEQ3 CMatch EALM endeavour 's astronauts connected the first two building blocks of the international space station on sunday , creating a seven-story tower in the shuttle cargo bay .
First 2 building blocks of international space station successfully joined.
endeavour 's astronauts connected the first two building blocks of the international space station endeavour 's astronauts create a shuttle connected first building blocks of international space on creating tower in shuttle bay in a cocoon of loyal and wealthy supporters , president clinton said friday that he must " live with the consequences " of his mistakes , although he contended that democrats should take pride in the achievements of his presidency and take heart from its possibilities .
Clinton supports candidates, speaks at fundraisers, acknowledges mistakes.
in a her of loyal and wealthy supporters , president clinton said tuesday that must of loyal and wealthy supporters , clinton " . democrats take pride in presidency in a of loyal and wealthy supporters president clinton said that must live with consequences of mistakes although he that should pride in achievements of presidency and heart from possibilities on the eve of a holiday that has been linked to antiabortion violence , the authorities on tuesday were investigating whether a picture of an aborted fetus sent to a canadian newspaper was connected to last month 's fatal shooting of a buffalo , n.y. doctor who provided abortions or four similar attacks in western new york and canada since 1994 .
Anti-abortion flyer in Canada may be related to Buffalo clinic slaying on the eve of a holiday that has been linked to her violence , authorities of holiday that has been linked to her violence , on the eve of a holiday on eve of holiday that has been linked to violence on investigating picture of an sent to canadian newspaper was connected to last month fatal shooting of buffalo , doctor who provided or similar attacks in western new york and canada since 1994 famine-threatened north korea 's harvest will be no better this year than last and could be worse , a senior u.n. aid official said saturday .
World Food Program reports famine may have killed 2 million North Koreans her north korea 's harvest will be no better this year than last worse south korea 's zhan north korea harvest better last could worse senior aid official said matthew wayne shepard , the gay student who was beaten in the dead of night , tied to a fence and left to die alone , was mourned at his funeral friday by 1,000 people , including many who had never met him .
Matthew Shepard eulogized as one who wanted to make people's lives better matthew wayne her , the gay student who was beaten in the dead of night , gay student who was beaten him us ceos , gay who beaten in dead of night tied to fence and left to die alone was at funeral by including who had met a delegation of chilean legislators lobbying against the possible extradition of augusto pinochet to spain to face trial , warned thursday that chile was on the brink of political turmoil .