Deep Reinforcement Learning with Distributional Semantic Rewards for Abstractive Summarization

Deep reinforcement learning (RL) has been a commonly-used strategy for the abstractive summarization task to address both the exposure bias and non-differentiable task issues. However, the conventional reward Rouge-L simply looks for exact n-grams matches between candidates and annotated references, which inevitably makes the generated sentences repetitive and incoherent. In this paper, instead of Rouge-L, we explore the practicability of utilizing the distributional semantics to measure the matching degrees. With distributional semantics, sentence-level evaluation can be obtained, and semantically-correct phrases can also be generated without being limited to the surface form of the reference sentences. Human judgments on Gigaword and CNN/Daily Mail datasets show that our proposed distributional semantics reward (DSR) has distinct superiority in capturing the lexical and compositional diversity of natural language.

Many innovative deep RL methods (Ranzato et al., 2016;Wu et al., 2016;Paulus et al., 2018;Lamb et al., 2016) are developed to alleviate this issue by providing sentence-level feedback after generating a complete sentence, in addition to optimal transport usage (Napoles et al., 2012). However, commonly used automatic evaluation metrics for generating sentence-level rewards count exact n-grams matches and are not robust to different words that share similar meanings since the semantic level reward is deficient.
Currently, many studies on contextualized word representations (Peters et al., 2018;Devlin et al., 2019) prove that they have a powerful capacity of reflecting distributional semantic. In this paper, we propose to use the distributional semantic reward to boost the RL-based abstractive summarization system. Moreover, we design several novel objective functions.
Experiment results show that they outperform the conventional objectives while increasing the sentence fluency. Our main contributions are three-fold: • We are the first to introduce DSR to abstractive summarization and achieve better results than conventional rewards.
• Unlike ROUGE, our DSR does not rely on crossentropy loss (XENT) to produce readable phrases. Thus, no exposure bias is introduced.
• DSR improves generated tokens' diversity and fluency while avoiding unnecessary repetitions.

Methodology
Background While sequence models are usually trained using XENT, they are typically evaluated at test time using discrete NLP metrics such as BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Banerjee and Lavie, 2005). Therefore, they suffer from both the exposure bias and non-differentiable task metric issues. To solve these problems, many resorts to deep RL with sequence to sequence model (Paulus et al., 2018;Ranzato et al., 2016;Ryang and Abekawa, 2012), where the learning agent interacts with a given environment. However, RL models have poor sample efficiency and lead to very slow convergence rate. Therefore, RL methods usually start from a pretrained policy, which is established by optimizing XENT at each word generation step.
log P (y t |y 1 , . . . , y t−1 , x). (1) Then, during RL stage, the conventional way is to adopt self-critical strategy to fine-tune based on the target evaluation metric, Distributional Semantic Reward During evaluating the quality of the generated sentences, ROUGE looks for exact matches between references and generations, which naturally overlooks the expression diversity of the natural language. In other words, it fails to capture the semantic relation between similar words. To solve this problem, distributional semantic representations are a practical way. Recent works on contextualized word representations, including ELMO (Peters et al., 2018), GPT (Radford et al., 2018), BERT (Devlin et al., 2019), prove that distributional semantics can be captured effectively. Based on that, a recent study, called BERTSCORE (Zhang et al., 2019), focuses on sentence-level generation evaluation by using pre-trained BERT contextualized embeddings to compute the similarity between two sentences as a weighted aggregation of cosine similarities between their tokens. It has a higher correlation with human evaluation on text generation tasks comparing to existing evaluation metrics.
In this paper, we introduce it as a DSR for deep RL. The BERTSCORE is defined as: where y andŷ represent BERT contextual embeddings of reference word y and candidate wordŷ, respectively. The function idf(·) calculates inverse document frequency (idf). In our DSR, we do not use the idf since Zhang et al. (2019) requires to use the entire dataset including test set for calculation. Besides, ROUGE do not use similar weight, so we do not include idf for consistency.

Datasets
Gigaword corpus It is an English sentence summarization dataset based on annotated Gigaword (Napoles et al., 2012). A single sentence summarization is paired with a short article. We use the OpenNMT provided version It contains 3.8M training, 189k development instances. We randomly sample 15k instances as our test data.
CNN/Daily Mail dataset It consists of online news articles and their corresponding multisentence abstracts (Hermann et al., 2015;Nallapati et al., 2016). We use the non-anonymized version provided by See et al. (2017), which contains 287k training, 13k validation, and 11k testing examples. We truncate the articles to 400 tokens and limit the summary lengths to 100 tokens.

Pretrain
We first pretrain a sequence-to-sequence model with attention using XENT and then select the best parameters to initialize models for RL. Our models have 256-dimensional hidden states and 128dimensional word embeddings and also incorporates the pointer mechanism (See et al., 2017) for handling out of vocabulary words.

Baseline
In abstractive summarization, ROUGE (Lin, 2004) is a common evaluation metric to provide a sentence-level reward for RL. However, using ROUGE as a pure RL objective may cause too many repetitions and reduced fluency in outputs. Paulus et al. (2018) propose a hybrid learning objective that combines XENT and self-critical ROUGE reward (Paulus et al., 2018).
where γ is a scaling factor and the F score of ROUGE-L is used as the reward to calculate L Rouge . In our experiment, we select γ = 0.998 for Gigaword Corpus and γ = 0.9984 for CNN/Daily Mail dataset. 1 Note that we do not avoid repetition during the test time as Paulus et al. (2018) do, because we want to examine the repetition of sentence directly produced after training.

Proposed Objective Functions
Inspired by the above objective function (Paulus et al., 2018), we optimize RL models with a similar loss function as equation 2. Instead of ROUGE-L, we incorporate BERTSCORE, a DSR to provide sentence-level feedback. In our experiment, L DSR is the self-critical RL loss (equation 2) with F BERT as the reward. We introduce the following objective functions: In our experiment, we select γ = 0.5 for both datasets to balance the influence of two reward functions.
DSR+XENT: F BERT reward with XENT to make the generated phrases more readable.
In our experiment, we select γ = 0.998 for Gigaword Corpus and γ = 0.9984 for CNN/Daily Mail dataset.
DSR: Pure F BERT objective function without any teacher forcing. and DSR+XENT (equation 8) can obtain the best BERTScores as expected. It is also expected that ROUGE model will obtain worse BERTScores as simply optimizing ROUGE will generate less readable sentences (Paulus et al., 2018); however, DSR model without XENT as a teacher forcing can improve the performance of pretrained model in both F BERT and ROUGE-L scale. Note that DSR model's ROUGE-L is high in training time but does not have a good generalization on test set, and ROUGE-L is not our target evaluation metrics. In the next section, we will do human evaluation to analyze the summarization performance of different reward systems.

Human Evaluation
We perform human evaluation on the Amazon Mechanical Turk to assure the benefits of DSR on output sentences' coherence and fluency. We randomly sample 500 items as an evaluation set us-   Table 2, DSR and DSR+XENT models improve the relevance and fluency of generated summary significantly. In addition, using pure DSR achieve better performances on Gigaword Corpus and comparable results as DSR+XENT on CNN/Daily Mail. While Paulus et al.'s (2018) objective function requires XENT to make generated sentence readable, our proposed DSR does not require XENT and can limit the exposure bias originated from it.

Analysis
Diversity Other than extractive summarization, abstractive summarization allows more degrees of freedom in the choice of words. While simply selecting words from the article made the task easier to train, higher action space can provide more paths to potentially better results (Nema et al., 2017). Using the DSR as deep RL reward will support models to choose actions that are not n-grams of the articles. In Table 4, we list a few generated samples on Gigaword Corpus. In our first example in Table 4, the word "sino-german" provides  Table 3: Qualitative analysis on repetition(Rep) / diversity(Div). They are calculated by the percentage of repeat/out-of-article n-grams (unigrams for Gigaword and 5-grams for CNN/Daily Mail) in generated sentences.
an interesting and efficient way to express the relation between China and Germany. F BERT is also improved by making this change. In addition, the second example in Table 4 shows that RL model with DSR corrects the sentence' grammar and significantly improves the F BERT score by switching "down" to an unseen word "drops". On the other hand, when optimizing DSR to improve the diversity of generation, some semantically similar words may also be generated and harm the summarization quality as shown in the third example in Table 4. The new token "wins" reduces the scores of both metrics. We also evaluate the diversity of a model quantitively by averaging the percentage of out-of-article n-grams in generated sentences. Results can be found in Table 3. The DSR model achieves the highest diversity. 1) Context economic links between china and germany will be at the center of talks here next month between li peng and chancellor helmut kohl when the chinese premier makes a week-long visit to germany , sources on both sides said .  Table 4: Qualitative analysis of generated samples on Gigaword corpus. Generated words that do not appear in the context are marked blue. Repeated words are marked red. The first two examples represent DSR's generated tokens are more diverse. However, it may suffer from problems as shown in example 3 and 4.
Repetition Repetition will lead to lower F BERT as shown in the last example in Table 4. Using DSR reduces the probability of producing repetitions. The average percentage of repeated ngrams in generated sentences are presented in the Table 3. As shown in this table, unlike ROUGE, the DSR model can achieve high fluency without XENT; moreover, it produces the fewest repetitions among all the rewards. Table 4 gives an example that DSR produces a repeated word (from example 4), but it does not reflect the overall distribution of repeated word generation for all evaluated models.

Conclusion
This paper demonstrates the effectiveness of applying the distributional semantic reward to reinforcement learning in abstractive summarization, and specifically, we choose BERTSCORE. Our experimental results demonstrate that we achieve better performance on Gigaword and CNN/Daily Mail datasets. Besides, the generated sentences have fewer repetitions, and the fluency is also improved. Our finding is aligned to a contemporaneous study (Wieting et al., 2019) on leveraging semantic similarity for machine translation.