Inference Time Style Control for Summarization

How to generate summaries of different styles without requiring corpora in the target styles, or training separate models? We present two novel methods that can be deployed during summary decoding on any pre-trained Transformer-based summarization model. (1) Decoder state adjustment instantly modifies decoder final states with externally trained style scorers, to iteratively refine the output against a target style. (2) Word unit prediction constrains the word usage to impose strong lexical control during generation. In experiments of summarizing with simplicity control, automatic evaluation and human judges both find our models producing outputs in simpler languages while still informative. We also generate news headlines with various ideological leanings, which can be distinguished by humans with a reasonable probability.


Introduction
Generating summaries with different language styles can benefit readers of varying literacy levels (Chandrasekaran et al., 2020) or interests (Jin et al., 2020). Significant progress has been made in abstractive summarization with large pre-trained Transformers (Dong et al., 2019;Lewis et al., 2020;Zhang et al., 2019;Raffel et al., 2019;Song et al., 2019). However, style-controlled summarization is much less studied (Chandrasekaran et al., 2020), and two key challenges have been identified: (1) lack of parallel data, and (2) expensive (re)training, e.g., separate summarizers must be trained or finetuned for a pre-defined set of styles (Zhang et al., 2018). Both challenges call for inference time methods built upon trained summarization models, to adjust styles flexibly and efficiently.
To address these challenges, we investigate justin-time style control techniques that can be directly applied to any pre-trained sequence-to-sequence (seq2seq) summarization model. We study two methods that leverage external classifiers to favor Daily Mail Article: . . . A 16-year-old who was born a girl but identifies as a boy has been granted the opportunity to go through male puberty thanks to hormone treatment. . . . The transgender boy, who has felt as though he is living in the wrong body since he was a child, has been given permission by a Brisbane-based judge to receive testosterone injections . . .
(a) Decoder State Adjustment: Queensland teen has been granted hormone treatment. The 16-year-old was born a girl but identifies as a boy. . . . A judge has granted the teen permission to receive testosterone injections. . . . (b) Word Unit Prediction: A 16-year-old who was born a girl has been given the right to go through male puberty. The transgender boy has lived in a female body since he was a . . . the generation of words for a given style. First, decoder state adjustment is proposed to alter the decoder final states with feedback signaled by style scorers, which are trained to capture global property. Second, to offer stronger lexical control, we introduce word unit prediction that directly constrains the output vocabulary. Example system outputs are displayed in Fig. 1. Notably, our techniques are deployed at inference time so that the summary style can be adaptively adjusted during decoding.
We experiment with two tasks: (1) simplicity control for document summarization with CNN/Daily Mail, and (2) headline generation with various ideological stances on news articles from the SemEval task (Kiesel et al., 2019) and a newly curated corpus consisting of multi-perspective stories from AllSides 1 . In this work, the algorithms are experimented with the BART model (Lewis et al., 2020), though they also work with other Transformer models. Both automatic and human 5943 evaluations show that our models produce summaries in simpler languages than competitive baselines, and the informativeness is on par with a vanilla BART. Moreover, headlines generated by our models embody stronger ideological leaning than nontrivial comparisons. 2 2 Related Work Summarizing documents into different styles are mainly studied on news articles, where one appends style codes as extra embeddings to the encoder (Fan et al., 2018), or connects separate decoders with a shared encoder (Zhang et al., 2018). Similar to our work, Jin et al. (2020) leverage large pre-trained seq2seq models, but they modify model architecture by adding extra style-specific parameters. Nonetheless, existing work requires training new summarizers for different target styles or modifying the model structure. In contrast, our methods only affect decoder states or lexical choices during inference, allowing on-demand style adjustment for summary generation. Style-controlled text generation has received significant research attentions, especially where parallel data is scant (Lample et al., 2019;Shang et al., 2019;He et al., 2020). Typical solutions involve disentangling style representation from content representation, and are often built upon autoencoders (Hu et al., 2017) with adversarial training objectives (Yang et al., 2018). The target style is then plugged in during generation. Recently, Dathathri et al. (2020) propose plug and play language models (PPLMs) to alter the generation style by modifying all key-value pairs in the Transformer, which requires heavy computation during inference. Krause et al. (2020) then employ a generative discriminator (GeDi) to improve efficiency. Our methods are more efficient since we only modify the decoder final states or curtail the vocabulary.

Global Characteristic Control via Decoder State Adjustment
Given a style classifier q(z|·) that measures to which extent does the current generated summary resemble the style z, we use its estimate to adjust the final decoder layer's state o t at step t with gradient descent, as illustrated in Fig. 2. The  output token is produced as p(y t |y 1:t−1 , x) = softmax(W e o t ), W e is the embedding matrix.

Word Unit Prediction
Concretely, to generate the t-th token, a style score of q(z|y 1:t+2 ) is first computed. In addition to what have been generated up to step t − 1, we also sample y t and two future tokens for style estimation. The decoder state is updated as follows: where λ is the step size. Gradient descent is run for 10 iterations for document summarization and 30 iterations for headline generation. Below, we define one discriminative and one generative style classifier, to illustrate the method.
Discriminative Style Scorer. We feed the tokens into a RoBERTa encoder  and use the contextualized representation of the BOS token, i.e., h 0 , to predict the style score as p sty (z|·) = softmax(W s h 0 ), where W * are learnable parameters in this paper. At step t of summary decoding, the style score is estimated as: q(z|y 1:t+2 ) = log p sty (z|y 1:t+2 ) For the discriminative style scorer, the step size λ is set to 1.0.
Generative Language Model Scorer. We build a class-conditional language model (CC-LM) from texts prepended with special style-indicating tokens. Concretely, the CC-LM yields probabilities p LM (y t |y 1:t −1 , z) (p LM (y t , z) for short), conditional on the previously generated tokens y 1:t −1 and the style z. As the summarizer's output probability p(y t ) should be close to the language model's estimate, the style score is defined as: Here we use a step size λ of 0.1.

Lexical Control via Word Unit Prediction
Lexical control is another tool for managing summary style, as word choice provides a strong signal of language style. Given an input document, our goal is to predict a set of word units (e.g., the subwords used in BART pre-training) that can be used for summary generation. For instance, if the input contains "affix", we will predict "stick" to be used, while excluding the original word "affix". A similar idea has been used to expedite sequence generation (Hashimoto and Tsuruoka, 2019), though our goal here is to calculate the possibilities of different lexical choices.
Concretely, after encoding the input x by RoBERTa, we take the average of all tokens' contextual representations, and pass it through a residual block (He et al., 2016) to get its final rep-resentationR. We then compute a probability vector for all word units in the vocabulary as p r = sigmoid(W rR ). The top v word units with the highest probabilities are selected and combined with entity names from the input to form the new vocabulary, from which the summary is generated. We use v = 1000 in all experiments. Dynamic Prediction. We also experiment with a dynamic version, where the word unit predictor further considers what have been generated up to a given step. In this way, the new vocabulary is updated every m steps (m = 5 for document summarization, and m = 3 for headline generation).  (2019) for simplicity style scorer and class-conditional language model training. We split the pairs into 86,467, 10,778, and 10,788 for training, validation and testing, respectively. On the test set, our simplicity style scorer achieves an F1 score of 89.7 and our class-conditional language model achieves a perplexity of 30.35.

Simplicity-controlled Document Summarization
To learn the word unit predictor, for each paragraph pair, the predictor reads in the normal version and is trained to predict the word units used in the For comparison, we consider RERANKING beams based on our style score at the last step. We also use a label-controlled (LBLCTRL) baseline as described in Niu and Bansal (2018), where summaries in the training data are labeled as simple or normal by our scorer. We further compare with GEDI and two pipeline models: a style transfer model (Hu et al., 2017) applied on the output of BART (CTRLGEN) and a normal-to-simple translation model fine-tuned from BART (TRANS), both trained on Wikipedia. Finally, we consider LIGHTLS (Glavaš and Štajner, 2015), a rule-based lexical simplification model.
Automatic Evaluation. Table 1 shows that our models' outputs have significantly better simplicity and readability while preserving fluency and a comparable amount of salient content. Key metrics include simplicity level estimated by our scorer and Dale-Chall readability (Chall and Dale, 1995). We use GPT-2 perplexity (Radford et al., 2019) to measure fluency, and BERTScore  for content preservation. Our inference time style control modules can adaptively change the output style, and thus outperform reranking at the end of generation or using pipeline models. More-  over, by iteratively adjusting the decoder states, our methods deliver stronger style control than GEDI, which only adjusts the probability once per step. When comparing among our models, we find that word unit prediction is more effective at lexical simplification than updating decoder states, as demonstrated by the higher usage of simple words according to the Dale-Chall list. We believe that strong lexical control is achieved by directly pruning output vocabulary, whilst decoder state adjustment is more poised to capture global property, e.g., sentence compression as shown in Fig. 1. Moreover, we compute the edit distance between our style-controlled system outputs and the summaries produced by the fine-tuned BART. We find that adjusting decoder states with style scorer and language model yields an edit distance of 45.7 and 47.4, compared to larger distances of 56.7 and 54.3 given by word unit prediction and with additional dynamic prediction. Human Evaluation. We recruit three fluent English speakers to evaluate system summaries for informativeness-whether the summary covers important information from the input, and fluencywhether the summary is grammatical, on a scale of 1 (worst) to 5 (best). They then rank the summaries by simplicity level (ties are allowed). 50 samples are randomly selected for evaluation, and system summaries are shuffled. As seen in Table 2, summaries by our models are considered simpler than outputs of BART and GEDI, with better or comparable informativeness.   similarly for right and leaning-right articles. We use the lead paragraph as the input, and the headline as the target generation. The data is processed following Rush et al. (2015), and split into 346,985 for training, 30,000 each for validation and testing. Details of the ideology distribution for SemEval are in Appendix B. We fine-tune BART and train ideology classifiers on the SemEval training set. First, two binary style scorers are trained on headlines of left and right stances, with F1 scores of 76.1 and 78.0, respectively. One class-conditional language model is trained on headlines with a stance token (left or right) prepended, achieving a perplexity of 54.7. To learn the word unit predictor for the left (and similarly for the right), we use samples that are labeled as left-leaning, treat the lead paragraph as the input, and then predict the word units used in the headline. Recalls for our predictors range from 77.8 to 83.5.

Ideology-controlled Headline Generation
Automatic Evaluation with SemEval. Table 3 shows that our decoder state adjustment model with the ideology scorer obtains the highest ideology scores, due to its effectiveness at capturing  the global context-stance is often signaled by the joint selection of entities and sentiments. One might be interested in which words are favored for ideology-controlled generation. To that end, we analyze the change of word usages with Linguistic Inquiry and Word Count (LIWC) (Pennebaker et al., 2015). In Fig. 3, it can be seen that word unit prediction-based models generate more "negations", consistent with trends observed in human-written headlines. Meanwhile, models with decoder state adjustment and the baselines all use more "affect" words in both stances, indicating that they consider it easier to use explicit sentiments to demonstrate the stances. Human Evaluation with AllSides. Given the low ideology scores in Table 3, we further study if human can distinguish the stances in human-written and system generated headlines. News clusters from AllSides are used, where each cluster focuses on one story, with multiple paragraph-headline pairs from publishers of left, neutral, and right ideological leanings. We use the lead paragraph as the input, and collect 2,985 clusters with samples written in all three stances. More details of the collection are in Appendix B. We test and report results by using lead paragraphs from neural articles as the input to construct headlines of left and right ideological stances.
We randomly pick 80 samples and include, for each sample, two headlines of different stances generated by each system. Raters first score the relevance of the generated headlines to the neutral paragraph's headline, on a scale of 1 to 5. They then read each pair of headlines to decide whether they are written in different stances, and if so, to label them.  of the cases, among which the stance identification accuracy is 73.3%. In comparison, 42.5% of the output pairs by the decoder state adjustment model can be distinguished, significantly higher than those of the baselines (24.5% and 11.6%). Sample outputs by our models are shown in Table 5, with more outputs included in Appendix E.

Conclusion
We present two just-in-time style control methods, which can be used in any Transformer-based summarization models. The decoder state adjustment technique modifies decoder final states based on externally trained style scorers. To gain stronger lexical control, word unit prediction directly narrows the vocabulary for generation. Human judges rate our system summaries to be simpler with better readability. We are also able to generate headlines with different ideological leanings.

A Training and Decoding Settings
Training. We train our simplicity style scorer and ideology style scorers for 10 epochs. The peak learning rate is 1 × 10 −5 with a batch size of 32.
The class-conditional language models for simplicity and ideology are trained with a peak learning rate of 5 × 10 −4 until the perplexity stops dropping on the validation set. We limit the number of tokens in each batch to 2, 048.
All word unit predictors are trained with a peak learning rate of 1 × 10 −4 until the loss on the validation set no longer drops. We use a batch size of 32 for training.
Decoding. We use beam search for decoding. A beam size of 5 is used for all models except for the decoder state adjustment having a beam size 1 (greedy decoding) to maintain a reasonable running time. Repeated trigrams are disabled for generation in all experiments. As suggested by Lewis et al. (2020) and Yan et al. (2020), length penalties are set to 2.0 and 1.0 for summarization and headline generation, respectively. The minimum and maximum lengths are set for decoding at 55 and 140 for summarization, 0 and 75 for headline generation.

B Statistics on SemEval and Allsides
Each article in the SemEval dataset is labeled with a stance: left, leaning left, neutral, leaning right, or right. Here we combine left and leaning-left  articles into one bucket, and similarly for right and leaning-right articles. The ideology distribution for training, validation and test splits are in Table 6. In our human evaluation of ideology-controlled headline generation, we use data collected from Allsides. The Allsides news clusters are curated by editors. The stance labels for different publishers are provided by Allsides, which are synthesized from blind surveys, editorial reviews, third-party analyses, independent reviews, and community feedback. We collect all the Allsides news clusters by April 26, 2020. After removing empty clusters, the total number of news clusters is 4,422. Among them, 2,985 clusters contain articles written in all three stances. For each article in the cluster, we keep the first paragraph and pair it with the headline. We remove the bylines in the first paragraphs.

C Additional Results for Headline Generation
In Table 7, we show the results of ideologycontrolled headline generation on SemEval with BART fine-tuned on Gigaword (Napoles et al., 2012). Our methods are still effective, especially by using decoder states adjustment with style scorers.

D Human Evaluation Guidelines
We include the evaluation guidelines for summarization and headline generation in Figures 4 and 5.

E Sample Outputs
Additional outputs are in Figures 6 and 7.

Article
There was no special treatment for Lewis Ferguson at Paul Nicholls's yard on Thursday morning. The 18-year-old was mucking out the stables as usual, just a cut on the nose to show for the fall which has made him an internet sensation. Ferguson's spectacular double somersault fall from the favourite Merrion Square in the 4.20pm at Wincanton has been watched hundreds of thousands of times online. But he was back riding out and is undeterred from getting back in the saddle. Amateur jockey Lee Lewis Ferguson has just a cut on his nose to show for his ordeal . Teenager Ferguson was flung from his horse in spectacular fashion at Wincanton . 'It was just a blur,' he said. 'I couldn't work out what had happened until I got back to the weighing room and watched the replay. All the other jockeys asked me if I was all right and stuff, they all watched with me and looked away in horror. (....)

Informativeness:
1 Not relevant to the article e.g., "Paul Nicholl's yard will start its expansion in December. The expansion plan was carried out six months ago." 3 Relevant, but misses the main point e.g., "Amateur jockey Lee Lewis Ferguson has just a cut on his nose to show for his ordeal . 'It was just a blur,' he said." 5 Successfully captures the main point and most of the important points. e.g., "Lewis Ferguson was mucking out the stables as usual on Thursday. Favourite Merrion Square threw jockey in a freak fall on Wednesday."

Fluency:
1 Summary is full of garbage fragments and is hard to understand e.g., "18 year old nose. to cut show nose. the horse fashion, as to" 2 Summary contains fragments, missing components but has some fluent segments e.g., "Lewis Ferguson out on Thursday. threw jockey on Wednesday." 3 Summary contains some grammar errors but is in general fluent e.g., "Lewis Ferguson was muck out the stables as usual onThursday. The Merrion Square threw jockey jockey in a freak fall on Wednesday. His spectacular doublesomersault fall made him internetsensation." 4 Summary has relatively minor grammatical errors e.g., "Lewis Ferguson was mucking out the stables as usual on in Thursday. Favourite Merrion Square threw jockey ina freak fall on Wednesday. His spectacular double somersault fall made him internet sensation." 5 Fluent Summary e.g., "Lewis Ferguson was mucking out the stables as usual on Thursday. Favourite Merrion Square threw jockey in a freak fall on Wednesday. His spectacular double somersault fall made him internet sensation." Simplicity:

Bad
The summary uses complex words that can be replaced with simpler ones in almost all sentences and complex syntax structures (e.g., two or more clauses in a sentence) e.g., "Lewis Ferguson was thrown by Merrion Square and made a spectacular double somersault fall which gathered millions of views online, making him internet sensation. But he was back riding out and is undeterred from getting back in the saddle, just a cut on the nose to show for the fall ." Moderate The summary uses at most one complex words that can be replaced with simpler ones per sentence, and uses syntax structures with at most one clause in a sentence e.g., "Lewis Ferguson fell from Merrion Square. His spectacular double somersault fall made him internet sensation. But he was back riding out and is not afraid of getting back in the saddle." Good The summary almost always uses simple and common words and simple syntax structures (e.g., no clause or at most one clause in the whole summary) e.g., "Lewis Ferguson fell from his horse on Wednesday. His eye-catching double flip fall made him famous on the Internet. He was back to the yard. He is not afraid of getting back in the saddle." Paragraph US President Donald Trump has said he is going to halt funding to the World Health Organization (WHO) because it has "failed in its basic duty" in its response to the coronavirus outbreak. Relevance: 1 The headline does not contain any information related to the input e.g., "'a hateful act': what we know about the ft. lauderdale airport shooting" 2 The headline contains some relevant event or person in the paragraph, but the topic is largely irrelevant e.g., "trump: i don't take questions from cnn" 3 The headline includes the main point of the paragraph, but have a different focus e.g., "health experts condemn donald trump's who funding freeze: 'crime against humanity'" 4 The headline captures the main point of the paragraph, but contains some information that cannot be inferred from the paragraph e.g., "trump cuts off u.s. funding to who, pending review" 5 The content of the headline and the paragraph are well aligned e.g., "coronavirus: us to halt funding to who, says trump"

Example B
Article: . . . Raikkonen's contract finishes at the end of the current Formula One season, although there is an option for 2016 providing both parties are in agreement. The Finn stated this week he has never been happier working with a team in his entire F1 career, although his form to date has not matched that of team-mate Sebastian Vettel. Kimi Raikkonen has been urged to improve his performances if he wants to stay at Ferrari. . . .