Clickbait? Sensational Headline Generation with Auto-tuned Reinforcement Learning

Sensational headlines are headlines that capture people’s attention and generate reader interest. Conventional abstractive headline generation methods, unlike human writers, do not optimize for maximal reader attention. In this paper, we propose a model that generates sensational headlines without labeled data. We first train a sensationalism scorer by classifying online headlines with many comments (“clickbait”) against a baseline of headlines generated from a summarization model. The score from the sensationalism scorer is used as the reward for a reinforcement learner. However, maximizing the noisy sensationalism reward will generate unnatural phrases instead of sensational headlines. To effectively leverage this noisy reward, we propose a novel loss function, Auto-tuned Reinforcement Learning (ARL), to dynamically balance reinforcement learning (RL) with maximum likelihood estimation (MLE). Human evaluation shows that 60.8% of samples generated by our model are sensational, which is significantly better than the Pointer-Gen baseline and other RL models.


Introduction
Headline generation is the process of creating a headline-style sentence given an input article. The research community has been regarding the task of headline generation as a summarization task (Shen et al., 2017a), ignoring the fundamental differences between headlines and summaries. While summaries aim to contain most of the important information from the articles, headlines do not necessarily need to. Instead, a good headline needs to capture people's attention and serve as an irresistible invitation for users to read through the article. For example, the headline "$2 Billion Worth of Free Media for Trump", which gives only an intriguing hint, is considered bet-ter than the summarization style headline "Measuring Trump's Media Dominance" 1 , as the former gets almost three times the readers as the latter. Generating headlines with many clicks is especially important in this digital age, because many of the revenues of journalism come from online advertisements and getting more user clicks means being more competitive in the market. However, most existing websites 2 naively generate sensational headlines using only keywords or templates. Instead, this paper aims to learn a model that generates sensational headlines based on an input article without labeled data.
To generate sensational headlines, there are two main challenges. Firstly, there is a lack of sensationalism scorer to measure how sensational a headline is. Some researchers have tried to manually label headlines as clickbait or non-clickbait (Chakraborty et al., 2016;Potthast et al., 2018). However, these human-annotated datasets are usually small and expensive to collect. To capture a large variety of sensationalization patterns, we need a cheap and easy way to collect a large number of sensational headlines. Thus, we propose a distant supervision strategy to collect a sensationalism dataset. We regard headlines receiving lots of comments as sensational samples and the headlines generated by a summarization model as nonsensational samples. Experimental results show that by distinguishing these two types of headlines, we can partially teach the model a sense of being sensational.
Secondly, after training a sensationalism scorer on our sensationalism dataset, a natural way to generate sensational headlines is to maximize the sensationalism score using reinforcement learn-ing (RL). However, the following shows an example of a RL model maximizing the sensationalism score by generating a very unnatural sentence, while its sensationalism scorer gave a very high score of 0.99996: 十个可穿戴产品的设计原 则这消息消息可惜说明 Ten design principles for wearable devices, this message message pity introduction. This happens because the sensationalism scorer can make mistakes and RL can generate unnatural phrases which fools our sensationalism scorer. Thus, how to effectively leverage RL with noisy rewards remains an open problem. To deal with the noisy reward, we introduce Autotuned Reinforcement Learning (ARL). Our model automatically tunes the ratio between MLE and RL based on how sensational the training headline is. In this way, we effectively take advantage of RL with a noisy reward to generate headlines that are both sensational and fluent.
The major contributions of this paper are as follows: 1) To the best of our knowledge, we propose the first-ever model that tackles the sensational headline generation task with reinforcement learning techniques. 2) Without human-annotated data, we propose a distant supervision strategy to train a sensationalism scorer as a reward function.3) We propose a novel loss function, Auto-tuned Reinforcement Learning, to give dynamic weights to balance between MLE and RL. Our code will be released . 3

Sensationalism Scorer
To evaluate the sensationalism intensity score α sen of a headline, we collect a sensationalism dataset and then train a sensationalism scorer. For the sensationalism dataset collection, we choose headlines with many comments from popular online websites as positive samples. For the negative samples, we propose to use the generated headlines from a sentence summarization model. Intuitively, the summarization model, which is trained to preserve the semantic meaning, will lose the sensationalization ability and thus the generated negative samples will be less sensational than the original one, similar to the obfuscation of style after back-translation (Prabhumoye et al., 2018). For example, an original headline like "一 趟 挣10万？铁总增开申通、顺丰专列" (One trip to earn 100 thousand? China Railway opens new Shentong and Shunfeng special lines) will become "中铁总将增开京广两列快递专列" (China Railway opens two special lines for express) from the baseline model, which loses the sensational phrases of "一 趟 挣10万 ？" (One trip to earn 100 thousand?) . We then train the sensationalism scorer by classifying sensational and nonsensational headlines using a one-layer CNN with a binary cross entropy loss L sen . Firstly, 1-D convolution is used to extract word features from the input embeddings of a headline. This is followed by a ReLU activation layer and a max-pooling layer along the time dimension. All features from different channels are concatenated together and projected to the sensationalism score by adding another fully connected layer with sigmoid activation. Binary cross entropy is used to compute the loss L sen .

Training Details and Dataset
For the CNN model, we choose filter sizes of 1, 3, and 5 respectively. Adam is used to optimize L sen with a learning rate of 0.0001. We set the embedding size as 300 and initialize it from Qiu et al. (2018) trained on the Weibo corpus with word and character features. We fix the embeddings during training.
For dataset collection, we utilize the headlines collected in Qin et al. (2018); Lin et al. (2019a) from Tencent News, one of the most popular Chinese news websites, as the positive samples. We follow the same data split as the original paper. As some of the links are not available any more, we get 170,754 training samples and 4,511 validation samples. For the negative training samples collection, we randomly select generated headlines from a pointer generator (See et al., 2017) model trained on LCSTS dataset (Hu et al., 2015) and create a balanced training corpus which includes 351,508 training samples and 9,022 validation samples. To evaluate our trained classifier, we construct a test set by randomly sampling 100 headlines from the test split of LCSTS dataset and the labels are obtained by 11 human annotators. Annotations show that 52% headlines are labeled as positive and 48% headlines as negative by majority voting (The detail on the annotation can be found in Section 3.6).

Results and Discussion
Our classifier achieves 0.65 accuracy and 0.65 averaged F1 score on the test set while a random classifier would only achieve 0.50 accuracy and 0.50 averaged F1 score. This confirms that the predicted sensationalism score can partially capture the sensationalism of headlines. On the other hand, a more natural choice is to take headlines with few comments as negative examples. Thus, we train another baseline classifier on a crawled balanced sensationalism corpus of 84k headlines where the positive headlines have at least 28 comments and the negative headlines have less than 5 comments. However, the results on the test set show that the baseline classifier gets 60% accuracy, which is worse than the proposed classifier (which achieves 65%). The reason could be that the balanced sensationalism corpus are sampled from different distributions from the test set and it is hard for the trained model to generalize. Therefore, we choose the proposed one as our sensationalism scorer. Therefore, our next challenge is to show that how to leverage this noisy sensationalism reward to generate sensational headlines.

Sensational Headline Generation
Our sensational headline generation model takes an article as input and output a sensational headline. The model consists of a Pointer-Gen headline generator and is trained by ARL. The diagram of ARL can be found in Figure 1.
We denote the input article as x = {x 1 , x 2 , x 3 , · · · , x M }, and the corresponding headline as y * = {y * 1 , y * 2 , y * 3 , · · · , y * T }, where M is the number of tokens in an article and T is the number of tokens in a headline.

Pointer-Gen Headline Generator
We choose Pointer Generator (Pointer-Gen) (See et al., 2017), a widely used summarization model, as our headline generator for its ability to copy words from the input article. It takes a news article as input and generates a headline. Firstly, the tokens of each article, {x 1 , x 2 , x 3 , · · · , x M }, are fed into the encoder one-by-one and the encoder generates a sequence of hidden states h i . For each decoding step t, the decoder receives the embedding for each token of a headline y t as input and updates its hidden states s t . An attention mechanism following Luong et al. (2015) is used: where v, W h , W s , and b attn are the trainable parameters and h * t is the context vector. s t and h * t are then combined to give a probability distribution over the vocabulary through two linear layers: where V , b, V , and b are trainable parameters. We use a pointer generator network to enable our model to copy rare/unknown words from the input article, giving the following final word probability: where x t is the embedding of the input word of the decoder, w T h * , w T s , w T x , and b ptr are trainable parameters, and σ is the sigmoid function.

Training Methods
We first briefly introduce MLE and RL objective functions, and a naive way to mix these two by a hyper-parameter λ. Then we point out the challenge of training with noisy reward, and propose ARL to address this issue.

MLE and RL
A headline generation model can be trained with MLE, RL or a combination of MLE and RL. MLE training is to minimize the negative log likelihood of the training headlines. We feed y * into the decoder word by word and maximize the likelihood of y * . The loss function for MLE becomes For RL training, we choose the REINFORCE algorithm (Williams, 1992).
In the training phase, after encoding an article, a headline y s = {y s 1 , y s 2 , y s 3 , · · · , y s T } is obtained by sampling from P (w) from our generator, and then a reward of sensationalism or ROUGE(RG) is calculated.
We use the baseline rewardR t to reduce the variance of the reward, similar to Ranzato et al. (2016). To elaborate, a linear model is deployed to estimate the baseline rewardR t based on t-th state o t for each timestep t. The parameters of the linear model are trained by minimizing the mean square loss between R andR t : where W r and b r are trainable parameters. To maximize the expected reward, our loss function for RL becomes A naive way to mix these two objective functions using a hyper-parameter λ has been successfully incorporated in the summarization task . It includes the MLE training as a language model to mitigate the readability and quality issues in RL. The mixed loss function is shown as follows: where * is the reward type. Usually λ is large, and  used 0.9984.

Auto-tuned Reinforcement Learning
Applying the naive mixed training method using sensationalism score as the reward is not obvious/trivial in our task. The main reason is that our sensationalism reward is notably more noisy and more fragile than the ROUGE-L reward or abstractive reward used in the summarization task Kryściński et al., 2018). A higher ROUGE-L F1 reward in summarization indicates higher overlapping ratio between generation and true summary statistically, but our sensationalism reward is a learned score which is fragile to be fooled with unnatural samples.
To effectively train the model with RL under noisy sensationalism reward, our idea is to balance RL with MLE. However, we argue that the weighted ratio between MLE and RL should be sample-dependent, instead of being fixed for all training samples as in ; Kryściński et al. (2018). The reason is that, RL and MLE have inconsistent optimization objectives. When the training headline is nonsensational, MLE training will encourage our model to imitate the training headline (thus generating non-sensational headlines), which counteracts the effects of RL training to generate sensational headlines.
The sensationalism score is, therefore, used to give dynamic weight to MLE and RL. Our ARL loss function becomes: If α sen (y * ) is high, meaning the training headline is sensational, our loss function encourages our model to imitate the sample more using the MLE training. If α sen (y * ) is low, our loss function replies on RL training to improve the sensationalism. Note that the weight α sen (y * ) is different from our sensationalism reward α sen (y s ) and we call the loss function Auto-tuned Reinforcement Learning, because the ratio between MLE and RL are well "tuned" towards different samples.

Dataset
We use LCSTS (Hu et al., 2015) as our dataset to train the summarization model. The dataset is collected from the Chinese microblogging website Sina Weibo. It contains over 2 million Chinese short texts with corresponding headlines given by the author of each text. The dataset is split into 2,400,591 samples for training, 10,666 samples for validation and 725 samples for testing. We tokenize each sentence with Jieba 4 and a vocabulary size of 50000 is saved. Figure 2: The probability density function (pdf) of predicted sensationalism score in log scale. Low sensationalism score has much higher probability density.

Baselines and Our Models
We experiment and compare with the following models. Pointer-Gen is the baseline model trained by optimizing L MLE in Equation 8.
Pointer-Gen+Pos is the baseline model by training Pointer-Gen only on positive examples whose sensationalism score is larger than 0.5 Pointer-Gen+Same-FT is the model which finetunes Pointer-Gen on the training samples whose sensationalism score is larger than 0.1 Pointer-Gen+Pos-FT is the model which finetunes Pointer-Gen on the training samples whose sensationalism score is larger than 0. Test set is the headlines from the test set. Note that we didn't compare to Pointer-Gen+ARL-ROUGE as it is actually Pointer-GEN. Recall that α sen (y * ) in Equation 15 measures how good (based on reward function) is y * . Then the loss function for Pointer-Gen+ARL-ROUGE will be (1 − RG(y * , y * ))L RL + RG(y * , y * )L MLE = L MLE We also tried text style transfer baseline (Shen et al., 2017b), but the generated headlines were very poor (many unknown words and irrelevant).

Training Details
MLE training: An Adam optimizer is used with the learning rate of 0.0001 to optimize L MLE . The batch size is set as 128 and a one-layer, bidirectional Long Short-Term Memory (bi-LSTM) model with 512 hidden sizes and a 350 embedding size is utilized. Gradients with the l2 norm larger than 2.0 are clipped. We stop training when the ROUGE-L f-score stops increasing. Hybrid training: An Adam optimizer with a learning rate of 0.0001 is used to optimize L RL-* (Equation 14) and L ARL-SEN (Equation 15). When training Pointer-Gen+RL-ROUGE, the best λ is chosen based on the ROUGE-L score on the validation set. In our experiment, λ is set as 0.95. An Adam optimizer with a learning rate of 0.001 is used to optimize L b . When training Pointer-Gen+ARL-SEN, we don't use the full LCSTS dataset, but only headlines with a sensationalism score larger than 0.1 as we observe that Pointer-Gen+ARL-SEN will generate a few unnatural phrases when using full dataset. We believe the reason is the high ratio of RL during training. Figure 2 shows that the probability density near 0 is very high, meaning that in each batch, many of the samples will have a very low sensationalism score. On expectation, each sample will receive 0.239 MLE training and 0.761 RL training. This leads to RL dominanting the loss. Thus, we propose to filter samples with a minimum sensationalism score with 0.1 and it works very well. For Pointer-Gen+RL-SEN, we also set the minimum sensationalism score as 0.1, and λ is set as 0.5 to remove unnatural phrases, making a fair comparison to Pointer-Gen+ARL-SEN.
We stop training Pointer-Gen+Same-FT, Pointer-Gen+Pos-FT, Pointer-Gen+RL-SEN and Pointer-Gen+ARL-SEN, when α sen stops increasing on the validation set. Beam-search with a beam size of 5 is adopted for decoding in all models.
Human evaluation: We randomly sample 50 articles from the test set and send the generated headlines from all models and corresponding headlines in the test set to human annotators. We evaluate the sensationalism and fluency of the headlines by setting up two independent human annotation tasks. We ask 10 annotators to label each headline for each task. For the sensationalism annotation, each annotator is asked one question, "Is the headline sensational?", and he/she has to choose either 'yes' or 'no'. The annotators were not told which system the headline is from. The process of distributing samples and recruiting annotators is managed by Crowdflower. 6 After annotation, we define the sensationalism score as the proportion of annotations on all generated headlines from one model labeled as 'yes'. For the fluency annotation, we repeat the same procedure as for the sensationalism annotation, except that we ask each annotator the question "Is the headline fluent?" We define the fluency score as the proportion of annotations on all headlines from one specific model labeled as 'yes'. We put human annotation instructions in the supplemental material.

Results
We first compare all four models, Pointer-Gen, Pointer-Gen-RL+ROUGE, Pointer-Gen-RL-SEN, and Pointer-Gen-ARL-SEN, to existing models with ROUGE in  model produces relevant headlines and we leave the sensationalism for human evaluation. Note that we only compare our models to commonly used strong summarization baselines, to validate that our implementation achieves comparable performance to existing work. In our implementation, Pointer-Gen achieves a 34.51 RG-1 score, 22.21 RG-2 score, and 31.68 RG-L score, which is similar to the results of Gu et al. (2016). Pointer-Gen+ARL-SEN, although optimized for the sensationalism reward, achieves similar performance to our Pointer-Gen baseline, which means that Pointer-Gen+ARL-SEN still keeps its summarization ability. An example of headlines generated from different models in Table 2 shows that Pointer-Gen and Pointer-Gen+RL-ROUGE learns to summarize the main point of the article: "The Nikon D600 camera is reported to have black spots when taking photos". Pointer-Gen+RL-SEN  Table 3: Comparison of sensationalism score and fluency score between different models. Pointer-Gen+ARL-SEN achieves the best performance among all models in sensationalism score. * indicates Pointer-Gen+ARL-SEN is statistically significantly better than the corresponding model. Figure 3: Comparison of sensationalism score between Pointer-Gen+ARL-SEN and Pointer-Gen+RL-SEN for different test set headlines. The blue bars denote the smaller scores between the two models. Pointer-Gen+ARL-SEN achieves better performance on most cases. Greater improvement is achieved when the test set headline is non-sensational.
makes the headline more sensational by blaming Nikon for attributing the damage to the smog. Pointer-Gen+ARL-SEN generates the most sensational headline by exaggerating the result "Getting a serious trouble!" to maximize user's attention.
We then compare different models using the sensationalism score in Table 3. The Pointer-Gen baseline model achieves a 42.6% sensationalism score, which is the minimum that a typical summarization model achieves. By filtering out low-sensational headlines, Pointer-Gen+Same-FT and Pointer-Gen+Pos-FT achieves higher sensationalism scores, which implies the effectiveness of our sensationalism scorer. Our Pointer-Gen+ARL-SEN model achieves the best performance of 60.8%. This is an absolute improvement of 18.2% over the Pointer-Gen baseline. The Chi-square test on the results confirms that Pointer-Gen+ARL-SEN is statistically significantly more sensational than all the other baseline models, with the largest p-value less than 0.01. Also, we find that the test set headlines achieves 57.8% sensationalism score, much larger than Pointer-Gen baseline, which also supports our intuition that generated headlines will be less sensational than the original one. On the other hand, we found that Pointer-Gen+Pos is much worse than other baselines. The reason is that training on sensational samples alone discards around 80% of the whole training set that is also helpful for maintaining relevance and a good language model. It shows the necessity of using RL.
In addition, both Pointer-Gen+RL-SEN and Pointer-Gen+ARL-SEN, which use the sensationalism score as the reward, obtain statistically better results than Pointer-Gen+RL-ROUGE and Pointer-Gen, with a p-value less than 0.05 by a Chi-square test. This result shows the effectiveness of RL to generate more sensational headlines. The reason is that even though our noisy classifier could also learn to classify domains, the generator during RL training is not allowed to increase the reward by shifting domains but encouraged to generate more sensational headlines, due to the consistency constraint on the domains of the headline and the article. Furthermore, Poiner-Gen+ARL-SEN gets better performance than Pointer-Gen+RL-SEN, which confirms the superiority of the ARL loss function.
We also visualize in Figure 3 a comparison between Pointer-Gen+ARL-SEN and Pointer-Gen+RL-SEN according to how sensational the test set headlines are. The blue bars denote the smaller scores between the two models. For example, if the blue bar is 0.6, it means that the worse model between Pointer-Gen+RL-SEN and Pointer-Gen+ARL-SEN achieves 0.6. And the color of orange/black further indicates the better model and its score. We find that Pointer-Gen+ARL-SEN outperforms Pointer-Gen+RL-SEN for most cases. The improvement is higher when the test set headlines are not sensational (the sensationalism score is less than 0.5), which may be attributed to the higher ratio of RL training on non-sensational headlines.
Apart from the sensationalism evaluation, we measure the fluency of the headlines generated from different models. Fluency scores in Table  Table 4: Different sensationalization strategies Pointer-Gen+ARL-SEN learns.
3 show that Pointer-Gen+RL-SEN and Pointer-Gen+ARL-SEN achieve comparable fluency performance to Pointer-Gen and Pointer-Gen+RL-ROUGE. Test set headlines achieve the best performance among all models, but the difference is not statistically significant. Also, we observe that fine-tuning on sensational headlines will hurt the performance, both in sensationalism and fluency.
After manually checking the outputs, we observe that our model is able to generate sensational headlines using diverse sensationalization strategies. These strategies include, but are not limited to, creating a curiosity gap, asking questions, highlighting numbers, being emotional and emphasizing the user. Examples can be found in Table 4.

Related Work
Our work is related to summarization tasks. An encoder-decoder model was first applied to two sentence-level abstractive summarization tasks on the DUC-2004 andGigaword datasets (Rush et al., 2015). This model was later extended by selective encoding , a coarse to fine approach (Tan et al., 2017b), minimum risk training (Shen et al., 2017a), and topic-aware models . As long summaries were recognized as important, the CNN/Daily Mail dataset was used in Nallapati et al. (2016). Graph-based attention (Tan et al., 2017a), pointer-generator with coverage loss (See et al., 2017) are further developed to improve the generated summaries. Celikyilmaz et al. (2018) proposed deep communicating agents for representing a long document for abstractive summarization. In addition, many papers (Nallapati et al., 2017;Zhou et al., 2018b; use extractive methods to directly select sentences from articles. However, none of these work considered the sensationalism of generated outputs. RL is also gaining popularity as it can directly optimize non-differentiable metrics (Pasunuru and Bansal, 2018;Venkatraman et al., 2015;.  proposed an intra-decoder model and combined RL and MLE to deal with summaries with bad qualities. RL has also been explored with generative adversarial networks (GANs) (Yu et al., 2017).  applied GANs on summarization task and achieved better performance. Niu and Bansal (2018) tackles the problem of polite generation with politeness reward. Our work is different in that we propose a novel function to balance RL and MLE.
Our task is also related to text style transfer. Implicit methods (Shen et al., 2017b;Fu et al., 2018;Prabhumoye et al., 2018) transfer the styles by separating sentence representations into content and style, for example using backtranslation (Prabhumoye et al., 2018). However, these methods cannot guarantee the content consistency between the original sentence and transferred output (Xu et al., 2018a). Explicit methods (Zhang et al., 2018b;Xu et al., 2018a) transfer the style by directly identifying style related keywords and modifying them. However, sensationalism is not always restricted to keywords, but the full sentence. By leveraging small human labeled English dataset, clickbait detection has been well investigated (Chakraborty et al., 2016;Shu et al., 2018;Potthast et al., 2018). However, these human labeled dataset are not available for other languages, such as Chinese.
Modeling sensationalism is also related to modeling emotion. Emotion has been well investigated in both word level (Tang et al., 2016; and sentence level (Felbo et al., 2017;Winata et al., 2019Winata et al., , 2018Park et al., 2018;Lee et al., 2019). It has also been considered an important factor in engaging interactive systems (Lin et al., 2019b;Winata et al., 2017;. Although we observe that sensational headlines contain emotion, it is still not clear which emotion and how emotions will influence the sensationalism.

Conclusion and Future Work
In this paper, we propose a model that generates sensational headlines without labeled data using Reinforcement Learning. Firstly, we propose a distant supervision strategy to train the sensationalism scorer. As a result, we achieve 65% accuracy between the predicted sensationalism score and human evaluation. To effectively leverage this noisy sensationalism score as the reward for RL, we propose a novel loss function, ARL, to automatically balance RL with MLE. Human evaluation confirms the effectiveness of both our sensationalism scorer and ARL to generate more sensational headlines. Future work can be improving the sensationalism scorer and investigating the applications of dynamic balancing methods between RL and MLE in textGAN (Yu et al., 2017). Our work also raises the ethical questions about generating sensational headlines, which can be further explored.