Predicting Clickbait Strength in Online Social Media

Hoping for a large number of clicks and potentially high social shares, journalists of various news media outlets publish sensationalist headlines on social media. These headlines lure the readers to click on them and satisfy the curiosity gap in their mind. Low quality material pointed to by clickbaits leads to time wastage and annoyance for users. Even for enterprises publishing clickbaits, it hurts more than it helps as it erodes user trust, attracts wrong visitors, and produces negative signals for ranking algorithms. Hence, identifying and flagging clickbait titles is very essential. Previous work on clickbaits has majorly focused on binary classification of clickbait titles. However not all clickbaits are equally clickbaity. It is not only essential to identify a click-bait, but also to identify the intensity of the clickbait based on the strength of the clickbait. In this work, we model clickbait strength prediction as a regression problem. While previous methods have relied on traditional machine learning or vanilla recurrent neural networks, we rigorously investigate the use of transformers for clickbait strength prediction. On a benchmark dataset with ∼39K posts, our methods outperform all the existing methods in the Clickbait Challenge.


INTRODUCTION
Clickbait refers to those sensational, provocative or controversial posts which appear to be informative and objective, but are designed to entice its readers into clicking the link accompanying with the post. According to a survey of 53 Stanford students, 96.2 percent of Stanford students encounter clickbait articles on the Internet at least once per day 2 . Rony et al. (2017) estimate that 19.46% of headlines were "clickbait" in 2014; 23.73% in 2015; and 25.27% in 2016. Beyond the increased prevalence, clickbait is also a challenge across multiple modes of data, text, images and even videos 3 .
The economic model of the contemporary online news industry (Dvorkin, 2015) incentivizes more content views. A report by the Columbia Journalism Review highlighted the case of online magazine Slant, which pays writers $100 per month, plus $5 for every 500 clicks on their stories. This clearly motivates journalists to write catchy and suspenseful headlines. Table 1 lists some of the popular social media outlets publishing clickbait content and their followers. The numbers are indicative of how much people are easily falling to the bait.
Cognitively, human minds have a tendency to satisfy and bridge their curiosity gap by clicking on the link. Marketing companies have been using clickbaits to attract and engage more number of users resulting in getting more page views. Websites need more page views to promote their content or to create more opportunities to show advertisements which increase their revenue. Moreoever, it is a well  known fact that clickbaity content has a higher likelihood of being socially shared leading to more page views.
Clickbaits play with human psychology and sometimes are a wastage of time. They create a curiosity gap for the users through the short post, but do not make judicious attempt to fill it in the clicked article, thereby creating an information void. It is quite annoying to have social feeds spammed by over-promising headlines that lead users to under-delivering half-stories. Even for enterprises which use clickbaits for effective marketing, clickbaits are hurtful more than being helpful for the following reasons: (1) misleading clickbait damages brands and erodes user trust, (2) clickbaits attract wrong visitors rather than interested ones, (3) user interaction with clickbaits produces negative signals for ranking algorithms, (4) clickbait muddles the website's important data, and (5) sensationalism is now seen as more disappointing by smart users.  Clickbait classification is a very subjective task. While there are some terrible headlines that qualify as clear clickbaits (e.g., "You won't believe what happened!"), there is also an enormous gray area 4 . Since the very purpose of teaser message is to attract the attention of readers, every message containing a link baits the user to click the link. The question is whether this baiting is perceived as immoderate or deceptive by the reader (Potthast et al., 2017). Hence, in this work, we focus on the task of predicting the intensity or degree of clickbaity-ness of an article rather than a vanilla binary classification.
Predicting the degree of clickbaity-ness is challenging as clickbaits are short headlines often written in obscured ways which requires high-order semantic understanding. The strength of a clickbait could be defined as a function of how much attention-grabbing the post is, and the gap between what is promised in the headline and what is delivered by the article linked from it. Both of these are difficult to measure automatically. In most of the cases, we may need to predict clickbaity-ness just based on the content of the post. Using the content alone brings in further challenges: (1) it is usually very short, (2) it is often written in convoluted ways, and (3) it requires high-order semantic understanding, often with support of facts from some knowledge base.
In this paper, we make the following main contributions.
• We build multiple regressor models using the current state-of-the-art word embeddings and evaluate the performance of the classifiers over the current state-of-the art methods for clickbait strength prediction.
• We present the first work to investigate application of transformer regression models for the clickbait intensity prediction task.
• We augment transformer-based methods with multiple traditional machine learning regression methods to further improve the regression performance.
• Our experiments with a benchmark dataset result into a new state-of-the-art for the clickbait intensity prediction task.
2 Related Work

Clickbait Classification
The origin of clickbaits can be traced back to the advent of tabloid journalism. (Rowe, 2011;Blom and Hansen, 2015;Chen et al., 2015) are some of the earliest studies on analysis of linguistic aspects of clickbait, But they did not perform automatic classification. Most of the existing works on automated clickbait detection have been done in the context of binary classification, i.e. predicting whether a given news article's title is a clickbait or not. Traditionally, feature engineering based methods have been proposed (Biyani et al., 2016;Chakraborty et al., 2016;Wei and Wan, 2017). Feature sets include content features, textual similarity features between the headline and the body, informality and forward reference features, sentence structure features, word pattern features, clickbait language features and Ngram features. Machine learning methods like Gradient Boosted Decision Trees (Biyani et al., 2016), Support Vector Machine (SVM) classifier (Chakraborty et al., 2016), co-training (Wei and Wan, 2017) are then use to leverage these features and train a classifier. Features for clickbait detection can be derived from three sources: the teaser message or the post text, the linked article, and metadata for both. While all reviewed approaches derive features from the teaser message, the linked article and the metadata are considered only by (Potthast et al., 2016) and (Biyani et al., 2016). Besides the post text, Zheng et al. (2017) additionally took the user behaviour information into consideration, to improve the performance of clickbait detection on Chinese news articles. Also, recently deep learning techniques have been proposed. Anand et al. (2017) and Rony et al. (2017) use bidirectional Recurrent Neural Network (RNN) (Schuster and Paliwal, 1997) and fastText (Joulin et al., 2016) on word distributed representations, respectively for clickbait detection.
Most of the initial efforts on clickbait detection focused only on news headlines. Recently, there have been efforts at identifying clickbaits from social media like Twitter. Potthast et al. (2016) trained a random forest classifier by extracting various features from the post texts, linked webpages and associated meta information of tweets, to decide if a tweet was a clickbait. Agrawal (2016) trained a Convolutional Neural Network (CNN) (Kim, 2014), using the post texts only, to detect clickbait posts in Reddit, Facebook and Twitter. In , researchers analysed the differences in content, sentiment, consumers, etc., between the clickbait and non-clickbait tweets.

Clickbait Regression
Binary clickbait classification is not sufficient. Rather, it is useful to predict the finegrained intensity of the clickbait which can enable ranking of clickbaits, thereby providing a knob for elimination of clickbaits rather than a blanket binary elimination. The Clickbait Challenge (Potthast et al., 2017) (Hochreiter and Schmidhuber, 1997) (Glenski et al., 2017) zingel Bidirectional GRUs  with attention , GloVe (Pennington et al., 2014) (Zhou, 2017)  been devised in 2017 to enable benchmarking of solutions for the clickbait strength prediction problem. In this challenge, the goal is to predict the intensity of clickbaits rather than just predicting if a particular item is a clickbait or not. Table 2 shows various approaches that have been proposed for the clickbait intensity prediction task. Some approaches use traditional machine learning regression methods using a large set of hand crafted features, while others look at neural architectures (like RNNs, LSTMs, GRUs and CNNs) supported by word embeddings like word2vec and GloVe.

Dataset
The Webis Clickbait Corpus consists of 38517 tweets (Potthast et al., 2018). Restricting to Englishlanguage publishers, Potthast et al. (2018) obtain a ranking of the top-most retweeted news publishers from the NewsWhip social media analytics service 5 . Taking the top 27 publishers, they used Twitter"s API to record every tweet they published in the period from December 1, 2016, through April 30, 2017. They filtered and sampled from this collection of 459541 tweets to obtain a clean dataset of 38517 tweets. Each of the tweets was annotated for clickbait intensity label by five different workers from Amazon Mechanical Turk (AMT). A 4-point Likert scale was followed with these values: Not clickbaiting (0.0), Slightly clickbaiting (0.33), Considerably clickbaiting (0.66), Heavily clickbaiting (1.0). Of this, 19538 (of which 4761 are clickbaits) tweets have been released for training with labels. The maximum post size is 25. The post lengths follow a normal distribution around a mean of 12. We split the labeled data into 80:20 ratio for training and validation. We perform 5-fold cross validation and compile the results on the validation set. The remaining 18979 (of which 4515 are clickbaits) tweets are used by the clickbait challenge server for testing the submissions. This test set is private and not accessible publicly. Moreover, there are extremely limited number of test runs which can be submitted to the test server.
Empirical observations reveal that the field postText (text of the post) in the given dataset contributes majorly to decide the intensity of the clickbait. Hence, in spite of the availability of the tweets' metadata like the post media, the title of the target linked page, the content paragraphs and keywords of the target page, the time of the post and caption of every image in the target article, we use only the post text of the tweet to train a machine learning model to predict the clickbait intensity score of each tweet. We leave further exploration of other metadata fields as part of future work.

Evaluation Metrics
The goal for the clickbait intensity prediction task is to develop a model that can predict how click baiting a social media post is. The score is a real number between 0 and 1. Mean Squared Error (MSE) with respect to the mean judgments of the annotators is used as the primary evaluation metric. Models whose preditions have the lowest MSE would be ranked on the top. Unlike other classification task, where F1 score or accuracy is the evaluation metric, this challenge focuses more on predicting the intensity of the title than classifying the title as clickbait or not. Official evaluation is done on the platform called TIRA (Potthast et al., 2014). This platform evaluates the predictions by running the code for predictions in a virtual machine in a sandboxed environment which ensures that the test data is kept private and not revealed to public. Moreover, there are a limited number of times one can predict on the test data to ensure that the models are not trained to overfit the test data. The evaluation platform also computes secondary evaluation metrics such as the Median Absolute Error (MedAE), the F1-Score (F1) and Accuracy (Acc) with respect to the truth class.

Approach
We formulate the problem of clickbait strength prediction as a regression problem. We build multiple regression models using pretrained word embeddings, pretrained transformer representations and finetuned transformer representations for clickbait strength prediction. We experiment with various regression algorithms and rigorously investigate the efficacy of these for clickbait strength prediction.

Word and Sentence Embedding Representations
Word embeddings have been widely used in modern Natural Language Processing applications as they provide semantic vector representation of words. They capture the semantic properties of words and the linguistic relationship between them. These word embeddings have improved the performance of many downstream tasks across many domains like text classification, machine comprehension etc. (Camacho-Collados and Pilehvar, 2018). Multiple ways of generating word embeddings exist, such as Neural Probabilistic Language Model (Bengio et al., 2003), word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), LexVec (Salle et al., 2016), dependency-based embeddings (DepVec) (Levy and Goldberg, 2014) and more recently ELMo (Peters et al., 2018).
ELMo can generate different word embeddings for a word that captures the context of a word -that is its position in a sentence. ELMo achieves this by using two deep unidirectional LSTMs (forward and backward) and then computing embedding for a word as a weighted combination of hidden layer outputs at that position.
Universal Sentence Encoder (Cer et al., 2018) is based on the Transformer encoder (Vaswani et al., 2017) and a deep averaging network. It is trained using unsupervised data from Wikipedia, web news, web question-answer pages and discussion forums, and supervised data from the Stanford Natural Language Inference (SNLI) corpus.
We experiment with two models -ELMo (Peters et al., 2018) and Google's Universal sentence encoder (Cer et al., 2018) representations for transforming the clickbait title into a dense numerical vector representation.

Transformer Representations
After the original Transformer work by Vaswani et al. (2017), several architectures have been proposed like BERT (Devlin et al., 2018), RoBERTa  and OpenAI's GPT2 (Radford et al., 2019) and T5 (Raffel et al., 2019). The GLUE (Wang et al., 2019b) and the SuperGLUE (Wang et al., 2019a) dashboards indicate the great success of the transformer models which have outperformed all of the previous methods across complex NLP tasks like text classification, textual entailment, machine translation, word sense disambiguation, etc. We present the first of its kind work to investigate the application of transformer models for the clickbait intensity prediction task.
Transformer networks follow a non-recurrent architecture with stacked self-attention and fully connected layers for both the encoder and decoder, each with six layers. They are based on concepts like self attention, multi-head attention, positional embeddings, residual connections and masked attention. While transformers follow an encoder-decoder architecture, just the encoder or the decoder have been used to define other popular architectures like BERT, GPT-2, etc.
BERT (Devlin et al., 2018) essentially is a transformer encoder with 12 layers, 12 attention heads and 768 dimensions. We used the pre-trained model which has been trained on Books Corpus and Wikipedia using the MLM (masked language model) and the next sentence prediction (NSP) loss functions. The post text sequence is prepended with a "CLS" token. The representation C for the "CLS" token from the last encoder layer is used for regression by connecting it to an output softmax layer. We also finetune the pre-trained model using labeled training data for the clickbait intensity prediction task. BERTLarge is similar to BERT but with 24 layers, 16 attention heads and 1024 dimensions.
OpenAI's GPT2 (Radford et al., 2019) uses a left-to-right Transformer, where every token can only attend to previous tokens in the self-attention layers of the Transformer. We also finetune the pre-trained model using labeled training data for the clickbait intensity prediction task. GPT model size is almost the same as the BERT BASE model size. GPT is trained on the BooksCorpus (800M words); BERT is trained on the BooksCorpus (800M words) and Wikipedia (2,500M words). The largest GPT-2 variant is 1.5B parameters large and could take up more than 6.5 GBs of storage space.
RoBERTa ) is a robustly optimized method for pretraining natural language processing (NLP) systems that improves on BERT. RoBERTa was trained with much more data -160GB of text instead of the 16GB dataset originally used to train BERT. It is also trained for larger number of iterations up to 500K. Compared to BERT, batch sizes for training were 8K instead of 256 in the original BERT base model. Further, it uses larger byte-pair encoding (BPE) vocabulary with 50K subword units instead of character-level BPE vocabulary of size 30K used for BERT. Finally, compared to BERT, it removes the next sequence prediction objective from the training procedure, and a dynamically changing masking pattern is applied to the training data. RoBERTaLarge has configuration similar to BERTLarge.

Regression Models
We train multiple regression models on various kinds of word, sentence and transformer representations. We experiment with the following regression algorithms.
• Simple Linear Regression (LR): Linear regression is the most simplest of the regression algorithms typically fitted using the least squares approach. The relationship between the independent variable is modeled as a linear combination of the attributes.
• Ridge Regression (RR): Ridge Regression model is a linear regression model with L2 penalty as regularizers.
• Gradient Boosted Regression (GBR): GB regression learns an ensemble of regression trees, each of which have scalar values in the leaves. The ensemble of trees is produced by computing, in each step, a regression tree that approximates the gradient of the loss function, and adding it to the previous tree with coefficients that minimize the loss of the new tree. The output of the ensemble on a given instance is the sum of the tree outputs.
• Random Forest Regression (RFR): A random forest regressor is an ensemble learning algorithm for regression which constructs multiple decision trees at training time and outputting the average of the predictions of the individual trees, there by prevents over-fitting.
• Adaboost Regression (ABR): AdaBoost regressor is another ensemble learning algorithm that begins by fitting a regressor on the original dataset and then fits additional copies of the regressor on the same dataset but the weights of instances are adjusted according to the error of the current prediction, thereby subsequent regressors focus more on the difficult cases.

Experiments
In this work, we try various kinds of approaches and investigate how they perform for the clickbait prediction task. We use multiple models to transform the text into features. We train multiple regression models on the above features and evaluate the efficacy of each of the pre-trained embeddings (BERT, GPT2 or RoBERTa) for the downstream clickbait prediction task. We experimented with the following regression techniques: (1)  Empirical observations reveal that the field postText (text of the post) in the given dataset contributes majorly to decide the intensity of the clickbait. Hence, in spite of the availability of the tweets' metadata like the post media, the title of the target linked page, the content paragraphs and keywords of the target page, the time of the post and caption of every image in the target article, we use only the post text of the tweet to train a machine learning model to predict the clickbait intensity score of each tweet. We leave further exploration of other metadata fields as part of future work.
First, we experiment with pretrained word and sentence embedding representations. In this setting, we transform the text using the pretrained word embedding or the pretrained sentence encoder. These representations are used to train a regression model. Table 5 shows results using this setting on the validation set.
Next, we experiment with pretrained Transformer representations. In this setting, we transform the train and the test data using the pretrained transformer without any finetuning step. The representation C for the "CLS" token from the last encoder layer of the pretrained transformer models are used as features for these regression methods. Table 3 shows results using this setting on the validation set.
Further, we experiment with finetuned Transformer representations. In this setting, we finetune the pretrained transformer model with the labeled data. After finetuning, we use the finetuned transformer model to transform the input text into vector representations and fit a regressor model on these representations.The representation C for the "CLS" token from the last encoder layer of the finetuned transformer models are used as features for these regression methods.
The training method involves two stage training process. In the first step we finetune the Transformer model using the training data to create the finetuned Transformer model. In the next stage, the finetuned Transformer model is used to generate the representations of the training data. These representations are further used to train a regression model. Figure 2 explains these steps in detail.
For predicting the intensity of an unseen sample, first we transform the input post text into features using the finetuned Transformer model. These features are fed into the trained regressor which predicts a numeric score for the post. The predicted clickbait score is rectified by passing through a rectifier function as defined below to ensure that the clickbait score remains in the interval [0,1]. Bottom part of Figure 2 shows the flow for prediction.
For the final and official evaluation, we have used the complete training dataset for training the model. This model is used to make predictions on the unseen official test set. As there were limited number of runs allowed for the final test runs to prevent participants from over-fitting the test data, we submitted     Table 4: Results on the validation set using Finetuned Transformer model representations Table 5 show results using different machine learning regression methods using word and sentence embeddings. Tables 3 and 4 show results using pretrained-transformer representations and finetuned transformer representations respectively. Among the pretrained word and sentence embedding methods in Table 5, the best MSE/MedAE is obtained using Universal Sentence Encoder and with Ridge Regression. The best F1/Acc is obtained using ELMo with Linear Regression. Among the transformer based methods in Table 4, the best MSE and MedAE is obtained using RoBERTa approach and with GB regression. On the other hand, with respect to classification metrics, RoBERTaLarge with NNR performs best. We also experimented with just the finetuning approach (without any extra regressor augmented at the last layer, i.e., just using a neuron in the output layer of the neural network). We call this method as neu-    (Zhou, 2017) 0  ral network regression (NNR). Note that (1) the finetuning+ML regressors approach typically provides better results compared to the NNR method.
(2) finetuned models have lower MSE compared to the corresponding pretrained models (especially GPT2 where the pretrained-only model performs poorly).
(3) Larger Transformer models like RoBERTaLarge and BERTLarge do not lead to lower MSE/MedAE values, probably because of relatively small labeled data.
Finally, we show results on the test set by comparing them across several baselines in Table 7 also available on the Clickbait Challenge Leaderboard 6 as on 23-Nov-2019. Note that 6 of the top 7 are our approaches. Details of these approaches are as follows: goldfish 1 is RoBERTa + GBR, goldfish 2 is RoBERTa + RR, goldfish 3 is GPT2 + LR, torpedo19 1 is Universal Encoder + RR, torpedo19 2 is ELMo + RR, torpedo19 3 is ELMo + LR. Figure 3 shows the histogram of the absolute value of the errors produced by the model on the predictions. We can observe that for most of the posts the error between the actual and the predicted value is less than 0.2. Very few samples have error in the range of 0.2 to 0.45. There are a very few posts whose error is greater than 0.45. Further, Table 6 shows examples of posts where our proposed gives good predictions as well as those where our model fails. Higher score implies high degree of clickbait.
We show examples for both high as well as low clickbait strength.
Finally, in Figure 4 we show attention visualization for average attention that the [CLS] token pays to various words in the post in the last Transformer layer. For the first example, the words "15", "ways", "double" and "income" have high attention values -intuitively, these words indicate clickbaity-ness as well. Similarly, clickbaity words in the second example like "Five" and "style" have high attention values.

Conclusion
In this paper, we proposed various methods for clickbait intensity prediction based on the title of the post. Using a benchmark dataset from the Clickbait Challenge, we evaluate multiple models; we are the first to investigate effectiveness of Transformer based models for this task. As of now, we rank at the top on the official leaderboard for the challenge. We plan to work on reducing the model size and improve runtime latency using popular knowledge distillation methods.