Ad Headline Generation using Self-Critical Masked Language Model

For any E-commerce website it is a nontrivial problem to build enduring advertisements that attract shoppers. Many sellers often find it hard to pass the creative quality bar of the website, especially at a large scale. We thus propose a programmatic solution to generate product advertising headlines using retail content. We propose a state of the art application of Reinforcement Learning (RL) Policy gradient methods on Transformer (Vaswani et al., 2017) based Masked Language Models (Devlin et al., 2019). Our method creates the advertising headline by jointly conditioning on multiple products that a seller wishes to advertise. We demonstrate that our method outperforms existing Transformer and LSTM + RL methods in overlap metrics and quality audits. We also show that our model-generated headlines outperform human (advertiser) submitted headlines in terms of both grammar and creative quality as determined by audits.


Introduction
There are a various types of ads. A set of example ads that showcase products selected by sellers along with headlines that advertise them are shown in Figure 1. Sellers create multiple ad campaigns for multiple products, bid in an auction to advertise and pay for clicks on the ad.
An E-Commerce product catalog may have millions of products which can be advertised. To ease the ad headline writing process, humans resort to programmatically padding keywords, or repasting the retail catalog content in the advertisement.
Templated creatives such as "Save Now on ..." or "Buy more (product) of (brand)" save the creative effort but fail to create any excitement or brand identity in the minds of shoppers. High quality headlines are more attractive to shoppers and offer better value proposition. In this paper, we describe how we built a Natural Language Generation (NLG) system to generate instantaneous, attractive and brand identity building headlines for advertisements that intend to promote a wide range of products offered by a brand.
The content associated with a retail product has challenging characteristics. Some product titles have poor structure, grammatical issues, or partial phrases. The product titles also include varying number of product features such as "Hyper Tough 18V Cordless Drill, 3/8 inch Chuck, Variable Speed, with 1.2Ah Nickel Cadmium Battery, Charger, Bit Holder LED Light" along with titles such as "ZIPIT Grillz Backpack, Camo Grey".
The generated headlines need to capture the information present in the retail attributes and at the same time be different and uniquely attractive. Advertisers select multiple related products that are advertised as part of a single ad campaign. The ad campaign headline is then shared across all of these related products. Thus, the headline also needs to generalize the shared characteristics of the products and cannot be specific to a single product within the campaign.
The key contributions of our work are: • We use Masked Language Model (MLM) for the generation of advertisement headlines using multiple products at the same time. Extensive test-set metrics, quality and grammar audits show that the proposed model outperforms all the baselines and the humansubmitted headlines in terms of quality and grammar.
• The novel usage of RL for the training of MLM allows us to directly optimize the MLM for improved headline quality metrics without changing inference setup or latency. Our method can also be applied to any other NLG task such as summarization, translation etc.
• Our model reduces the extensive effort and time that is required to manually create headlines and has low latency.
2 Figure 1: Examples of different product ads from multiple websites across the internet. A variety of ad headlines accompany the products in these ads.

Related Work
Natural Language Understanding (NLU) using Language Models (LM) has observed great leaps in recent years. LMs have evolved from using word level models (Joulin et al., 2016) to to a variety of extensions to the Transformer (Vaswani et al., 2017). The BERT (Devlin et al., 2019) employs Transformer in a pre-training setting and introduced the MLM training objective. Ramachandran et al. (2016) first demonstrated textual generation by using auto-regressive prediction in a seq2seq architecture. Transformer based auto-regressive methods such as GPT2 (Radford et al., 2019) and BART (Lewis et al., 2019) which predict one word at a time have also shown good results. Zhu et al. (2020) concatenated BERT representations with the Encoder and Decoder layers of another LM to incorporate pre-trained LM. Another model (Dong et al., 2019) combines BERTbased Transformer Encoder with attention masking from the Transformer decoder. Rothe et al. (2019) combined pre-trained BERT Encoder with GPT decoder for NLG. Ranzato et al. (2016) framed NLG as an RL problem and the generation quality as a reward. The Self-Critical Sequence Training (SCST) approach (Rennie et al., 2017) replaces the learned baseline from other approaches (Bahdanau et al., 2017) with the model's own inference time algorithm to normalize the rewards.
For advertising, recent works (Xu et al., 2019;Hughes et al., 2019) have combined LSTM based pointer network (See et al., 2017) with RL methods to generate advertisement headlines. While these methods improve the results, they fail to utilize extensive pre-training of Transformer based models and their various well-demonstrated advantages.
Our method extends BERT based generation (Dong et al., 2019) by using Self-Critical policy gradient method (Rennie et al., 2017) and jointly conditioning the generated sentence on multiple products at the same time. This allows us to use pre-trained BERT based LMs that can be trained to optimize various inference time metrics that are typically non-differentiable such as BLEU, Rouge, Readability etc.
3 Self-Critical Masked Language Model

Masked Language Model
The BERT model takes an unlabeled input sequence x = (x 1 , x 2 , ..., x |x| ) and randomly masks some positions M x by replacing them with a special mask token [MASK], to produce a sequence like (x 1 , [MASK], ..., x |x| ). All the tokens are embedded and added to special positional embeddings. It then uses N identical Transformer layers to generate contextualized representation, with each layer employing self-attention by taking in the output of the previous layer. To compute self-attention, the output of the previous layer is projected into triplets of vectors named Query, Key and Value (Q, K, V ) of dimensions d. The attention A is then given as: (1) d After the final Transformer layer the model uses a feed forward layer followed by a softmax over the vocabulary to predict the masked tokens. The MLM loss for the sequence x is then calculated as: (2) where (x m 0 ∈ x \ M x ) represents all the tokens in x that are not masked and m ∈ M x are all the masked positions. Figure 2: The sub-tokens from the product titles and headline are embedded and added with other embeddings that encode the positional and segment information. We also optionally add an embedding that represents the category of the product. During training, the masked tokens are predicted using Transformer layers and the cross-entropy (Eq. 2) loss and Self-Critical (Eq. 9) gradient is used to optimize the model. During inference, we predict one word at a time (left-to-right) in an auto-regressive manner using Beam Search.

Encoding multiple products and common headline for Proposed MLM
During training, for a given advertising campaign, and a set P of one or more products. Each product p is represented by its title . The titles and the headline are To encode using the model that only accepts a single product, we simply append '[EOS]' ∈ V to both the title and the headline and concatenate their tokens. The entire concatenated sequence is prepended with '[SOS]' ∈ V.
We encode multiple products by concatenating the tokens from different products using a special token '[P_SEP]' ∈ V. We replace a token '[UNUSED_0]' ∈ V that remains unused during pre-training, with this special token during multi-product fine-tuning. This makes a distinction between different titles as well as the source and target sub-sequences. It also yields individual embeddings for each product for other tasks.
h Only the tokens from the headline x are randomly masked with token '[MASK]' ∈ V. We discuss results for the model that additionally also masks the source tokens in section 5.1.
The complete process for an example such that all products in the ad have two tokens and the headline has 4 tokens is illustrated in Figure 2.
We also experimented with adding of category based embeddings. The category labels for each product such as "Cell Phones and Accessories" are tokenized to subword units, encoded using the same embedding matrix as that of the title tokens, averaged and added to the title token embeddings.

Generation using Self-Critical Masked Language Model
The BERT MLM framework with multi-directional attention discussed in Section 3.1 cannot be used for auto-regressive generation directly. This is because, during training, the masked headline words may condition on the future words which are not available during auto-regressive inference. For MLM auto-regressive generation, we employ masked attention (Dong et al., 2019) that modifies the attention from equation 1 as below: where Φ ij represents the attention mask between the positions i and j. The elements are set to 0 if attention is allowed and −∞ if it is not allowed. Figure 3 illustrates the attention mask for headline generation using multiple input products.
The BERT MLM uses log-likelihood (Equation 2) of masked words during training to optimize the model parameters. The likelihood is predicted using other ground-truth words during training and other predicted words during inference. This causes exposure bias (Ranzato et al., 2016;Rennie et al., 2017) and accumulates error during inference. Moreover, the training is optimized for log-likelihood, while we actually care about other more evolved measures of headline quality such as overlap metrics BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004).
To overcome these issues and improve the quality of the generated headlines, we frame the MLM Figure 3: Masked attention partially restricts attention for some token pairs. It prevents attention to headline tokens that would not be accessible during each step of generation during inference.
as an RL problem. The model is an 'agent' that takes the 'action' of predicting masked words and updates the 'state' such as the self-attention weights. The MLM follows a policy π θ defined by the parameters θ of the model. It receives a reward that is proportional to the quality of the generated headline. This quality may either be the overlap with ground truth headlines that have been approved by internal subject-matter-experts or be predicted by another model. Our goal is to maximize the reward corresponding to a generated headline x h during training, with the tokens at some masked positions M h sampled from the model. x We thus minimize the negative expected reward defined by any reward function r(·) for headline quality r(x h ) as: We can compute the gradient r θ L RL using the REINFORCE algorithm (Williams, 1992). It is defined as: where, Mˆh are all the unmasked tokens.
x To reduce the variance without changing the expected gradient, the algorithm proposes to use a baseline b that does not depend on the generated headline x h . b is used to normalize the reward along with P from equation 6 as: (7) A single Monte-Carlo sample for each set of products and headline can be used to approximate the gradient. Using the definition of P from equation 6, we have the approximate gradient: Instead of using other models to estimate the expected baseline reward (Ranzato et al., 2016;Bahdanau et al., 2017), we employ Self-Critical training (Rennie et al., 2017) that involves generating two headlines using the same underlying MLM. The first headline x h is generated by sampling from the vocabulary distributions generated by the model for the masked tokens. The second headline ẑ h is generated using the inference time strategy, which uses the token with the maximum probability at each step rather than sampling. The difference in the reward achieved by these two headlines is used to compute the gradient: where P is defined by equation 6. Thus, this method maximizes both the reward of the headlines generated by MLM and the likelihood of correct words by incorporating both the likelihood and the reward in the loss function.

Inference
During inference, we generate the headline autoregressively using beam search until we reach the predetermined max length or each beam generates the end token. We have employed a modified version of Length Normalization (Wu et al., 2016) to better adapt to our headline lengths and training setup. This is necessary as the default beam search setup uses the log probability of each word to select the best headline. However, this biases the results as longer headlines would have lower probability of generation. We thus use the following normalized 4.2 Baseline scores for each word to select the best headline: where α is the length normalization coefficient h and x is the i th word of the generated headline i in each beam. We also include additional Regular Expression based post-processing to remove extra spaces around various symbols such as '-,+()' etc.

Training and Inference
We used over 500,000 ad campaigns that were created on Amazon by sellers who have signed-up for advertising. Each campaign contains a set of related products along with an ad headline. We only selected the campaigns that contained English headlines and products with English titles. They were also de-duplicated to only have unique productsheadline pairs. The mean product title length is 19.6 words and the mean headline length is 6.16 words. The entire dataset was divided into train (85%), validation (5%) and test (10%) sets. For training, we only selected the campaigns that comply with ad policies as verified by internal experts.
We use the HuggingFace (Wolf et al., 2020) implementation of the Transformer BERT 'Large' models as the base for our experiments. The models are pre-trained on WikiPedia and BookCorpus (Devlin et al., 2019;Dong et al., 2019). We first fine-tune the pre-trained model for up-to 15 epochs with early stopping using L M LM and Adam (Kingma and Ba, 2014). We then further fine-tune the model for another 15 epochs with early stopping using Adam with rL SC_M LM (Equation 9). We use the Rouge L F1 (Lin, 2004) overlap with the approved headlines as the headline quality reward. For a fair comparison, the MLM-only model is fine-tuned for upto 30 epochs.
The model training is very time expensive with a single fine-tuning sub-experiment of 30 epochs taking over 20 days on an Nvidia v100. We thus only performed the essential experiments that help to determine the contribution of different subexperiments and proposals. We estimated postexperiment that a single fine-tuning sub-experiment of 30 epochs would consume approximately 150 kWh of energy based on the GPU's power draw.
We used a Pointer Network (See et al., 2017) based bi-LSTM with intra-decoder and temporal attention. We also used Self-Critical training with the bi-LSTM, similar to other ad headline generation methods (Xu et al., 2019;Hughes et al., 2019) methods for a fair comparison to Self-Critical MLM.

Ablations
We trained a model with the same architecture, number of parameters and input as the proposed models but without MLM pre-training and separately without Self-Critical loss to study the impact of the proposals.
We also trained a model with MLM pre-training but fine-tuning only using the primary first product from each campaign instead of using all the products. This is interesting since some of the campaigns are cohesive to a degree with similar products and using only one product improves training time and inference latency.
We also report overlap metrics for model that does not use length normalization and postprocessing discussed in equation 10. We also include results for model that uses BERT Base as the base model instead of BERT Large.

Overlap with Approved Headlines
The first evaluation criterion we adopt is overlap (Sharma et al., 2017) of model headlines with subject-matter-experts approved human-submitted headlines from the test set (Table 1).
Masking the source product title words reduces the performance as the titles and headlines do not follow the same sentence structure and distribution. Adding product category embedding reduces performance and our hypothesis is that this is because the base model cannot be pre-trained with these embeddings. Only using one title achieves lesser but respectable performance, highlighting the efficacy of multi-product conditioning.
"No pre-training of MLM" highlights the advantage of using non-pretrained Transformer based architecture over bi-LSTM. 'Proposed MLM' shows the advantage of using pre-training, BERT Large and only masking the headline. 'Proposed Self-Critical MLM' achieves the best scores across all the metrics and highlights the applicability of our proposed approach.

Quality and Grammar Audits
We also conducted large scale crowd-sourced evaluation studies of the headlines with over 150,000 judgments. All headlines are shuffled and each headline is rated by 3 random and double-blind crowd-sourced auditors. The quality is judged on a 3-point scale of [1. Incorrect or Irrelevant, 2. Correct, 3. Correct and Attractive] and we use the mode of the 3 judgments.
In this double-blind audit, the auditors were not aware of the source of the headlines and we were not aware of the identity or demographics of any auditor. More details about the workforce may be found in the platform documentation (Ground Truth, 2021). In order to determine the compensation for the crowd-sourced workers, we used the guideline provided by the crowd-sourcing platform to "choose a price consistent with the approximate time it takes to complete a task" (Visible in the Console while creating the Labeling (2021) job). We thus first conducted an internal audit by volunteers across our organization to determine the time required to complete the task (average 21.59s) and then used the remuneration recommended for the corresponding time range ($0.12 for 20s -22s). Table 2 summarizes the quality audits. The SC-biLSTM model performed worse compared to human-submitted headlines. The proposed SC-MLM model achieves the highest average rating and the most number of perfectly rated headlines. Using just a single product does produce correct headlines with 8% faster inference latency but fails to produce attractive headlines due to lack of input from multiple products.
We also conducted Grammar specific audits (N ≈ 10000) in which the grammar of the headlines is judged independently. 98.13% of SC-MLM and 98.12% of MLM generated headlines were judged to have correct grammar against 93.14% of human submitted headlines. Table 3 shows a sample of headlines for campaigns in the blind test-set. Excessive keyword stuffing in source product titles does hamper headline quality at times and post-filtering using beam  Table 3: Some samples of model generated headlines from subsets rated 3, 2 and 1. The frequency of headlines is not indicative of true distribution of headline quality.
search score helps to filter them out. We do observe cases where both the models generate the same headline. This is an artifact of the fact that both the models share the first 15 epochs. The SC-MLM model generates more descriptive headlines and both models are able to abstract the product qualities.

Conclusion
Ad headline generation is a difficult problem owing to the varying nature of retail product attributes. A lot of historical methods focus on template based creation of ad headlines that are not very expressive.
We demonstrated a new NLG based method to generate headlines for multiple products. Our method achieves highest score in overlap metrics, quality audits and grammar audits compared to the baselines and human-submitted headlines. Masked Language Models were relatively unexplored for ad headline generation and we were able to demonstrate their utility. We further extended the performance of the model by using Reinforcement Learning. The method only changes the training procedure without impacting inference latency. Thus, our work contributes to both SOTA and practical business applications.
The approach can also be used for any other NLG task.