Interpretable Neural Architectures for Attributing an Ad’s Performance to its Writing Style

How much does “free shipping!” help an advertisement’s ability to persuade? This paper presents two methods for performance attribution: finding the degree to which an outcome can be attributed to parts of a text while controlling for potential confounders. Both algorithms are based on interpreting the behaviors and parameters of trained neural networks. One method uses a CNN to encode the text, an adversarial objective function to control for confounders, and projects its weights onto its activations to interpret the importance of each phrase towards each output class. The other method leverages residualization to control for confounds and performs interpretation by aggregating over learned word vectors. We demonstrate these algorithms’ efficacy on 118,000 internet search advertisements and outcomes, finding language indicative of high and low click through rate (CTR) regardless of who the ad is by or what it is for. Our results suggest the proposed algorithms are high performance and data efficient, able to glean actionable insights from fewer than 10,000 data points. We find that quick, easy, and authoritative language is associated with success, while lackluster embellishment is related to failure. These findings agree with the advertising industry’s emperical wisdom, automatically revealing insights which previously required manual A/B testing to discover.


Introduction
A text's style can affect our cognitive responses and attitudes, thereby influencing behavior (Spence, 1983;Van Laer et al., 2013). The predictive relationship between language and behavior has been well studied in applications of NLP to tasks like linking text to sales figures (Ho and Wu, 1999; and voter preference (Luntz, 2007;Ansolabehere and Iyengar, 1995).
In this paper, we are interested in interpreting rather than predicting the relationship between language and behavior. We focus on a specific instance: the relationship between the way a search advertisement is written and internet user behavior as measured by click through rate (CTR). In this study CTR is the ratio of clicks to impressions over a 90-day period, i.e. the probability of a click, given the person saw the ad. Our goal is to develop a method for performance attribution in textual advertisements: identifying lexical features (words, phrases, etc.) to which we can attribute the success (or failure) of a search ad, regardless of who created the advertisement or what it is selling.
Identifying linguistic features that are associated with various outcomes is a common activity among machine learning scientists and practitioners. Indeed, it is essential for developing transparent and interpretable machine learning NLP models (Yamamoto, 2012). However, the various forms of regression and association quantifiers like mutual information or log-odds ratio that are the de-facto standard for feature weighting and text attribution all have known drawbacks, largely related to problems of multicollinearity (Imai and Kim, 2016;Gelman and Loken, 2014;Wurm and Fisicaro, 2014;Estévez et al., 2009;Szumilas, 2010).
Furthermore, these prior methods of text attribution critically fail to disentangle the explanatory power of the text from that of confounding information which could also explain the outcome. For example, in movie reviews, the actors who star in a film are the most powerful predictors of box office success (Joshi et al., 2010). However, these are words that the film's marketers can't change.
Likewise, the name of a well-known brand in an ad for shoes might boost its effectiveness, but if we attribute the ad's success to the brand terms, we are actually crediting the power of the brand, not necessarily an actionable writing strategy (Ghose and Sundararajan, 2006).
There is an emerging line of work on text understanding for confound-controlled settings Egami et al., 2017;Pryzant et al., 2018;Li et al., 2018), but these methods are usually concerned with making causal inferences using text. They are limited to word-features and can only tell you whether a word is discriminative. Attribution involves the more fine-grained problem of identifying discriminative subsequences of the text and being able to explain which level of the outcome these subsequences support.
We present a pair of new algorithms for solving this problem. Based on the Adversarial and Residualizing models of (Pryzant et al., 2018), these algorithms first train a machine learning model and then analyze the trained parameters on strategically chosen inputs to infer the most important features for each output class. Our first algorithm encodes the text with a convolutional neural network (CNN) and proceeds to predict the outcome and adversarially predict the confounders. We select attributional n-grams by projecting back the weights of the output layer onto the encoder's convolutional feature maps. Our second algorithm uses a bag-of-words text representation and is trained to learn the part of the text's effect that the confounds cannot explain. We get n-grams from this method by tracing back the contribution of each feature towards each outcome class.
We demonstrate these algorithms' efficacy by conducting attribution studies on high-and lowperforming search advertisements across three domains: real estate, job listings, and apparel. We find the proposed algorithms lend importance to words that are more predictive and less confoundrelated than a variety of strong baselines.

Text Attribution
We begin by proposing a methodological framework for text attribution and formalizing the activity into a concrete task.
We have access to a vocabulary V = {v 1 , ..., v m }, text T = (w 1 , ..., w t ) that is represented as a sequence of tokens, where each w is an element of V , outcome variable Y ∈ {1, ..., k}, and confounding variable(s) C. The data consists of (T i , Y i , C i ) triples, where the i th data point includes a passage of text, an outcome, and confounding information that could also explain the outcome. Note that parts of T and C are related because language reflects circumstance (the text T is usually authored within a broader pragmatic context, for example the intent to promote a certain product at a certain price); T and Y are related because language influences behavior; C and Y are related because circumstance also influences behavior. We are interested in isolating the T -Y relationship and finding out which parts of the text act towards each possible outcome. We do so by choosing a lexicon L 1 , ..., L k ⊂ V for each each outcome class Y i such that the outcome x in observation (T i , Y i = x, C i ) can be credited to T i ∩ L x , regardless of C. In other words, observing Y i = x can always be attributed to the tokens in L x no matter the circumstances.
Saying that Y i = x can be attributed to L x means (1) the words in L x have a causal effect on Y and (2) that these words push Y towards class x, i.e., L x is associated with class x. Based on the potential outcomes model of (Holland et al., 1985;Splawa-Neyman et al., 1990;Rubin, 1974;Pearl, 1999), Pryzant et al. (2018) developed a causal informativeness coefficient which measures the causal effects of a lexicon L on Y : I(L) measures the ability of T ∩ L to explain Y 's variability beyond the information already contained in the confounders. One computes I(L) by (1) regressing C on Y , (2) regressing C + L ∩ T on Y , and (3) measuring the difference in crossentropy error between these models over a test set. So I(L x ) measures the degree to which L x influences Y , but it can't describe the degree to which L x influences Y towards the specific outcome x. We propose circumventing this issue with a new directed informativeness coefficient I (L, x) =lo(L, x) · I(L), wherelo is the average strength of association between the tokens in L x and outcome x, as measured by log-odds: Intuitively, if I (L x , x) is high, then L x is both highly influential on Y and strongly associated with outcome x.

Proposed Algorithms
We continue by describing the pair of novel algorithms we are proposing to use for text attribution. Each algorithm consists of two phases: training, where we use T , Y , and C to train a machine learning model, and interpretation, where we analyze the learned parameters to identify attributional language.

Convolutional Adversarial Selector (CA)
Training. We begin by observing that the language we want to attribute should be able to explain the variation in Y and should also be decorrelated from the confounders C. This implies that the features we want to select should be predictive of Y , but not C (e.g. brand name). The Convolutional Adversarial Selector (CA) draws inspiration from this. It adversarially learns encodings of T which are useful for predicting Y but are not useful for predicting C. The model is depicted on the left-hand side of Figure 1.
First, we encode T into e ∈ R f with the following steps: 1. Embed the tokens of T with word vectors of dimension e. If the input text sequence has length t, the embedded input is a matrix E ∈ R e×t .
2. Slide convolutional filters of size f × n along the time axis of E, where n are the n-gram size(s) we are interested in attributing during the interpretation stage. This process transforms text T into a set of n-gram features of various sizes, n. The input are now trans- for each n-gram size n.
3. Perform global average pooling (Lin et al., 2014) on F n . We now have our encoding e n ∈ R f , where each e n j = i F n j,i .

4.
Concatenate all e n 's from every filter width n. This produces the final encoding, e.
Armed with e, we proceed to predict Y and C with a single linear transformation: The model receives error signals from both of these "prediction heads" via a cross-entropy loss term: Where p i andp i correspond to the ground truth and predicted probabilities for class i, respectively. Last, as gradients backpropagate from the Cprediction head to the encoder, we pass them through a gradient reversal layer in the style of (Ganin et al., 2016;Britz et al., 2017), which multiplies gradients by -1. If the loss of the Yprediction head is L Y , and that of the confounders is L C , then the loss which is implicitly used to train the encoder is L e = L Y − L C . This encourages the encoder to match e's distributions, regardless of C, thereby learning representations of the text which are invariant to the confounders (Xie et al., 2017). Interpretation. Once we've trained a CA model, we interpret its behavior in order to determine the most important n-grams for each level of the outcome. This stage is depicted in the right-hand side of Figure 1.
Inspired by the class activation mapping technique for computer vision (Zhou et al., 2016), we project the weights of W Y , the output layer, onto F n , the convolutional feature maps. Sincê indicates the importance of e i for class k. The elements of e are averages of each feature map, so W Y i,k also indicates the importance of the i th feature map for class k. Each feature map contains one activation per n-gram feature. This means we can quantify the importance of the j th n-gram feature v n j towards each output class k by summing over all feature maps: M k is a mapping between input features and their importance towards class k.
In order to draw lexicons L i from our vocabulary V , we perform interpretation over a dataset and map each (n-gram, outcome class) tuple to all of the importance values it was assigned. We then compute the average importance for each n-gram and select the top k for inclusion in the outgoing lexicon.
Note that this algorithm is only interpretable to the extent that there is a single linear combination relating e toŶ . With multiple layers at the "decision" stage of the network, the relationship between each dimension of e (and by extension, the rows of F) and each output class becomes obfuscated.

Directed Residualization Selector (DR)
Training. Recall from Section 2 that I (L, x) measures two quantities: (1) the amount by which L can further improve predictions of Y compared to the prediction only made from the confounders C, and (2) the strength of association between members of L and outcome class x. The Directed Residualization method is directly motivated by this setup. It first predicts Y directly from C as well as possible, and then seeks to fine-tune these predictions using T . This two-stage prediction process lets us control for the confounders C, because T is being used to predict the part of Y that the confounders can't explain. This model is depicted in the left-hand side of Figure 2.
First, the confounders C are converted into onehot feature vectors that are passed through a feedforward neural network (FFNN) to obtain a vector of preliminary predictionsŶ . We then re-predict the outcome with the following steps: Where t = {0, 1} |V | is a bag-of-words representation of T , W in ∈ R |V |×f , e ∈ R f , W out ∈ R (f +k)×k , and k is the number of classes in Y . The model receives supervision from bothŶ and Y. We use the same cross-entropy loss function as the Convolutional Adversarial Selector of Section 3.1. Note the similarities between this approach and the popular residualizing regression (RR) attribution technique (Jaeger et al., 2009;Baayen et al., 2010, inter alia). Both use the text to improve an estimate generated from the confounds. RR treats this as two separate regression tasks (using C to predict Y , then T to predict the first model's residuals). We introduce the capacity for nonlinear interactions by backpropagating between RR's steps.
Interpretation. This stage is depicted in the righthand side of Figure 2. Once we've trained a DR model, we determine the importance of each feature v j for each class Y k by tracing all possible paths between v j and Y k , multiplying the weights along those paths, then summing across paths. The resulting importance value, M k (v j ), is how much Y k 's log-likelihood increases if v j is added to a text according to the trained model (and thus irrespective of the confounders).
We can derive this procedure by considering the models' parameters. In equation 7, we produce log-likelihood estimates for Y by concatenating e andŶ and multiplying the result with W out . This means the first |e| = f rows of W out (written as W out,T ) are an output projection transforming e intoŶ T , the text's contribution towardsŶ. So W out i,k indicates the importance of e i for output class k. As per equation 6, e is the sum of all of the rows of W in that correspond to features in the text. So we can decomposeŶ T into a sum of contributions from each text feature v j : And the estimated log-likelihood contribution of of any v j towards class k is For this algorithm, there is no need to run the model over any data in order to retrieve importance values -we can directly obtain these values from the trained parameters. This procedure is depicted in the right-hand side of Figure 2.
Last, like the CA algorithm, DR is only interpretable to the extent that there is a single linear combination between e andŶ .

Experiments
We demonstrate the efficacy of the proposed algorithms on a dataset of internet advertisements.

Experimental Set-Up
Data. In this setting our (T , Y , C) data triples consist of • T : the header text of sponsored search results in an internet search engine.
• Y : a binary categorical variable which indicates whether the corresponding advertisement was high-performing or lowperforming.
• C: a categorical variable which indicates the brand of the ad. We use customer id and the hostname of the landing page the ad points to as a proxy for this.
We collect advertisements across three domains: apparel (16,000 advertisements), job listings (70,000), and real estate (32,000). See section A for more details on these data. We selected pairs of ads where both had the same landing page and targeting, but where one ad was in the 97.5 th CTR percentile (high-performing) and its counterpart was in the 2.5 th percentile (low-performing). This implies that any performance differences may be attributed to differences in their text.
We tokenized these data with Moses (Koehn et al., 2007) and joined word-tokens into n-grams of size 1, 2, 3, and 4 for the n-gram portion of the study. Implementation. We implemented nonlinear models with the Tensorflow framework (Abadi et al., 2016) and optimized using Adam (Kingma and Ba, 2014) with a learning rate of 0.001. We implemented linear models with the scikit learn package (Pedregosa et al., 2011). We evaluate each algorithm by selecting lexicons of size |L i | = 50. We optimized the hyperparameters of all algorithms for each dataset. Complete hyperparameter specifications are provided in the online supplementary materials; for the proposed DR and CA algorithms we set |e| to 8, 32, and 32 for the apparel, job listing, and real estate data, respectively.

Baselines.
Along with the Convolutional Adversarial Selector (CA) and Directed Residualization Selector (DR) of Section 3, we compare the following methods: Regression (R), Residualized Regressions (RR), Regression with Confound features (RC), and the Adversarial Selection (AS) algorithm of (Pryzant et al., 2018), which selects words that are important for a confound-controlled prediction task by considering the attentional scores of an adversarially-trained RNN.

Experimental results
We begin by investigating whether the proposed methods successfully discovered features that are simultaneously indicative of each CTR status and untangled from the confounding effects of brand (Tables 1, 2, 3 On the apparel data (Table 1), we find that the proposed algorithms select words that are often both the most influential on CTR (highest I) and are also the most strongly associated with their target outcome classes (highestlo). It is not surprising that the Adversarial Selector of (Pryzant et al., 2018) (AS) had lowlo because the method is only capable of identifying discriminative features while controlling for confounds. AS was also inconsistent in its ability to select words that are predictive of CTR while being unrelated to brand. This may be due to the instability of adversarial learning (Shrivastava et al., 2017) or the complex nonlinear relationship between the model's attention scores and final predictions.
On the job advertisements (Table 2), the proposed DR algorithm performed the best, selecting words that were both more influential on CTR and more strongly associated with its target than  any other algorithm. In general, I values were an order of magnitude larger for n-grams than unigrams, indicating that for job postings on the internet, phrases are more important than the individual words they are composed of. This suggests job seekers may read advertisements more closely than internet shoppers, who are known to "skim" content and are thus more attuned to individual keywords (Campbell and Maglio, 2013;Seda, 2004).
For real estate, Table 3 indicates that except for the case of weak unigrams, the proposed DR and CA algorithms can perform best. In many cases, the regression-based approaches successfully selected words that are strongly related to each target outcome class (lo was relatively high), but failed to choose words whose explanatory power exceeds that of the confounds (I was relatively low). For a plain regression (R) this makes sense; there is no mechanism to control for confounders. For the other regression-based approaches (RC & RR), this may be due to the multicolinearity of confounders and text which is described in (Gelman and Loken, 2014;Wurm and Fisicaro, 2014) as a fundamental weakness of these attribution algorithms. Again, n-grams performed drastically better than unigrams, implying that phraseology may matter more than vocabulary to prospective home-  owners.

Algorithmic Analysis
Ablation Study. We proceed to ablate the mechanism by which each proposed algorithm controls for the confounds. First we toggled the gradient reversal layer of the Convolutional Adversarial Selector (CA). Doing so reduced the algorithm's performance by an average of 0.03lo and 0.24 I. For the Directed Residualization Selector (DR), we removed the part of the model that uses the confounds to generate preliminary predictions. Doing so resulted in an average increase of 0.02lo and a decrease of 0.21 I. For both algorithms, only the average difference in I was significant (p < 0.05). From these results, we conclude that these confound-controlling mechanisms bear little impact on the degree to which the selected words are associated with their corresponding outcome classes. However, the mechanisms are important for getting the models to avoid confound-related features. Visualization. We visualize M high−CTR and M low−CTR as computed by a proposed and baseline method (Figure 3). We see that the regression lends high-CTR importance to the name of a popular real estate company, and low-CTR importance to an unpopular location (which that company happens to specialize in). The Adversarial Selector gives confound-related features less importance. By disabling the reversal layer, we recover some of the regression's confound-relatedness.

Language Analysis
We continue by studying high-scoring words and phrases from the models we experimented with in order to glean useful insights about internet advertising. Please note that this is an illustration of the present algorithm and this study is limited in scope. These are experimental results, not suggestions for real online advertising campaigns.
When comparing the words selected by the proposed and baseline methods, we observe that many of the regression-based methods selected brand names or words that are closely associated with brands, like locations (areas where real estate and staffing agencies specialize) or proper nouns (fashion designers, real estate agents, and so on). Indeed, for apparel, the percent of selected words and phrases which contained the name of a fashion retailer was less for DR and CA (6.5% and 8.5%) than AS (9%), R (23%) RC (19%) and RR (13%).
After clustering words and phrases based on the cosine similarity of their GloVe embeddings (Pennington et al., 2014), the authors found semantic classes that include industry best practices (e.g., Schwab, 2013). For example: • Involvement. This includes language which creates a dialogue with the reader ("your", "you", "we") and portrays a personal experience ("personalized") at the reader's discretion ("compare", "view"). This aligns with growing demand for personalized internet services (Meeker, 2018).
• Authority. This includes appeals to the rhetorical device of ethos, in the form of authoritative framing, such as "official site" and "®".
We also find some semantic classes among weakly performing words and phrases. One notable class includes "filler words" consisting of lackluster embellishment. This aligns with prior psychological research suggesting that words that don't contribute to a topic can have a slightly negative effect on attitude (Fazio et al., 1986;Grush, 1976).
Finally, we note that popular items or categories of items were frequently high-scoring. This comes as no surprise and reflects an important aspect of the proposed methodology: it only controls for the confounders it is given, and we controlled for the brand of an ad, not its content. There are innumerable factors which influence clicking behavior (position, demographics, etc.) that we did not model explicitly in this study; we leave this to future work.

Related Work
Neural Network Interpretability. A variety of work has been done on understanding the relationship between input features and the network's behavior. Attention mechanisms (Bahdanau et al., 2015;Luong et al., 2015) are a popular method for highlighting parts of the input, but the nonlinear relationship between attention scores and ouputs makes it a poor tool for attribution on a per-class basis (as our Adversarial Selector (AS) baseline demonstrates). Dosovitskiy and Brox (2015) and Mahendran and Vedaldi (2015) invert the layers of a neural network to show which input features are being used. Zhou et al. (2016) extends this work to show exactly which parts of the input are being used. Parts of our Convolutional Adversarial Selector draw on this, and as far as these authors know, we are the first to adapt class activation maps to language data. Sundararajan et al. (2017) also highlight important parts of the input with a method that is similar to our Directed Residualization Selector. Their method uses gradients to trace influence. Because our models' gradients are a composite of signals, only some of which we want to consider while attributing, the method can't be applied directly to our setting. Ribeiro et al. (2016), Biran and McKeown (2017), and Lei et al. (2016) also use "importance scores" to explain the predictions of neural network-based classifiers. Causal Inference. Our methods have connections to recent advances in the causal inference literature.  and  propose an algorithm for causal inference which bears similarity to our Convolutional Adversarial Selector (CA). Imai et al. (2013) advocate a lasso-based method similar to our Directed Residualization (DR), and Egami et al. (2018) explore how to make causal inferences from texts through careful data splitting. Unlike the present study, these papaers are largely unconcerned with the underlying interpretability. Pryzant et al. (2018) makes a foray into causal interpretability, developing the informativeness coefficient metric we use in our evaluations. This work also proposed two algorithms for deconfounded lexicon induction which inspired our proposed CA and DR algorithms.

Conclusion
In this paper, we presented two new algorithms for the analysis of persuasive text. These algorithms are based on interpreting the behaviors and parameters of trained machine learning models. They perform performance attribution, the practice of finding words that are indicative of particular outcomes and are unrelated to confounding information. We used these algorithms to conduct the first public investigation into successful writing styles for internet search advertisements. We find that the proposed method can automatically identify successful (and unsuccessful) writing styles of advertising. These findings are inline with industry practices built on manual A/B testing and also previous psychological studies. This is an exciting new direction for NLP research. There are many directions for future work, including core algorith-mic innovation and applying the proposed algorithms to new and rich social questions.

Acknowledgments
We are grateful to Emanuel Schorsch, Kristen LeFevre and Jason Baldridge for their helpful comments and suggestions.