Discovery of Treatments from Text Corpora

An extensive literature in computational social science examines how features of messages, advertisements, and other corpora affect individuals’ decisions, but these analyses must specify the relevant features of the text before the experiment. Automated text analysis methods are able to discover features of text, but these meth-ods cannot be used to obtain the estimates of causal effects—the quantity of interest for applied researchers. We introduce a new experimental design and statistical model to simultaneously discover treatments in a corpora and estimate causal effects for these discovered treatments. We prove the conditions to identify the treat-ment effects of texts and introduce the supervised Indian Buffet process to discover those treatments. Our method enables us to discover treatments in a training set us-ing a collection of texts and individuals’ responses to those texts, and then estimate the effects of these interventions in a test set of new texts and survey respondents. We apply the model to an exper-iment about candidate biographies, recovering intuitive features of voters’ decisions and revealing a penalty for lawyers and a bonus for military service.


Introduction
Computational social scientists are often interested in inferring how blocks of text, such as messages from political candidates or advertising content, affect individuals' decisions (Ansolabehere and Iyengar, 1995;Mutz, 2011;Tomz and Weeks, 2013). To do so, they typically attempt to estimate the causal effect of the text: they model the out-come of interest, Y , as a function of the block of text presented to the respondent, t, and define the treatment effect of t relative to some other block of text t as Y (t) − Y (t ) (Rubin, 1974;Holland, 1986). For example, in industrial contexts researchers design A/B tests to compare two potential texts for a use case. Academic researchers often design one text that has a feature of interest and another text that lacks that feature but is otherwise identical (for example, (Albertson and Gadarian, 2015)). Both kinds of experiments assume researchers already know the features of text to vary and offer little help to researchers who would like to discover the features to vary.
Topic models and related methods can discover important features in corpora of text data, but they are constructed in a way that makes it difficult to use the discovered features to estimate causal effects (Blei et al., 2003). Consider, for example, supervised latent Dirichlet allocation (sLDA) (Mcauliffe and Blei, 2007). It associates a topicprevalence vector, θ, with each document where the estimated topics depend upon both the content of documents and a label associated with each document. If K topics are included in the model, then θ is defined on the K − 1-dimensional unit simplex. It is straightforward to define a treatment effect as the difference between two treatments θ and θ (or points on the simplex) Y (θ) − Y (θ ). It is less clear how to define the marginal effect of any one dimension. This is because bigger values on some dimensions implies smaller values on other dimensions, making the effect of any one topic necessarily a combination of the differences obtained when averaging across all the dimensions (Aitchison, 1986;Katz and King, 1999). This problem will befall all topic models because the zero-sum nature of the topic-prevalence vector implies that increasing the prevalence of any one topic necessarily decreases the prevalence of some other topic. The result is that it is difficult (or impossible) to interpret the effect of any one topic marginalizing over the other topics. Other applications of topic models to estimate causal effects treat text as the response, rather than the treatment (Roberts et al., 2016). And still other methods require a difficult to interpret assumption of how text might affect individuals' responses (Beauchamp, 2011).
To facilitate the discovery of treatments and to address the limitation of existing unsupervised learning methods, we introduce a new experimental design, framework, and statistical model for discovering treatments within blocks of text and then reliably inferring the effects of those treatments. By doing so, we combine the utility of discovering important features in a topic model with the scientific value of causal treatment effects estimated in a potential outcomes framework. We present a new statistical model-the supervised Indian Buffet Process-to both discover treatments in a training set and infer the effects treatments in a test set (Ghahramani and Griffiths, 2005). We prove that randomly assigning blocks of text to respondents in an experiment is sufficient to identify the effects of latent treatments that comprise blocks of text.
Our framework provides the first of its kind approach to automatically discover treatment effects in text, building on literatures in both social science and machine learning (Blei et al., 2003;Beauchamp, 2011;Mcauliffe and Blei, 2007;Roberts et al., 2016). The use of the training and test set ensures that this discovery does not come at the expense of credibly inferring causal effects, insulating the research design from concerns about "p-hacking" and overfitting (Ioannidis, 2005;Humphreys et al., 2013;Franco et al., 2014). Critically, we use a theoretical justification for our methodology: we select our particular approach because it enables us to estimate causal effect of interest. Rather than demonstrating that our method performs better at some predictive task, we prove that our method is able to estimate useful causal effects from the data.
We apply our framework to study how features of a political candidate's background affect voters' decisions. We use a collection of candidate biographies collected from Wikipedia to automatically discover treatments in the biographies and then infer their effects. This reveals a penalty for lawyers and career politicians and a bonus for military service and advanced degrees. While we describe our procedure throughout the paper, we summarize our experimental protocol and strategy for discovering treatment effects in Table 1. Our goal is to discover a set of featurestreatments-underlying texts and then estimate the effect of those treatments on some response from an individual. We first show that randomly assigning texts to respondents is sufficient to identify treatment effects. We then provide a statistical model for using both the text and responses to discover latent features in the text that affect the response. Finally, we show that we can use the mapping from text to features discovered on a training set to estimate the presence of features in a test set, which allows us to estimate treatment effects in the test set.

Randomizing Texts Identifies Underlying Treatment Effects
When estimating treatment effects, researchers often worry that the respondents who received one treatment systematically differ from those who received some other treatment. In a study of advertising, if all of the people who saw one advertisement were men and all of the people who saw a different advertisement were women, it would be impossible to tell whether differences in their responses were driven by the fact that they saw different advertisements or by their pre-existing differences. Randomized experiments are the gold standard for overcoming this problem (Gerber and Green, 2012). However, in text experiments, individuals are randomly assigned to blocks of text rather than to the latent features of the text that we analyze as the treatments. In this section, we show that randomly assigning blocks of text is sufficient to identify treatment effects.
To establish our result, we suppose we have a corpora of J texts, X . We represent a specific text with X j ∈ X , with X j ∈ D . Throughout we will assume that we have standardized the variable X j to be a per-document word usage rate with each column normalized to have mean zero and variance one. We have a sample of N respondents from a population, with the response of individual i to text j[i] given by the potential outcome Y i (X j[i] ). We use the notation j[i] because multiple individuals may be assigned to the same text; if i and i are assigned to the same text, then j[i] = j[i ]. We suppose that for each document j there is a corresponding vector of K binary treatments Z j ∈ Z where Z contains all 2 K possible combinations of treatments, {0, 1} K . The function g : X → Z maps from the texts to the set of binary treatments: we will learn this function using the supervised Indian Buffet process introduced in the next section. Note that distinct elements of X may map to the same element of Z.
To establish our identification result, we assume (Assumption 1) Y i (X) = Y i (X j[i] ) for all i. This assumption ensures that each respondent's treatment assignment depends only on her assigned text, a version of the Stable Unit Treatment Value Assumption (SUTVA) for our application (Rubin, 1986). We also assume (Assumption 2) that ) for all X j[i] ∈ X and all i, or that Z j[i] is sufficent to describe the effect of a document on individual i's response. Stated dif-ferently, we assume an individual would respond in the same way to two different texts if those texts have the same latent features. We further suppose (Assumption 3) that texts are randomly assigned to respondents according to probability for all X j[i] ∈ X and for all individuals i. This assumption ensures unobserved characteristics of individuals are not confounding inferences about the effects of texts. The random assignment of texts to individuals induces a distribution over a probability measure on treatment vectors Z, f (Z) = X 1(Z = g(X))h(X)dX. Finally, we assume (Assumption 4) that f (Z) > 0 for all Z ∈ Z. 1 This requires that every combination of treatment effects is possible from the documents in our corpus. In practice, when designing our study we want to ensure that the treatments are not aliased or perfectly correlated. If perfect correlation exists between factors, we are unable to disentangle the effect of individual factors.
In this paper we focus on estimating the Average Marginal Component Specific Effect for factor k (AMCE k ) (Hainmueller et al., 2014). 2 The AMCE k is useful for finding the effect of one feature, k, when k interacts with the other features in some potentially complicated way. It is defined as the difference in outcomes when the feature is present and when it is not present, averaged over the values of all of the other features. Formally, AMCE k = is some analyst-defined density on all elements but k of the treatment vector. For example, m(·) can be chosen as the density of Z −k in the population to obtain the marginal component effect of k in the empirical population. The most commonly used m(·) in applied work is uniform across all Z −k 's, and we follow this convention here.
We now prove that assumptions 1, 2, 3, and 4 are sufficient to identify the AMCE k for all k.
Proof. To obtain a useful form, we first marginalize over the documents to obtain,

A Statistical Model for Identifying Features
The preceding section shows that if we are able to discover features in the data, we can estimate their AMCEs by randomly assigning texts to respondents. We now present a statistical model for discovering those features. As we argued in the introduction, it is difficult to use the topics obtained from topic models like sLDA because the topic vector exists on the simplex. When we compare the outcomes associated with two different topic vectors, we do not know whether the change in the response is caused by increasing the degree to which the document about one topic or decreasing the degree to which it is about another, because the former mathematically entails the latter.
Other models, such as LASSO regression, would necessarily suppose that the presence and absence of words are the treatments (Hastie et al., 2001;Beauchamp, 2011). This is problematic substantively, because it is hard to know exactly what the presence or absence of a single word implies as a treatment in text. We therefore develop the supervised Indian Buffet Process (sIBP) to discover features in the document. For our purposes, the sIBP has two essential properties. First, it produces a binary topic vector, avoiding the complications of treatments assigned on the simplex. Second, unlike the Indian Buffet Process upon which it builds (Ghahramani and Griffiths, 2005), it incorporates information about the outcome associated with various texts, and therefore discovers features that explain both the text and the response. 3 Figure 1 describes the posterior distribution for the sIBP and a summary of the posterior is given in Equation 1. We describe the model in three steps: the treatment assignment process, document creation, and response. The result is a model that creates a link between document content and response through a vector of treatment assignments.
Treatment Assignment We assume that π is a K-vector (where we take the limit as K → ∞) where π k describes the population proportion of documents that contain latent feature k. We suppose that π is generated by the stick-breaking construction (Doshi-Velez et al., 2009). Specifically, we suppose that η k ∼ Beta(α, 1) for all K. We label π 1 = η 1 and for each remaining topic, we assume that π k = k z=1 η z . For document j and topic k, we suppose that z j,k ∼ Bernoulli(π k ), which importantly implies that the occurrence of treatments are not zero sum. We collect the treatment vector for document j into Z j and collect all the treatment vectors into Z an N texts × K binary matrix, where N texts refers to number of unique documents. Throughout we will assume that N texts = N or that the number of documents and responses are equal and index the documents with i.

Document Creation
We suppose that the documents are created as a combination of latent factors. For topic k we suppose that A k is a D−dimensional vector that maps latent features onto observed text. We collect the vectors into A, a K × D matrix, and suppose that , where X i,d is the standardized number of times word d appears in document i. While it is common to model texts as draws from multinomial distributions, the multi- 3 We note that there is a different model also called the supervised Indian Buffet Process (Quadrianto et al., 2013). There are fundamental differences between the model presented here and the sIBP in (Quadrianto et al., 2013). Their outcome is a preference relation tuple, while ours is a realvalued scalar. Because of this difference, the two models are fundamentally different. This leads to a distinct data generating process, model inference procedures, and inferences of features on the test set. To leverage the analogy between LDA and sLDA vis a vis IBP and sIBP, we overload the term sIBP in our paper. We expect that in future applications of sIBP, it will be clear from the context which sIBP is being employed.  Response to Treatment Vector We assume that a K−vector of parameters β describes the relationship between the treatment vector and response. Specifically, we use a standard parameterization and suppose that τ ∼ Gamma(a, b), β ∼ MVN(0, τ −1 ) and that Y i ∼ Normal(Z i β, τ −1 ). (1)

Inference for the Supervised Indian Buffet Process
We approximate the posterior distribution with a variational approximation, building on the algorithm introduced in (Doshi-Velez et al., 2009). We approximate the nonparametric posterior setting K to be large and use a factorized approximation, assuming that p(π, Z, A, β, τ |X, Y , α, σ 2 A , σ 2 X , a, b) = q(π)q(A)q(Z)q(β, τ ) A standard derivation that builds on (Doshi-Velez et al., 2009) leads to the following distributional forms and update steps: The updated parameter values are, Where typical element of E[Z T ] j,k = ν j,k and typical on-diagonal element of where ψ(·) is the digamma function. We repeat the algorithm until the change in the parameter vector drops below a threshold.
To select the final model using the training set data, we perform a two-dimensional line search over values of α and σ X . 4 We then run the model several times for each combination of values for α and σ X to evaluate the output at several different local modes. To create a candidate set of models, we use a quantitative measure that balances coherence and exclusivity (Mimno et al., 2011;Roberts et al., 2014). Let I k be the set of documents for which ν i,k ≥ 0.5, and let I C k be the complement of this set. We identify the top ten words for intervention k as the ten words with the largest value in A k , t k and define N k = N i=1 I{ν i,k ≥ 0.5}. We then obtain measure CE for a particular model where here X I k ,l refers to the l th column and I k th rows of X. We make a final model selection based on the model that provides the most substantively clear treatments.

Inferring Treatments and Estimating Effects in Test Set
To discover the treatment effects, we first suppose that we have randomly assigned a set of respondents a text based treatment X i according to some probability measure h(·) and that we have observed their response Y i . We collect the assigned texts into X and the responses into Y . As we describe below, we will often assign each respondent their own distinctive message, with the probability of receiving any one message at 1 N for all respondents and messages. We use the sIBP model trained our training set documents and responses to infer the effect of those treatments among the test set documents. Separating the documents and responses into training and test sets ensures that Assumption 1, SUTVA, holds. We learn the mapping from texts to binary vectors in the training set,ĝ(·) and then apply this mapping to the test set to infer the latent treatments present in the test set documents, without considering the test set responses. Dividing texts and responses into training and test sets provides a solution to SUTVA violations present in other attempts at causal inference in text analysis (Roberts et al., 2014).
We approximate the posterior distribution for the treatment vectors using the variational approximation from the training set parameters ( λ, φ , Φ, m, S, c, d, σ 2 X , σ 2 A ) and a modified update step on q(z test i,k ). In this modified update step, we remove the component of the update that incorporates information about the outcome. Specifically for individual i in the test set for category k we have the following update step v test For each text in the test set we repeat this update several times until ν test has converged. Note that for the test set we have excluded the component of the model that links the latent features to the response, ensuring that SUTVA holds.
With the approximating distribution q(Z test ) we then measure the effect of the treatments in the test set. Using the treatments, the most straightforward model to estimate assumes that there are no interactions between each of the components. Under the no interactions assumption, we estimate the effects of the treatments and infer confidence intervals using the following bootstrapping procedure that incorporates uncertainty both from estimation of treatments and uncertainty about the effects of those treatments: 1) For each respondent i and component k we 2) Given the matrixZ, we sample (Y test ,Z) with replacement and for each sample estimate the regression Y test = β testZ + .
We repeat the bootstrap steps 1000 times, keeping β test for each iteration. The result of the procedure is a point estimate of the effects and confidence interval of the treatments under no interactions. Technically, it is possible to estimate the treatment effects in our variational approximation. But we estimate the effects in a second-stage regression because variational approximations tend to understate uncertainty, the bootstrap provides a straightforward method for including uncertainty from estimation of the latent features and the effect estimates, and it ensures that SUTVA is not violated.

Application: Voter Evaluations of an Ideal Candidate
We demonstrate our method in an experiment to assess how features of a candidate's background affect respondents evaluations of the candidates. There is a rich literature in political science about the ideal attributes of political candidates (Canon, 1990;Popkin, 1994;Carnes, 2012;Campbell and Cowley, 2014). We build on this literature and use a collection of candidate biographies to discover features of candidates' backgrounds that voters find appealing. To uncover the features of candidate biographies that voters are responsive to we acquired a collection of 1,246 Congressional candidate biographies from Wikipedia. We then anonymize the biographies-replacing names and removing other identifiable information-to ensure that the only information available to the respondent was explicitly present in the text. In Section 2.1 we show that a necessary condition for this experiment to uncover latent treatments is that each vector of treatments has nonzero probability of occuring. This is equivalent to assuming that none of the treatments are aliased, or perfectly correlated (Hainmueller et al., 2014). Aliasing would be more likely if there are only a few distinct texts that are provided to participants in our experiment. Therefore, we assign each respondent in each evaluation round a distinct candidate biography. To bolster our statistical power, we ask our respondents to evaluate up to four distinct candidate biographies, resulting in each respondent evaluating 2.8 biographies on average. 5 After presenting the respondents with a candidate's biography, we ask each respondent to rate the candidate using a feeling thermometer: a well-established social science scale that goes from 0 when a respodent is "cold" to a candidate to 100 when a respondent is "warm" to the candidate.
We recruited a sample of 1,886 participants using Survey Sampling International (SSI), an online survey platform. Our sample is census matched to reflect US demographics on sex, age, race, and education. Using the sample we obtain 5,303 total observations. We assign 2,651 responses to the training set and 2,652 to the test set. We then apply the sIBP process to the training data. To apply the model, we standardize the feeling thermometer to have mean zero and standard deviation 1. We set K to a relatively low value (K = 10) reflecting a quantitative and qualitative search over K. We then select the final model varying the parameters and evaluating the CE score. Table 2 provides the top words for each of the ten treatments the sIBP discovered in the training set. We selected ten treatments using a combination of guidance from the sIBP, assessment using CE scores, and our own qualitative assessment of the models (Grimmer and Stewart, 2013). While it is true that our final selection depends on human input, some reliance on human judgment at this stage is appropriate. If one set includes a treatment about military service but not legal training and another set includes a treatment about legal training but not military service, then model selection is tantamount to deciding which hypotheses are most worthy of investigation. Our CE scores identify sets of treatments that are most likely to be interesting, but the human analyst should make the final decision about which hypotheses he would like to test. However, it is extremely important for the analyst to select a set of treatments first and only afterwards estimate the effects of those treatments. If the analyst observes the effects of some treatments and then decides he would like to test other sets, then the integrity of any p-values he might calculate are undermined by the multiple testing problem. A key feature of our procedure is that it draws a clear line between the selection of hypotheses to test (which leverages human judgment) and the estimation of effects (which is purely mechanical).
The estimated treatments cover salient features of Congressional biographies from the time period that we analyze. For example, treatments 6 and 10 capture a candidate's military experience. Treatment 5 and 7 are about previous political experience and Treatment 3 and 9 refer to a candidate's education experience. Clearly, there are many features of a candidate's background missing, but the treatments discovered provide a useful set of dimensions to assess how voters respond to a candidate's background. Further, the discovered treatments are a combination of those that are both prevalent in the biographies and have an effect on the thermometer rating. The absence of biographical features that we might think matters for candidate evaluation could be because there are few of those biographies in our data set, or because the respondents were unresponive to those features. After training the model on the training set, we apply it to the test set to infer the treatments in the biographies. We assume there are no interactions   between the discovered treatments in order to estimate their effects. 6 Figure 2 shows the point estimate and 95-percent confidence intervals, which take into account uncertainty in inferring the treatments from the texts and the relationship between those treatments and the response.
The treatment effects reveal intuitive, though interesting, features of candidate biographies that affect respondent's evaluations. For example, Figure 2 reveals a distaste for political and legal experience-even though a large share of Congressional candidates have previous political ex-perience and a law degree. Treatment 5, which describes a candidate's previous political experience, causes an 2.26 point reduction in feeling thermometer evaluation (95 percent confidence interval, [-4.26,-0.24]). Likewise, Treatment 9 shows that respondents dislike lawyers, with the presence of legal experience causing a 2.34 point reduction in feeling thermometer (95-percent confidence interval, [-4.28,-0.29]). The aversion to lawyers is not, however, an aversion to education. Treatment 3, a treatment that describes advanced degrees, causes a 2.43 point increase in feeling thermometer evaluations (95-percent confidence interval, [0.49,4.38]).
In contrast, Figure 2 shows that there is a consistent bonus for military experience. This is consistent with intuition from political observers that the public supports veterans. For example, treatment 6, which describes a candidate's military record, causes a 3.21 point increase in feeling thermometer rating (95-percent confidence interval, [1.34,5.12]) and treatment 10 causes a 4.00 point increase (95-percent confidence interval, [1.53,6.45]).
Because simultaneously discovering treatments from labeled data and estimating their average marginal component effects is a novel task, we cannot compare the performance of our framework against any benchmark. Even so, one natural question is whether the user could obtain much more coherent topics by foresaking the estimation of causal effects and using a more traditional topic modeling method. We provide the topics discovered by sLDA in Table 3. sIBP discovered most of the same features sLDA did. Both find military service, legal training, political background, and higher education. The Greek life feature is less coherent in sIBP than it is in sLDA, and sLDA finds business and ancestry features that sIBP does not. Both have a few incoherent treatments. This comparison suggests that sIBP does almost as well as sLDA at identifying coherent latent features, while also facilitating the estimation of marginal treatment effects.

Conclusion
We have presented a methodology for discovering treatments in text and then inferring the effect of those treatments on respondents' decisions. We prove that randomizing texts is sufficient to identify the underlying treatments and introduce the supervised Indian Buffet process for discovering the effects. The use of a training and test set ensures that our method provides accurate confidence intervals and avoids the problems of overfitting or "p-hacking" in experiments. In an application to candidate biographies, we discover a penalty for political and legal experience and a bonus for military service and non-legal advanced degrees.
Our methodology has a wide variety of applications. This includes numerous alternative experimental designs, providing a methodology that computational social scientists could use widely to discover and then confirm the effects of messages in numerous domains-including images and other high dimensional data. The methodology is also useful for observational data-for studying the effects of complicated treatments, such as how a legislator's roll call voting record affects their electoral support.