Deconfounded Lexicon Induction for Interpretable Social Science

NLP algorithms are increasingly used in computational social science to take linguistic observations and predict outcomes like human preferences or actions. Making these social models transparent and interpretable often requires identifying features in the input that predict outcomes while also controlling for potential confounds. We formalize this need as a new task: inducing a lexicon that is predictive of a set of target variables yet uncorrelated to a set of confounding variables. We introduce two deep learning algorithms for the task. The first uses a bifurcated architecture to separate the explanatory power of the text and confounds. The second uses an adversarial discriminator to force confound-invariant text encodings. Both elicit lexicons from learned weights and attentional scores. We use them to induce lexicons that are predictive of timely responses to consumer complaints (controlling for product), enrollment from course descriptions (controlling for subject), and sales from product descriptions (controlling for seller). In each domain our algorithms pick words that are associated with narrative persuasion; more predictive and less confound-related than those of standard feature weighting and lexicon induction techniques like regression and log odds.


Introduction
Applications of NLP to computational social science and data science increasingly use lexical features (words, prefixes, etc) to help predict nonlinguistic outcomes like sales, stock prices, hospital readmissions, and other human actions or preferences. Lexical features are useful beyond predictive performance. They enhance interpretability in machine learning because practitioners know why their system works. Lexical features can also be used to understand the subjective properties of a text.
For social models, we need to be able to select lexical features that predict the desired outcome(s) while also controlling for potential confounders. For example, we might want to know which words in a product description lead to greater sales, regardless of the item's price. Words in a description like "luxury" or "bargain" might increase sales but also interact with our confound (price). Such words don't reflect the unique part of text's effect on sales and should not be selected. Similarly, we might want to know which words in a consumer complaint lead to speedy administrative action, regardless of the product being complained about; which words in a course description lead to higher student enrollment, regardless of the course topic. These instances are associated with narrative persuasion: language that is responsible for altering cognitive responses or attitudes (Spence, 1983;Van Laer et al., 2013).
In general, we want words which are predictive of their targets yet decorrelated from confounding information. The lexicons constituted by these words are useful in their own right (to develop causal domain theories or for linguistic analysis) but also as interpretable features for down-stream modeling. Such work could help widely in applications of NLP to tasks like linking text to sales figures (Ho and Wu, 1999), to voter preference (Luntz, 2007;Ansolabehere and Iyengar, 1995), to moral belief (Giles et al., 2008;Keele et al., 2009), to police respect (Voigt et al., 2017), to financial outlooks (Grinblatt and Keloharju, 2001;Chatelain and Ralf, 2012), to stock prices , and even to restaurant health inspections (Kang et al., 2013).
Identifying linguistic features that are indicative of such outcomes and decorrelated with confounds is a common activity among social scientists, data scientists, and other machine learning practitioners. Indeed, it is essential for developing transpar-ent and interpretable machine learning NLP models. Yet there is no generally accepted and rigorously evaluated procedure for the activity. Practitioners have conducted it on a largely ad-hoc basis, applying various forms of logistic and linear regression, confound-matching, or association quantifiers like mutual information or log-odds to achieve their aims, all of which have known drawbacks (Imai and Kim, 2016;Gelman and Loken, 2014;Wurm and Fisicaro, 2014;Estévez et al., 2009;Szumilas, 2010).
We propose to overcome these drawbacks via two new algorithms that consider the causal structure of the problem. The first uses its architecture to learn the part of the text's effect which the confounds cannot explain. The second uses an adversarial objective function to match text encoding distributions regardless of confound treatment. Both elicit lexicons by considering learned weights or attentional scores. In summary, we 1. Formalize the problem into a new task.
2. Propose a pair of well-performing neural network based algorithms.
3. Conduct the first systematic comparison of algorithms in the space, spanning three domains: consumer complaints, course enrollments, and e-commerce product descriptions.
The techniques presented in this paper will help scientists (1) better interpret the relationship between words and real-world phenomena, and (2) render their NLP models more interpretable 1 .

Deconfounded Lexicon Induction
We begin by formalizing this language processing activity into a task. We have access to text(s) T , target variable(s) Y , and confounding variable(s) C. The goal is to pick a lexicon L such that when words in T belonging to L are selected, the resulting set L(T ) is related to Y but not C. There are two types of signal at play: the part of Y that T can explain, and that explainable by C. These signals often overlap because language reflects circumstance, but we are interested in the part of T 's explanatory power which is unique to T , and hope to choose L accordingly.
So if Var [E [Y |L(T ), C]] is the information in Y explainable by both L(T ) and C, then our goal is to choose L such that this variance is maximized after C has been fixed. With this in mind, we formalize the task of deconfounded lexicon induction as finding a lexicon L that maximizes an informativeness coefficient, which measures the explanatory power of the lexicon beyond the information already contained in the confounders C. Thus, highly informative lexicons cannot simply collect words that reflect the confounds. Importantly, this coefficient is only valid for comparing different lexicons of the same size, because in terms of maximizing this criterion, using the entire text will trivially make for the best possible lexicon. Our coefficient I(L) can also be motivated via connections to the causal inference literature: in Section 7, we show that-under assumptions often used to analyze causal effects in observational studies-the coefficient I(L) can correspond exactly to the strength of T 's causal effects on Y .
Finally, note that by expanding out an ANOVA decomposition for Y , we can re-write this criterion as i.e., I(L) measures the performance improvement L(T ) affords to optimal predictive models that already have access to C. We use this fact for evaluation in Section 4.

Proposed Algorithms
We continue by describing the pair of novel algorithms we are proposing for deconfounded lexicon induction problems.

Deep Residualization (DR)
Motivation. Our first method is directly motivated by the setup from Section 2. Recall that I(L) measures the amount by which L(T ) can improve predictions of Y made from the confounders C. We accordingly build a neural network architecture that first predicts Y directly from C as well as possible, and then seeks to fine-tune those predictions using T . Description. First we pass the confounds through a feed-forward neural network (FFNN) to obtain preliminary predictionsŶ . We also encode the text into a continuous vector e ∈ R d via two alternative mechanisms: 1. DR+ATTN: the text is converted into a sequence of embeddings and fed into Long Short-Term Memory (LSTM) cell(s) (Hochreiter and Schmidhuber, 1997) followed by an attention mechanism inspired by Bahdanau et al. (2015). If the words of a text have been embedded as vectors x 1 , x 2 , ..., x n then e is calculated as a weighted average of hidden states, where the weights are decided by a FFNN whose parameters are shared across timesteps: DR+BOW: the text is converted into a vector of word frequencies, which is compressed with a two-layer feedforward neural network (FFNN): We then concatenate e withŶ and feed the result through another neural network to generate final predictionsŶ . If Y is continuous we compute loss with Where p * corresponds to the predicted probability of the correct class. The errors fromŶ are propagated through the whole model, but the errors from Y are only used to train its progenitor ( Figure 1).
Note the similarities between this model and the popular residualizing regression (RR) technique (Jaeger et al., 2009;Baayen et al., 2010, inter alia). Both use the text to improve an estimate generated from the confounds. RR treats this as two separate regression tasks, by regressing the confounds against the variables of interest, and then using the residuals as features, while our model introduces the capacity for nonlinear interactions by backpropagating between RR's steps. Lexicon Induction. We elicit lexicons from +ATTN style models by (1) running inference on a test set, but rather than saving those predictions, saving the attentional distribution over each source text, and (2) mapping each word to its average attentional score and selecting the k highest-scoring words.
For +BOW style models, we take the matrix that compresses the text's word frequency vector, then score each word by computing the l 1 norm of the column that multiplies it, with the intuition that important words are dotted with big vectors in order to be a large component of e.

Adversarial Selector (A)
Motivation. We begin by observing that a desirable L can explain Y , but is unrelated to C, which implies it should should struggle to predict C. The Adversarial Selector draws inspiration from this. It learns adversarial encodings of T which are useful for predicting Y , but not useful for predicting C. It is depicted in Figure 2. Description. First, we encode T into e ∈ R d via the same mechanisms as the Deep Residualizer of Section 3.1. e is then passed to a series of FFNNs ("prediction heads") which are trained to predict each target and confound with the same loss functions as that of Section 3.1. As gradients backpropagate from the confound prediction heads to the encoder, we pass them through a gradient reversal layer in the style of Ganin et al. (2016) and Britz et al. (2017), which multiplies gradients by −1. If the cumulative loss of the target variables is L t and that of the confounds is L c , then the loss which is implicitly used to train the encoder is L e = L t − L c , thereby encouraging the encoder to learn representations of the text which are not useful for predicting the confounds.
Lexicons are elicited from this model via the same mechanism as the Deep Residualizer of Section 3.1.

Experiments
We evaluate the approaches described in Sections 3 and 5 by generating and evaluating deconfounded lexicons in three domains: financial complaints, e-commerce product descriptions, and course descriptions. In each case the goal is to find words which can always help someone net a positive outcome (fulfillment, sales, enrollment), regardless of their situation. This involves finding words associated with narrative persuasion: predictive of human decisions or preferences but decorrelated from non-linguistic information which could also explain things. We analyze the resulting lexicons, especially with respect to the classic Aristotelian modes of persuasion: logos, pathos, and ethos.
We compare the following algorithms: Regression (R), Regression with Confound features (RC), Mixed effects Regression (M), Residualizing Regressions (RR), Log-Odds Ratio (OR), Mutual Information (MI), and MI/OR with regresssion (R+MI and R+OR). See Section 5 for a discussion of these baselines, and the online supplementary information for implementation details. We also compare the proposed algorithms: Deep Residualization using word frequencies (DR+BOW) and embeddings (DR+ATTN), and Adversarial Selection using word frequencies (A+BOW) and embeddings (A+ATTN).
In Section 2 we observed that I(L) measures the improvement in predictive power that L(T ) affords a model already having access to C. Thus, we evaluate each algorithm by (1) regressing C on Y , (2) drawing a lexicon L, (3) regressing C + L(T ) on Y , and (4) measuring the size of gap in test prediction error between the models of step (1) and (3). For classification problems, we measured error with cross-entropy (XE): And for regression, we computed the mean squared error (MSE): Because we fix lexicon size but vary lexicon content, lexicons with good words will score highly under this metric, yielding the large performance improvements when combined with C.
We also report the average strength of association between words in L and C. For categorical confounds, we measure Cramer's V (V ) (Cramér, 2016), and for continuous confounds, we use the point-biserial correlation coefficient (r pb ) (Glass and Hopkins, 1970). Note that r pb is mathematically equivalent to Pearson correlation in bivariate settings. Here the best lexicons will score the lowest.
We implemented neural models with the Tensorflow framework (Abadi et al., 2016) and optimized using Adam (Kingma and Ba, 2014). We implemented linear models with the scikit learn package (Pedregosa et al., 2011). We implemented mixed models with the lme4 R package (Bates et al., 2014). We refer to the online supplementary materials for per-experiment hyperparameters.
For each dataset, we constructed vocabularies from the 10,000 most frequently occurring tokens, and randomly selected 2,000 examples for evaluation. We then conducted a wide hyperparameter search and used lexicon performance on the evaluation set to select final model parameters. We then used these parameters to induce lexicons from 500 random train/test splits. Significance is estimated with a bootstrap procedure: we counted the number of trials each algorithm "won" (i.e. had the largest error C − error L(T ),C ). We also report the average performance and correlation of all the lexicons generated from each split. We ran these experiments using lexicon sizes of k = 50, 150, 250, and 500 and observed similar behavior. The results reported in the following sections are for k = 150, and the words in Tables 1, and 2, 3 are from randomly selected lexicons (other lexicons had similar characteristics).

Consumer Financial Protection Bureau (CFPB) Complaints
Setup. We consider 189,486 financial complaints publicly filed with the Consumer Financial Protection Bureau (CFPB) 2 . The CFPB is a product of Dodd-Frank legislation which solicits and addresses complaints from consumers regarding a variety of financial products: mortgages, credit reports, etc. Some submissions are handled on a timely basis (< 15 days) while others languish. We are interested in identifying salient words which help push submissions through the bureaucracy and obtain timely responses, regardless of the specific nature of the complaint. Thus, our target variable is a binary indicator of whether the complaint obtained a timely response. Our 2 These data can be obtained from https: //www.consumerfinance.gov/data-research/ consumer-complaints/ confounds are twofold, (1) a categorical variable tracking the type of issue (131 categories), and (2) a categorical variable tracking the financial product (18 categories). For the proposed DR+BOW, DR+ATTN, A+BOW, and A+ATTN models, we set |e| to 1, 64, 1, and 256, respectively. Results. In general, this seems to be a tractable classification problem, and the confounds alone are moderately predictive of timely response (XE C = 1.06). The proposed methods appear to perform the best, and DR+BOW achieved the largest performance/correlation ratio (Figure 3).  We obtain further evidence upon examining the lexicons selected by four representative algorithms: proposed (DR+BOW), a well-performing baseline (RR), and two naive baselines (R, MI) ( Table 1). MI's words appear unrelated to the confounds, but don't seem very persuasive, and our results corroborate this: these words failed to add predictive power over the confounds (Figure 3). On the opposite end of the spectrum, R's words appear somewhat predictive of the timely response, but are confound-related: they include the FDCPA (Fair Debt Collection Practices Act) and HIPAA (Health Insurance Portability and Accountability Act), which are directly related to the confound of financial product.
The top-scoring words in RR's lexicon include numbers ("6", "150.00") and words that suggest that the issue is ongoing ("being", "starting"). On the other hand, the words of DR+BOW draw on the rhetorical devices of ethos by respecting the reader's authority ("ma'am", "honor"), and logos by suggesting that the writer has been proactive about solving the issue ("multiple", "submitted", "xx/xx/xxx", "ago"). These are narrative qualities that align with two of the persuasion literature's "weapons of influence": reciprocation and commitment (Kenrick et al., 2005). Several algorithms implicitly favored longer (presumably more detailed) complaints by selecting common punctuation.

University Course Descriptions
Setup. We consider 141,753 undergraduate and graduate course offerings over a 6-year period (2010 -2016) at Stanford University. We are interested in how the writing style of a description convinces students to enroll. We therefore choose log(enrollment) as our target variable and control for non-linguistic information which students also use when making enrollment decisions: course subject (227 categories), course level (26), number of requirements satisfied (7), whether there is a final (3), the start time, and the combination of days the class meets (26). All except start time are modeled as categorical variables. For the proposed DR+BOW, DR+ATTN, A+BOW, and A+ATTN models, we set |e| to 1, 100, 16, and 64, respectively.
Results. This appears to be a tractable regression problem; the confounds alone are highly predictive of course enrollment (MSE C = 3.67). (Fig-A+ATTN   ure 4). A+ATTN performed the best, and in general, the proposed techniques produced the mostpredictive and least-correlated lexicons. Interestingly, Residualization (RR) and Regression with Confounds (RC) appear to outperform the Deep Residualization selector.
In Table 2 we observe stark differences between the highest-scoring words of a proposed technique (A+ATTN) and two baselines with opposing characteristics (R, OR) ( Table 2). Words chosen via Regression (R) appear predictive of enrollment, but also related to the confounds of subject ("programming", "computer", "management", "chemical", "clinical") and level ("required", "prerequisites", "introduction").  appear unrelated to both the confounds and enrollment. The Adversarial Selector (A+ATTN) selected words which are both confounddecorrelated and predictive of enrollment. Its words appeal to the concept of variety ("or", "guest"), and to pathos, in the form of universal student interests ("future", "eating", "sexual"). Notably, the A+ATTN words are also shorter (mean length of 6.2) than those of R (9.3) and OR (9.0), which coincides with intuition (students often skim descriptions) and prior research (short words are known to be more persuasive in some settings (Pratkanis et al., 1988)). The lexicon also suggests that students prefer courses with research project components ("research", "project").

eCommerce Descriptions
Setup. We consider 59,487 health product listings on the Japanese e-commerce website Rakuten 3 . These data originate from a December 2012 snapshot of the Rakuten marketplace. They were tokenized with the JUMAN morphological analyzer (Kurohashi and Nagao, 1999). We are interested in identifying words which advertisers could use to increase their sales, regardless of the nature of the product. Therefore, we set log(sales) as our target variable, and control for an item's price (continuous) and seller (207 categories). The category of an item (i.e. toothbrush vs. supplement) is not included in these data. In practice, sellers specialize in particular product types, so this may be indirectly accounted for. For the proposed DR+BOW, DR+ATTN, A+BOW, and A+ATTN models, we set |e| to 4, 3 These data can be obtained from https://rit. rakuten.co.jp/data_release/ Results. This appears to be a more difficult prediction task, and the confounds are only slightly predictive of sales (MSE C = 116.34) ( Figure 5). Again, lexicons obtained via the proposed methods were the most successful, achieving the highest performance with the lowest correlation (Table 3). When comparing the words selected by A+BOW (proposed) and RR (widely used and well performing), we find that both draw on the rhetorical element of logos and demonstrate informativeness ("nutrition", "size", etc.). A+BOW also draws on ethos by identifying word stems associated with politeness. This quality draws on the authority of shared cultural values, and has been shown to appeal to Japanese shoppers . On the other hand, RR selected sev-eral numbers and failed to avoid brand indicators: "nichiban", a large company which specializes in medical adhesives, is one of the highest-scoring words.

Related Work
There are three areas of related work which we draw on. We address these in turn. Lexicon induction. Some work in lexicon induction is intended to help interpret the subjective properties of a text or make make machine learning models more interpretable, i.e. so that practitioners can know why their system works. For example, Taboada et al. (2011);Hamilton et al. (2016) induce sentiment lexicons, and Mohammad and Turney (2010); Hu et al. (2009) induce emotion lexicons. Practitioners often get these words by considering the high-scoring features of regressions trained to predict an outcome (McFarland et al., 2013;Chahuneau et al., 2012;Ranganath et al., 2013;Kang et al., 2013). They account for confounds through manual inspection, residualizing (Jaeger et al., 2009;Baayen et al., 2010), hierarchical modeling (Bates, 2010;Gustarini, 2016;Schillebeeckx et al., 2016), log-odds (Szumilas, 2010;Monroe et al., 2008), mutual information (Berg, 2004), or matching (Tan et al., 2014;DiNardo, 2010). Many of these methods are manual processes or have known limitations, mostly due to multicollinearity (Imai and Kim, 2016;Chatelain and Ralf, 2012;Wurm and Fisicaro, 2014). Furthermore, these methods have not been tested in a comparative setting: this work is the first to offer an experimental analysis of their abilities. Causal inference. Our methods for lexicon induction have connections to recent advances in the causal inference literature. In particular, Johansson et al. (2016) and  propose an algorithm for counterfactual inference which bear similarities to our Adversarial Selector (Section 3.2), Imai et al. (2013) advocate a lasso-based method related to our Deep Residualization (DR) method (Section 3.1), and Egami et al. (2017) explore how to make causal inferences from text through careful data splitting. Unlike us, these papers are largely unconcerned with the underlying features and algorithmic interpretability. Athey (2017) has a recent survey of machine learning problems where causal modeling is important. Persuasion. Our experiments touch on the mech-anism of persuasion, which has been widely studied. Most of this prior work uses lexical, syntactic, discourse, and dialog interactive features (Stab and Gurevych, 2014;Habernal and Gurevych, 2016;Wei et al., 2016), power dynamics (Rosenthal and Mckeown, 2017;Moore, 2012), or diction (Wei et al., 2016) to study discourse persuasion as manifested in argument. We study narrative persuasion as manifested in everyday decisions. This important mode of persuasion is understudied because researchers have struggled to isolate the "active ingredient" of persuasive narratives (Green, 2008;De Graaf et al., 2012), a problem that the formal framework of deconfounded lexicon induction (Section 2) may help alleviate.

Conclusion
Computational social scientists frequently develop algorithms to find words that are related to some information but not other information. We encoded this problem into a formal task, proposed two novel methods for it, and conducted the first principled comparison of algorithms in the space. Our results suggest the proposed algorithms offer better performance than those which are currently in use. Upon linguistic analysis, we also find the proposed algorithms' words better reflect the classic Aristotelian modes of persuasion: logos, pathos, and ethos. This is a promising new direction for NLP research, one that we hope will help computational (and non-computational!) social scientists better interpret linguistic variables and their relation to outcomes. There are many directions for future work. This includes algorithmic innovation, theoretical bounds for performance, and investigating rich social questions with these powerful new techniques.

Appendix: Causal Interpretation of the Informativeness Coefficient
Recall the definition of I(L): Here, we discuss how under standard (albeit strong) assumptions that are often made to identify causal effects in observational studies, we can interpret I(L) with L(T ) = T as a measure of the strength of the text's causal effect on Y .
Following the potential outcomes model of Rubin (1974) we start by imagining potential out-comes Y (t) corresponding to the outcome we would have observed given text t for any possible text t ∈ T ; then we actually observe Y = Y (T ). With this formalism, the causal effect of the text is clear, e.g., the effect of using text t versus t is simply Y (t ) − Y (t).
Suppose that T , our observed text, takes on values in T with a distribution that depends on C. Let's also assume that the observed text T is independent of the potential outcomes {Y (t)} t∈T , conditioned on the confounders C (Rosenbaum and Rubin, 1983). So we know what would happen with any given text, but don't yet know which text will get selected (because T is a random variable). Now if we fix C and there is any variance remaining in Y (T ) (i.e. E Var Y (T ) C, {Y (t)} t∈T > 0) then the text has a causal effect on Y . Now we assume that Y (t) = f c (t) + , meaning that the difference in effects of one text t relative to another text t is always the same given fixed confounders. For example, in a bag of words model, this would imply that switching from using the word "eating" versus "homework" in a course description would always have the same impact on enrollment (conditionally on confounders). With this assumption in hand, then the causal effects of T , E Var Y (T ) C, {Y (t)} t∈T , matches I(L) as described in equation (1) (Imbens and Rubin, 2015). In other words, given the same assumptions often made in observational studies, the informativeness coefficient of the full, uncompressed text in fact corresponds to the amount of variation in Y due to the causal effects of T .