Topics to Avoid: Demoting Latent Confounds in Text Classification

Despite impressive performance on many text classification tasks, deep neural networks tend to learn frequent superficial patterns that are specific to the training data and do not always generalize well. In this work, we observe this limitation with respect to the task of native language identification. We find that standard text classifiers which perform well on the test set end up learning topical features which are confounds of the prediction task (e.g., if the input text mentions Sweden, the classifier predicts that the author’s native language is Swedish). We propose a method that represents the latent topical confounds and a model which “unlearns” confounding features by predicting both the label of the input text and the confound; but we train the two predictors adversarially in an alternating fashion to learn a text representation that predicts the correct label but is less prone to using information about the confound. We show that this model generalizes better and learns features that are indicative of the writing style rather than the content.


Introduction
Text classification systems based on neural networks are biased towards learning frequent spurious correlations in the training data that may be confounds in the actual classification task (Leino et al., 2019). A major challenge in building such systems is to discover features that are not just correlated with the signals in the training data, but are true indicators of these signals, and therefore generalize well.
For example, Kiritchenko and Mohammad (2018) found that sentiment analysis systems implicitly overfit to demographic confounds, systematically amplifying the intensity ratings of posts written by women.  showed that visual semantic role labeling models implicitly capture actions stereotypically associated with men or women (e.g., women are cooking and men are fixing a faucet), and in cases of higher model uncertainty assign stereotypical labels to actions and objects, thereby amplifying social biases found in the training data.
We focus on the task of native language identification (L1ID), which aims at automatically identifying the native language (L1) of an individual based on their language production in a second language (L2, English in this work). The aim of this task is to discover stylistic features present in the input that are indicative of the author's L1. However, a model trained to predict L1 is likely to predict that a person is, say, a native Greek speaker, if the texts authored by that person mention Greece, because the training data exhibits such topical correlations ( §2).
This problem is the focus of our work, and we address it in two steps. First, we introduce a novel method for representing latent confounds. Recent relevant work in the area of domain adaptation (Ganin et al., 2016) and deconfounding for text classification (Pryzant et al., 2018;Elazar and Goldberg, 2018) assumes that the set of confounds is known a priori, and their values are given as part of the training data. This is an unrealistic setting that limits the applicability of such models in real world scenarios. In contrast, we introduce a new method, based on log-odds ratio with Dirichlet prior (Monroe et al., 2008), for identifying and representing latent confounds as probability distributions ( §3). Second, we propose a novel alternating learning procedure with multiple adversarial discriminators, inspired by adversarial learning (Goodfellow et al., 2014), that demotes latent confounds and results in textual representations that are invariant to the confounds ( §4).
Note that these two proposals are taskindependent and can be extended to a vast array of text classification tasks where confounding factors are not known a priori. For concreteness, however, we evaluate our approach on the task of L1ID ( §5). We experiment with two different datasets: a small corpus of student written essays  and a large and noisy dataset of Reddit posts . We show that classifiers trained on these datasets without any intervention learn spurious topical correlations that are not indicative of style, and that our proposed deconfounded classifiers alleviate this problem ( §6). We present an analysis of the features discovered after demoting these confounds in §7.
The main contributions of this work are: 1. We introduce a novel method for representing and identifying variables which are confounds in text classification tasks.
2. We propose a classification model and an algorithm aimed at learning textual representations that are invariant to the confounding variable.
3. We introduce a novel approach to adversarial training with multiple adversaries, to alleviate the problem of drifting parameters during alternating classifier-adversary optimization.
4. Finally, we analyze some linguistic features that are not only predictive of the author's L1 but are also devoid of topical bias.

Motivation
We study the general effect of topical confounds in text classification. To motivate the need to demote them, we introduce as a case study the L1ID task, in which the goal is to predict the native language of a writer given their texts in L2.
We begin with a subset of the L2-Reddit corpus , consistsing of Reddit posts by authors with 23 different L1s, most of them European languages. Some of the posts come from Europe-related forums (e.g. r/Europe, r/AskEurope, r/EuropeanCulture), whereas others are from unrelated forums. We view the latter as out-of-domain data and use them to evaluate the generalization of our models. We use a subset of this corpus, with only the 10 most frequent L1s, to guarantee a large enough balanced training set. We remove all the posts with fewer than 50 words and sample the dataset to obtain a balanced distribution of labels: from this balanced dataset, we randomly sample 20% of examples from each class and divide them equally to create development and test sets. In total, there are  around 260,000 examples in the training set and  32,000 examples each in the development, the indomain test set, and the out-of-domain test set. We trained a standard (non-adversarial) classifier, with a bidirectional LSTM encoder followed by two feedforward layers with a tanh activation function and a softmax in the final layer (full experimental details are given in §5.2). We refer to this model as NO-ADV. The results are shown in Table 1. Notice the huge drop in accuracy on the out-of-domain data, which indicates that the model is learning topical features.
To further verify this claim, we used log-odds ratio with Dirichlet prior (Monroe et al., 2008)a common way to identify words that are statistically overrepresented in a particular population compared to others-to identify the top-K words that were most strongly associated with a specific L1 in the training set. (We refer the reader to (Monroe et al., 2008) for the details about the algorithm.) We experimented with K ∈ {20, 50, 100, 200}. Table 2 shows the top-10 words in each class; observe that almost all of these words are geographical (hence, topical) terms that have nothing to do with the L1.
Next, we masked such topical words (by replacing them with a special token) and evaluate the trained classifier on masked test sets. Accuracy (Table 1) degrades on both the in-domain and out-of-domain sets, even when only 20 words are removed. The drop in accuracy with the outof-domain dataset is smaller since these data do not include many instances where the presence of topical words would help in identifying the label. These experiments confirm our hypothesis that the baseline classifier is primarily learning topical correlations, and motivate the need for a deconfounded classification approach which we describe next.  Blei et al., 2003) is a probabilistic generative model for discovering abstract topics that occur in a collection of documents. Under LDA, each document can be considered a mixture of a small (fixed) number of topics-each represented as a distribution over words-and each word's presence is assumed to be attributed to one of the document's topics. More precisely, LDA assigns each document a probability distribution over a fixed number of topics K. LDA topics are known to be poor features for classification (McAuliffe and Blei, 2008), indicating that they do not encode all the topical information. Moreover, they can encode information which is not actually topical and can be a useful L1 marker. Motivated by our case study ( §2), we propose a novel method to represent topic distributions, based on log-odds scores (Monroe et al., 2008), and compare it to LDA as a baseline.

In-Domain
For each class label y and each word type w, we calculate a log-odds score lo(w, y) ∈ R. The higher this score, the stronger the association between the class and the word. As we saw in §2, the highest scored words are mostly topical and hence constitute superficial features which we want the classification model to "unlearn." We therefore define a distribution which assigns high probability to a document containing these high scoring words. For a label y ∈ Y and an input document x = w 1 , . . . , w n , we define p(y | x): The above expansion assumes a bag of words representation. When the dataset is balanced, p(y) is equal for each label and can be omitted. Finally, we define p(w i | y) ∝ σ(lo(w i , y)), where σ(.) is the sigmoid function, which squashes the log-odds scores (whose values are in R) to the range [0, 1]. We normalize the sigmoid values over the vocabulary to convert them to a probability distribution. In this distribution, the number of "topics" equals the number of labels, m.

Deconfounded Text Classification
We now formalize the task setup and the classification model. We are given N labeled documents in the training set {(x 1 , y 1 ), (x 2 , y 2 ), For each document x i , we represent latent (topical) confounds-domain-specific and superficial document features-as a K-dimensional multinomial distribution t i ∈ {(t 1 , . . . , t K ) | K j=1 t j = 1}. In our task, the confounds are topics, so that each t j represents the proportion of document i associated with topic j but these topics are not given a priori. In this work, the number of topics K, equals m, but the methods presented in this work are valid for any number of topics.
Our goal is to train a classifier f , parameterized by θ, which learns to accurately predict the target label, while ignoring superficial topical correlations present in the training set. That is, for a text x we wish to predictŷ = f θ (x) which doesn't encode any information about t. Following Ganin et al. (2016), Pryzant et al. (2018), and Elazar and Goldberg (2018), we input x to an encoder neural network h(x; θ h ) to obtain a hidden representation h x (see Figure 1), followed by two feedforward networks: (1) c(h(x); θ c ) to predict the label y; and (2) an adversary network adv(h(x); θ a ) to predict the topics. Departing from prior work which used predefined binary confounds, our adversary predicts the topic distribution t. If h x does not encode any information to predict t, then c(h(x)) will not depend on t. Concretely, we want to optimize the following quantity: where CE denotes cross-entropy loss, and . This objective seeks a representation h x which is maximally predictive of the class label but not of the topical distribution (ideally, it should output a uniform topic distribution for every input).

Learning Schedule: Alternating
Optimization of Classifier and Adversary In practice, this optimization is done in an alternating fashion by minimizing the following two English ireland irish british britain russia scotland england states american london brexit Finnish finland finnish finns helsinki swedish finn nordic sweden sauna nokia estonian French french france paris sarkozy macron fillon hollande gaulle hamon marine valls breton German german germany austria merkel refugees asylum germans bavaria austrian berlin also Greece greek greece greeks syriza macedonia athens turkey macedonians fyrom turkish ancient Dutch dutch netherlands amsterdam wilders rotterdam holland rutte belgium bike hague Polish poland polish poles warsaw lithuanian lithuania judges jews ukranians imho tusk Romanian romania romanian romanians moldova bucharest hungarian hungarians transistria Spanish spain catalan spanish catalonia catalans madrid barcelona independence spaniards Swedish sweden swedish swedes stockholm swede malmo danish nordic denmark finland   quantities: The training schedule is critical in adversarial setups where the loss has two competing terms (Mescheder et al., 2018;Arjovsky and Bottou, 2017;Roth et al., 2017); here, these terms minimize classification loss while maximizing the topic prediction loss. Algorithm 1 details our proposed alternating learning procedure. Inspired by generative adversarial networks (GANs; Goodfellow et al., 2014), the training procedure alternates between training the classifier and the adversary (see Figure 1). First (pretrain-ing), we train the encoder along with the classifier using only classification loss, until convergence. After pretraining, h x has encoded topical information which it uses for classification (as shown in our analysis in §2). Now, we train only adv(h(x)) to (accurately) predict t, keeping the parameters of h(.) fixed. Once adv(.) is trained, it should be able to successfully extract a topic distribution from h x (topic training, see Figure 1a). The goal now is to modify h x in such a way that adv(h x ) produces a uniform distribution (that is, fooling the adversary; similar to fooling the discriminator in GANs). We do that by keeping the weights of adv(.) fixed, and training the network to produce the class label and a uniform topic distribution (topic forgetting, see Figure 1b). We then repeat this procedure for a fixed number of steps which was tuned using the validation set.
Algorithm 1: Alternating optimization of classifier and adversary.
Result: θ h , θ c , θ a 1 , . . . , θ a T Randomly initialize θ h , θ c ; while not converged do Sample a minibatch of b training samples; Update θ h and θ c using gradients with Randomly initialize θ a j ; end for t steps do Sample a minibatch of b training samples; Fix θ h and θ c , update θ a j using gradients with respect to Sample a minibatch of b training samples; Fix θ au for u ∈ R {1, . . . , j} and update θ c and θ h using gradients with respect to

Multiple Adversaries
In our experiments, we observe that after every "topic forgetting" stage, adv(.) does end up producing a uniform distribution, but in the next "topic training" phase, adv(.) is able to reproduce the topical distribution accurately. This is because, during "topic forgetting," the classifier does not really forget the topics in h x ; it just encodes them in a different way. 2 This is a general problem in setups with alternating classifier-adversary optimization. To solve this issue, we propose using multiple adversaries, inspired by the "experience replay" approach used in reinforcement learning (O'Neill et al., 2010;Mnih et al., 2015). During the ith "topic training" phase, we train a new adversary adv i (with parameters θ a i instead of retraining only one adversary over and over again. In the next "topic forgetting" phase, at each training step we pick adv j at random from the pool of previously learned adversaries, j ∈ R {1, . . . , i}. By using multiple adversaries, we make it difficult for the classifier to encode topical information anywhere.

Datasets
We evaluate our topical confound demotion method on the L1ID task. We show experiments with two datasets where L2 is English: the L2-Reddit dataset described in §2, and TOEFL17, a collection of essays authored by non-native English speakers who apply for academic studies in the US . This corpus reflects eleven L1s: Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish. The training data include 11,000 authors (1,000 per L1) and the development set has 1,100 essays per L1. We evaluate on the development set. Each essay is also marked with a prompt ID which was given to the authors to write the essay. There are 8 prompts in total, based on which we construct 8 versions of train and test set. In each version, we remove essays marked with one of the prompts from both the train and the development sets, and consider the removed essays from the development set an "out-of-domain" test set. We refer to the version where prompt "PK" is out-of-domain as "-PK" in the results (Table 3), K ∈ {0, . . . , 7}.

Implementation Details
We tokenized and lowercased all the text using spaCy. Limiting our vocabulary to the most frequent 30,000 words in the training data, we replaced all out-of-vocabulary words with "UNK." We encoded each word using a word embedding layer (initialized at random and learned) and passed these embeddings to a bidirectional LSTM encoder (one layer for each direction) with attention (h(x); Pryzant et al., 2018). Each LSTM layer had a hidden dimension of 128. We used two layered feed forward networks with a tanh activation function in the middle layer (of size 256), followed by a softmax in the final layer, as c(.) and adv(.).

Baselines
We consider several baselines that are intended to capture the stylistic features of the texts, explicitly avoiding content.
Linear classifier with content-independent features (LR) Replicating Goldin et al. (2018), we trained a logistic regression classifier with three types of features: function words, POS trigrams, and sentence length, all of which are reflective of the style of writing. We deliberately avoided using content features (e.g., word frequencies).
Classification with no adversary on masked texts (LO-TOP-K) We mask the top-K words (based on log-odds scores) in both the train and the test sets (as in §2); we train the classification model again without training adv(.). After masking the top words, we expect patterns of writing style (and, therefore, L1) to become more apparent.
Adversarial training with gradient reversal (GR-LO) A common method of learning a confound-invariant representations is to use a gradient reversal layer (Beutel et al., 2017;Ganin et al., 2016;Pryzant et al., 2018;Elazar and Goldberg, 2018). The output of the encoder, h x , is passed through this layer before applying adv(.). This training setup usually proves too difficult to optimize, and often results in poor performance. That is, even if the performance of adv(.) is weak, h x still ends up leaking information about the confound (Lample et al., 2019;Elazar and Goldberg, 2018). In the forward pass, this layer acts as identity whereas in the backward pass it multiplies the gradient values by −λ, essentially reversing the gradients before they go into the encoder. λ controls the intensity of the reversal (we used λ = 0.2).
LDA topics as confounds (ALT-LDA) We trained LDA on the training set and for each example in the training set, generated a probability distribution (over 50 topics), and used it as topical confound with our proposed learning setup, alternating classifier-adversary training.

TOEFL17 Dataset
We begin with experiments on the TOEFL17 dataset, where predicting L1 is an easier task due to the lower proficiency of the authors. Table 3 reports the accuracy of our proposed model, denoted ALT-LO, compared to the logistic regression baseline (LR), and two adversarial baselines: one demotes latent log-odds-based topics via gra-dient reversal (GR-LO), and another uses our proposed novel learning procedure but demotes baseline LDA topics (ALT-LDA). We report both indomain accuracy and out-of-domain results; the latter is obtained by averaging the accuracy of each set "-PK" over K ∈ {0, . . . , 7}.  Our model strongly outperforms all baselines that demote confounds, in both classification setups. We observe in our experiments that gradient reversal is especially unstable and hyperparameter sensitive: it has been shown to work well with categorical confounds like domain type or binary gender, but in demoting continuous outputs like a topic distribution, we observe it is not effective. The proposed alternating training with multiple discriminators obtains better results, and replacing LDA with log-odds-based topics also improves both in-domain and (much more substantially) out-of-domain predictions, confirming the effectiveness of our proposed innovations.

In-Domain
A vanilla classifier without demoting confounds (denoted in §2 as NO-ADV) yields in-domain and out-of-domain accuracies of 62.0 and 58.3, respectively. We would expect that the better generalization power of our proposed model would come at a price of lower accuracy in-domain. Our goal is to capture the true signals of L1, rather than superficial patterns that are more frequent in the data and artificially boost the performance in NO-ADV settings. This is indeed what we observe.
For example, the text ". . . i agree with you on the prolonged war if the plc heartland (poland proper) was not as rich as it was i dont really see how we would been . . . " in the dataset is labeled as "Polish" instead of the gold label "Swedish" by the NO-ADV classifier, likely because of the mention of the term "poland", but the ADV-LO model predicts it correctly since it likely picks on other features that indicate non-fluency, like "we would been". Such naive classification errors become especially costly in making predictions about peo-ple's demographic attributes: ethnicity, which often correlates with L1, but also gender, race, religion, and others (Hardt et al., 2016;Beutel et al., 2017).

L2-Reddit Dataset
Next, we experiment with L2-Reddit, a larger and more challenging dataset (since many speakers in the dataset are highly fluent, and the signal of their native language is weaker). The performance of the simple baselines on this dataset is shown in Table 4. The accuracy of the linear classifier is poor (compared to Table 1), perhaps because it fails to capture some contextual features learned by the neural network models. With LO-TOP-20, the performance on both test sets improves. It slightly degrades when more words are removed, perhaps because some words indicative of L1 are also removed.

In-Domain
Out-of-Domain  Finally, we evaluate the impact of our novel training procedure and the quality of our proposed topical confound identification method. We compare our proposed solution, denoted ALT-LO, with two alternatives, as before, one with a different learning setup (GR-LO) and one with a different confound representation (ALT-LDA). Table 5 summarizes the results: our proposed learning procedure ALT-LO performs better than both the alternatives. Unsurprisingly, the model trained with gradient reversal (GR-LO) performs particularly poorly; this was our primary motivation to explore better learning techniques.

In-Domain
Out-of-Domain  To further confirm that the ALT-LO model is not learning topical features, we repeat the experiment presented in Table 1-masking the top K topical words (based on log-odds scores) from the test sets, but not retraining the models-now, with our proposed model ALT-LO. Table 6 shows that in contrast to standard models that do not demote topical confounds (as in Table 1), there is less degradation in the performance of ALT-LO. We conjecture that our model is stable to demoting topics because it learns relevant stylistic features, rather than spurious correlations.

In-Domain
Out-of-Domain

Analysis
We present an analysis of what the models are learning, based on words they attend to for classification. We focus on the L2-Reddit dataset. Following Pryzant et al. (2018), we generated a lexicon of most attended words by (1) running the model on the test set and saving the attention score for each word; and (2) for each word, computing its average attentional score and selecting the topk words based on this score.
What emerges from this lexicon (Table 7) is a dramatic difference between the top indicative words in the various models. Whereas in the baseline model all the most indicative words are proper nouns, the ALT-LO model highlights exclusively function words. The proper nouns in the baseline model are all geographical terms directly associated with the L1s reflected in the L2-Reddit dataset: they are easy giveaways of the authors' L1s, but they are meaningless linguistically. In contrast, the function words highlighted in the ALT-LO model are mostly prepositions and determiners; it is well known that nonnative speakers are challenged by the use of prepositions (in any L2, English included). The distribution of determiners is also a challenge for nonnatives, and the correct usage of the in particular is quite hard for learners to master. These challenges are evident from the most indicative words of our model. Observe also that the LO-TOP-50 model is somewhere in the middle: it includes some proper nouns (including geographical terms such as eu or us) but also several function words. A more detailed analysis of these observations is left for future work.
Recently, there has been a debate on whether attention can be used to explain model decisions (Serrano and Smith, 2019;Jain and Wallace, 2019;Wiegreffe and Pinter, 2019), we thus present additional analysis of our proposed method based on saliency maps (Ding et al., 2019). Saliency maps have been shown to better capture word alignment than attention probabilities in neural machine translation. This method is based on computing the gradient of the probability of the predicted label with respect to each word in the input text and normalizing the gradient to obtain probabilities. We use saliency maps to generate lexicons similar to the ones generated using attention. As shown in table 8, the top indicative words for baseline and LO-TOP-50 follow a similar pattern as the ones obtained with attention scores. In line with results in Table 7, salient words for ALT-LO are determiners and prepositions. However, saliency maps also reveal that our proposed approach still attends to some geographical terms that were not demoted by our classifier.

Related Work
Controlling for confounds in text Controlling for confounds is an active field of research, especially in the medical domain, where the common solution is to do random trials or propensity score matching (Rosenbaum and Rubin, 1985). Paul (2017) tackled the problem of learning causal associations between word features and class labels using propensity matching for the task of sentiment analysis. This method is not scalable to large text datasets as it involves training a logistic regression model for every word type. Tan et al. (2014) built models to estimate the number of retweets of Twitter messages and addressed confounding factors by matching tweets of the same author and topic. Reis and Culotta (2018) proposed a statistical technique called Pearl's back-door adjustment for text classification (Pearl, 2009). All these works focused on a bag-of-words model with lexical features only.
Adversarial training in text Much recent work focuses on learning textual representations that are invariant to selective properties of the text. This work used domain adaptation and transfer learning (Ganin et al., 2016;Tzeng et al., 2014;Xie et al., 2017), either to remove sensitive attributes such as demographic information (Li et al., 2018;Elazar and Goldberg, 2018;Beutel et al., 2017), or to understand costumer behavior for social science applications (Pryzant et al., 2018). Most of the work in this area, however, focuses on cases where these confounds are known in advance and their values are given along with the training data.
Native language identification The L1ID task was introduced by Koppel et al. (2005), who worked on the International Corpus of Learner English (Granger, 2003). The same experimental setup was adopted by several other authors (Tsur and Rappoport, 2007;Dras, 2009, 2011). Since the release of nonnative TOEFL essays by the Educational Testing Service , the task gained popularity and this dataset has been used for two L1ID Shared Tasks . Malmasi and Dras (2017) report that the state of the art is a linear classifier with character n-grams and lexical and morphosyntactic features.
The best accuracy under cross-validation on the TOEFL17 dataset, which includes 11 native languages (with a rather diverse distribution of language families), was 85.2%.
The above works all identify the L1 of learners. Identifying the native language of advanced, fluent speakers is a much harder task. Goldin et al. (2018) addressed this task, using the L2-Reddit dataset with as many as 23 different L1s, all of them European and many which are typologically close, which makes the task even harder. They experimented with a variety of features, using logistic regression as the classifier, and achieved results as high as 69% accuracy with cross-validation; however, when testing their classifier outside the domain it was trained on (Reddit forums focusing on European issues), accuracy dropped to 36%.

Conclusion
We introduced a method to represent unknown confounds in text classification using topic models and log-odds scores, and a new general method NO-ADV sweden france greece finland poland spain greek germany french eu romania polish dutch german spanish swedish netherlands finnish  eu 's 're 'm ' & uk us because 've am its nt english these usa nt here 'll especially correct pis de within ALT-LO the in to of that a i is and 't as from with by ? on but & they are about at because like was would have you Table 7: The highest scoring words in lexicons generated using attention scores.

NO-ADV
poland greek romania greece france spain french sweden finland polish dutch spanish netherlands finnish german  on 're even 'd up less things 'll doesn living majority sense talk level 've rights took number north ALT-LO the of to i a in greece romania france finland that for is french & you 't finnish with alternating optimization to learn textual representations which are invariant of confounds. We evaluated the proposed solution on the task of native language identification, and showed that it learns to make predictions using stylistic features, rather than focus on topical information.
The learning procedure we presented is general and applicable to other tasks that require learning invariant representations with respect to some attribute of text (some of which are discussed in §8). We plan to evaluate our proposed solution on other tasks where topics can be latent confounds, like predicting gender bias (Voigt et al., 2018). We leave this exploration for future work.