Inflating Topic Relevance with Ideology: A Case Study of Political Ideology Bias in Social Topic Detection Models

We investigate the impact of political ideology biases in training data. Through a set of comparison studies, we examine the propagation of biases in several widely-used NLP models and its effect on the overall retrieval accuracy. Our work highlights the susceptibility of large, complex models to propagating the biases from human-selected input, which may lead to a deterioration of retrieval accuracy, and the importance of controlling for these biases. Finally, as a way to mitigate the bias, we propose to learn a text representation that is invariant to political ideology while still judging topic relevance.


Introduction
Due to the extensive reaches of its network and the breadth of information enmeshed in it, social media has become an invaluable data source for empirical studies in the social sciences. Yet, identifying all and only relevant information out of a vast data stream remains an untamed challenge. While topic detection methods may help researchers extract some relevant text about a topic of interest (e.g., immigration policies), they may miss other equally relevant text while including some irrelevant ones. Crucially, because most topic detection methods are trained, they may unintentionally contain or propagate certain biases (e.g., extracting more instances written by women where gender balance is expected), resulting in a skewed data collection that may lead social scientists to draw incorrect conclusions. This paper explores the interactions between social biases and automatic topic detection models, and their impact on the resulting data collection. Our goal is to help social scientists gain insights about biases in text analytic so as to mitigate such biases in their data collections.
More specifically, we examine the role of political ideology biases (liberal-leaning, denoted as Blue, and conservative-leaning, denoted as Red) in the process of collecting data about certain social topics (immigration and gun control) from Twitter. We observe that biases may be introduced at three major junctures of the data collection pipeline. First, it may be introduced in the data source itself (e.g., certain forums may have strong political leanings), but social scientists typically choose their data intentionally and are aware of pre-existing biases therein (Malik et al., 2015;Kosinski et al., 2015;Cihon and Yasseri, 2016). Second, biases may be introduced in the way in which "topic relevance" is defined. For example, domain experts may be consulted to identify a set of keywords or sample instances that are indicative or representative of the topic of interest. Thus, any unconscious bias on the part of the domain experts would be encoded into these keywords and examples (King et al., 2017), which would then serve as a noisy training corpus for developing a topic classifier. Third, the choice of the computational models for performing relevance classification may amplify or mitigate the impact of the biases.
Through a suite of empirical analyses, this work studies the effect of biased keywords (Blue-leaning, Red-leaning) on downstream training and retrieval: 1) To what extent does a trained classifier propagate the bias seen in the training data? Can it learn to generalize and blunt some of the bias? 2) To what extent do biases in the training corpus degrade the overall retrieval ability of the classifier? Specifically, we generate strongly Blue-leaning and Red-leaning noisy training sets, and we compare the impact of these training sets on three common off-the-shelf models: GloVe (Pennington et al., 2014), ELMo (Peters et al., 2018, and BERT (Devlin et al., 2018). The three models are chosen to span a range of model sizes and representational power. We find that of the three off-the-shelf models, BERT more frequently suffers a significant drop in retrieval quality and propagates more bias when trained on biased data.
We then propose a method to mitigate the bias. That is, we want a classifier that is oblivious to an instance's group affiliation to Blue or Red, yet still performs the main task of judging the instance's relevance to the topic. Our approach adapts Domain-Adversarial Training (Ganin et al., 2016) for the three off-the-shelf models. Experimental results show that the proposed approach mitigates the unintended bias at no or little cost of retrieval accuracy as compared to the original models; in fact, the retrieval accuracy for the modified BERT is slightly boosted. The code and data for this project is available. 1

Political Ideology Bias on Social Topic Detection
We investigate the impact of political ideology biases on extracting tweets relevant to specific social topics. Unlike gender or racial bias, which has been widely studied in language representation, machine translation or relation extraction (Stanovsky et al., 2019;Gaut et al., 2020;Blodgett and O'Connor, 2017), there exists fewer work on political ideology biases. Political ideology biases on social topic detection may arise from the difference in language usages between political ideological groups. Prior studies have compared different language usages between political ideological groups such as conservatives and liberals in the US. Such differences are reflected in general linguistics patterns such as language complexity (Schoonvelde et al., 2019) and emotions associated with language (Wojcik et al., 2015). Moreover, while talking about the same topic, language devices, such as specific types of metaphors, are often found to be different, which are associated with the groups' distinct political background and moral concerns (Dehghani et al., 2011;Lakoff, 1995). These observations indicate that the information producers come from diverse political ideological backgrounds, and the selection of keywords is critical for obtaining balanced and representative data points. Therefore, we first examine the political ideology biases introduced from in human-selected keywords.

Data Source: Twitter
We focus on Twitter because it is a widely-used space for people to express their views on social topics. For our study, we rely on a prior work (Yang et al., 2017) that collected data from publicly posted tweets using official Twitter APIs during a time-frame close to the 2016 U.S. Presidential election. Two groups of users are identified -Clinton-supporters (Blue) and Trump-supporters (Red) -that are likely to have distinct political and ideological preferences. Group membership is defined as an exclusive follower; i.e., Twitter users who followed only one presidential election candidate but not the other. In their study, Yang et al. (2017) have validated the concept that exclusive followers make good proxies for group affiliations. Our final raw corpus used for this study consists of over 7 million tweets. More detail for our data collection is described in Section 4.1.

Quantifying Bias in Keywords
For any controversial topic, some useful keywords are necessarily going to be biased toward one group or another. Even taken as a set, the keywords that a human expert came up with may reflect the bias of that expert. A useful piece of information, therefore, is if we could quantify the level of bias in the keywords. It could inform the experimenters on whether they should recruit additional diverse experts to expand their keyword set.
Since we have the ground truth for the political group (Red/Blue) of each tweet, we could use a simple ratio between the numbers of tweets containing keyword x between one group with the other as the metric, but a better metric is to apply a Chi-square test because it takes deviation into consideration for estimating the probability. More specifically, we evaluate the bias of each keyword by two-tailed Pearson Chi-square Test (details in Appendix A). The root challenge of this quantifying methodology is that due to the sheer size of the raw corpus, we do not have the full ground truth for topic relevance (i.e., whether a tweet is about gun control or immigration for all tweets). A direct consequence is that we cannot identify all the high-precision keywords by brute-force; thus, our study also relies on humanchosen keywords. Even though keywords chosen by an expert may be biased, the ensemble of keywords from diverse experts is much less likely to be biased (King et al., 2017). Therefore, we approximate the ground truth for topic relevance by the ensemble of human-chosen keywords for the Chi-square test of each single keyword. We selected the topics of gun control and immigration from the ProCon website 2 because both have engaged enthusiastic political debates, with extremely conflicting stances and opinions from opposing political camps. Keywords are collected from diverse experts who are familiar with or have worked on these social topics in tweet corpus. There are 29 keywords for the immigration topic and 34 keywords for the gun control topic. We assign each keyword to the Blue-leaning, Red-leaning or unbiased (neutral) group by setting the confidence level of the Chi-square test equal to 99%. Table 1 shows the number of keywords in each group as well as some examples, which reveals that most (around 75%) expert-selected keywords actually have political ideology bias. Moreover, some keywords are extremely biased, such as "#NoBanNoWall" for the topic immigration and "#NoBillNoBreak" for the topic gun control (refer to Appendix B which shows the exact Z-test scores for each keyword). Our findings verify our hypothesis that the language usages by different political ideology groups are often found to be different, even while talking about the same topic. These observations suggest that a perfectly balanced selection of keywords or a fully representative set of data points of diverse political ideological camps may not be achievable in practice. Therefore, there is a pressing need to study how biases propagate through topic detection models when they are trained on biased keywords.

Bias Propagation through Models
Unlike well-curated and annotated benchmark datasets, raw social media data is sprawling and unorganized. Contributors come from diverse backgrounds, with different racial origins, personalities, education levels, etc.; they may hold many kinds of implicit biases, some of which may not have been identified by the social scientists carrying out the experiment. Under this setting, prior work for addressing the bias and ethical issues such as data statements (Bender and Friedman, 2018) may not be applicable. Nonetheless, data sources such as Twitter remain a powerful resource that researchers are willing to tap into. Therefore, it is important to compare how different NLP systems perform on potentially biased training data and to develop approaches for mitigating bias propagation through models.
We consider bias propagation in two dimensions: 1) To what extent does a model trained on biased examples tend to detect more instances with the same bias? 2) How does the learned bias interact with relevance? (Does a biased classifier simply retrieve fewer instances of the other group, or does it actually retrieve less relevant instances for that group?). We also want to determine whether certain types of NLP systems are more likely to propagate the bias. Given that NLP models are built with a diverse of architectures (transformers, RNN, etc.) and the number of trainable parameters varies from hundreds to billions, we define the type of NLP systems along their context representations and sizes.
Prior work shows that complex models, such as BERT, do quite well for many NLP applications. Multi-head attention allows BERT to be able to capture complex and fine-grained patterns for the target prediction. On the other hand, big complex models with numerous training parameters are more likely to be overfitted when there are not enough training data (Yin and Shen, 2018). Therefore, it is not obvious what might happen with biased-trained large complex models: do they succeed in using "real" patterns for the target task (e.g., predict relevant tweets in our case); or do they make use of the bias seen in the training data (e.g., capitalize on superficial patterns in the biased data) for reaching a minimum loss? Our work aims to answer this question by examining three representative NLP models under multiple, differently biased training sets.

Comparison between Different Off-the-shelf NLP Models
Our study is over two state-of-the-art NLP models, representing high performance approaches, and one simpler model, representing the benchmark. We compare three different topic detection models which are respectively built with BERT, ELMo, and GloVe. For predicting relevant tweets of a target topic, we fine-tune the BERT model with just one additional output layer. When we build topic detection models using ELMo and GloVe, we add a Bi-LSTM layer after ELMo/GloVe as the text encoder, then feed it forward through one output layer for predicting relevance. These three text encoder models have different architectures and sizes, shown as below.
BERT (Devlin et al., 2018) A language representation model whose architecture is deep bidirectional transformers and which is pre-trained on large-scale unlabeled text corpus. It can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference. The base BERT model has 110M trainable parameters.
ELMo (Peters et al., 2018) A large-scale pre-trained deep contextualized word representation. Contextual word vectors are learned functions of the internal states of a deep bidirectional language model. These representations significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis, at the time it was released. Our topic detection model built with ELMo has 3M trainable parameters.
GloVe (Pennington et al., 2014) A traditional distributed word representation learnt from global logbilinear regression model of word-word co-occurrence matrix. It could capture fine-grained semantic and syntactic regularities using vector arithmetic. Our topic detection model built with GloVe has 16M trainable parameters (the number of parameters is linearly correlated with the vocabulary size).
We intentionally experiment with these three NLP models in order to answer the question about the relation between model complexity and robustness to bias, because they are respectively good representatives of bidirectional transformers, bidirectional LSTM and single word vectors. Moreover, their parameter sizes are in different magnitudes.

Proposed Approach for Mitigating Bias Propagation
One promising explanation for bias propagation of ML models is that inductive bias in gradient descent methods results in the overestimation of the importance of moderately-predictive "weak" features if training data is biased and insufficient (Jayakumar et al., 2019). Due to the difference in language use between political ideological groups, topic detection models are biased towards learning frequent spurious correlations in the training data instead of learning true indicators of relevance. For example, when most immigration-relevant tweets are posted by users in Blue while Blue and Red users are evenly distributed in non-relevant tweets in the training data, the text classification systems may overestimate the importance of Blue users language features as the signal of relevance. This could result in a lose of retrieval accuracy for both Blue and Red tweets and especially a low recall for Red tweets during the test time.
The reason for bias propagation of ML models is close to domain adaptation, in which training and test data come from similar but different distributions. Ben-David (2007) suggests that a good representation for cross-domain transfer is one for which an algorithm cannot learn to identify the domain of origin of the input observation. We expect something similar here: an ideal representation of tweets should be invariant to group affiliation (Blue/Red) as well as discriminant to topic relevance. With this thought, we propose an approach inspired by a domain adaptation technique -Domain-Adversarial Neural Networks (Ganin et al., 2016) as a way to mitigate the bias propagation. Prior work uses adversarial feature learning for demoting latent confounds with respect to the task of native language identification (Kumar et al., 2019); for interpreting computational social science with deconfounded lexicon induction (Pryzant et al., 2018); or for preserving privacy by removing demographic attributes (Li et al., 2018). Xie (2017) demonstrates the effectiveness of adversarial feature learning on fair classifications. However the datasets they used (for predicting the savings, credit ratings and health conditions of individuals) have no natural language text input. Our work focuses on state-of-the-art NLP models (such as BERT) and applies adversarial feature learning to mitigate the political ideology bias.
Our goal is to train a classifier that learns to accurately predict the relevance, while ignoring superficial patterns biased with political group present in the training set. The architecture of our proposed model is shown in Figure 1. The input tweet x first goes through a text encoder e(x; θ e ) for getting a feature vector f x as the text representation. The encoder could be BERT, ELMo+BiLSTM or GloVe+BiLSTM, exactly same as what we describe in Section 3.1. Then the feature vector f x is fed into two one-layer feed forward neural networks: 1) r(f x ; θ r ) (FF in orange in Figure 1) for predicting whether x is relevant or not; 2) g(f x ; θ g ) (FF in yellow in Figure 1) for predicting whether x is posted by a Blue or Red user. As gradients back-propagate from the group prediction g heads to the encoder, we pass them through a gradient reversal layer (Ganin et al., 2016), which multiplies gradients by −1. If the cumulative loss of the relevance prediction is L r and that of the group classification is L g , then the loss which is implicitly used to train the encoder is L e = L r − αL g (with loss weighted by α), thereby encouraging the encoder to learn representations of the text which are not useful for predicting the political group. We use Cross-Entropy for computing L r and L g : In this work we use those three off-the-shelf NLP models as text encoder for comparing directly with Section 3.1, but the proposed approach is applicable to other encoder models as well.

Experiments
To address the central questions raised in this work -how biases are propagated in several widely-used NLP models and their effect on the overall retrieval accuracy, we conduct experiments to quantify the impact of biased training. To do so, we need to generate training sets for which we can measure the  degree of bias along the Blue-Red spectrum. We also need metrics for determining the quality and bias of retrieved tweets. With this evaluation framework, we compare how different NLP models perform under multiple, differently biased training sets on two social topics -immigration and gun control. Then we evaluate the effectiveness of our proposed bias mitigating approach under the same setting.

Experimental Setup
Data: In this study, we build on data acquired from prior work by Yang et al.(2017). They collected over 7 million tweets posted by the exclusive followers of Trump and Clinton within a nine-month period (between June 2016 and February 2017). We pre-process this Tweet corpus by removing emoji, website links and usernames. Then we split it into training and test set by a ratio of 9:1. Topic detection models are trained and validated on the training corpus, and the retrieval quality and retrieval bias are evaluated on the test set. Training Set Settings: For each topic we collect a set of keywords (referred as K total ) from experts and compute their bias scores (Section 2.2). Keywords that don't pass the Chi-square test at the confidence level equal to 99% are considered as biased. If the z-score of a biased keyword is positive, it is biased towards Blue; otherwise, it is biased towards Red. We refer the set of Blue-leaning keywords as K blue and Red-leaning keywords as K red . For each keyword in K blue (respectively K red ), we extract all tweets containing those keywords from the training corpus as "relevant" examples; we randomly select an equal number of tweets that don't contain any keyword as "irrelevant" examples. In this way we construct a noisy training dataset biased towards Blue (respectively Red). Similarly, we also construct a full training dataset with K total . We report the number of relevant tweets by this keyword approach in the training and test set in Table 2. Notice that for the topic of immigration the size of Blue-leaning training set is twice more than Red-leaning training set.
Model Settings: We use the publicly available versions of BERT (bert-base-uncased 3 ), ELMo (orginal 4 ) and GloVe (common crawl 830B300d 5 ) with the recommended parameter settings. The loss weight α for the adversarial training gradually increases from 0 to 1 along with the increasing of training batch number. When training our bias mitigating approach, we keep most hyper-parameters same as its correspondent original model, e.g., learning rate, epoch number, batch size, maximum sequence length, etc. For ELMo+ADV we tune the dropout parameter and report the one without RNN input dropout in Section 4.3.
Evaluation Metrics: While our data contains the ground truth for whether an instance belongs to Blue or Red, we do not know a priori whether it is relevant to some topic or not. To determine model performance on the dimension of relevance, we use precision at N , which computes the precision score of a set of instances from the top N prediction scores. The relevance is judged by crowdsourcing workers via Amazon Mechanical Turk. In order to evaluate the retrieval quality, we need to choose a reasonable N for the metric P@N: if N is too large or too small, then all models would have a very low or high precision score so that the comparison between them is not significant. We find that the number of relevant tweets extracted by K total from the test set is a good candidate for N. It is not too large because we expect there to be at least this many relevant tweets in the test set; and it is not too small, since keywords K red or K blue which are used for generating training sets are subsets of K total . As Table 2 Immigration P@3000 Gun control P@1000 Blue-leaning Train Red-leaning Train Blue-leaning Train Red-leaning Train All Blue  Table 3: Retrieval accuracy of different topic detection models trained on Blue-learning or Red-leaning training sets. Columns "All", "Blue" and "Red" respectively show the accuracy for all retrieved tweets, retrieved tweets posted by Blue users and retrieved tweets posted by Red users. Within each column of "All", the best model is bolded; if a model's performance is over 10% worse than the best one, then it is marked in red color.
shows, there are 3007 tweets and 1049 tweets which contain keywords from K total for immigration and gun control, respectively. Therefore, we use P@3000 and P@1000 (round 3007 and 1049 to hundred) as evaluation metric respectively for immigration and gun control. For reducing the annotation cost, we randomly select 100 samples from the top 3000 or 1000 for human annotation which could significantly represent the performance of models. To determine the level of bias in a model's predictions, we compute the Blue-versus-Red Log Odds Ratio to determine how likely a retrieved instance (one of the top N ) is from Blue instead of the Red. The first odds computes # of Blue instances: # of Red instances in top N ; the second odds computes the same odds in the non-retrieved instances.
Annotation Process: The ground-truth relevance of retrieved tweets is annotated by crowdsourcing workers via Amazon Mechanical Turk. Selected workers are well trained and carefully evaluated by qualification tests in order to make sure they know the coverage of each social topic (refer to Appendix C), for example immigration covers a broad set of sub-topics ranging from a specific policy (e.g., DACA), border security, birthright citizens, to labor market. We add gold standard instances to each annotation batch for monitoring the annotation quality. In addition, each instance is annotated by two annotators and a third person is involved if they don't meet an agreement. The average accuracy and inter-agreement of our annotations are both above 90%.

Results of Comparing Off-the-shelf NLP Models
The retrieval quality of different topic detection models trained on biased training sets is shown in Table  3. Models built with GloVe, ELMo and BERT are trained on Blue-leaning and Red-leaning training sets. In addition to the three off-the-shelf models, we also include a naive keyword approach as baseline which only retrieves tweets containing training keywords. Let's first look at columns "All" which show the accuracy for all retrieved tweets. For the both topics, it is not surprising that trained NLP models generally outperform the keyword-extraction baseline. This means that the models are able to learn some patterns besides the keywords and generalize to tweets which do not contain any keyword. For more easily comparing between models, the best model within the column is bolded; if a model's performance is over 10% worse than the best one, then it is marked in red color. Our experimental results show that ELMo-based model has the best overall retrieval quality; BERT-based model is the most negatively affected by the training bias. Next, we compare models' accuracy for retrieved tweets posted by Blue users and Red users by looking into columns "Blue" and "Red". In general, models have a better retrieval accuracy for tweets from the group towards which the training set is biased (as the second, third and fourth big column show), except for the first big column (when models are trained on the Blue-leaning set for the topic immigration, models have a better retrieval accuracy for Red tweets than Blue). We also report the retrieval accuracy for models trained on the full training set (constructed from K total ) in Table  4. The performance of different models are close to each other.
Next, we evaluate to what extent the political bias is propagated to the retrieved (predicted top N ) tweets by different NLP models. Blue-versus-Red Log Odds Ratio of different topic detection models trained on Blue-leaning or Red-leaning training sets are shown in Table 5. We use the bias in each training   set as baseline. The closer to 0 Log Odds Ratio is, the less the political bias is propagated. Positive means leaning to Blue and negative means leaning to Red. Experimental results show that for both topics, NLP models are all able to mitigate political ideology bias from training data. Especially for models trained on Blue-leaning set of the topic gun control, the initial training set is highly biased towards Blue group with an Log Odds Ratio equal to 2.08, NLP models are able to mitigate 61% less bias. For comparing between models more easily, the best model within each column is bolded. We find that ELMo and GloVe-based models propagate the least of the bias; BERT-based model propagates the most of the bias seen in the training data. Taking both the retrieval accuracy and retrieval bias into consideration, we conclude that ELMo-based model is the most robust to training bias, while BERT-based model is the most negatively affected by the training bias. Our findings inform practitioners to choose the more robust model when training data is biased. Moreover, it is important to develop new approaches for mitigating the impact of bias, especially for BERT-based models.

Results of the Bias Mitigating Approach
We compare our bias mitigating approach with its original models on both retrieval accuracy and retrieval bias metrics. We report the performance of our bias mitigating approach in both a more realistic training scenario -models are trained on the full training data, and extremely biased cases -models are trained on our generated strongly Blue-leaning or Red-leaning training sets. The full training dataset is slightly biased towards Blue with a Blue-versus-Red Log Odds Ratio equal to 0.51 for the topic of immigration   or 0.61 for the topic of gun control. Table 6 shows that our proposed BERT-based model (noted as BERT+ADV) improves the retrieval accuracy compared with the original one; for ELMo+ADV and GloVe+ADV the retrieval accuracy slightly reduces for most cases. Table 7 shows that our proposed models are very efficient for mitigating the bias propagation. In sum, experimental results demonstrate that our proposed approach succeeds to mitigate bias propagation, with no or little drop of retrieval accuracy. It works especially well for BERT-based models -mitigates the bias at the same time increases the retrieval accuracy.

Conclusion
We have studied the impact of political ideology biases in different types of topic detection models and demonstrated a domain adaptation approach as an effective way of mitigating the bias. Our experimental results suggest that an ELMo-based model is more robust to training bias, while a BERT-based model is more negatively affected by the training bias. Since the ELMo-based model has nearly 40 times fewer trainable parameters than BERT, we conjecture that big complex models are more likely to propagate the bias seen in the training set. Although we have found the proposed adaptation architecture to be helpful for the three models, especially BERT, in terms of mitigating some of the training bias, the approach still relies on some knowledge of the existence of the bias. This work offers a comparison point for future studies to evaluate the effect of bias in various predictive models and opens the door for further reducing the bias in topic detection applications.

Appendix B Political ideology bias of each keyword
The Z-test scores for each keyword are respectively shown in Table 8 for immigration and gun control.