Adaptive Ensembling: Unsupervised Domain Adaptation for Political Document Analysis

Insightful findings in political science often require researchers to analyze documents of a certain subject or type, yet these documents are usually contained in large corpora that do not distinguish between pertinent and non-pertinent documents. In contrast, we can find corpora that label relevant documents but have limitations (e.g., from a single source or era), preventing their use for political science research. To bridge this gap, we present adaptive ensembling, an unsupervised domain adaptation framework, equipped with a novel text classification model and time-aware training to ensure our methods work well with diachronic corpora. Experiments on an expert-annotated dataset show that our framework outperforms strong benchmarks. Further analysis indicates that our methods are more stable, learn better representations, and extract cleaner corpora for fine-grained analysis.


Introduction
Recent progress in natural language processing and computational social science have pushed political science research into new frontiers. For example, scholars have studied language use in presidential elections (Acree et al., 2018), legislative text in Congress (de Marchi et al., 2018), and similarities in national constitutions (Elkins and Shaffer, 2019). However, datasets used by political scientists are mostly homogeneous in terms of subject (e.g., immigration) or document type (e.g., constitutions). Labeled corpora with pertinent documents usually only stem from a single source; this makes it difficult to generalize conclusions derived from them to other sources. On the other hand, corpora spanning multiple decades and sources tend to be unlabeled. These corpora are largely untouched by political scientists; to illustrate some problems that arise with studying such data, Table 1 shows a sample of topics Topic 1 like, day, would, a.m., center Topic 2 two, samour, family, veronica, son Topic 3 would, hospital, also, car, hyundai Topic 4 said, people, one, years, think Topic 5 city, 6-4, last, wine, york Table 1: Randomly sampled topics and top keywords derived from a 50-topic LDA model trained on a sample of COHA documents. Topic modeling results on a political subset of COHA are presented in Table 5. Additionally, topic model hyperparameters are detailed in Appendix A. generated by Latent Dirichlet Allocation (LDA) (Blei et al., 2003), a popular topic model in social science, trained on 60,000 documents sampled from the Corpus of Historical American English (COHA) (Davies, 2008). The generated topics are extremely vague and not specific to politics. This paper bridges the gap between labeled and unlabeled corpora by framing the problem as one of domain adaptation. We develop adaptive ensembling, an unsupervised domain adaptation framework that learns from a single-source, labeled corpus (the source domain) and utilizes these representations effectively to obtain labels for a multi-source, unlabeled corpus (the target domain). Our method draws upon consistency regularization, a popular technique that stabilizes model predictions under input or weight perturbations (Athiwaratkun et al., 2019). At the framework-level, we introduce an adaptive, feature-specific approach to optimization; at the model-level, we develop a novel text classification model that works well with our framework. To better handle the diachronic nature of our corpora, we also incorporate time-aware training and representations.
Our experiments use the New York Times Annotated Corpus (NYT) (Sandhaus, 2008) as our source domain corpus and COHA as our target do-main corpus. Concretely, we construct two classification tasks: a binary task to determine whether a document is political or non-political; and a multi-label task to categorize a document under three major areas of political science in the US: American Government, Political Economy, and International Relations (Goodin, 2009). We subsequently introduce an expert-labeled test set from COHA to evaluate our methods.
Our framework, equipped with our best model, significantly outperforms existing domain adaptation algorithms on our tasks.
In particular, adaptive ensembling achieves gains of 11.4 and 10.1 macro-averaged F1 on the binary and multi-label tasks, respectively.
Qualitatively, adaptive ensembling conditions the optimization process, learns smoother latent representations, and yields precise but diverse topics as demonstrated by LDA on an extracted political subcorpus of COHA. We release our code and datasets at http://github.com/ shreydesai/adaptive-ensembling.

Motivation from Political Science
Quantitative studies of American public opinion over time have mostly been restricted to surveys such as the American National Election Survey (Baldassarri and Gelman, 2008;Campbell et al., 1980). However, surveys often do not pose wellformed questions, reflect true voter opinion, or capture mass public opinion (Zaller et al., 1992;Bishop, 2004). Therefore, researchers often seek to compare survey findings with those of mass media as the relationship between public opinion and the media has been widely established (Baum and Potter, 2008;McCombs, 2018). Press media, one form of mass media, manifests itself in large, diachronic collections of newspaper articles; such corpora provide a promising avenue for studying public opinion and testing theories, provided scholars can be confident that the measures they obtain over time are substantively invariant (Davidov et al., 2014). However, as alluded to earlier, such diachronic corpora are often unlabeled; political scientists cannot draw conclusions from these corpora in their raw form as they are unable to distinguish between political and non-political articles. We frame this problem as an exchange between two domains: a source, labeled corpus with modern articles (NYT) and a target, unlabeled corpus with decades of articles originating from a multitude of news sources (COHA). Using domain adaptation methods, we can extract a political subcorpus from COHA that would be amenable for the study of public opinion research over time.

Unsupervised Domain Adaptation
In this section, we detail the core concepts behind our unsupervised domain adaptation framework. We describe the problem setup ( §3.1), an overview of self-ensembling and consistency regularization ( §3.2- §3.4), and our novel contributions to this framework ( §3.5- §3.6).

Problem Setup
Let X and Y denote the input and output spaces, respectively. We have access to labeled samples {x The goal of unsupervised domain adaptation is to learn a function f : X → Y that maximizes the likelihood of the target domain samples by only leveraging supervision from the source domain samples. We also assume the existence of a small amount of labeled target domain samples in order to create a development set, following existing work in unsupervised domain adaptation (Glorot et al., 2011;Chen et al., 2012;French et al., 2018;Zhang et al., 2017).

Self-Ensembling
Our unsupervised domain adaptation framework builds on top of self-ensembling (Laine and Aila, 2017), a semi-supervised learning algorithm based on consistency regularization, whereby models are trained to be robust against injected noise (Athiwaratkun et al., 2019).
Self-ensembling is an interplay between two neural networks: a student network f (x; θ) and a teacher network f (x; φ). The inputs to both networks are perturbed separately, and the objective is to measure the consistency of the student network's predictions against the teacher's. Both networks share the same base model architecture and initial parameter values, but follow different training paradigms (Laine and Aila, 2017). In particular, the student network is updated via backpropagation, then the teacher network is updated with an exponential average of the student network's parameters (Tarvainen and Valpola, 2017). The networks are trained in an alternating fashion until they converge. During test time, the teacher network is used to infer the labels for target domain samples. Figure 1 visualizes the overall training procedure. Further intuition behind selfensembling is available in Appendix B.

Student Training
The student network uses labeled samples from the source domain and unlabeled samples from the target domain to learn domain-invariant features. This is realized by using multiple loss functions, each with its own objective. The supervised loss is simply the cross-entropy loss of the student network outputs given source domain samples: However, the supervised loss alone prevents the student network from learning anything useful about the target domain. To address this, Laine and Aila (2017) introduce an unsupervised loss to ensure that the student and teacher networks have similar predictions for target domain samples. French et al. (2018) only enforce the consistency constraint for target domain samples, but we propose using both source and target domain samples with separately perturbed inputs x and x ; this provides a balanced source of supervision to train our adaptive constants, discussed in §3.5: The overall objective is a combination of the two loss functions:

Fixed Ensembling
The teacher network's parameters form an ensemble of the student network's parameters over the course of training: where α is a smoothing factor that controls the magnitude of the parameter updates. Since the labels for the target domain samples are inherently unknown, ensembling parameters in the presence of noise helps the teacher network's predictions converge to the true label (Tarvainen and Valpola, 2017).
Limitations Empirically, we find that the highly unstable loss surface presented by textual datasets causes large instabilities in the optimization process. One of the key insights of this paper is that these instabilities are due to the dynamics of the unsupervised loss. Because the unsupervised loss effectively regularizes the source domain representations to work well in the target domain (Laine and Aila, 2017), performance degrades rapidly if this loss fails to converge. This is a strong indicator that self-ensembling fails to learn useful, shared representations for knowledge transfer between textual domains. Qualitative evidence of the unsupervised loss' instability is shown in Figure 6a and further discussed in §7.

Adaptive Ensembling
We hypothesize that smoothing with a fixed hyperparameter α is responsible for said instabilities. For any given weight matrix (or bias vector), each hidden unit can be conceptualized as controlling one highly specific feature or attribute (Bau et al., 2019). These units may need to be updated with varying degrees throughout the course of training; therefore, smoothing each unit with a fixed constant severely overlooks dynamics at the parameter-level. We propose modifying fixed ensembling by introducing trainable smoothing constants for each unit-hereafter termed adaptive constants-as opposed to using a fixed smoothing constant: where a matrix of adaptive constants C is applied element-wise to φ and θ at each step. Example Assume we are training an arbitrary weight matrix W ∈ R m×n in the kth layer of a fixed network architecture. Both the student and teacher network have their own copy of W, denoted as W STU and W TEA , respectively. To ensure each parameter W ij has a corresponding adaptive constant α ij , C shares the same dimensionality as W STU and W TEA . The previous equation can then be written as: Supervision Because the adaptive constants are designed to stabilize training, it is a natural fit to train them using the unsupervised loss: This forms a crucial difference between selfensembling and adaptive ensembling: in the former method, the teacher network has no say in how its parameters are modified. Adaptive ensembling equips the teacher network with fine-grained control over gradient updates, making it far easier to align activations under a noisy setting.

Temporal Curriculum
Diachronic datasets important in political science can be difficult to adapt to given the minimal vocabulary overlap between the source and target domain documents. Source and target articles mention named entities and events that, for the most part, do not appear across both datasets. To ease the difficulty of domain adaptation, we exploit the temporal information in our datasets to introduce a curriculum (Bengio et al., 2009).
In particular, each article comes with metadata that includes the year in which the article was published. Figure 2 shows that COHA articles written closer to the time of NYT articles have a larger vocabulary overlap than those written in the distant past. Intuitively, it is easier to learn features from target domain samples that are more like the source domain samples. Hence, we sort the target domain mini-batches by year; the learning task becomes progressively harder as opposed to confusing the models during the early stages of training.

Model
In this section, we introduce a new convolutional neural network (CNN) as the plug-in model for our unsupervised domain adaptation framework. We motivate the use of CNNs ( §4.1), formalize the model input ( §4.2), and introduce several novel components for our task ( §4.3).

Motivation
CNNs have emerged as strong baselines for text classification in NLP (Kim, 2014). CNNs are desirable candidates for our framework as they exhibit a high degree of parameter sharing, significantly reducing the number of parameters to train. In addition, they can be designed to solely optimize the log-likelihood of the training data. Experimentally, we find that models that optimize other distributions (e.g., attention distributions in Transformers (Vaswani et al., 2017) or Hierarchical Attention Networks (Yang et al., 2016)) do not work well with this framework.

Model Input
Given a discrete input x = [w 1 , · · · , w n ] and vocabulary V , an embedding matrix E ∈ R |V |×d replaces each word w i with its respective ddimensional embedding. The resulting embeddings are stacked row-wise to obtain an input matrix X ∈ R n×d . Following the notion of input perturbation used in consistency regularization algorithms (Athiwaratkun et al., 2019), we design several methods to inject noise into the input layer. Each input is perturbed with additive, isotropic Gaussian noise:X = X + N (0, I). Then, we apply dropout on the perturbed inputs to eliminate dependencies on any one word: X =X M where M ∈ R n×d is a Bernoulli mask applied element-wise to the input matrix.

Model Architecture
Background: 1D Convolutions CNNs for text classification generally use 2D convolutions over the input matrix (Kim, 2014), but architectures using 1D convolutions have also been explored in other contexts, e.g., sequence modeling (Bai et al., 2018), machine translation (Kalchbrenner et al., 2016), and text generation (Yang et al., 2017). Our model draws upon the latter approach for political document classification. CNNs utilizing 1D convolutions are typically autoregressive in nature; that is, each output y t only depends on the inputs x <t to avoid information leakage into the future. Two approaches have been proposed to achieve this: history-padding (Bai et al., 2018(Bai et al., , 2019 and masked convolutions (Kalchbrenner et al., 2016). Further, each successive convolution uses an exponentially increasing dilation factor, reducing the depth of the network significantly. Below, we elaborate on the components of our model: Sequence Squeezing Given a model with layers, previous approaches (Bai et al., 2018(Bai et al., , 2019 history-pad the input with i=1 d (i−1) (f − 1) zeros to obtain an output of length n, where d is the dilation factor and f is the filter size. However, we propose history-padding the input with ( i=1 d (i−1) (f − 1)) − n + 1 zeros to ensure the convolutions compress the sequence down to one output unit. Formally, this produces an output feature map of dimension B × C × 1 where B is the batch size and C is the number of channels; one can use a simple squeeze() operation to obtain the compact feature matrix B × C. Though this is a subtle difference, our approach yields much richer representations for classification.

State Connections
In each layer i , a kernel W i convolves across an intermediate sequence, inducing a feature map A i . Because the input is pre-sented as a sequence, the application of W i along a one-dimensional axis encourages A i to encode temporal features, similar to how the hidden state is formed by applying shared weights across a sequence in recurrent architectures. Further, because the receptive field grows exponentially, the convolutions build hierarchical representations of the input, implying A i+1 builds a more abstract representation of the input than A i . We exploit this stateful information by pooling each activation map A i into a vector and concatenating them row-wise to create a state matrix: To the best of our knowledge, our paper is the first to explicitly use the temporal state embedded in causal 1D convolution activations as representations for an end task.
Time Embedding To make our model time aware, we learn representations for the years of the documents (available as metadata in COHA). Such time representations allow the model to reason about content as it appears in different decades.
Given a year y (e.g. 1954), we normalize it to the closed unit interval [0, 1] and linearly transform it into a low-dimensional embedding e: e = W e y − max y max y − min y + b e where max y and min y represent the maximum and minimum observed years in the training dataset, respectively.
Overall Architecture We concatenate the various components of our model [X ; S; e] to create a collective representation for classification. We use a 1D convolution (f = 1 and d = 1) to project this representation to k classes: We did not observe any performance advantages from using a fully-connected layer to perform the projection, so we opt to use a fully-convolutional architecture to minimize the number of parameters (Long et al., 2015). Finally, we apply softmax to the output vector y ∈ R k to obtain a valid probability distribution over the classes. An example of our model architecture is depicted in Figure 3.

Datasets
We present a dataset for identifying political documents with manual annotation from political science graduate students. The dataset is constructed for binary and multi-label tasks: (1) identifying whether a document is political (i.e. containing notable political content) and (2) if so, the area(s) among three major political science subfields in the US: American Government, Political Economy, and International Relations (Goodin, 2009).
Source We use NYT as the source dataset as it contains fine-grained descriptors of article content. We sample 4,800 articles with the descriptor US POLITICS & GOVERNMENT. To obtain non-political articles, we sample 4,800 documents whose descriptors do not overlap with an exhaustive list of political descriptors identified by a political science graduate student. For our multilabel task, the annotator grouped descriptors in NYT that belong to each area label we consider 1 .
Target Our target data are historical documents from COHA, which contains a large collection of news articles since the 1800s. To ensure our dataset is useful for diachronic analysis (e.g., public opinion over time), we sample only from news sources that consistently appear across the decades. Further, we ensure there are at least 8,000 total documents in each decade group; this narrows down our time span to . From this subset, we sample ∼250 documents from each decade for annotation. Two political science graduate students each annotated a subset of the data. To train our unsupervised domain adaptation framework, we use 9,600 unlabeled target examples (same number as NYT). The expert-annotated dataset is divided into three subsets: a training set of 984 documents (only for training the In-Domain classifier discussed in §6.2), development set of 246 documents, and test set of 350 documents (50 per decade) 2 .

Settings
Our CNN has 8 layers, each with 256 channels, f = 3, d = 2 i (for the ith layer), and ReLU activation. We enforce a maximum sequence length 1 These descriptors are available in Appendix C. 2 The news sources used and label distributions for the expert-annotated dataset are available in Appendix D.  Table 2: Framework results for the binary label task (left) and multi-label task (right). For the binary task, we show micro-and macro-averaged F1 scores. For the multi-label task, we show macro-averaged precision, recall, and F1 scores.
of 200 and minimum word count from [1, 2, 3] to build the vocabulary. The embedding matrix uses 300-D GloVe embeddings (Pennington et al., 2014) with a dropout rate of 0.5 (Srivastava et al., 2014). We history-pad our input with a zero vector, the state connections are obtained using average pooling, and the time embedding has a dimensionality of 10. The model is optimized with Adam (Kingma and Ba, 2015), learning rate from [10 −4 , 5 · 10 −5 , 10 −5 ], and mini-batch size from [16,32]. Hyperparameters were discovered using a grid search on the held-out development set.

Framework Results
Using our best model, we benchmark our unsupervised domain adaptation framework against established methods: (1) Marginalized Stacked Denoising Autoencoders (mSDA): Denoising autoencoders that marginalize out noise, enabling learning on infinitely many corrupted training samples (Chen et al., 2012). (2) Table 3: Model results with adaptive ensembling for the binary label task (left) and multi-label task (right). For the binary task, we show micro-and macroaveraged F1 scores. For the multi-label task, we show macro-averaged precision, recall, and F1 scores.
method achieves the highest F1 scores for both tasks. The temporal curriculum further improves our results by a large margin, validating its effectiveness for domain adaptation on diachronic corpora. Although DANN achieves higher precision on the multi-label task, its recall largely suffers.

Model Results
Next, we ablate the various components of our model and evaluate several other strong text classification baselines under our framework: (1) Logistic Regression (LR): We average the word embeddings of each token in the sequence, then use these to train a logistic regression classifier.
Model ablations and results are presented in Table 3. Our full model achieves the highest F1 scores on both the binary and multi-label tasks, and each component consistently contributes to the overall F1 score. The 2D CNN also has decent F1 scores, showing that our framework works with standard CNN models. Further, the time embedding significantly improves both F1 scores, indicating the model effectively utilizes the unique temporal information present in our corpora.

Analysis
In this section, we pose and qualitatively answer numerous probing questions to further understand the strong performance of adaptive ensembling. We analyze several characteristics of the overall framework ( §7.1), then qualitatively inspect its performance on our datasets ( §7.2).

Framework
Are the adaptive constants different across hidden units? We randomly sample five adaptive constants and track their value trajectories over the course of training. Figure 4 shows all of them sharply converge to and bounce around the same general neighboorhood. This is strong evidence that we cannot use a fixed hyperparameter α to smooth each parameter, rather we need perparameter smoothing constants to account for the functionality and behavior of each unit.
How do the adaptive constants change by layer? Figure 5 shows the distribution of weight and bias parameters of adaptive constants for a top, middle, and bottom layer of our CNN. For the weight parameters, the teacher relies heavily on the student (α is skewed towards smaller smoothing rates) in the top layer, but gradually reduces its dependence by learning target domain features in the lower layers (α is skewed towards larger smoothing rates). For the bias parameters, the teacher prominently shifts the student features to work for the target domain in the top layer, but reduces its dependence on the student in the lower layers. This shows why using a fixed hyperparameter α does not account for layer-wise dynamics, i.e. each layer requires a specific distribution of α values to achieve strong performance.
Do adaptive constants benefit training and latent representations? Figure 6a depicts the unsupervised loss trajectories for self-ensembling Figure 5: Distribution of teacher network adaptive constants for a top, middle, and bottom layer. We display adaptive constants for both weight (top) and bias (bottom) parameters. The x-axis is shared for both the weight and bias distributions.
(SE) and adaptive ensembling (AE). Compared to SE, the adaptive constants significantly stabilize the unsupervised loss. Next, Figure 6b shows the general training curves for AE and domainadversarial neural networks (DANN). The DANN loss oscillates uncontrollably as the adversarial weight increases, but increasing the unsupervised loss weight for AE does not result in as much instability. We also compare the latent representations learned by SE and AE in Figure 7. While SE shows evidence of feature alignment, AE learns a much smoother manifold where source and target domain representations are intertwined.

Datasets
Does adaptive ensembling yield better topics? In Table 1, we showed that applying LDA directly on COHA yields noisy, unrecognizable topics. Here, we use the SOURCE ONLY model and the adaptive ensembling framework to obtain labels for the unlabeled pool of COHA documents. We extract the political documents, run a topic model on the political subcorpus, and randomly sample topics. The SOURCE ONLY results are shown in Table 4 and the adaptive ensembling results are shown in Table 5. The SOURCE ONLY model has poor recall, as most of the topics are extracted are vague and not inherently political in nature. In contrast, our framework is able to extract a wide range of clean, identifiable political topics. For example, the first topic reflects documents related to the Vietnam conflict while the third topic reflects documents related to important court proceedings. Topic 1 dr, women, week, medical, doctors Topic 2 city, police, street, car, avenue Topic 3 trial, years, police, prison, court Topic 4 union, strike, workers, lewis, service Topic 5 like, man, years, little, week Does adaptive ensembling preserve the integrity of the original corpus? In order for political scientists to effectively study latent variables-such as political polarization-over time, the extracted political subcorpus must contain a similar integrity as the original corpus. That is, the subcorpus' distribution of documents across years and sources must relatively match that of the original corpus. First, we analyze the document counts for each decade bin, shown in Figure 8. The political subcorpus shows a relatively consistent count across the deacdes, notably also capturing salient peaks from the 1920-1930s. Next, we analyze the document counts for each news source. Once again, the political subcorpus features documents from all sources that appear in Figure 7: PCA performed on the latent representations of the teacher network in self-ensembling (left) and adaptive ensembling (right). We show representations for both source domain samples (green) and target domain samples (blue). Best viewed in color. Topic 1 vietnam,hanoi,atomic,bombing,south Topic 2 germany,britain,france,europe,soviet Topic 3 court,justice,commission,law,attorney Topic 4 tax,oil,prices,petroleum,industry Topic 5 coal,union,strike,workers,miners  the original corpus. In addition, the varied distribution across sources is also captured; Time Magazine (TM) has the most documents whereas Wall Street Journal (WSJ) has the least documents. Together, these results show that the resulting subcorpus is amenable for political science research as it exhibits important characteristics derived from the original COHA corpus.

Related Work
Early approaches for unsupervised domain adaptation use shared autoencoders to create crossdomain representations (Glorot et al., 2011;Chen et al., 2012). More recently, Ganin et al. (2016) introduce a new paradigm that create domaininvariant representations through adversarial training. This has gained popularity in NLP (Zhang et al., 2017;Fu et al., 2017;Chen et al., 2018), however the difficulties of adversarial training are well-established (Salimans et al., 2016;Arjovsky and Bottou, 2017). Consistency regularization methods (e.g., self-ensembling) outperform adversarial methods on visual semi-supervised and domain adaptation tasks (Athiwaratkun et al., 2019), but have rarely been applied to textual data (Ko et al., 2019). Finally, Huang and Paul (2018) establish the feasibility of using domain adaptation to label documents from discrete time periods.  Our work departs from previous work by proposing an adaptive, time-aware approach to consistency regularization provisioned with causal convolutional networks.

Conclusion
We present adaptive ensembling, an unsupervised domain adaptation framework capable of identifying political texts for a multi-source, diachronic corpus by only leveraging supervision from a single-source, modern corpus. Our methods outperform strong benchmarks on both binary and multi-label classification tasks. We release our system, as well as an expert-annotated set of political articles from COHA, to facilitate domain adaptation research in NLP and political science research on public opinion over time.

A LDA Topic Model
We experimented with a range of hyperparameters to ensure the Latent Dirichlet Allocation (LDA) model was best optimized for our datasets, leveraging the Gensim 3 library. In particular, we removed all stopwords, extremely rare words (tail 10-20% from a unigram distribution), and set the number of topics to 50.

B Self-Ensembling
The core intuition behind consistency regularization is that ensembled predictions are more likely to be correct than single predictions (Laine and Aila, 2017;Tarvainen and Valpola, 2017). To this end, Laine and Aila (2017) introduce a student and teacher network that yield single predictions and ensembled predictions, respectively.
After learning from labeled samples, the student may produce varying, dissimilar predictions for unlabeled samples due to the stochastic nature of optimization. One potential solution is to ensemble predictions across time to converge at the most likely prediction (Laine and Aila, 2017). Tarvainen and Valpola (2017) improve upon this method by showing that ensembling parameters (as opposed to predictions) results in better predictions. Because the teacher's parameters are smoothed with the student's learned parameters at each iteration, the teacher effectively becomes an ensemble of the student across time.
Further, to ensure that the features learned from the labeled samples are compatible with the unlabeled samples, Laine and Aila (2017); Tarvainen and Valpola (2017);French et al. (2018) motivate a consistency-enforcing approach to bring the student and teacher's predictions closer together. In essence, if a feature learned from samples in the labeled domain is incompatible with samples in the unlabeled domain, the consistency (unsupervised) loss penalizes its incompatibility. Therefore, the interplay between these two networks creates a robust, domain-invariant feature space that characterizes both labeled and unlabeled samples (French et al., 2018). A detailed visualization of the training procedure is presented in Figure 1 in the main body of this paper.

C NYT Descriptors
We build a list of "political" descriptors in NYT to determine (a) which labels we can or cannot sample non-political documents from; and (b) which descriptors fall under the three areas of political science we consider for our multi-label task (American Government, Political Economy, and International Relations).
Because documents can be tagged with multiple descriptors, we build a list of descriptors whose documents have significant overlap with US POL-ITICS & GOVERNMENT. The second author, a political science graduate student, filtered this list to 57 descriptors that are political in nature.
For (a), we sample 4,600 non-political documents whose descriptors do not overlap with the 57 political descriptors described above. For (b), the same political science graduate student assigns each descriptor with one or more area labels. We use this label information to build an NYT dataset for our tasks. The 57 political descriptors and their corresponding area labels are tabulated in Table 7.

D Expert-Annotated Dataset
To create an initial COHA subcorpus of 56,000 documents (8,000 per decade), we sample from the following news sources that consistently appear in across decades: Chicago Tribune, Christian Science Monitor, New York Times, Time Magazine, and Wall Street Journal. Note that these NYT articles (up to year 1986) do not appear in the NYT annotated corpus (Sandhaus, 2008) (starting from year 1987), which we used as our source, training dataset.
From this subcorpus, we perform additional steps to create an expert-annotated dataset ( §5). Label distributions for our dataset are presented in Table 6. Although political economy (PE) is severely underrepresented, we experimentally find that these documents have salient features and are not as difficult to classify. In addition, we employ class imbalance penalties to prevent our model from ignoring these documents.
The source dataset (NYT) was already annotated; to ensure label agreement with our target dataset (COHA), we sampled documents from the source dataset and had our political science graduate students label them to compare against the original label. There were minimal problems here-because NYT has fine-grained labels for their documents, the politically-labeled articles were clearly political and vice-versa.
The target datatset (COHA) was divided into halves and each political science graduate student annotated a half. Prior to annotation, they agreed upon a set of rules to minimize bias in the annotation process. In addition, both of them worked side-by-side during all annotation periods, so they were able to ask each other's opinion in case there was confusion. We also took measures to ensure label correctness after annotation was completed. Each political science graduate student sampled a batch of their political and non-political annotations and sent it to the other to evaluate. Again, there was not much disagreement here as the rules decided upon in the beginning were sufficient to cover most edge cases. Quantitatively, Cohen's κ = 0.95 as calculated on a mutually annotated subset (Cohen, 1960).