Open Extraction of Fine-Grained Political Statements

Text data has recently been used as evidence in estimating the political ideologies of individuals, including political elites and social media users. While inferences about people are often the intrinsic quantity of interest, we draw inspiration from open information extraction to identify a new task: inferring the political import of propositions like O BAMA IS A S OCIAL - IST . We present several models that exploit the structure that exists between people and the assertions they make to learn latent positions of people and propositions at the same time, and we evaluate them on a novel dataset of propositions judged on a political spectrum.


Introduction
Over the past few years, much work has focussed on inferring political preferences of people from their behavior, both in unsupervised and supervised settings. Classical ideal point models (Poole and Rosenthal, 1985;Martin and Quinn, 2002) estimate the political ideologies of legislators through their observed voting behavior, possibly paired with the textual content of bills (Gerrish and Blei, 2012) and debate text (Nguyen et al., 2015); other unsupervised models estimate ideologies of politicians from their speeches alone (Sim et al., 2013). Twitter users have also been modeled in a similar framework, using their observed following behavior of political elites as evidence to be explained (Barberá, 2015). Supervised models, likewise, have not only been used for assessing the political stance of sentences (Iyyer et al., 2014) but are also very popular for predicting the holistic ideologies of everyday users on Twitter (Rao et al., 2010;Pennacchiotti and Popescu, 2011;Al Zamal et al., 2012;Cohen and Ruths, 2013;Volkova et al., 2014), Facebook (Bond and Messing, 2015) and blogs (Jiang and Argamon, 2008), where training data is relatively easy to obtaineither from user self-declarations, political following behavior, or third-party categorizations.
Aside from their intrinsic value, estimates of users' political ideologies have been useful for quantifying the orientation of news media sources (Park et al., 2011;Zhou et al., 2011). We consider in this work a different task: estimating the political import of propositions like OBAMA IS A SOCIALIST.
In focusing on propositional statements, we draw on a parallel, but largely independent, strand of research in open information extraction. IE systems, from early slot-filling models with predetermined ontologies (Hobbs et al., 1993) to the largescale open-vocabulary systems in use today (Fader et al., 2011;Mitchell et al., 2015) have worked toward learning type-level propositional information from text, such as BARACK OBAMA IS PRES-IDENT. To a large extent, the ability to learn these facts from text is dependent on having data sources that are either relatively factual in their presentation (e.g., news articles and Wikipedia) or are sufficiently diverse to average over conflicting opinions (e.g., broad, random samples of the web).
Many of the propositional statements that individuals make online are, of course, not objective descriptions of reality at all, but rather reflect their own beliefs, opinions and other private mental states (Wiebe et al., 2005). While much work has investigated methods for establishing the truth content of individual sentences -whether from the perspective of veridicality (de Marneffe et al., 2012), fact assessment (Nakashole and Mitchell, 2014), or subjectivity analysis (Wiebe et al., 2003;Wilson, 2008) -the structure that exists between users and their assertions gives us an opportunity to situate them both in the same political space: in this work we operate at the level of subject-predicate propositions, and present models that capture not only the variation in what subjects (e.g., OBAMA, ABORTION, GUN CONTROL) that individual communities are more likely to discuss, but also the variation in what predicates different communities assert of the same subject (e.g., GLOBAL WARMING IS A HOAX vs. IS A FACT). The contributions of this work are as follows: • We present a new evaluation dataset of 766 propositions judged according to their positions in a political spectrum. • We present and evaluate several models for estimating the ideal points of subject-predicate propositions, and find that unsupervised methods perform best (on sufficiently partisan data).

Task and Data
The task that we propose in this work is assessing the political import of type-level propositions; on average, are liberals or conservatives more likely to claim that GLOBAL WARMING IS A HOAX? To support this task, we create a benchmark of political propositions, extracted from politically partisan data, paired with human judgments (details in §2.3). We define a proposition to be a tuple comprised of a subject and predicate, each consisting of one or more words, such as global warming, is a hoax . 1 We adopt an open vocabulary approach where each unique predicate defines a unary relation.

Data
In order to extract propositions that are likely to be political in nature and exhibit variability according to ideology, we collect data from a politically volatile source: comments on partisan blogs. We draw data from NPR, 2 Mother Jones 3 and Politico 4 , all listed by Pew Research  as news sources most trusted by those with consistently liberal views; Breitbart, 5 most trusted by those with consistently conservative views; and the Daily Caller, 6 Young Conservatives 7 and the Independent Journal Review, 8 all popular among conservatives (Kaufman, 2014  We gather comments using the Disqus API; 9 as a comment hosting service, Disqus allows users to post to different blogs using a single identity. Table 1 lists the total number of articles, user comments, unique users and tokens extracted from each blog source. In total, we extract 28 million comments (1.2 billion tokens) posted by 621,231 unique users. 10

Extracting Propositions
The blog comments in table 1 provide raw data from which to mine propositional assertions. In order to extract structured subject, predicate propositions from text, we first parse all comments using the collapsed dependencies (de Marneffe and Manning, 2008) of the Stanford parser (Manning et al., 2014), and identify all subjects as those that hold an nsubj or nsubjpass relation to their head. In order to balance the tradeoff between generality and specificity in the representation of assertions, we extract three representations of each predicate.
1. Exact strings, which capture verbatim the specific nuance of the assertion. This includes all subjects paired with their heads and all descendants of that head. Tense and number are preserved.
Example: Reagan, gave amnesty to 3 million undocumented immigrants 2. Reduced syntactic tuples, which provide a level of abstraction by lemmatizing word forms and including only specific syntactic relationships. This includes propositions de-fined as nominal subjects paired with their heads and children of that head that are negators, modal auxiliaries (can, may, might, shall, could, would), particles and direct objects. All word forms are lemmatized, removing tense information on verbs and number on nouns.
Example: Reagan, give amnesty 3. Subject-verb tuples, which provide a more general layer of abstraction by only encoding the relationship between a subject and its main action. In this case, a proposition is defined as the nominal subject and its lemmatized head.
Example: Reagan, give The human benchmark defined in §2.3 below considers only verbatim predicates, while all models proposed in §3 and all baselines in §4 include the union of all three representations as data.
Here, syntactic structure not only provides information in the representation of propositions, but also allows us to define criteria by which to exclude predicates -since we are looking to extract propositions that are directly asserted by an author of a blog comment (and not second-order reporting), we exclude all propositions dominated by an attitude predicate (Republicans think that Obama should be impeached) and all those contained within a conditional clause (If Obama were impeached. . . ). We also exclude all assertions drawn from questions (i.e., sentences containing a question mark) and all assertions extracted from quoted text (i.e., surrounded by quotation marks).
In total, from all 28 million comments across all seven blogs, we extract all propositions defined by the criteria above, yielding a total of 61 million propositions (45 million unique).

Human Benchmark
From all propositions with a verbatim predicate extracted from the entire dataset, we rank the most frequent subjects and manually filter out noncontent terms (like that, one, someone, anyone, etc.) to yield a set of 138 target topics, the most frequent of which are obama, democrats, bush, hillary, and america.
For each proposition containing one of these topics as its subject and mentioned by at least 5 different people across all blogs, we randomly sampled 1,000 in proportion to their frequency of use (so that sentences that appear more frequently in the data are more likely to be sampled); the sentences selected in this random way contain a variety of politically charged viewpoints. We then presented them to workers on Amazon Mechanical Turk for judgments on the extent to which they reflect a US liberal vs. conservative political worldview.
For each sentence, we paid 7 annotators in the United States to a.) confirm that the extracted sentence was a well-formed assertion and b.) to rate "the most likely political belief of the person who would say it" on a five-point scale: very conservative/Republican (−2), slightly conservative/Republican (−1), neutral (0), slightly liberal/Democrat (1), and very liberal/Democrat (2).
We keep all sentences that at least six annotators have marked as meaningful (those excluded by this criterion include sentence fragments like bush wasn't and those that are difficult to understand without context, such as romney is obama) and where the standard deviation of the responses is under 1 (which excludes sentences with flat distributions such as government does nothing well and those with bimodal distributions, such as christie is done). After this quality control, we average the responses to create a dataset of 766 propositions paired with their political judgments.

Models
The models we introduce to assess the political import of propositions are based on two fundamental ideas. First, users' latent political preferences, while unobserved, can provide an organizing principle for inference about propositions in an unsupervised setting. Second, by decoupling the variation in subjects discussed by different communities (e.g., liberals may talk more about global warming while conservatives may talk more about gun rights) from variation in what statements are predicated of those subjects (e.g., liberals may assert that global warming, is a fact while conservatives may be more likely to assert that it is a hoax), we are able to have a more flexible and interpretable parameterization of observed textual behavior that allows us to directly measure both. We present two models below: one that represents users and propositions as real-valued points, and another that represents each as categorical variables. For both models, the input is a set of users paired with a list of subject, predicate tuples they author; the variables of interest we seek are representations of those users, subjects, and predicates that explain the coupling between users and propositions we see.

Additive Model
The first model we present ( fig. 1) represents each user, subject, and predicate as a real-valued point in K-dimensional space. In the experiments that follow, we consider the simple case where K = 1 but present the model in more general terms below.
In this model, we parameterize the generative probability of a subject (like Obama) as used by an individual u as the exponentiated sum of a background log frequency of that subject in the corpus overall (m sbj ) and K additive effects, normalized over the space of S possible subjects, as a real-valued analogue to the SAGE model of Eisenstein et al. (2011). While the background term controls the overall frequency of a subject in the corpus, β ∈ R K×S mediates the relative increase or decrease in probability of a subject for each latent dimension. Intuitively, when both η u,k and β k,sbj (for a given user u, dimension k, and subject sbj ) are the same sign (either both positive or both negative), the probability of that subject under that user increases; when they differ, it decreases. β ·,sbj is a K-dimensional representation of subject sbj , and η u,· is a K-dimensional representation of user u.
Likewise, we parameterize the generative probability of a predicate (conditioned on a subject) in the same way; for S subjects, each of which contains (up to) P predicates, ψ ∈ R S×K×P captures the relative increase or decrease in probability for a given predicate conditioned on its subject, relative to its background frequency in the corpus overall, m pred|sbj . The full generative story for this model runs as follows. For a vocabulary of subjects of size S, where each subject s has P predicates: -For each dimension k, draw subject coefficients β k ∈ R S ∼ Norm(µ s , σ s I) -For each subject s: -For each dimension k, draw subject-specific predicate coefficients ψ s,k ∈ R P ∼ Norm(µ p , σ p I) -For each user u: -Draw user representation η ∈ R K ∼ Norm(µ, σI) -For each proposition sbj , pred made by u: -Draw sbj according to eq. 1 -Draw pred according to eq. 2 The unobserved quantities of interest in this model are η, β and ψ. In the experiments reported below, we set the prior distributions on η, β and ψ to be standard normals (µ = 0, σ = 1) and perform maximum a posteriori inference with respect to η, β and ψ in turn for a total of 25 iterations.
While β and ψ provide scores for the political import of subjects and of predicates conditioned on fixed subjects, respectively, we can recover a single ideological score for both a subject and its predicate by adding their effects together. In the evaluation given in §5, let the PREDICATE SCORE for subject, predicate be that given by ψ subject,·,predicate , and let the PROPOSITION SCORE be β ·,subject + ψ subject,·,predicate .

Single Membership Model
While the additive model above represents each user and proposition as a real-valued point in Kdimensional space, we can also represent those values as categorical variables in an unsupervised naïve Bayes parameterization; in this case, a user is not defined as a mixture of different effects, but rather belongs to a single unique community. The generative story for this model (shown in fig. 2) is as follows: -Draw population distribution over categories θ ∼ Dir(α) -For each category k, draw distribution over subjects φ k ∼ Dir(γ) -For each category k and subject s: -Draw distribution over subject-specific predicates ξ k,s ∼ Dir(γ s ) -For each user u: -Draw user type index z ∼ Cat(θ) -For each proposition sbj , pred made by u: -Draw subject sbj ∼ Cat(φ z ) -Draw predicate pred ∼ Cat(ξ z,sbj ) We set K = 2 in an attempt to recover a distinction between liberal and conservative users. For the experiments reported below, we run inference using collapsed Gibbs sampling (Griffiths and Steyvers, 2004) for 100 iterations, performing hyperparameter optimization on α, γ and γ s (all asymmetric) every 10 using the fixed-point method of Minka (2003).
In order to compare the subject-specific predicate distributions across categories, we first calculate the posterior predictive distribution by taking a single sample of all latent variables z to estimate z pred θ α ξ sb j φ γ γ s W u U Figure 2: Single membership model with decoupled subjects and predicates. z is the latent category identity of a user (e.g., liberal or conservative); φ is a distribution over subjects for each category; and ξ is a distribution of predicates given subject s.
the following (Asuncion et al., 2009): Whereζ z,v is the vth element of the zth multinomial being estimated, c(z, v) is the count of element v associated with category z and γ v is the associated Dirichlet hyperparameter for that element. Given this smoothed distribution, for each proposition we assign it a real valued score, the log-likelihood ratio between its value in these two distributions. In the evaluation that follows, let the PREDICATE SCORE for a given subject, predicate under this model be: Let the PROPOSITION SCORE be:

Comparison
The two models described in §3 are unsupervised methods for estimating the latent political positions of users along with propositional assertions. We compare with three other models, a mixture of unsupervised, supervised, and semi-supervised methods. Unlike our models, these were not designed for the task described in §2.

Principal Component Analysis
To compare against another purely unsupervised model, we evaluate against principal component analysis (PCA), a latent linear model that minimizes the average reconstruction error between an original data matrix X ∈ R n×p and a lowdimensional approximation ZW , where Z ∈ R n×K can be thought of as a K-dimensional latent representation of the input and W ∈ R p×K contains the eigenvectors of the K largest eigenvalues of the covariance matrix XX , providing a K-dimensional representation for each feature. We perform PCA with K = 1 on two representations of our data: a.) counts, where the input data matrix contains the counts for each proposition for each user, and b.) frequencies, where we normalize those counts for each user to unit length. While the input data is sparse, we must center each column to have a 0 mean (resulting in a dense matrix) and perform PCA through a singular value decomposition of that column-centered data using the method of Halko (2011); in using SVD for PCA, the right singular vectors correspond to the principal directions; from these we directly read off a K = 1 dimensional score for each proposition in our data.

2 -Regularized Logistic Regression
While unsupervised methods potentially allow us to learn interesting structure in data, they are often eclipsed in prediction tasks by the addition of any form of supervision. While purely supervised models give more control over the exact decision boundary being learned, they can suffer by learning from a much smaller training set than unsupervised methods have access to. To evaluate this tradeoff, we compare against a supervised model trained using naturally occurring data -users who self-declare themselves in their profiles to be liberal, conservative, democrat, or republican. We randomly sampled 150 users who self-identify as liberals and 150 who identify as conservatives. We do not expect these users to be a truly random sample of the population -those who self-declare their political affiliation are more likely to engage with political content differently from those who do not (Sandvig, 2015;Hargittai, 2015) -but is a method that has been used for political prediction tasks in the past (Cohen and Ruths, 2013).
We build a predictive model using two classes of features: a.) binary indicators of the most frequent 25,000 unigrams and multiword expressions 11 in the corpus overall; and b.) features derived from user posting activity to the seven blogs shown in table 1 (binary indicators of the blogs posted to, and the identity of the most frequent blog). In a tenfold cross-validation (using 2regularized logistic regression), this classifier attains an accuracy rate of 76.7% (with a standard error of ±1.7 across the ten folds).
In order to establish real-valued scores for propositions, we follow the same method as for the single membership model described above, using the log likelihood ratio of the probability of the proposition under each condition, where that probability is given as the count of the proposition among users classified as (e.g.) liberals (plus some small smoothing factor) divided by the total number of propositions used by them overall.

Co-Training
Since the features we use for the supervised model provide two roughly independent views of the data, we also evaluate against the semi-supervised method of co-training (Blum and Mitchell, 1998). Here, we train two different logistic regression classifiers, each with access to only the unigrams and multiword expressions employed by the user (h words ) or to binary indicators of the blogs posted to and the identity of the most frequent blog (h blogs ). For ten iterations, we pick a random sample U of 1,000 data points from the full dataset U and classify each using the two classifiers; each classifier then adds up to 100 of the highestconfidence predictions to the training set, retaining the class distribution balance of the initial training set; after training, the final predictive probability for an item is the product of the two trained classifiers. In a tenfold cross-validation, co-training yielded a slightly higher (but not statistically significant) accuracy over pure supervision (77.0% ±1.8). We calculate scores for propositions in the same way as for the fully supervised case above.
sitions whose subject falls within the evaluation benchmark; and include only propositions used by at least five different users, and only users who make at least five different assertions, yielding a total dataset of 40,803 users and 1.9 million propositions (81,728 unique), containing the union of all three kinds of extracted propositions from §2.2.
Each of the automatic methods that we discuss above assigns a real-valued score to propositions like OBAMA IS A SOCIALIST. Our goal in evaluation is to judge how well those model scores recover those assigned by humans in our benchmark. Since each method may make different assumptions about the distribution of scores (and normalizing them may be sensitive to outliers), we do not attempt to model them directly, but rather use two nonparametric tests: Spearman's rank correlation coefficient and cluster purity.
Spearman's rank correlation coefficient. The set of scores in the human benchmark and as output by a model each defines a ranked list of propositions; Spearman's rank correlation coefficient (ρ) is a nonparametric test of the Pearson correlation coefficient measured over the ranks of items in two lists (rather than their values). We use the absolute value of ρ to compare the degree to which the ranked propositions of two lists are linearly correlated; a perfect correlation would have ρ = 1.0; no correlation would have ρ = 0.0.
Purity. While Spearman's rank correlation coefficient gives us a nonparametric estimate of the degree to which the exact order of two sequences are the same, we can also soften the exact ordering assumption and evaluate the degree to which a ranked proposition falls on the correct side of the political continuum (i.e., not considering whether OBAMA IS A SOCIALIST is more or less conservative than OBAMA IS A DICTATOR but rather that it is more conservative than liberal). For each ranked list, we form two clusters of propositions, split at the midpoint: all scores below the midpoint define one cluster, and all scores above or equal define a second. For N = 766 propositions, given gold clusters G = {g 1 , g 2 } and model clusters C n = {c 1 , c 2 } (each containing 383 propositions), we calculate purity as the average overlap for the best alignment between the two gold clusters and their model counterparts. 12 (7) A perfect purity score (in which all items from each cluster in C are matched to the same cluster in G) is 1.0; given that all clusters are identically sized (being defined as the set falling on each half of a midpoint), a random assignment would yield a score of 0.50 in expectation.   Table 3 presents the results of this evaluation. For both of the models described in §3, we present results for scoring a proposition like OBAMA IS A SOCIALIST based only on the conditional predicate score (PRED.) and on a score that includes variation in the subject as well (PROP.). Since both models are fit using approximate inference with a non-convex objective function, we run five models with different random initializations and present the average across all five.
We estimate confidence intervals using the block jackknife (Quenouille, 1956;Efron and Stein, 1981), calculating purity and Spearman's ρ over 76 resampled subsets of the full 766 elements, each leaving out 10. 13 For both metrics, the two best performing models show statistically significant improvement over all other models, but are not significantly different from each other.
We draw two messages from these results: For heavily partisan data, unsupervised methods are sufficient. In drawing on comments on politically partisan blogs, we are able to match human judgments of the political import of propositions quite well (both of the unsupervised models described in §3 outperform their supervised and semi-supervised counterparts by a large margin), which suggests that the easiest structure to find in this particular data is the affiliation of users with their political ideologies. Both unsupervised models are able to exploit the natural structure without being constrained by a small amount of training data that may be more biased (e.g., in its class balance) than helpful. The two generative models also widely outperform PCA, which may reflect a mismatch between its underlying assumptions and the textual data we observe; PCA treats data sparsity as structural zeros (not simply missing data) and so must model not only the variation that exists between users, but also the variation that exists in their frequency of use; other latent component models may be a better fit for this kind of data.
Joint information is important. For both models, including information about the full joint probability of a subject and predicate together yields substantial improvements for both purity and the Spearman correlation coefficient compared to scores calculated from variation in the conditional predicate alone. While we might have considered variation in the predicate to be sufficient in distinguishing between political parties, we see that this is simply not the case; variation in the subject may help anchor propositions in the spectrum relative to each other.

Convergent Validity
The primary quantity of interest that we are trying to estimate in the models described above is the political position of an assertion; a user's latent political affiliation is only a helpful auxiliary variable in reaching this goal. We can, however, also measure the correlation of those variables themselves with other variables of interest, such as users' self-declarations of political affiliation and audience participation on the different blogs. Both provide measures of convergent validity that confirm the distinction being made in our models is indeed one of political ideology.

Correlation with Self-declarations
One form of data not exploited by the unsupervised models described above are users' selfdeclarations; we omit these above in order to make the models as general as possible (requiring only text and not metadata), but they can provide an independent measure of the distinctions our unsupervised models are learning. (The supervised baselines in contrast are able to draw on this profile information for training data.) Approximately 12% of the users in the data input to our models (4,718 of 40,804) have affiliated self-declared profile information; the most frequent of these include retired, businessman, student, and patriot. For all of these users, we regress binary indicators of the top 25,000 unigrams in their profiles against the MAP estimate of their political affiliation in the single-membership model. Across all 5 folds, the features with the highest predictive weights for one class were patriot, conservative, obama, and god while the highest predictive weights for the other are progressive, voter, liberal, and science.

Estimating Media Audience
We can also use users' latent political ideologies to estimate the overall ideological makeup of a blog's active audience. If we assign each post to our estimate of the political ideology of its author, we find that Mother Jones has the highest fraction of comments by estimated liberals at 80.4%, while Breitbart has the highest percentage of comments by conservatives (79.5%).  This broadly accords with , which finds that among the blogs in our dataset, consistently liberal respondents trust NPR and Mother Jones most, while consistent conservatives trust Breitbart most and NPR and Mother Jones the least.

Conclusion
We introduce the task of estimating the political import of propositions such as OBAMA IS A SO-CIALIST; while much work in open information extraction has focused on learning facts such as OBAMA IS PRESIDENT from text, we are able to exploit structure in the users and communities who make such assertions in order to align them all within the same political space. Given sufficiently partisan data (here, comments on political blogs), we find that the unsupervised generative models presented here are able to outperform other models, including those given access to supervision.
One natural downstream application of this work is fine-grained opinion polling; while existing work has leveraged social media data on Twitter for uncovering correlations with consumer confidence, political polls (O'Connor et al., 2010), and flu trends (Paul and Dredze, 2011), our work points the way toward identifying finegrained, interpretable propositions in public discourse and estimating latent aspects (such as political affiliation) of the communities who assert them. Data and code to support this work can be found at http://people.ischool. berkeley.edu/˜dbamman/emnlp2015/.