Multi-view Models for Political Ideology Detection of News Articles

A news article’s title, content and link structure often reveal its political ideology. However, most existing works on automatic political ideology detection only leverage textual cues. Drawing inspiration from recent advances in neural inference, we propose a novel attention based multi-view model to leverage cues from all of the above views to identify the ideology evinced by a news article. Our model draws on advances in representation learning in natural language processing and network science to capture cues from both textual content and the network structure of news articles. We empirically evaluate our model against a battery of baselines and show that our model outperforms state of the art by 10 percentage points F1 score.


Introduction
Many issues covered or discussed by the media and politicians today are so subtle that even word-choice may require one to adopt a particular ideological position (Iyyer et al., 2014). For example, conservatives tend to use the term tax reform, while liberals use tax simplification. Though objectivity and unbiased reporting remains a cornerstone of professional journalism, several scholars argue that the media displays ideological bias (Gentzkow and Shapiro, 2010;Groseclose and Milyo, 2005;Iyyer et al., 2014). Even if one were to argue that such bias may not be reflective of a lack of objectivity, prior research Dardis et al. (2008); Card et al. (2015) note that framing of topics can significantly influence policy.
Since manual detection of political ideology is challenging at a large scale, there has been extensive work on developing computational models for automatically inferring the political ideology of articles, blogs, statements, and congressional speeches (Gentzkow and Shapiro, 2010;Iyyer et al., 2014;Preoţiuc-Pietro et al., 2017;Sim et al., 2013). In this paper, we consider the detection of ideological bias at the news article level, in contrast to recent work by Iyyer et al. (2014) who focus on the sentence level or the work of (Preoţiuc-Pietro et al., 2017) who focus on inferring ideological bias of social media users. Prior research exists on detecting ideological biases of news articles or documents (Gentzkow and Shapiro, 2010;Gerrish and Blei, 2011;Iyyer et al., 2014). However, all of these works generally only model the text of the news article. However, in the online world, news articles do not just contain text but have a rich structure to them. Such an online setting influences the article in subtle ways: (a) choice of the title since this is what is seen in snippet views online (b) links to other news media and sources in the article and (c) the actual textual content itself. Except for the textual content, prior models ignore the rest of these cues. Figure 1 shows an example from The New York Times. Note the presence of hyperlinks in the text, which link to other sources like The Intercept( Figure 1a). We hypothesize that such a link structure is reflective of homophily between news sources sharing similar political ideology -homophily which can be exploited to build improved predictive models (see Figure 1b). Building on this insight, we propose a new model MVDAM: Multi-view document attention model to detect the ideological bias of news articles by leveraging cues from multiple views: the title, the link structure, and the article content. Specifically, our contributions are: 1. We propose a generic framework MVDAM to incorporate multiple views of the news article and show that our model outperforms state of the art by 10 percentage points on the F1 score. 2. We propose a method to estimate the ideological proportions of sources and rank them by the degree to which they lean towards a particular ideology. 3. Finally, differing from most works, which typically focus on congressional speeches, we conduct ideology detection of news articles by assembling a large-scale diverse dataset spanning more than 50 sources.

Related Work
Several works study the detection of political ideology through the lens of computational linguistics and natural language processing (Laver et al., 2003;Monroe and Maeda, 2004;Thomas et al., 2006;Lin et al., 2008;Carroll et al., 2009;Ahmed and Xing, 2010;Gentzkow and Shapiro, 2010;Gerrish and Blei, 2011;Sim et al., 2013). Gentzkow and Shapiro (2010) first attempt to rate the ideological leaning of news sources by proposing a measure called "slant index" which captures the degree to which a particular newspaper uses partisan terms or co-allocations. Gerrish and Blei (2011) predict the voting patterns of Congress members based on supervised topic models. Other works use topic models to analyze bias in news articles, blogs, and political speeches (Ahmed and Xing, 2010;Lin et al., 2008). Sim et al. (2013) propose a novel HMM-based model to infer the ideological proportions of the rhetoric used by political candidates in their campaign speeches which relies on a fixed lexicon of bigrams associated with ideologies. The work that is most closely related to our work is that of Iyyer et al. (2014);Preoţiuc-Pietro et al. (2017). Iyyer et al. (2014) use recurrent neural networks to predict political ideology of congressional debates and articles in the ideological book corpus (IBC) and demonstrate the importance of compositionality in predicting ideology where modifier phrases and punctuality affect the political ideological position. Preoţiuc-Pietro et al. (2017) propose models to infer political ideology of Twitter users based on their everyday language. Most crucially, they also show how to effectively use the relationship between user groups to improve prediction accuracy. Our work draws inspiration from both of these works but differentiates itself from these in the following aspects: We leverage the structure of a news article by noting that an article is just not free-form text, but has a rich structure to it. In particular, we model cues from the title, the inferred network, and the content in a joint generic neural variational inference framework to yield improved models for this task. Furthermore, differing from Iyyer et al. (2014), we also incorporate attention mechanisms in our model which enables us to inspect which sentences (or words) have the most predictive power as captured by our model. Finally, since we work with news articles (which also contain hyperlinks), naturally our setting is different from all other previous works in general (which mostly focus on congressional debates) and in particular from Iyyer et al. (2014) where only textual content is modeled or Preoţiuc-Pietro et al. (2017) which focuses on social media users.

Dataset Construction
News Sources We rely on the data released by ALLSIDES.COM 1 to obtain a list of 59 US-based news sources along with their political ideology ratings: LEFT, CENTER or RIGHT which specify our target label space. While we acknowledge that there is no "perfect" measure of political ideology, ALLSIDES.COM is an apt choice for two main reasons. First, and most importantly the ratings are based on a blind survey, where readers are asked to rate news content without knowing the identity of the news source or the author being rated. This is also precisely the setting in which our proposed computational models operate (where the models have access to the content but are agnostic of the source itself) thus seeking to mirror human judgment closely. Second, these are normalized by ALLSIDES to ensure they closely reflect popular opinion and political diversity present in the United States. These ratings also correlate with independent measurements made by the PEW RESEARCH CENTRE. All these observations suggest that these ratings are fairly robust and generally "reflective of the average judgment of the American People" 2 .

Content Extraction
Given the set of news sources selected above, we extract the article content for these news sources. We control for time by obtaining article content over a fixed time-period for all sources. Specifically, we spider several news sources and perform data cleaning. In particular, the spidering component collates the raw HTML of news sources into a storage engine (MongoDB). We track thousands of US based news outlets including country wide popular news sources as well as many local/state news based outlets like the (a) A sample news article. Note the presence of hyperlinks to other sources like The Intercept.
(b) Homophily in link structure (viewed in color) of various news sources which can be observed by noting the presence of clusters corresponding to political ideologies. The blue, orange and green clusters correspond to left, right and center leaning sources respectively.  Figure 1a shows a sample article from the New York Times. The presence of such links can provide informative signals for predictive tasks like ideology detection primarily due to homophily ( Figure 1b). Boston Herald 3 . However, in this paper, we consider only the 59 US news sources for which we can derive ground truth labels for political ideology. For each of the news sources considered, we extract the title, the cleaned pre-processed content, and the hyperlinks within the article that reveal the network structure. The label for each article is the label assigned to its source as obtained from ALL-SIDES. We choose a random sample of 120, 000 articles and create 3 independent splits for training (100, 000), validation (10, 000) and test (10, 000) with a roughly balanced label distribution. 4 Data Pre-processing and Cleaning Since the labels were derived from the source, we are careful to remove any systematic features in each article which are trivially reflective of the source, since that would result in over-fitting. In particular we perform the following operations: (a) Remove source link mentions When modeling the link structure of an article, we explicitly remove any link to the source itself. Second, we also explicitly remove any systematic link structures in articles that are source specific. In particular, some sources may always have links to other domains (like their own franchisees or social media sites). These links are removed explicitly by noting their high frequency. (b) Remove headers, footers, advertisements News sources systematically introduce footers, and advertisements which we remove explicitly. For example, every article of the The Daily Beast has the following footer You can subscribe to the Daily Beast here which we filter out.

Problem
Formulation Given X = {X title , X net , X content } which represents a set of multi-modal features of news articles and a label set Y = {LEFT, CENTER, RIGHT}, we would like to model Pr(Y |X).
Overview of MVDAM We consider a Bayesian approach with stochastic attention units to effectively model textual cues. Bayesian approaches with stochastic attention have been noted to be quite effective at modeling ambiguity as well as avoiding over-fitting scenarios especially in the case of small training data sets (Miao et al., 2016). In particular, we assume a latent representation h learned from the multiple modalities in X which is then mapped to the label space Y . In the most general setting, instead of learning a deterministic encoding h given X, we posit a latent distribution over the hidden representation h, Pr(h|X) to model the overall document where Pr(h|X) is parameterized by a diagonal Gaussian distribution N (h|µ(X), σ 2 (X)).
Specifically, consider the distribution Pr(Y |X) which can be written as follows: As noted by Miao et al. (2016), computing the exact posterior is in general intractable. Therefore, we posit a variational distribution q φ (h) and maximize  the evidence lower bound L ≤ Pr(y|X) namely, (2) where p(Y |h) denotes a probability distribution over Y given the latent representation h, and p(h|X) denotes the probability distribution over h conditioned on X.
Equation 2 can be interpreted as consisting of three components, each of which can modeled separately: (a) Discriminator p(Y |h) can be viewed as a discriminator given the hidden representation h. Maximizing the first term is thus equivalent to minimizing the cross-entropy loss between the model's prediction and true labels. (b) The second term, the KL Divergence term consists of two components: (1) Approximate Posterior The term q φ (h) also known as the approximate posterior parameterizes the latent distribution which encodes the multi-modal features X of a document. (2) Prior The term p(h|X) can be viewed as a prior which can be uninformative (a standard Gaussian prior in the most general case, or any other prior model based on other features). We now discuss how we model each of these components in detail.

Discriminator
We use a simple feed-forward network with a linear layer that accepts as input the latent hidden representation of X, followed by a ReLU for non-linearity followed by a linear layer and a final softmax layer to model this component.

Approximate Posterior
Here we model the approximate posterior q φ (h) by an inference network shown succinctly in Figure 2b. The inference network takes as input the features X and learns a corresponding hidden representation h. More specifically, it outputs two components: (µ, ς) corresponding to the mean and log-variance of the gaussian parametrizing the hidden representation h. We model this using a "multi-view" network which incorporates hidden representations learned from multiple modalities into a joint representation. Specifically, given ddimensional hidden representations corresponding to multiple modalities z title , z network , and z content the model first concatenates these representations into a single 3d-dimensional representation z concat which is then input through a 2-layer feed-forward network to output a d-dimensional mean vector µ and a d-dimensional log-variance vector ς that parameterizes the latent distribution governing h. We now discuss the models used for capturing each view.

Modeling the Title
We learn a latent representation of the title of a article by using a convolutional network. Convolutional networks have been shown to be very effec-tive for modeling short sentences like titles of news articles. In particular, we use the same architecture proposed by (Kim, 2014). The input words of the title are mapped to word embeddings and concatenated and passed through convolutional filters of varying window sizes. This is then followed by a max-over-time pooling (Collobert et al., 2011). The outputs of this layer are input to a fully connected layer of dimension d with drop-out which outputs z title , the latent representation of the title.

Modeling the Network Structure of articles
Capturing the network structure of article consists of two steps: (a) Learning a network representation of each source based on its social graph G. (b) Using the learned representation of each source to capture the link structure of a particular article.
We use a state-of-the-art network representation learning algorithm to learn representations of nodes in a social network. In particular, we use Node2Vec (Grover and Leskovec, 2016), which learns a d-dimensional representation of each source given the hyperlink structure graph G. Node2Vec seeks to maximize the log likelihood of observing the neighborhood of a node N (u), given the node u. Let F be a matrix of size (V, d) where F (u) represents the embedding of node u. We then maximize the following likelihood function max F u log Pr(N (u)|u). We model the above likelihood similar to the Skipgram architecture (Mikolov et al., 2013) by assuming that the likelihood of observing a node v ∈ N (u) is conditionally independent of any other node in the neighborhood given u. That is log Pr(N (u)|u) = v∈N (u) log Pr(v|u). We then model Pr(v|u) = e F (u).F (v) v e F (u).F (v) . Having fully specified the log likelihood function, we can now optimize it using stochastic gradient ascent.
Having learned the embedding matrix F for each source node, we now model the link structure of any given article A simply by the average of the network embedding representations for each link l referenced in A. In particular, we compute z network as: z network = 1 |A| l∈A F (l).

Modeling the Content of articles
To model the content of an article, we use a hierarchical approach with attention. In particular, we compute attention at both levels: (a) words and (b) sentences. We closely follow the approach by (Yang et al., 2016) which learns a latent representation of a document d using both word and sentence attention models.
We model the article A hierarchically, by first representing each sentence i with a hidden representation s i . We model the fact that not all words contribute equally in the sentence through a word level attention mechanism. We then learn the representation of the article A by composing these individual sentence level representations with a sentence level attention mechanism.
Learning sentence representations We first map each word to its embedding matrix through a lookup embedding matrix W . We then learn a hidden representation of the given sentence h it centered around word w i by embedding the sentence through a bi-directional GRU as described by (Bahdanau et al., 2014). Since not all words contribute equally to the representation of the sentence, we introduce a word level attention mechanism which attempts to extract relevant words that contribute to the meaning of the sentence. Specifically we learn a word level attention matrix W w as follows Composing sentence representations We follow a similar method to learn a latent representation of an article. Given the embedding s i of each sentence in the article, we learn a hidden representation of the given sentence h i centered around s i by embedding the list of sentences through a bi-directional GRU as described by (Bahdanau et al., 2014). Once again, since not all sentences contribute equally to the representation of the article, we introduce a sentence level attention mechanism which attempts to extract relevant sentences that contribute to the meaning of the article. Specifically we learn the weights of a sentence level attention matrix W s as α s ∝ exp(W s h s + b s ), z content = s α s h s , where z content is the latent representation of the article. In this case we let the hidden representation of the sentence be a stochastic representation similar to the work by (Miao et al., 2016) and use the Gaussian re-parameterization trick to enable training via end-to-end gradient based methods 5 . Such techniques have been shown to be useful in modeling ambiguity and also generalize well to small training datasets (Miao et al., 2016).

Prior
The prior models p(h|X) in Equation 2. Note that our proposed framework is general and can be used to incorporate a variety of priors. Here, we assume the prior is drawn from a Gaussian distribution with diagonal co-variances. The KL Divergence term in Equation 2 can thus be analytically computed. In particular, the KL Divergence between two K dimensional Gaussian distributions A, B with means µ A , µ B and diagonal co-variances κ A , κ B is: Parameter Estimation Having described precisely, the models for each of the components in Equation 2, we can reformulate the maximization of the variational lower bound to the following loss function on the set of all learn-able model parameters θ: J (θ) as follows: where NLL is the negative log likelihood loss computed between the predicted label and the true label, and λ is a hyper-parameter that controls the amount of regularization offered by the KL Divergence term. We use ADADELTA to minimize this loss function.

Experiments
We evaluate our model against several competitive baselines which model only a single view to place our model in context: 1. Chance Baseline We consider a simple baseline that returns a draw from the label distribution as the prediction. 2. Logistic Regression LR (Title) We consider a bag of words classifier using Logistic Regression that can capture linear relationships in the feature space and use the words of the title as the feature set. 3. CNN (Title) We consider a convolutional net classifier based on exactly the same architecture as (Kim, 2014) which uses the title of the news article. Convolutional Nets have been shown to be extremely effective at classifying short pieces of text and can capture nonlinearities in the feature space (Kim, 2014). 4. FNN (Network) We also consider a simple fully-connected feed forward neural network using only the network features to characterize the predictive power of the network alone. 5. HDAM Model (Content) We use the state of the art hierarchical document attention model proposed by (Yang et al., 2016) that models the content of the article using both word and sentence level attention mechanisms.
We consider three different flavors of our proposed model which differ in the subset of modalities used (a) Title and Network (b) Title and Content, and (c) Full model: Title, Network, and Content. We train all of our models and the baselines on the training data set choosing all hyper-parameter using the validation set. We report the performance of all models on the held-out test set.

Experimental Settings
We set the embedding latent dimension captured by each view to be 128 including the final latent representation obtained by fusing multiple modalities. In case of the CNN's, we consider three convolutional filters of window sizes 3, 4, 5 each yielding a 100 dimensional feature map followed by max-over time pooling which is then passed through a fully connected layer to yield the output. In all the neural models, we used AdaDelta with an initial learning rate of 1.0 to learn the parameters of the model via back-propagation.  Table 1: Precision, Recall, and F1 scores of our model MV-DAM on the test set compared with several baselines. All flavors of our model significantly outperform baselines and yield state of the art performance.

Results and Analysis
Quantitative Results Table 1 shows the results of the evaluation. First note that the logistic regression classifier and the CNN model using the Title outperforms the CHANCE classifier significantly (F1: 59.12, 59.24 vs 34.53). Second, only modeling the network structure yields a F1 of 55.10 but still significantly better than the chance baseline. This confirms our intuition that modeling the network structure can be useful in prediction of ideology. Third, note that modeling the content (HDAM) significantly outperforms all previous baselines (F1:68.92). This suggests that content cues can be very strong indicators of ideology. Finally, all flavors of our model outperform the baselines. Specifically, observe that incorporating the network cues outperforms all uni-modal models that only model either the title, the network, or the content. It is also worth noting that without the network, only the title and the content show only a small improvement over the best performing baseline (69.54 vs 68.92) suggesting that the network yields distinctive cues from both the title, and the content. Finally, the best performing model effectively uses all three modalities to yield a F1 score of 79.67 outperforming the state of the art baseline by 10 percentage points. Altogether our results suggest the superiority of our model over competitive baselines. In order to obtain deeper insights into our model, we also perform a qualitative analysis of our model's predictions.
Visualizing Attention Scores Figure 3 shows a visualization of sentences based on their attention scores. Note that for a left leaning article (see Figure 3a), the model focuses on sentences involving gun-control, feminists, and transgender. In contrast, a visualization of sentence attention scores for an article which the model predicted as "right-leaning" ((see Figure 3b)) reveals a focus on words like god, religion etc. These observations qualitatively suggest that the model is able to effectively pick up on content cues present in the article. By examining the distribution over the sentence indices corresponding to the maximum attention scores, we noted that only in about half the instances, the model focuses its greatest attention on the beginning of the article suggesting that the ability to selectively focus on sentences in the news article contributes to the superior performance.
Challenging Cases In Table 2, we highlight some of the challenges of our model. In particular, our model finds it quite challenging to identify the political ideology of the source for articles that are non-political and related to global events, or entertainment. Examples include instances like Tourist dies hiking in Australia Outback heat or Juan Williams makes the 'case for Oprah'. We also note that articles with "click-baity" titles like We are all Just Overclocked Chimpanzees are not necessarily discriminative of the underlying ideology. In summary, while our proposed model significantly advances the state of art, it also suggests scope for further improvement especially in identifying political ideologies of articles in topics like Entertainment or Sports. For example, prior research suggests that engagement in particular sports is correlated with the political leanings (Hoberman, 1977) which suggest that improved models might need to capture deeper linguistic and contextual cues.
Ideological Proportions of News Sources Finally, we compute the expected proportion of an ideology in a given source based on the probability estimates output by our model for the various articles. While one might expect that the expected degree of "left-ness" (or "right-ness") for a given source can easily be computed by taking a simple mean of the prediction probability for the given ideology over all articles belonging to the source, such an approach can be in-accurate because the probability estimates output by the model are not necessarily calibrated and therefore cannot be interpreted as a confidence value. We therefore use isotonic regression to calibrate the probability scores output by the model. Having calibrated the probability scores, we now compute the degree to which a particular news source leans toward an ideology by simply computing the mean output score over all articles corresponding to the source. Table 3 shows the top 10 sources ranked according to their proportions for each ideology. We note that sources like CNN, Buzz Feed, SF Chronicle are considered more left-leaning than the Washington Post. Similarly, we note that NPR and Reuters are considered to be the most center-aligned while Breitbart, Infowars and Blaze are considered to be most right-aligned by our model. These observations are moderately aligned with survey results that place news sources on the ideology spectrum based on the political beliefs of their consumers 6 .

Conclusion
We proposed a model to leverage cues from multiple views in the predictive task of detecting political ideology of news articles. We show that incorporating cues from the title, the link structure and the content significantly beats state of the art. Finally, using the predicted probabilities of our model, we draw on methods for probability cali-bration to rank news sources by their ideological proportions which moderately correlates with existing surveys on the ideological placement of news sources. To conclude, our proposed framework effectively leverages cues from multiple views to yield state of the art interpret-able performance and sets the stage for future work which can easily incorporate other modalities like audio, video and images.