Hierarchical Structured Model for Fine-to-Coarse Manifesto Text Analysis

Election manifestos document the intentions, motives, and views of political parties. They are often used for analysing a party’s fine-grained position on a particular issue, as well as for coarse-grained positioning of a party on the left–right spectrum. In this paper we propose a two-stage model for automatically performing both levels of analysis over manifestos. In the first step we employ a hierarchical multi-task structured deep model to predict fine- and coarse-grained positions, and in the second step we perform post-hoc calibration of coarse-grained positions using probabilistic soft logic. We empirically show that the proposed model outperforms state-of-art approaches at both granularities using manifestos from twelve countries, written in ten different languages.


Introduction
The adoption of NLP methods has led to significant advances in the field of computational social science (Lazer et al., 2009), including political science (Grimmer and Stewart, 2013). Among a myriad of data sources, election manifestos are a core artifact in political analysis. One of the most widely used datasets by political scientists is the Comparative Manifesto Project (CMP) dataset (Volkens et al., 2017), which contains manifestos in various languages, covering over 1000 parties across 50 countries, from elections dating back to 1945.
In CMP, a subset of the manifestos has been manually annotated at the sentence-level with one of 57 political themes, divided into 7 major categories. 1 Such categories capture party positions (FAVORABLE, UNFAVORABLE or NEITHER) 1 https://manifesto-project.wzb.eu/ coding_schemes/mp_v5 on fine-grained policy themes, and are also useful for downstream tasks including calculating manifesto-level (policy-based) left-right position scores (Budge et al., 2001;Lowe et al., 2011;Däubler and Benoit, 2017). An example sentence from the Green Party of England and Wales 2015 election manifesto where they take an UNFAVOR-ABLE position on MILITARY is: We would: Ensure that ... less is spent on military research.
Elsewhere, they take a FAVORABLE position on WELFARE STATE: Double Child Benefit.
Such manual annotations are labor-intensive and prone to annotation inconsistencies .
In order to overcome these challenges, supervised sentence classification approaches have been proposed (Verberne et al., 2014;Subramanian et al., 2017).
Other than the sentence-level labels, the manifesto text also has a document-level score that quantifies its position on the left-right spectrum. Different approaches have been proposed to derive this score, based on alternate definitions of "left-right" (Slapin and Proksch, 2008;Benoit and Laver, 2007;Lo et al., 2013;Däubler and Benoit, 2017). Among these, the RILE index is the most widely adopted (Merz et al., 2016;Jou and Dalton, 2017), and has been shown to correlate highly with other popular scores (Lowe et al., 2011). RILE is defined as the difference between RIGHT and LEFT positions on (pre-determined) policy themes across sentences in a manifesto (Volkens et al., 2013); for instance, UNFAVORABLE position on MILITARY is categorized as LEFT. RILE is popular in CMP in particular, as mapping individual sentences to LEFT/RIGHT/NEUTRAL categories has been shown to be less sensitive to systematic errors than other sentence-level class sets (Klingemann et al., 2006;Volkens et al., 2013).
Finally, expert survey scores are gaining popularity as a means of capturing manifesto-level political positions, and are considered to be contextand time-specific, unlike RILE (Volkens et al., 2013;Däubler and Benoit, 2017). We use the Chapel Hill Expert Survey (CHES) (Bakker et al., 2015), which comprises aggregated expert surveys on the ideological position of various political parties. Although CHES is more subjective than RILE, the CHES scores are considered to be the gold-standard in the political science domain.
In this work, we address both fine-and coarsegrained multilingual manifesto text policy position analysis, through joint modeling of sentence-level classification and document-level positioning (or ranking) tasks. We employ a two-level structured model, in which the first level captures the structure within a manifesto, and the second level captures context and temporal dependencies across manifestos. Our contributions are as follows: • we employ a hierarchical sequential deep model that encodes the structure in manifesto text for the sentence classification task; • we capture the dependency between the sentence-and document-level tasks, and also utilize additional label structure (categorization into LEFT/RIGHT/NEUTRAL: Volkens et al. (2013)) using a joint-structured model; • we incorporate contextual information (such as political coalitions) and encode temporal dependencies to calibrate the coarse-level manifesto position using probabilistic soft logic (Bach et al., 2015), which we evaluate on the prediction of the RILE index or expert survey party position score.

Related Work
Analysing manifesto text is a relatively new application at the intersection of political science and NLP. One line of work in this space has been on sentence-level classification, including classifying each sentence according to its major political theme (1-of-7 categories) (Zirn et al., 2016;Glavaš et al., 2017a), its position on various policy themes (Verberne et al., 2014;Biessmann, 2016;Subramanian et al., 2017), or its relative disagreement with other parties (Menini et al., 2017). Recent approaches (Glavaš et al., 2017a;Subrama-nian et al., 2017) have also handled multilingual manifesto text (given that manifestos span multiple countries and languages; see Section 5.1) using multilingual word embeddings. At the document level, there has been work on using label count aggregation of (manuallyannotated) fine-grained policy positions, as features for inductive analysis (Lowe et al., 2011;Däubler and Benoit, 2017).
Text-based approaches has used dictionary-based supervised methods, unsupervised factor analysis based techniques and graph propagation based approaches (Hjorth et al., 2015;Bruinsma and Gemenis, 2017;Glavaš et al., 2017b). A recent paper closely aligned with our work is Subramanian et al. (2017), who address both sentence-and document-level tasks jointly in a multilingual setting, showing that a joint approach outperforms previous approaches. But they do not exploit the structure of the text and use a much simpler model architecture: averages of word embeddings, versus our bi-LSTM encodings; and they do not leverage domain information and temporal regularities that can influence policy positions (Greene, 2016). This work will act as a baseline in our experiments in Section 5.
Policy-specific position classification can be seen as related to target-specific stance classification (Mohammad et al., 2017), except that the target is not explicitly mentioned in most cases. Secondly, manifestos have both fine-and coarsegrained positions, similar to sentiment analysis (McDonald et al., 2007). Finally, manifesto text is well structured within and across documents (based on coalition), has temporal dependencies, and is multilingual in nature.

Proposed Approach
In this section, we detail the first step of our two-stage approach. We use a hierarchical bidirectional long short-term memory ("bi-LSTM") model (Hochreiter and Schmidhuber, 1997;Graves et al., 2013;Li et al., 2015) with a multi-task objective for the sentence classification and document-level regression tasks. A post-hoc calibration of coarse-grained manifesto position is given in Section 4.
Let D be the set of manifestos, where a manifesto d ∈ D is made up of L sentences, and a sentence s i has T words: w i1 , w i2 , ...w iT . The set D s ⊂ D is annotated at the sentence-level with positions on fine-grained policy issues (57 classes). The task here is to learn a model that can: (a) classify sentences according to policy issue classes; and (b) score the overall document on the policy-based left-right spectrum (RILE), in an inter-dependent fashion.
Word encoder: We initialize word vector representations using a multilingual word embedding matrix, W e . We construct W e by aligning the embedding matrices of all the languages to English, in a pair-wise fashion. Bilingual projection matrices are built using pre-trained Fast-Text monolingual embeddings (Bojanowski et al., 2017) and a dictionary D constructed by translating 5000 frequent English words using Google Translate. Given a pair of embedding matrices E (English) and O (Other), we use singular value decomposition of O T DE (which is U ΣV T ) to get the projection matrix (W * =U V T ), since it also enforces monolingual invariance (Artetxe et al., 2016;Smith et al., 2017). Finally, we obtain the aligned embedding matrix, W e , as OW * .
We use a bi-LSTM to derive a vector representation of each word in context. The bi-LSTM traverses the sentence s i in both the forward and backward directions, and the encoded representation for a given word w it ∈ s i , is defined by concatenating its forward ( − → h it ) and backward hidden states ( ← − h it ), t ∈ 1, T . Sentence model: Similarly, we use a bi-LSTM to generate a sentence embedding from the wordlevel bi-LSTM, where each input sentence s i is represented using the last hidden state of both the forward and backward LSTMs. The sentence embedding is obtained by concatenating the hidden representations of the sentence-level bi-LSTM, in both the directions, With this representation, we perform fine-grained classification (to one-of-57 classes), using a softmax output layer for each sentence. We minimize the cross-entropy loss for this task, over the sentence-level labeled set D s ⊂ D. This loss is denoted L S . Document model: To represent a document d we use average-pooling over the sentence representations h i and predicted output distributions (y i ) of individual sentences, 2 i.e., 2 Preliminary experiments suggested that this representation performs better than using either hidden representations or just the output distribution.
, which we scale to the range [−1, 1], and model using a final tanh layer. We minimize the mean-squared error loss function between the predictedr d and actual RILE score r d , which is denoted as L D : (1) Overall, the loss function for the joint model ( Figure 1), combining L S and L D , is: where 0 ≤ α ≤ 1 is a hyper-parameter which is tuned on a development set.

Joint-Structured Model
The RILE score is calculated directly from the sentence labels, based on mapping each label according to its positioning on policy themes, as LEFT, RIGHT and NEUTRAL (Volkens et al., 2013). Specifically, 13 out of 57 classes are categorized as LEFT, another 13 as RIGHT, and the rest as NEUTRAL. We employ an explicit structured loss which minimizes the deviation between sentencelevel LEFT/RIGHT/NEUTRAL polarity predictions p and the document-level RILE score. The motivation to do this is two-fold: (a) enabling interaction between the sentence-and document-level tasks with homogeneous target space (polarity and RILE); and (b) since we have more documents with just RILE and no sentence-level labels, 3 augmenting an explicit semi-supervised learning objective could propagate down the RILE label to generate sentence labels that concord with the document score.
For the sentence-level polarity prediction (shown in Figure 1), we use cross-entropy loss over the sentence-level labeled set D s ⊂ D, which is denoted as L S P . The explicit structured sentence-document loss is given as:  where p i right and p i left are the predicted RIGHT and LEFT class probabilities for a sentence s i (∈ d), r d is the actual RILE score for the document d, and L d is the length of each document, d ∈ D.
We augment the joint model's loss function (Equation (2)) with L S P and L struc to generate a regularized multi-task loss: where β, γ ≥ 0 are hyper-parameters which are, once again, tuned on the development set. We refer to the model trained with Equation (2) as "Joint", and that trained with Equation (4) as "Joint struc ".

Manifesto Position Re-ranking
We leverage party-level information to enforce smoothness and regularity in manifesto positioning on the left-right spectrum (Greene, 2016). For example, manifestos released by parties in a coalition are more likely to be closer in RILE score, and a party's position in an election is often a relative shift from its position in earlier election, so temporal information can provide smoother estimations.

Probabilistic Soft Logic
To address this, we propose an approach using hinge-loss Markov random fields ("HL-MRFs"), a scalable class of continuous, conditional graphical models (Bach et al., 2013). HL-MRFs have been used for many tasks including political framing analysis on Twitter (Johnson et al., 2017) and user stance classification on socio-political issues (Sridhar et al., 2014). These models can be specified using Probabilistic Soft Logic ("PSL") (Bach et al., 2015), a weighted first order logical template language. An example of a PSL rule is where P, Q, and R are predicates, a and b are variables, and λ is the weight associated with the rule. PSL uses soft truth values for predicates in the interval 0, 1 . The degree of ground rule satisfaction is determined using the Lukasiewicz t-norm and its corresponding co-norm as the relaxation of the logical AND and OR, respectively. The weight of the rule indicates its importance in the HL-MRF probabilistic model, which defines a probability density function of the form: where φ r (Y, X) is a hinge-loss potential corresponding to an instantiation of a rule, and is specified by a linear function l r and optional exponent ρ r ∈ {1, 2}. Note that the hinge-loss potential captures the distance to satisfaction. 4

PSL Model
Here we elaborate our PSL model (given in Table 1) based on coalition information, manifesto content-based features (manifesto similarity and right-left ratio), and temporal dependency. Our target pos (calibrated RILE) is a continuous variable 0, 1 , where 1 indicates that a manifesto occupies an extreme right position, 0 denotes an extreme left position, and 0.5 indicates center. Each instance of a manifesto and its party affiliation are denoted by the predicates Manifesto and Party.
Coalition: We model multi-relational networks based on regional coalitions within a given country (RegCoalition), 5 and also crosscountry coalitions in the European parliament 4 Degree of satisfaction for the example PSL rule r, ¬P ∨ ¬Q ∨ R, using the Lukasiewicz co-norm is given as min{2 − P − Q + R, 1}. From this, the distance to satisfaction is given as max{P + Q − R − 1, 0}, where P + Q − R − 1 indicates the linear function lr. Manifesto(x)∧Party(x, a)∧Manifesto(y)∧Party(y, b)∧Recent(x, y)∧EUCoalition(a, b)∧¬pos(x) → ¬pos(y) Transitivity PSLesim -Similarity-based relational feature  (Figure 1). Except for pos, other values are fixed in the network. Domain (y) for SameElec(x, y) is within the country, and for Recent(x, y) covers all the countries. ¬ denotes negation. Distance to satisfaction for each ground rule is obtained using a hinge-loss potential, which is then used inside the HL-MRF model (Equation (5)), where pos is Y.
(EUCoalition). 6 We set the scope of interaction between manifestos (x and y) from a country to the same election (SameElec). For manifestos across countries, we consider only the most recent manifesto (Recent) from each party (y), released within 4 years relative to x. We use a logistic transformation of the number of times two parties have been in a coalition in the past (to get a value between 0 and 1), for both RegCoalition and EUCoalition. We also construct rules based on transitivity for both the relational features, i.e., parties which have had common coalition partners, even if they were not allies themselves, are likely to have similar policy positions.
Manifesto similarity: Manifestos that are similar in content are expected to have similar RILE scores (and associated sentence-6 http://www.europarl.europa.eu level label distributions), similar to the modeling intuition captured by Burford et al. (2015) in the context of congressional debate vote prediction. For a pair of recent manifestos (Recent) we use the cosine similarity (Similarity) between their respective document vectors V d (Figure 1).

Right-left ratio:
For a given manifesto, we compute the ratio of sentences categorized under RIGHT to OTHERS ( # RIGHT # RIGHT+# LEFT+# NEUTRAL ), where the categorization for sentences is obtained using the joint-structured model (Equation (4)). We also encode the location of sentence l s in a document, by weighing the count of sentences for each class C by its location value s∈C log(l s + 1) (referred to as loc lr). The intuition here is that the beginning parts of a manifesto tends to contain generic information such as preamble, compared to later parts which are more policy-dense. We perform a logistic transformation of loc lr to derive the LwRightLeftRatio.
Temporal dependency: We capture the temporal dependency between a party's current manifesto position and its previous manifesto position (PreviousManifesto).
Other than for the look-up based random variables, the network is instantiated with predictions (for Similarity, LwRightLeftRatio and pos) from the joint-structured model (Figure 1). All the random variables, except pos (which is the target variable), are fixed in the network. These values are then used inside a PSL model for collective probabilistic reasoning, where the first-order logic given in Table 1 is used to define the graphical model (HL-MRF) over the random variables detailed above. Inference on the HL-MRF is used to obtain the most probable interpretation such that it satisfies most ground rule instances, i.e., considering the relational and temporal dependencies.

Experimental Setup
As our dataset, we use manifestos from CMP for European countries only, as in Section 5.5 we will validate the manifesto's overall position on the left-right spectrum, using the Chapel Hill Expert Survey (CHES), which is only available for European countries (Bakker et al., 2015). In this, we sample 1004 manifestos from 12 European countries, written in 10 different languages -Danish (Denmark), Dutch (Netherlands), English (Ireland, United Kingdom), Finnish (Finland), French (France), German (Austria, Germany), Italian (Italy), Portuguese (Portugal), Spanish (Spain), and Swedish (Sweden). Out of the 1004 manifestos, 272 are annotated with both sentence-level labels and RILE scores, and the remainder only have RILE scores (see Table 2 for further statistics).
There are (less) scenarios where a natural sentence is segmented into sub-sentences and annotated with different classes (Däubler et al., 2012). Hence we use NLTK sentence tokenizer followed by heuristics from Däubler et al. (2012) to obtain sub-sentences. Consistent with previous work (Subramanian et al., 2017), we present results with manually segmented and annotated test documents.
• AE-NN : MLP model with average multilingual word embeddings as the sentence representation (Subramanian et al., 2017).
• Bi-LSTM : Simple bi-LSTM over multilingual word embeddings, last hidden units are concatenated to form the sentence representation, and fed directly into a softmax sentencelevel layer. We evaluate two scenarios: (1) with a trainable embedding matrix W e (Bi-LSTM(+up) ); and (2) without a trainable W e . Document-level baseline approaches include: • BoC : Bag-of-centroids (BoC) document representation based on clustering the word embeddings (Lebret and Collobert, 2014), fed into a neural network regression model.
• HCNN : Hierarchical CNN, where we encode both the sentence and document using stacked CNN layers.
• HNN : State-of-the-art hierarchical neural network model of Subramanian et al. (2017), based on average embedding representations for sentences and the document.
We present results evaluated under two different settings: (a) 80-20% random split averaged across 10 runs to validate the hierarchical model (Section 5.3 and Section 5.4); and (b) temporal setting, where train-and test-set are split chronologically, to validate both the hierarchical deep model and the PSL approach especially, since we encode temporal dependencies (Section 5.5).

Hierarchical Sentence-and Document-level Model
We present sentence-level results with a 80-20% random split in Table 3, stratified by country, averaged across 10 runs. For Bi-LSTM , we found the setting with a trainable embedding matrix (Bi-LSTM(+up) ) to perform better than the nontrainable case (Bi-LSTM ). Hence we use a similar setting for Joint and Joint struc . We show the effect of α (from Equation (2)) in Figure 2a, based on which we set α = 0.3 hereafter. With the chosen model, we study the effect of the structured loss (Equation (4)), by varying γ with fixed β = 0.1, as shown in Figure 2b. We observe that γ = 0.7 gives the best performance, and varying β with γ at 0.7 does not result in any further improvement (see Figure 2c). Sentence-level results measured using F-measure, for baseline approaches and the proposed models selected from Figure Table 3. We also evaluate the special case of α = 1, in the form of sentence-only model Joint sent . For the document-level task, results for overall manifesto positioning measured using Pearson's correlation (r) and Spearman's rank correlation (ρ) are given in Table 4. We also evaluate the hierarchical bi-LSTM model with document-level objective only, Joint doc . We observe that hierarchical modeling (Joint sent , Joint and Joint struc ) gives the best performance for sentence-level classification for all the languages except Portuguese, on which it performs slightly worse than Bi-LSTM(+up) . Also, Joint struc , does not improve over Joint sent . We perform further analysis to see the effect of joint-structured model on the sentence-level task under sparsely-labeled conditions in Section 5.4. On the other hand, for the document-level task,  the joint model (Joint) performs better than Joint doc and all the baseline approaches. Lastly, the joint-structured model (Joint struc ) provides further improvement over Joint.

Analysis of Joint-Structured Model for
Sentence-level task To understand the utility of joint modeling, especially given that there are more manifestos with document-level labels only than both sentenceand document-level labels, we compare the following two settings: (1) Joint struc , which uses additional manifestos with document-level supervision (RILE); and (2) Joint sent , which uses manifestos with sentence-level supervision only. We vary the proportion of labeled documents at the sentence-level, from 10% to 80%, to study the effect under sparsely-labeled conditions. Note that 80% is the maximum labeled training data under the cross-validation setting. In other cases, a subset (say 10%) is randomly sampled for train-    ing. From Figure 3, having more manifestos with document-level supervision demonstrates the advantage of semi-supervised learning, especially when the sentence-level supervision is sparse (≤ 40%)-Joint struc performs better than Joint sent .

Manifesto Position Re-ranking using PSL
Finally, we present the results using PSL, which calibrates the overall manifesto position on the left-right spectrum, obtained using the jointstructured model (Joint struc ). As we evaluate the effect of temporal dependency, we use manifestos before 2008-09 for training (868 in total) and the later ones (until 2015, 136 in total) for testing. This test set covers one recent set of election manifestos for most countries, and two for the Nether-  lands, Spain and United Kingdom. To avoid variance in right-to-left ratio and the target variable (pos, initialized using Joint struc ) between the training and test sets, we build a stacked network (Fast and Jensen, 2008), whereby we estimate values for the training set using cross-validation across the training partition, and estimate values for the test-set with a model trained over the entire training data. Note that we build the Joint struc model afresh using the chronologically split training set, and the parameters are tuned again using an 80-20 random split of the training set. For a consistent view of results for both the tasks (and stages), we provide micro-averaged results for sentence-classification with the competing approaches (from Table 3): AE-NN (Subramanian et al., 2017), Bi-LSTM(+up) , and Joint struc . Results are presented in Table 5, noting that the results for a given method will differ from earlier due to the different data split. For the document-level regression task, we also evaluate other approaches based on manifesto similarity and automated scaling with sentence-level policy positions: • Cross-lingual scaling (CLS ): A recent unsupervised approach for crosslingual political speech text scoring (Glavaš et al., 2017b), based on TF-IDF weighed average wordembeddings to represent documents, and a graph constructed using pair-wise document  Table 6: Manifesto regression task using the two-stage approach. Best scores are given in bold.
similarity. Given two pivot texts (for left and right), label propagation approach is used to position other documents.
• PCA: Apply principal component analysis (Gabel and Huber, 2000) on the distribution of sentence-level policy positions (56 classes, without 000), and use the projection on its principal component to explain maximum variance in its sentence-level positions, as a latent manifesto-level position score.
• Joint struc : We evaluate the scores obtained using Joint struc , which we calibrate using PSL.
We validate the calibrated position scores using both RILE and CHES 7 scores. We use CHES 2010-14, and map the manifestos to the closest survey year (wrt its election date). CHES scores are used only for evaluation and not during training. We provide results in Table 6 by augmenting features for the PSL model (Table 1) incrementally. We observed that the coalition-based feature, and polarity of sentences with its position information improves the overall ranking (r, ρ). Document similarity based relational feature provides only mild improvement (similarly to Burford et al. (2015)), and temporal dependency provides further improvement against CHES. That is, combining content, network and temporal features provides the best results.

Conclusion and Future Work
This work has been targeted at both fine-and coarse-grained manifesto text position analysis. We have proposed a two-stage approach, where in the first step we use a hierarchical multi-task 7 https://www.chesdata.eu/ deep model to handle the sentence-and documentlevel tasks together. We also utilize additional information on label structure, to augment an auxiliary structured loss. Since the first step places the manifesto on the left-right spectrum using text only, we leverage context information, such as coalition and temporal dependencies to calibrate the position further using PSL. We observed that: (a) a hierarchical bi-LSTM model performs best for the sentence-level classification task, offering a 10% improvement over the state-of-art approach (Subramanian et al., 2017); (b) modeling the document-level task jointly, and also augmenting the structured loss, gives the best performance for the document-level task and also helps the sentence-level task under sparse supervision scenarios; and (c) the inclusion of a calibration step with PSL provides significant gains in performance against both RILE and CHES, in the form of an increase from ρ = 0.42 to 0.61 wrt CHES survey scores.
There are many possible extensions to this work, including: (a) learning multilingual word embeddings with domain information; and (b) modeling other policy related scores from text, such as "support for EU integration".