Policy Preference Detection in Parliamentary Debate Motions

Debate motions (proposals) tabled in the UK Parliament contain information about the stated policy preferences of the Members of Parliament who propose them, and are key to the analysis of all subsequent speeches given in response to them. We attempt to automatically label debate motions with codes from a pre-existing coding scheme developed by political scientists for the annotation and analysis of political parties’ manifestos. We develop annotation guidelines for the task of applying these codes to debate motions at two levels of granularity and produce a dataset of manually labelled examples. We evaluate the annotation process and the reliability and utility of the labelling scheme, finding that inter-annotator agreement is comparable with that of other studies conducted on manifesto data. Moreover, we test a variety of ways of automatically labelling motions with the codes, ranging from similarity matching to neural classification methods, and evaluate them against the gold standard labels. From these experiments, we note that established supervised baselines are not always able to improve over simple lexical heuristics. At the same time, we detect a clear and evident benefit when employing BERT, a state-of-the-art deep language representation model, even in classification scenarios with over 30 different labels and limited amounts of training data.


Introduction
Commonly known as the Hansard record, transcripts of debates that take place in the House of Commons of the United Kingdom (UK) Parliament are of interest to scholars of political science as well as the media and members of the public who wish to monitor the actions of their elected representatives. Debate motions (the proposals tabled for debate) are expressions of the policy positions taken by the governments, political parties, and individual Members of Parliament (MPs) who propose them. As all speeches given and all votes cast in the House are responses to one of these proposals, the motions are key to any understanding and analysis of the opinions and positions expressed in the subsequent speeches given in parliamentary debates.
By definition, debate motions convey the stated policy preferences of the MPs or parties who propose them. They therefore express polaritypositive or negative-towards some target, such as a piece of legislation, policy, or state of affairs. As noted by Thomas et al. (2006), the polarity of a debate proposal can strongly affect the language used by debate participants to either support or oppose it, effectively acting as a polarity shifter on the ensuing speeches. Analysis of debate motions is therefore a key first step in automatically determining the positions presented and opinions expressed by all speakers in the wider debates.
Additionally, there are further challenges associated with this task that differentiate it from the forms of sentiment analysis typically performed in other domains. Under Parliament's Rules of Behaviour, 1 debate participants use an esoteric speaking style that is not only laden with opaque procedural language and parliamentary jargon, but is also indirect, containing few explicitly negative words or phrases, even where negative positions are being expressed (Abercrombie and Batista-Navarro, 2018a).
The topics discussed in these debates revolve around policies and policy domains. Topic modelling or detection methods, which tend to produce coarse overviews and output neutral topics such as 'education' or 'transport' (as in Menini et al. (2017), for instance), are therefore not suitable for our purposes. Rather, we seek to find the proposer of a motion's position or policy preference towards each topic-in other words, an opiniontopic. Topic labels do exist for the Hansard transcripts, such as those produced by the House of Commons Library or parliamentary monitoring organsitions such as Public Whip. 2 However, these are unsuitable due to, in the former case, the fact that they incorporate no opinion or policy preference information, and for the latter, being unsystematic, insufficient in both quantity and coverage of the topics that appear in Hansard, and not future-proof (that is, they do not cover unseen topics that may arise (Abercrombie and Batista-Navarro, 2018b)).
In this paper, we use the coding scheme devised by the Manifesto Project, 3 because: (a) it is systematic, having been developed by political scientists over a 40 year period, (b) it is comprehensive and designed to cover any policy preference that may be expressed by any political party in the world, (c) it has been devised to cover any policies that may arise in the future, and (d) there exist many expert-coded examples of manifestos, which we can use as reference documents and/or for validation purposes.
We approach automatic policy preference labelling at both the motion and (quasi-)sentence levels (see Section 2). We envisage that the output could therefore be used for downstream tasks, such as sentiment and stance analysis and agreement assessment of debate speeches, which may be performed at different levels of granularity.
Our contributions This paper makes the following contributions to the literature surrounding natural language processing of political documents and civic technology applications: 1. We develop a corpus of English language debate motions from the UK Parliament, annotated with policy position labels at two levels of granularity. We also produce annotation guidelines for this task, analysis of inter-annotator agreement rates, and further evaluation of the difficulty of the task on data from both parliamentary debates and the manifestos. We make these resources publicly available for the research community.
2. We test and evaluate two different ways of automatically labelling debate motions with Manifesto Project codes: lexical similarity matching and supervised classification. For the former, we compare a baseline of unigram overlap with cosine similarity measurement of vector representations of the texts. For the latter, we test a range of established baselines and state-of-the-art deep learning methods.

Background
Rather than being forums in which speakers attempt to persuade one another of their points of view, as the word 'debate' may imply, parliamentary speeches are displays of position-taking that MPs use to communicate their policy preferences to 'other members within their own party, to members of other parties, and, most important, to their voters' (Proksch and Slapin, 2015). Debate motions are proposals put forward in Parliament, and as such are the objects of all votes and decisions made by MPs, and, in theory at least, of all speeches and utterances made in the House. 4 Each parliamentary debate begins with such a motion, and may include further amendment motions (usually designed to alter or reverse the meaning of the original) as it progresses. Motions routinely begin with the words 'I beg to move That this House ...', and may include multiple parts, as in Example 1, 5 which consists of two clauses, and appears to take a positive position towards international peace: I beg to move That this House notes the worsening humanitarian crisis in Yemen; and calls upon the Government to take a lead in passing a resolution at the UN Security Council that would give effect to an immediate ceasefire in Yemen. (1) The concept of policy preferences is widely used in the political science literature (e.g. Budge et al., 2001;Lowe et al., 2011;Volkens et al., 2013) to represent the positions of political actors expressed in text or speech. The Manifesto Project is an ongoing venture that spans four decades of work in this area and consists of a collection of party political documents annotated by trained experts with codes (labels) representing such preferences. Organised under seven 'domains', the coding scheme comprises 57 policy preference codes, all but one of which (408: Economic goals) are 'positional', encoding a positive or negative position towards a policy issue (Mikhaylov et al., 2008). Indeed, many of these codes exist in polar opposite pairs, such as 504: Welfare State Expansion and 505: Welfare State Limitation. The included manifestos are coded at the quasi-sentence level-that is, units of text that span a sentence or part of a sentence, and which have been judged by the annotators to contain 'exactly one statement or "message"' (Werner et al., 2011), as in Example 2, in which a single sentence has been annotated as four quasi-sentences: 6 To secure your first job we will create 3 million new apprenticeships; 411: Technology and Infrastructure take everyone earning less than 12,500 out of Income Tax

Related work
There exists a large body of work concerning the analysis of opinions and policy positions in the related domains of legislative debate transcripts (for a survey, see Abercrombie and Batista-Navarro, 2019) and party political manifestos (see Volkens et al., 2015). Inspired by work on analysis of text from other domains, such as product reviews and social media, much of the computer science research in this area has concentrated on classifying the sentiment polarity of individual speeches (e.g. Burford et al., 2015;Thomas et al., 2006;Yogatama et al., 2015). Political scientists meanwhile, have tended to focus on position scalingthe task of placing the combined contributions of a political actor on a (usually) one-dimensional scale, such as Left-Right (e.g. Glavaš et al., 2017b;Laver et al., 2003;Nanni et al., 2019a;Proksch and Slapin, 2010). In either case, the majority of this work does not take into consideration the topics or policy areas addressed in the speeches.
Supervised classification approaches to opinion-topic identification have been explored in a number of papers. Abercrombie and Batista-Navarro (2018b) obtain good performance in classifying debate motions as belonging to one of 13 'policies' or opinion-topics. However, this approach is somewhat limited in that they use a set of pre-existing labelled examples which does not extend to cover the whole Hansard corpus or any new policies that may arise in the future. A similar setting to ours is that of Herzog et al. (2018), who use labels from the Comparative Agendas Project (CAP). 7 However, while they seek to discover latent topics present in the corpus, we wish to determine the policy-topic of each individual debate/motion. Rather than employ labelled manifesto data, as we do, they use the descriptions of the CAP codes.
Concerning policy identification in party political manifestos, previous studies have focused on topical segmentation  and classification of sentences into the seven coarsegrained policy domains (Glavaš et al., 2017a;Zirn et al., 2016). Meanwhile, Subramanian et al. (2018) recently presented a deep learning model that classifies manifesto sentences with the finer-grained code-level scheme of the Manifesto Project, as well as placing them on a Left-Right scale. In order to contribute to these research efforts and following recent advancements in deep language representation models (Devlin et al., 2018;Peters et al., 2018), we test the potential of BERT (Bidirectional Encoder Representations from Transformers) for policy-topic classification on both debate motions and manifestos.
There is also a growing body of research on the evaluation of annotations for this domain. While the Manifesto Project relies on trained individual annotators to label manifestos, Mikhaylov et al. (2008) report the results of experiments which show that agreement between annotators is difficult to achieve, casting doubts on the reliability of the Project's codes. However, in similar experiments, Lacewell and Werner (2013) report greater inter-annotator agreement, and claim that with ongoing training, annotators can produce reliable labels. An extended analysis of the validity and reproducibility of the coding scheme is offered by Gemenis (2013), who remarks on the fact that 'the problem of unreliability does not lie with the coders but with the complex nature of the CMP (Comparative Manifesto Project) coding scheme'. Aware of such challenges, and in order to offer an additional comparison to these previous studies, in this work we provide a detailed analysis of the agreement rates of our annotators on both manifestos and debate motions.

Data
In the experimental section we report on the use of codes from the Manifesto Project as policy preference labels, with the goal of applying them to debate motions. These labels are convenient because: (a) like debate transcripts, they have been collected over time; and (b) the Project is ongoing, meaning that new example manifestos will continue to be added to it, mitigating potential concept drift problems (in which the language used to refer to aspects of different policy areas may change diachronically).
To construct our corpus, we made use of the data sources described below:

The Manifesto Project
We used annotated manifestos (1) as reference texts for labelling of debate motions by similarity matching, and (2) training a neural network for cross-domain classification of the motions. We downloaded all fifteen of the annotated United Kingdom (including Northern Ireland) manifestos from the Manifesto Corpus Version 2018-1 (Krause et al., 2018)-that is those that have been coded under version 4 of the coding scheme. 8   In this subset, the number of UK manifesto quasi-sentences labelled with codes in each domain varies considerably (see Table 1). These manifestos were written by a variety of political parties for elections over an 18 year period ( Table  2). The most prevalent code in these manifestos is 504: Welfare State Expansion (2,691 examples), and the least used is 103: Anti-Imperialism (3 examples). Two codes, 102: Foreign Special Rela-tionships: Negative and 415: Marxist Analysis: Positive, do not appear at all in manifestos from the United Kingdom.

Debate transcripts
The Hansard record of House of Commons debates is available for each day on which debates have taken place from 1919 to the present day in xml format at https://www. theyworkforyou.com, where it is updated daily with the most recent debates. As the record is more complete for recent years, we downloaded all files from May 7th 1997 (the start of that year's session of Parliament) to February 28th 2019. From these we extracted 1,156 motions together with the titles of the debates and the dates on which they were tabled. We manually removed procedural motions (those concerned solely with the workings of Parliament) from the dataset as these do not concern policy preferences and have no equivalents in political manifestos.
In order to approximate the format of the data in the Manifesto Project, and to investigate policy preference detection at different levels of granularity, we divided each motion into smaller units. For convenience, we approximated quasi-sentences in the Hansard data by automatically dividing motions into clauses, which are separated by semicolons in the transcripts.

Annotation
We adapt the Project's Coding Instructions (Werner et al., 2011) to provide guidelines for the annotation of debate motions. We use version 4 of these instructions because, although a more recent, more finely grained version exists, there are as yet few example manifestos coded under the newer scheme. To complete the annotation task, we recruited three Political Science Master's students from the University of Mannheim, who worked for a total of 40 hours each over a two month period.

Debate motions
Annotations were carried out in two stages: an initial training phase, followed by labelling of the main dataset. We used the coding instructions of version 4 of the Manifesto Project handbook 9 supplemented by debate motion-specific guidelines including notes based on the annotators' discussions during training. 10 For the training phase, after being introduced to the data and the coding instructions, the annotators individually labelled three batches of motions and their quasi-sentences. In addition to labelling each of these with one of the codes, they were instructed to note examples which they found difficult to decide upon. Between each batch we met to discuss these instances, as well as other examples on which the annotators disagreed, adding notes to the annotation guidelines based on the observations made. Interannotator agreement during training ranged from 'fair' to 'substantial', following common interpretation of Fleiss' kappa scores (Landis and Koch, 1977) (see Table 3).
The final corpus includes 386 hand-annotated motions and 1,683 quasi-sentences. 11 The majority of these have been labelled by two of the three annotators. Inter-annotator agreement is within the ranges generally interpreted as being 'moderate' to 'substantial' (see Table 4). The slightly higher agreement at the quasi-sentence level than on overall motion labels suggests that it may be difficult in some cases to select a single policy preference code for a whole motion. A subsection of the corpus (41 motions, 180 quasi-sentences) was labelled by all three annotators. Fleiss' kappa scores for this subsection are 0.46 at both levels, which indicates 'moderate' agreement. Following Pustejovsky and Stubbs (2012), the gold standard label for each example is obtained by adjudication, which was carried out by the first author.

Manifestos
To validate our labelling procedure, and for comparison with other work, we also asked the annotators to label a small quantity (120) of quasisentences from the Manifesto Project. We calculate Fleiss' kappa for these annotations to be 0.48, which is comparable to that obtained on the main dataset of debate motions, and higher than those reported by Mikhaylov et al. (2008) on manifestos.
Again, we asked the annotators to mark any examples which they considered to be difficult to decide upon. Agreement (Fleiss' kappa) on these 'difficult' cases is only 0.17, with only one ex-10 These guidelines are available along with the corpus. 11 These constitute examples with 'gold standard' labels. The corpus also includes examples labelled by a sole annotator ('silver standard') and further unlabelled motions (see Table 5).   ample marked as such by all three annotators. In this case, two of them used the 'correct' Manifesto Project gold label, while the third annotator applied a different code from the same domain.
Overall, of the 47 examples (39.2%) on which all three annotators agree, 36 of these agree with the gold label (30% of the total). Domain-level agreement is 0.56, which is also similar to that achieved on the debate motions.

The Motion Policy Preference Corpus
We make the corpus available for download at https://madata.bib.uni-mannheim. de/308. The number of labelled and unlabelled examples it contains can be seen in Table 5. For the gold-labelled data, motions range in length from one to 13 quasi-sentences (mean = 4.3), with each of these consisting of between four and 163 tokens (mean = 28.7).

Automatic Labelling Methods
We investigated two ways of automatically labelling debate motions with the codes from the Manifesto Project: (1) similarity matching and (2) supervised classification. We tested both at the quasi-sentence level and we additionally ex-  periment with similarity matching methods at the whole motion level, where the lack of sufficient training data prevents application of supervised learning methods. In pre-processing we filtered out any motions that have gold standard labels that appear less than ten times in the corpus, leaving 370 motions and 1,634 quasi-sentences, each annotated with one of the 32 remaining class labels.

Similarity matching
We tested two methods of matching debate motions to codes from the Manifesto Project, comparing a baseline of unigram overlap scores with cosine similarity measurement. In each case, we measured the similarity of the list of tokens A = A 1 , A 2 , ..., A n in each motion or quasi-sentence text and the list of tokens in each collection of concatenated manifesto extracts B = B 1 , B 2 , ...B n . For unigram overlap, we simply counted the union of the sets of tokens from A and B. For the latter method, each text was represented by its term frequency-inverse document frequency vector (tf-idf), and cosine similarity calculated as: With both of these approaches, we explored the use of the following combinations of sources of textual unigram features: the debate titles, which have been shown to be highly predictive of a motion's opinion-topic in a supervised classification setting (Abercrombie and Batista-Navarro, 2018b), the debate motions themselves, and both the titles and motions together.

Supervised Classification
We tested a range of supervised machine learning algorithms for the policy preference classification task, ranging from traditional approaches to recently developed pre-trained deep language representation models. We were particularly interested in assessing the performance of such approaches: (1) despite the limited training data available (1.6k motion quasi-sentences); and (2) in a cross-domain application (training on over 16k manifesto quasi-sentences, and testing on the motion quasi-sentences).
First, we examined the performance of Support Vector Machines (SVM) trained using lexical (tfidf) or word embedding (w-emb) features, which act as strong traditional baselines. We tested both pre-trained general purpose word embeddings from https://fasttext.cc (Mikolov et al., 2018) and in-domain vectors generated on the Hansard transcripts from Nanni et al. (2019b).
We also report the results of a widely adopted neural network baseline for topic classification (see for instance Glavaš et al. (2017a) and Subramanian et al. (2018) in the context of manifesto quasi-sentences classification): a Convolutional Neural Network (CNN) with single convolution layer and a single max-pooling layer. We again tested the CNN with general purpose and indomain embeddings.
As final skyline comparisons, we present the performance of (1) a pre-trained BERT (large, cased) model (Devlin et al., 2018), with a final soft-max layer; and (2) the same pre-trained BERT model, with a CNN and max-pooling layers before the soft-max layer. We additionally experimented with the latter two models in a fine-tuning setting: after training on manifestos, they have been further fine-tuned on motions.
We tested all approaches with a 80/20 split of the dataset, and trained all the neural models for three iterations.

Results
We evaluated the predicted labels of each experimental model against the gold standard labels produced by the annotation process. For the machine learning methods, we report F1 scores with both macro and micro weightings in order to offer an understanding of the quality overall, as well as for the different classes.

Motions: Similarity Matching
We evaluate labelling of motions by similarity matching at two levels of granularity: quasisentence and whole motion. Cosine similarity matching comfortably outperforms the baseline at both levels of granularity and at both the policy and domain levels (see Figure 1).
Unlike the findings of Abercrombie and Batista-Navarro (2018b), in most settings, we do not find the debate titles to be as powerful indicators of class labels as features derived from the texts of the motions, perhaps due to our larger set of class labels containing more similar (same domain) policy preference codes.
Best performances at both policy and domain levels (F1 macro = 0.59) are obtained using tf-idf features derived from both motion titles and texts, although performance using the texts only is comparable. For most combinations of feature input and similarity measurement method, F1 scores are around twice as good at the domain level as at the policy level.  Figure 1: F1 macro scores for unigram overlap and cosine similarity matching at the policy and domain levels using textual features from whole motions. Use of cosine similarity leads to markedly better performance than unigram overlap, and the best performance is achieved using features derived from both the titles and motion texts at policy and domain levels.

Motions: Quasi-sentence Classification
We tested the supervised pipelines at the quasisentence level and at the two levels of class label granularity (policy and domain), which allows  us to compare the results with previous work on the Manifesto Project (e.g., Zirn et al. (2016)). As can be seen in Table 6, the use of machine learning methods generally (but not always) leads to a substantial improvement (especially for Micro F1), in comparison to the heuristics that we have discussed above. Concerning the SVM and CNN baselines, training the classifiers on the large collection of annotated manifestos and then applying them to the motions does not lead to improvements in comparison to the performance of the same architectures on the motions alone. Similarly, we notice that in most cases the use of in-domain embeddings does not improve the results. These two findings might be due to the fact that the style of communication and vocabulary of the employed resources are very different. The size of the training data may also play a role, as can be noticed in particular with the weak performances of the CNNs, especially in comparison to more traditional approaches; in the next section, we return to this issue.
Finally, to further confirm the large potential of BERT, even in tasks which involve many labels, a lack of training data, and a very specific style of communication, we have obtained a clear improvement over all other systems when employing this state-of-the-art architecture, trained on manifesto quasi-sentences and further fine-tuned on motions.

Manifestos: Quasi-sentence Classification
As a final comparison of the presented systems for quasi-sentence classification, we report their performance on the corpus of 16k manifesto quasisentences, again with an 80/20 train-test split. The results (see Table 7) are consistent with the performance of supervised pipelines on the Manifesto Corpus presented in previous literature (Glavaš et al., 2017a;Subramanian et al., 2018;Zirn et al., 2016) and in line with the performances we obtained on the motion corpus in Table 6.
Interestingly, we once again notice the weak performances of the CNNs on the collection, even with ten times as much training data. This could be due to a necessity to extend the architecture (for example, by adding more convolutional layers) rather than a simple lack of training data. Con-

Model
Text  versely, traditional SVM baselines offer reasonable results, and we achieve state-of-the-art performances when employing BERT.

Discussion and Conclusion
Through this work we have been able to make a number of observations about the validity and reliability of the annotations produced and the difficulty of the tasks of labelling both debate motions and manifestos.
In labelling the manifestos, our annotators agreed with each other to roughly the same extent that they agree with the gold labels provided by the Manifesto Project's expert annotators. This level of agreement is also similar to that reported in Mikhaylov et al. (2008), though not as good as that of MARPOR 12 itself (Lacewell and Werner, 2013).
The task does seem to be transferable to parliamentary debate motions, with our inter-annotator agreement scores comparable on both domains. Although automatic labelling with lexical similarity matching is more succesful at the quasisentence level than at the motion level, the annotators do not seem to find the coarser grained task much easier.
Overall, this is a hard task for humans. However, despite the issue of annotation reproducibility, political scientists continue to find these labels useful-as evidenced by Volkens et al. (2015), who find 230 articles that use this data in the eight journals they examine. With comparable reliabilty (inter-annotator agreement), the labelled motions could prove equally suitable for many automatic analysis applications.
Concerning automation of the labeling process, we can derive three general findings. The first is that a very simple approach-matching debate motions to coded manifestos using cosine similarity measurement-appears to produce potentially useful outputs, particularly at the domain level, with supervised baselines not necessarily offering consistently better results (especially the CNN architectures). The second is that cross-domain applications (from manifestos to motions) seem to necessitate a further fine-tuning step, perhaps due to the very different styles of communication involved. The third is the significant contribution that the use of BERT provides our supervised pipelines, which are able to achieve state-of-theart performance on both the motions and manifesto quasi-sentences.
The generated dataset of topically labelled motions along with the trained BERT+CNN classifier can now pave the way for further work at the intersection of natural language processing and political science, which can benefit from these fine-grained policy position annotations: from analysing the sentiment of the motions to measuring the level of disagreement between members of the same party, and up to full-blown argumentation mining of each debate.