Target Based Speech Act Classification in Political Campaign Text

We study pragmatics in political campaign text, through analysis of speech acts and the target of each utterance. We propose a new annotation schema incorporating domain-specific speech acts, such as commissive-action, and present a novel annotated corpus of media releases and speech transcripts from the 2016 Australian election cycle. We show how speech acts and target referents can be modeled as sequential classification, and evaluate several techniques, exploiting contextualized word representations, semi-supervised learning, task dependencies and speaker meta-data.


Introduction
Election campaign text is a core artifact in political analysis. Campaign communication can influence a party's reputation, credibility, and competence, which are primary factors in voter decision making (Fernandez-Vazquez, 2014). Also, modeling the discourse is key to measuring the role of party in constructive democracy -to engage in constructive discussion with other parties in a democracy (Gibbons et al., 2017).
Speech act theory (Austin, 1962;Searle, 1976) can be used to study such pragmatics in political campaign text. Traditional speech act classes have been studied to analyze how people engage with elected members (Hemphill and Roback, 2014), and how elected members engage in discussions (Shapiro et al., 2018), with a particular focus on pledges (Artés, 2011;Naurin, 2011Naurin, , 2014Gibbons et al., 2017). Also, election manifestos have been analyzed for prospective and retrospective messages (Müller, 2018). In this work, we combine traditional speech acts with those proposed by political scientists to study political discourse, such as specific pledges, which can also help to verify the pledges' fulfilment after an election (Thomson et al., 2010).
In addition to speech acts, it is important to identify the target of each utterance -that is, the political party referred to in the text -in order to determine the discourse structure. Here, we study the effect of jointly modeling the speech act and target referent of each utterance, in order to exploit the task dependencies. That is, this paper is an application of discourse analysis to the pragmaticsrich domain of political science, to determine the intent of every utterance made by politicians, and in part, automatically extract pledges at varying levels of specificity from campaign speeches and press releases.
We assume that each utterance is associated with a unique speech act (similar to Zhao and Kawahara (2017)) and target party, 1 meaning that a sentence with multiple speech acts and/or targets must be segmented into component utterances. Take the following example, from the Labor Party: (1) Labor will contribute $43 million towards the Roe Highway project and we call on the WA Government to contribute funds to get the project underway.
The example is made up of two utterances (with and without an underline), belonging to speech act types commissive-action-specific and directive, referring to different parties (LABOR and LIB-ERAL), resp. In our initial experiments, we perform target based speech act classification (i.e. joint speech act classification and determination of the target of the utterance) over gold-standard utterance data (Section 6), but return to perform automatic utterance segmentation along with target based speech act classification (Section 7). While speech act classification has been applied to a wide range of domains, its application to polit- ical text is relatively new. Most speech act analyses in the political domain have relied exclusively on manual annotation, and no labeled data has been made available for training classifiers. As it is expensive to obtain large-scale annotations, in addition to developing a novel annotated dataset, we also experiment with a semi-supervised approach by utilizing unlabeled text, which is easy to obtain. The contributions of this paper are as follows: (1) we introduce the novel task of target based speech act classification to the analysis of political discourse; (2) we develop and release a dataset (can be found here https://github.com/ shivashankarrs/Speech-Acts) based on political speeches and press releases, from the two major parties -Labor and Liberal -in the 2016 Australian federal election cycle; and (3) we propose a semi-supervised learning approach to the problem by augmenting the training data with indomain unlabeled text.

Related Work
The recent adoption of NLP methods has led to significant advances in the field of computational social science (Lazer et al., 2009), including political science (Grimmer and Stewart, 2013). With the increasing availability of datasets and computational resources, large-scale comparative political text analysis has gained the attention of political scientists (Lucas et al., 2015). One task of particular importance is the analysis of the functional intent of utterances in political text. Though it has received notable attention from many political scientists (see Section 1), the primary focus of almost all work has been to derive insights from manual annotations, and not to study computational approaches to automate the task.
Another related task in the political communication domain is reputation defense, in terms of party credibility. Recently, Duthie and Budzynska (2018) proposed an approach to mine ethos support/attack statements from UK parliamentary debates, while Naderi and Hirst (2018) focused on classifying sentences from Question Time in the Canadian parliament as defensive or not. In this work, our source data is speeches and press releases in the lead-up to a federal election, where we expect there to be rich discourse and interplay between political parties.
Speech act theory is fundamental to study such discourse and pragmatics (Austin, 1962;Searle, 1976). A speech act is an illocutionary act of conversation and reflects shallow discourse structures of language. Due to its predominantly small-data setting, speech act classification approaches have generally relied on bag-of-words models (Qadir and Riloff, 2011;Vosoughi and Roy, 2016), although recent approaches have used deep-learning models through data augmentation (Joty and Hoque, 2016) and learning word representations for the target domain (Joty and Mohiuddin, 2018), outperforming traditional bag-ofwords approaches.
Another technique that has been applied to com-pensate for the sparsity of labeled data is semisupervised learning, making use of auxiliary unlabeled data, as done previously for speech act classification in e-mail and forum text (Jeong et al., 2009). Zhang et al. (2012) also used semisupervised methods for speech act classification over Twitter data. They used transductive SVM and graph-based label propagation approaches to annotate unlabeled data using a small seed training set. Joty and Mohiuddin (2018) leveraged outof-domain labeled data based on a domain adversarial learning approach. In this work, we focus on target based speech act analysis (with a custom class-set) for political campaign text and use a deep-learning approach by incorporating contextualized word representations (Peters et al., 2018) and a cross-view training framework  to leverage in-domain unlabeled text.

Problem Statement
Target based speech act classification requires the segmentation of sentences into utterances, and labelling of those utterances according to speech act and target party. In this work we focus primarily on speech act and target party classification. Our speech act coding schema is comprised of: assertive, commissive, directive, expressive, pastaction, and verdictive. An assertive commits the speaker to something being the case. With a commissive, the speaker commits to a future course of action. Following the work of Artés (2011) and Naurin (2011), we distinguish between action and outcome commissives. Action commissives (commissive-action) are those in which an action is to be taken, while outcome commissives (commissive-outcome) can be defined as a description of reality or goals. Secondly, similar to Naurin (2014) we also classify action commissives into vague (commissive-action-vague) and specific (commissive-action-specific), according to their specificity. This distinction is also related to text specificity analysis work addressed in the news (Louis and Nenkova, 2011) and classroom discussion (Lugini and Litman, 2017) domains. A directive occurs when the speaker expects the listener to take action in response. In an expressive, the speaker expresses their psychological state, while a past-action denotes a retrospective action of the target party, and a verdictive refers to an assessment on prospective or retrospective actions.
Examples of the eight speech act classes are  given in Table 1, along with the target party (LABOR, LIBERAL, or NONE), indicating which party the speech act is directed towards, and the "speaker" party making the utterance (information which is provided for every utterance).

Utterance Segmentation
Sentences are segmented both in the context of speech act and target party -when a sentence has utterances belonging to more than one speech act or/and more than one target. For example, the following sentence conveys a pledge (commissive-outcome) followed by the party's belief (assertive), with the utterance boundary indicated by : (2) We will save Medicare because Medicare is more than just a standard of health.
Further, the following (from the Labor party) has segments comparing LABOR and LIBERAL: (3) Our party is united -the Liberals are not united.

Election Campaign Dataset
We collected media releases and speeches from the two major Australia political parties -Labor and Liberal -from the 2016 Australian federal election campaign. A statistical breakdown of the dataset is given in Table 2. We compute agreement over 15 documents, annotated by two independent annotators, with disagreements resolved by a third annotator. The remaining documents are annotated by the two main annotators without redundancy. Agreement between the two annotators for utterance segmentation based on exact boundary match using Krippendorff's alpha (α) (Krippendorff, 2011) is 0.84. Agreement statistics for the classification tasks (Cohen, 1960;Carletta, 1996) are given in Tables 3 and 4.

Proposed Approach
Our approach to labeling utterances for speech act and target party classification is as follows. Utterances are first represented as a sequence of word embeddings, and then using a bidirectional Gated Recurrent Unit ("biGRU": Cho et al. (2014)). The representation of each utterance is set to the concatenation of the last hidden state of both the forward and backward GRUs, After this, the model has a softmax output layer. This network is trained for both the speech act (eight class) and target party (three class) classification tasks, minimizing cross-entropy loss, denoted as L S and L T respectively.
Our approach has the following components: ELMo word embeddings ("biGRU ELMo "): As word embeddings we use a 1024d learned linear combination of the internal states of a bidirectional language model (Peters et al., 2018).
Semi-supervised Learning: We employ a cross-view training approach  to leverage a larger volume of unlabeled text. Cross-view training is a kind of teacher-student method, whereby the model "teaches" another "student" model to classify unlabelled data. The student sees only a limited form of the data, e.g., through application of noise (Sajjadi et al., 2016;Wei et al., 2018), or a different view of the input, as used herein. This procedure regularises the learning of the teacher to be more robust, as well as increasing the exposure to unlabeled text.
We augment our dataset with over 36k sentences from Australian Prime Minister candidates' election speeches. 2 On these unlabeled examples, the model's probability distribution over targets p θ (y|s) is used to fit auxiliary model(s), p ω (y|s), by minimising the Kullback-Leibler (KL) divergence, KL(p θ (y|s), p ω (y|s)). This consensus loss component, denoted L unsup , is added to the supervised training objective (L S or L T ).
The intuition is that the student models only have access to restricted views of the data on which the teacher network is trained, and therefore this acts as a regularization factor over the unlabeled data when learning the teacher model.
Multi-task Learning ("biGRU Multi "): For speech act classification, target party classification is considered as an auxiliary task, and vice versa. Accordingly, a separate model is built for each task, with the other task as an auxiliary task, in each case using a linearly weighted objective L S + αL T , where α ≥ 0 is tuned separately in each application. The intuition here is to capture the dependencies between the tasks, e.g., commissive is relevant to the Speaker party only.
Meta-data (biGRU Meta ): We concatenate a binary flag encoding the speaker party (m i ) alongside the utterance embedding h i , i.e., [h i , m i ]. This representation is passed through a hidden layer with ReLU-activation, then projected onto a output layer with softmax activation for both the classification tasks.

Evaluation
We compare the models presented in Section 5 with the following baseline approaches: • Support Vector Machine ("SVM BoW ") with with unigram term-frequency representation.
• Using speaker party as the predicted target party ("Meta naive ").
We average results across 10 runs with 90%/10% training/test random splits. Hyperparameters are tuned over a 10% validation set randomly sampled and held out from the training set. We evaluate using accuracy and macro-averaged F-score, to account for classimbalance. We compare the baseline approaches against our proposed approach (different components given in Section 5).
We evaluate the effect of each component by adding them to the base model (biGRU ELMo ), e.g., biGRU model with ELMo embeddings and word-level dropout based semi-supervised approach is given as biGRU ELMo + CVT worddrop . Results for speech act and target party classification are given in Table 5. The corresponding class-wise performance for both speech act and target party tasks with our approach (biGRU ELMo + CVT worddrop + Meta ) compared against the competitive approach from Table 5 is given in Table 6 and Table 7 respectively (and also discussed further in Section 8). All the approaches are evaluated with the gold-standard segmentation. Utterance segmentation is discussed in Section 7.
From the results in Table 5, we observe that the biGRU 4 performs better than the other approaches, and that ELMo contextual embeddings (biGRU ELMo ) boosts the performance appreciably. Apart from ELMo, the semi-supervised learning methods (biGRU ELMo + CVT worddrop ) provide a boost in performance for the target party  Figure 1: Classification performance across different training ratios. Note that 90% is using all the training data, as 10% is used for validation.
task (wrt accuracy) using all the training data. biGRU ELMo + CVT worddrop and biGRU ELMo + CVT fwd provide gains in performance for the speech act task, especially with fewer training examples (≤ 50% of training data, see Figure 1). Performance of semi-supervised learning models with cross-view training (which leverages indomain unlabeled text) is compared against biGRU ELMo , which is a supervised approach. Results across different training ratio settings are given in Figure 1. From this, we can see that biGRU ELMo + CVT worddrop and biGRU ELMo + CVT fwd performs better than biGRU ELMo + CVT fwdbwd in almost all cases. With a training ratio ≤ 50%, biGRU ELMo + CVT worddrop achieves a comparable performance to biGRU ELMo + CVT fwd .
Multi-task learning (biGRU ELMo + CVT worddrop + Multi ) provides only small improvements for the speech act task.
Further, when we add speaker party meta-data (biGRU ELMo + CVT worddrop + Meta ), it provides large gains in performance for the target party task.
Overall, the proposed approach (biGRU ELMo + CVT worddrop + Meta ) provides the best performance for the target party task. Its performance is better than the biGRU ELMo + Meta model, which does not leverage the additional unlabeled text using semi-supervised learning, where it achieves 0.70 accuracy and 0.65 Macro F1. Also, ELMo and semi-supervised methods (biGRU ELMo + CVT worddrop and biGRU ELMo + CVT fwd ) provide significant improvements for the speech act task, especially under sparse supervision scenarios (see Figure 1, for training ratio ≤ 50%).

Segmentation Results
In the previous experiments, we used goldstandard utterance data, but next we experiment with automatic segmentation. We use sentences as input, based on the NLTK sentence tokenizer (Bird et al., 2009), and automatically segment sentences into utterances based on token-level segmentation, in the form of a BI binary sequence classification task using a CRF model (Hernault et al., 2010). 5 We use the following set of features for each word: token, word shape (capitalization, punctuation, digits), Penn POS tags based on SpaCy, ClearNLP dependency labels (Choi and Palmer, 2012), relative position in the sentence, and features for the Utterance Target party Speaker Our new Tourism Infrastructure Fund will bring more visitor dollars and more hospitality jobs to Cairns, Townsville and the regions.

LABOR LABOR
Just as he sold out 35,000 owner-drivers in his deal with the TWU to bring back the "Road Safety Remuneration Tribunal".

LABOR LIBERAL
Then in 2022, we will start construction of the first of 12 regionally superior submarines, the single biggest investment in our military history. adjacent words (based on this same feature representation). We compute segmentation accuracy (SA: Zimmermann et al. (2006)), which measures the percentage of segments that are correctly segmented, i.e. both the left and right boundary match the reference boundaries. SA for the CRF model is 0.87. Secondly, to evaluate the effect of segmentation on classification, we compute joint accuracy (JA). It is similar to SA but also requires correctness of the speech act and target party. In cascaded style, JA using the CRF model for segmentation and biGRU ELMo + CVT worddrop + Meta for speech act and target party classification is 0.60 and 0.64 respectively. Here, segmentation errors lead to a small drop in performance.

Error Analysis
We analyze the class-wise performance and confusion matrix for our best performing approach (biGRU ELMo + CVT worddrop + Meta ). Speech act and target party class-wise performance is given in Tables 6  and 7 respectively. We can see that the proposed approach provides improvement across all classes, while achieving comparable performance for directive. Recognizing commissive-outcome can be seen to be tougher than other classes. In addition, we analyze the results to identify cases where having "Speaker" party information is beneficial for predicting the target party of sentences. Some of those scenarios are given in Table 8, where the meta-data enables predicting the target party correctly even when there is no explicit reference to the party or leaders. Confusion matrices for the speech act and target party classification tasks are given in Figure 2. Some observations from the confusion matrices are: (a) assertive and verdictive are often misclassified as each other; (b) commissiveaction-vague utterances are often misclassified as commissive-action-specific; and (c) LABOR and LIBERAL classes are often misclassified as each other for the target party classification task.

Qualitative Analysis
Here we provide the policy-wise speech act distribution for both parties, which indicates the difference in their predilection for the indicated six policy areas (Figure 3). We provide results for the six most frequent policy categories, for each of which, the campaign text is first classified into one of the policy-areas that are relevant to Australian politics, by building a Logistic Regression classifier with data obtained from ABC Fact Check. 6 Some observations (based on Figure 3) are as follows: • The incumbent government (LIBERAL) uses more directive, expressive, verdictive, and past-action utterances than the opposition (LABOR). • LIBERAL's text has relatively more pledges (commissive-action-vague, commissiveaction-specific and commissive-outcome) on economy compared to LABOR, whereas LA-BOR has more pledges on social services and education. This is as expected for right-and left-wing parties respectively. Other policyareas have a comparable number of pledges from both parties.
Overall, party-wise salience towards these policy areas correlates highly with the relative breakdowns in the Comparative Manifesto Project (Volkens et al., 2017): where the relative share of sentences from the LABOR and LIBERAL manifestos 7 for welfare state (health and social services) is 22:7, education is 9:6, economy is 11:23, and technology & infrastructure (communication, infrastructure) is 17:19.

Conclusion and Future Work
In this work we present a new dataset of election campaign texts, based on a class schema of speech acts specific to the political science domain. We study the associated problems of identifying the referent political party, and segmentation. We showed that this task is feasible to annotate, and present several models for automating the task. We use a pre-trained language model and also leverage auxiliary unlabeled text with semi-supervised learning approach for the target based speech act classification task. Our results are promising, with the best method being a semisupervised biGRU with ELMo embeddings for the speech act task, and the model additionally incorporating speaker meta-data for the target party task. We provided qualitative analysis of speech acts across major policy areas, and in future work aim to expand this analysis further with finegrained policies and ideology-related analysis.