Joint Modeling of Content and Discourse Relations in Dialogues

We present a joint modeling approach to identify salient discussion points in spoken meetings as well as to label the discourse relations between speaker turns. A variation of our model is also discussed when discourse relations are treated as latent variables. Experimental results on two popular meeting corpora show that our joint model can outperform state-of-the-art approaches for both phrase-based content selection and discourse relation prediction tasks. We also evaluate our model on predicting the consistency among team members’ understanding of their group decisions. Classifiers trained with features constructed from our model achieve significant better predictive performance than the state-of-the-art.


Introduction
Goal-oriented dialogues, such as meetings, negotiations, or customer service transcripts, play an important role in our daily life.Automatically extracting the critical points and important outcomes from dialogues would facilitate generating summaries for complicated conversations, understanding the decision-making process of meetings, or analyzing the effectiveness of collaborations.
We are interested in a specific type of dialogues -spoken meetings, which is a common way for collaboration and idea sharing.Previous work (Kirschner et al., 2012) has shown that discourse structure can be used to capture the main discussion points and arguments put forward during problem-solving and decision-making processes in meetings.Indeed, content of different speaker turns do not occur in isolation, and should be interpreted within the context of discourse.Meanwhile, content can also reflect the purpose of speaker turns, thus facilitate with discourse relation understanding.Take the meeting snippet from Here we highlight salient phrases (in italics) that are relevant to the major topic discussed, i.e., "which type of battery to use for the remote control".Arrows indicate discourse structure between speaker turns.We also show some of the discourse relations for illustration.
AMI corpus (Carletta et al., 2006) in Figure 1 as an example.This discussion is annotated with discourse structure based on the Twente Argumentation Schema (TAS) by Rienks et al. (2005), which focuses on argumentative discourse information.As can be seen, meeting participants evaluate different options by showing doubt (UNCERTAIN), bringing up alternative solution (OPTION), or giving feedback.The discourse information helps with the identification of the key discussion point, i.e., "which type of battery to use", by revealing the discussion flow.
To date, most efforts to leverage discourse information to detect salient content from dialogues have focused on encoding gold-standard discourse relations as features for use in classifier training (Murray et al., 2006;Galley, 2006;McKeown et al., 2007;Bui et al., 2009).However, automatic discourse parsing in dialogues is still a challenging problem (Perret et al., 2016).Moreover, acquiring human annotation on discourse relations is a timeconsuming and expensive process, and does not scale for large datasets.
In this paper, we propose a joint modeling approach to select salient phrases reflecting key discussion points as well as label the discourse relations between speaker turns in spoken meetings.We hypothesize that leveraging the interaction between content and discourse has the potential to yield better prediction performance on both phrase-based content selection and discourse relation prediction.Specifically, we utilize argumentative discourse relations as defined in Twente Argument Schema (TAS) (Rienks et al., 2005), where discussions are organized into tree structures with discourse relations labeled between nodes (as shown in Figure 1).Algorithms for joint learning and joint inference are proposed for our model.We also present a variation of our model to treat discourse relations as latent variables when true labels are not available for learning.We envision that the extracted salient phrases by our model can be used as input to abstractive meeting summarization systems (Wang and Cardie, 2013;Mehdad et al., 2014).Combined with the predicted discourse structure, a visualization tool can be exploited to display conversation flow to support intelligent meeting assistant systems.
To the best of our knowledge, our work is the first to jointly model content and discourse relations in meetings.We test our model with two meeting corpora -the AMI corpus (Carletta et al., 2006) and the ICSI corpus (Janin et al., 2003).Experimental results show that our model yields an accuracy of 63.2 on phrase selection, which is significantly better than a classifier based on Support Vector Machines (SVM).Our discourse prediction component also obtains better accuracy than a state-of-the-art neural networkbased approach (59.2 vs. 54.2).Moreover, our model trained with latent discourse outperforms SVMs on both AMI and ICSI corpora for phrase selection.We further evaluate the usage of selected phrases as extractive meeting summaries.Results evaluated by ROUGE (Lin and Hovy, 2003) demonstrate that our system summaries obtain a ROUGE-SU4 F1 score of 21.3 on AMI corpus, which outperforms non-trivial extractive summarization baselines and a keyword selection algorithm proposed in Liu et al. (2009).
Moreover, since both content and discourse structure are critical for building shared understanding among participants (Mulder et al., 2002;Mercer, 2004), we further investigate whether our learned model can be utilized to predict the consistency among team members' understanding of their group decisions.This task is first defined as consistency of understanding (COU) prediction by Kim and Shah (2016), who have labeled a portion of AMI discussions with consistency or inconsistency labels.We construct features from our model predictions to capture different discourse patterns and word entrainment scores for discussion with different COU level.Results on AMI discussions show that SVM classifiers trained with our features significantly outperform the state-ofthe-art results (Kim and Shah, 2016) (F1: 63.1 vs. 50.5)and non-trivial baselines.
The rest of the paper is structured as follows: we first summarize related work in Section 2. The joint model is presented in Section 3. Datasets and experimental setup are described in Section 4, which is followed by experimental results (Section 5).We then study the usage of our model for predicting consistency of understanding in groups in Section 6.We finally conclude in Section 7.

Related Work
Our model is inspired by research work that leverages discourse structure for identifying salient content in conversations, which is still largely reliant on features derived from gold-standard discourse labels (McKeown et al., 2007;Murray et al., 2010;Bokaei et al., 2016).For instance, adjacency pairs, which are paired utterances with question-answer or offer-accept relations, are found to frequently appear in meeting summaries together and thus are utilized to extract summary-worthy utterances by Galley (2006).There is much less work that jointly predicts the importance of content along with the discourse structure in dialogus.Oya and Carenini (2014) employs Dynamic Conditional Random Field to recognize sentences in email threads for use in summary as well as their dialogue acts.Only local discourse structures from adjacent utterances are considered.Our model is built on tree structures, which captures more global information.
Our work is also in line with keyphrase identification or phrase-based summarization for conversations.Due to the noisy nature of dialogues, recent work focuses on identifying summary-worthy phrases from meetings (Fernández et al., 2008;Riedhammer et al., 2010) or email threads (Loza et al., 2014).For instance, Wang and Cardie (2012) treat the problem as an information extraction task, where summary-worthy content represented as indicator and argument pairs is identified by an unsupervised latent variable model.Our work also targets at detecting salient phrases from meetings, but focuses on the joint modeling of critical discussion points and discourse relations held between them.
For the area of discourse analysis in dialogues, a significant amount of work has been done in predicting local discourse structures, such as recognizing dialogue acts or social acts of adjacent utterances from phone conversations (Stolcke et al., 2000;Kalchbrenner and Blunsom, 2013;Ji et al., 2016), spoken meetings (Dielmann and Renals, 2008), or emails (Cohen et al., 2004).Although discourse information from non-adjacent turns has been studied in the context of online discussion forums (Ghosh et al., 2014) and meetings (Hakkani-Tur, 2009), none of them models the effect of discourse structure on content selection, which is a gap that this work fills in.

The Joint Model of Content and Discourse Relations
In this section, we first present our joint model in Section 3.1.The algorithms for learning and inference are described in Sections 3.2 and 3.3, followed by feature description (Section 3.4).

Model Description
Our proposed model learns to jointly perform phrase-based content selection and discourse relation prediction by making use of the interaction between the two sources of information.Assume that a meeting discussion is denoted as x, where x consists of a sequence of discourse units Each discourse unit can be a complete speaker turn or a part of it.As demonstrated in Figure 1, a tree-structured discourse diagram is constructed for each discussion with each discourse unit x i as a node of the tree.In this work, we consider the argumentative discourse structure by Twente Argument Schema (TAS) (Rienks et al., 2005).For each node x i , it is attached to another node x i (i < i) in the discussion, and a discourse relation d i is hold on the link x i , x i (d i is empty if x i is the root).Let t denote the set of links x i , x i in x.Following previous work on discourse analysis in meetings (Rienks et al., 2005;Hakkani-Tur, 2009), we assume that the attachment structure between discourse units are given during both training and testing.A set of candidate phrases are extracted from each discourse unit x i , from which salient phrases that contain gist information will be identified.We obtain constituent and dependency parses for utterances using Stanford parser (Klein and Manning, 2003).We restrict eligible candidate to be a noun phrase (NP), verb phrase (VP), prepositional phrase (PP), or adjective phrase (ADJP) with at most 5 words, and its head word cannot be a stop word.1 If a candidate is a parent of another candidate in the constituent parse tree, we will only keep the parent.We further merge a verb and a candidate noun phrase into one candidate if the later is the direct object or subject of the verb.For example, from utterance "let's use a rubber case as well as rubber buttons", we can identify candidates "use a rubber case" and "rubber buttons".For x i , the set of candidate phrases are denoted as , where m i is the number of candidates.c i,j takes a value of 1 if the corresponding candidate is selected as salient phrase; otherwise, c i,j is equal to 0. All candidate phrases in discussion x are represented as c.
We then define a log-linear model with feature parameters w for the candidate phrases c and discourse relations d in x as: Here Φ(•) and φ(•) denote feature vectors.We utilize three types of feature functions: (1) content-only features φ c (•), which capture the importance of phrases, (2) discourse-only features φ d (•), which characterize the (potentially higherorder) discourse relations, and (3) joint features of content and discourse φ cd (•), which model the interaction between the two.w c , w d , and w cd are corresponding feature parameters.Detailed feature descriptions can be found in Section 3.4.Discourse Relations as Latent Variables.As we mentioned in the introduction, acquiring labeled training data for discourse relations is a timeconsuming process since it would require human annotators to inspect the full discussions.Therefore, we further propose a variation of our model where it treats the discourse relations as latent variables, so that p(c|x, w) = d p(c, d|x, w).Its learning algorithm is slightly different as described in the next section.

Joint Learning for Parameter Estimation
For learning the model parameters w, we employ an algorithm based on SampleRank (Rohanimanesh et al., 2011), which is a stochastic structure learning method.In general, the learning algorithm constructs a sequence of configurations for sample labels as a Markov chain Monte Carlo (MCMC) chain based on a task-specific loss function, where stochastic gradients are distributed across the chain.
The full learning procedure is described in Algorithm 1.To start with, the feature weights w is initialized with each value randomly drawn from [−1, 1].Multiple epochs are run through all samples.For each sample, we randomly initialize the assignment of candidate phrases labels c and discourse relations d.Then an MCMC chain is constructed with a series of configurations σ = (c, d): at each step, it first samples a discourse structure d based on the proposal distribution q(d |d, x), and then samples phrase labels conditional on the new discourse relations and previous phrase labels based on q(c |c, d , x).Local search is used for both proposal distributions. 2 The new configuration is accepted if it improves on the score by ω(σ ).The parameters w are updated accordingly.
For the scorer ω, we use a weighted combination of F1 scores of phrase selection (F 1 c ) and discourse relation prediction When discourse relations are treated as latent, we initialize discourse relations for each sample with a label in {1, 2, . . ., K} if there are K relations indicated, and we only use F 1 c as the scorer.

Joint Inference for Prediction
Given a new sample x and learned parameters w, we predict phrase labels and discourse relations as arg max c,d p(c, d|x, w).
Dynamic programming can be employed to carry out joint inference, however, it would be time-consuming since our objective function has a large search space for both content and discourse labels.Hence we propose an alternating optimizing algorithm to search for c and d iteratively.Concretely, for each iteration, we first optimize on d by maximizing and Eisner, 2008) is used to find the best d.
In the second step, we search for c that maximizes ).We believe that candidate phrases based on the same concepts should have the same predicted label.Therefore, candidates of the same phrase type and sharing the same head word are grouped into one cluster.We then cast our task as an integer linear programming problem. 3 We optimize our objective function under constraints: (1) c i,j = c i ,j if c i,j and c i ,j are in the same cluster, and (2) c i,j ∈ {0, 1}, ∀i, j.
The inference process is the same for models trained with latent discourse relations.

Features
We use features that characterize content, discourse relations, and the combination of both.Content Features.For modeling the salience of content, we calculate the minimum, maximum, and average of TF-IDF scores of words and number of content words in each phrase based on the intuition that important phrases tend to have more content words with high TF-IDF scores (Fernández et al., 2008).We also consider whether the head word of the phrase has been mentioned in preceding turn, which implies the focus of a discussion.
The size of the cluster each phrase belongs to is also included.Number of POS tags and phrase types are counted to characterize the syntactic structure.Previous work (Wang and Cardie, 2012) has found that a discussion usually ends with decision-relevant information.We thus identify the absolute and relative positions of the turn containing the candidate phrase in the discussion.Finally, we record whether the candidate phrase is uttered by the main speaker, who speakers the most words in the discussion.Discourse Features.For each discourse unit, we collect the dialogue act types of the current unit and its parent node in discourse tree, whether there is any adjacency pair held between the two nodes (Hakkani-Tur, 2009), and the Jaccard similarity between them.We record whether two turns are uttered by the same speaker, for example, ELABORATION is commonly observed between the turns from the same participant.We also calculate the number of candidate phrases based on the observation that OPTION and SPECIALIZATION tend to contain more informative words than POSI-TIVE feedback.Length of the discourse unit is also relevant.Therefore, we compute the time span and number of words.To incorporate global structure features, we encode the depth of the node in the discourse tree and the 3 We use lpsolve: http://lpsolve.sourceforge.net/5.5/.number of its siblings.Finally, we include an order-2 discourse relation feature that encodes the relation between current discourse unit and its parent, and the relation between the parent and its grandparent if it exists.Joint Features.For modeling the interaction between content and discourse, the discourse relation is added to each content feature to compose a joint feature.For example, if candidate c in discussion x has a content feature φ [avg−T F IDF ] (c, x) with a value of 0.5, and its discourse relation d is POSITIVE, then the joint feature takes the form of φ [avg−T F IDF,P ositive] (c, d, x) = 0.5.

Datasets and Experimental Setup
Meeting Corpora.We evaluate our joint model on two meeting corpora with rich annotations: the AMI meeting corpus (Carletta et al., 2006) and the ICSI meeting corpus (Janin et al., 2003).AMI corpus consists of 139 scenario-driven meetings, and ICSI corpus contains 75 naturally occurring meetings.Both of the corpora are annotated with dialogue acts, adjacency pairs, and topic segmentation.We treat each topic segment as one discussion, and remove discussions with less than 10 turns or labeled as "opening" and "chitchat".694 discussions from AMI and 1139 discussions from ICSI are extracted, and these two datasets are henceforth referred as AMI-FULL and ICSI-FULL.
Acquiring Gold-Standard Labels.Both corpora contain human constructed abstractive summaries and extractive summaries on meeting level.Short abstracts, usually in one sentence, are constructed by meeting participants -participant summaries, and external annotators -abstractive summaries.Dialogue acts that contribute to important output of the meeting, e.g.decisions, are identified and used as extractive summaries, and some of them are also linked to the corresponding abstracts.
Since the corpora do not contain phrase-level importance annotation, we induce gold-standard labels for candidate phrases based on the following rule.A candidate phrase is considered as a positive sample if its head word is contained in any abstractive summary or participant summary.On average, 71.9 candidate phrases are identified per discussion for AMI-FULL with 31.3% labeled as positive, and 73.4 for ICSI-FULL with 24.0% of them as positive samples.
Furthermore, a subset of discussions in AMI-  (Fernández et al., 2008;Wang and Cardie, 2013) and discourse parsing for formal text (Hernault et al., 2010).Therefore, we compare with linear SVM-based classifiers, trained with the same feature set of content features or discourse features.We fix the trade-off parameter to 1.0 for all SVM-based experiments.For discourse relation prediction, we use one-vs-rest strategy to build multiple binary classifiers. 5We also compare with a state-of-the-art discourse parser (Ji et al., 2016), which employs neural language model to predict discourse relations.
5 Experimental Results

Phrase Selection and Discourse Labeling
Here we present the experimental results on phrase-based content selection and discourse relation prediction.We experiment with two variations of our joint model: one is trained on goldstandard discourse relations, the other is trained by 4 There are 9 types of relations in TAS: We also investigate whether joint learning and joint inference can produce better prediction per- formance.We consider joint learning with separate inference, where only content features or discourse features are used for prediction (Separate-Inference).We further study learning separate classifiers for content selection and discourse relations without joint features (Separate-Learn).
We first show the phrase selection and discourse relation prediction results on AMI-SUB in Tables 1 and 2. As shown in Table 1, our models, trained with gold-standard discourse relations or latent ones with true attachment structure, yield significant better accuracy and F1 scores than SVM-based classifiers trained with the same feature sets for phrase selection (paired t-test, p < 0.05).Our joint learning model with separate inference also outperforms neural network-based discourse parsing model (Ji et al., 2016) in Table 2.
Moreover, Tables 1 and 2 demonstrate that joint learning usually produces superior performance for both tasks than separate learning.Combined with joint inference, our model obtains the best accuracy and F1 on phrase selection.This indicates that leveraging the interplay between content and discourse boost the prediction performance.Similar results are achieved on AMI-FULL and ICSI-FULL in Table 3, where latent discourse relations without true attachment structure are employed for training.

Phrase-Based Extractive Summarization
We further evaluate whether the prediction of the content selection component can be used for summarizing the key points on discussion level.For each discussion, salient phrases identified by our model are concatenated in sequence for use as the summary.We consider two types of gold-standard summaries.One is utterance-level extractive summary, which consists of human labeled summaryworthy utterances.The other is abstractive sum- mary, where we collect human abstract with at least one link from summary-worthy utterances.
We calculate scores based on ROUGE (Lin and Hovy, 2003), which is a popular tool for evaluating text summarization (Gillick et al., 2009;Liu and Liu, 2010).ROUGE-1 (unigrams) and ROUGE-SU4 (skip-bigrams with at most 4 words in between) are used.Following previous work on meeting summarization (Riedhammer et al., 2010;Wang and Cardie, 2013), we consider two dialogue act-level summarization baselines: (1) LONGEST DA in each discussion is selected as the summary, and (2) CENTROID DA, the one with the highest TF-IDF similarity with all DAs in the discussion.We also compare with an unsupervised keyword extraction approach by Liu et al. (2009), where word importance is estimated by its TF-IDF score, POS tag, and the salience of its corresponding sentence.With the same candidate phrases as in our model, we extend Liu et al. (2009) by scoring each phrase based on its average score of the words.Top phrases, with the same number of phrases output by our model, are included into the summaries.Finally, we compare with summaries consisting of salient phrases predicted by an SVM classifier trained with our content features.
From the results in Table 4, we can see that phrase-based extractive summarization methods can yield better ROUGE scores for recall and F1 than baselines that extract the whole sentences.Meanwhile, our system significantly out-Meeting Clip: D: can we uh power a light in this? can we get a strong enough battery to power a light?A: um i think we could because the lcd panel requires power, and the lcd is a form of a light so that. . .D: . ..it's gonna have to have something high-tech about it and that's gonna take battery power. . .D: illuminate the buttons.yeah it glows.D: well m i'm thinking along the lines of you're you're in the dark watching a dvd and you um you find the thing in the dark and you go like this . . .oh where's the volume button in the dark, and uh y you just touch it . . .and it lights up or something.Abstract by Human: What sort of battery to use.The industrial designer presented options for materials, components, and batteries and discussed the restrictions involved in using certain materials.Longest DA: well m i'm thinking along the lines of you're you're in the dark watching a dvd and you um you find the thing in the dark and you go like this.Centroid DA: can we uh power a light in this?Our Method: -power a light, a strong enough battery, -requires power, a form, -a really good battery, battery power, -illuminate the buttons, glows, -watching a dvd, the volume button, lights up or something performs the SVM-based classifiers when evaluated on ROUGE recall and F1, while achieving comparable precision.Compared to Liu et al. (2009), our system also yields better results on all metrics.
Sample summaries by our model along with two baselines are displayed in Figure 2. Utterancelevel extract-based baselines unavoidably contain disfluency and unnecessary details.Our phrasebased extractive summary is able to capture the key points from both the argumentation process and important outcomes of the conversation.This implies that our model output can be used as input for an abstractive summarization system.It can also facilitate the visualization of decision-making processes.

Further Analysis and Discussions
Features Analysis.We first discuss salient features with top weights learned by our joint model.For content features, main speaker tends to utter more salient content.Higher TF-IDF scores also indicate important phrases.If a phrase is mentioned in previous turn and repeated in the current turn, it is likely to be a key point.For discourse features, structure features matter the most.For instance, jointly modeling the discourse relation of the parent node along with the current node can lead to better inference.An example is that giving more details on the proposal (ELABORATION) tends to lead to POSITIVE feedback.Moreover, REQUEST usually appears close to the root of the argument diagram tree, while POSITIVE feedback is usually observed on leaves.Adjacency pairs also play an important role for discourse prediction.For joint features, features that composite "phrase mentioned in previous turn" and relation POSITIVE feedback or REQUEST yield higher weight, which are indicators for both key phrases and discourse relations.We also find that main speaker information composite with ELABORA-TION and UNCERTAIN are associated with high weights.
Error Analysis and Potential Directions.Taking a closer look at our prediction results, one major source of incorrect prediction for phrase selection is based on the fact that similar concepts might be expressed in different ways, and our model predicts inconsistently for different variations.For example, participants use both "thick" and "two centimeters" to talk about the desired shape of a remote control.However, our model does not group them into the same cluster and later makes different predictions.For future work, semantic similarity with context information can be leveraged to produce better clustering results.Furthermore, identifying discourse relations in dialogues is still a challenging task.For instance, "I wouldn't choose a plastic case" should be labeled as OPTION EXCLUSION, if the previous turns talk about different options.Otherwise, it can be labeled as NEGATIVE.Therefore, models that better handle semantics and context need to be considered.

Predicting Consistency of Understanding
As discussed in previous work (Mulder et al., 2002;Mercer, 2004), both content and discourse structure are critical for building shared understanding among discussants.In this section, we test whether our joint model can be utilized to predict the consistency among team members' under-standing of their group decisions, which is defined as consistency of understanding (COU) in Kim and Shah (2016).Kim and Shah (2016) establish gold-standard COU labels on a portion of AMI discussions, by comparing participant summaries to determine whether participants report the same decisions.If all decision points are consistent, the associated topic discussion is labeled as consistent; otherwise, the discussion is identified as inconsistent.Their annotation covers the AMI-SUB dataset.Therefore, we run the prediction experiments on AMI-SUB by using the same annotation.Out of total 129 discussions in AMI-SUB, 86 discussions are labeled as consistent and 43 are inconsistent.
We construct three types of features by using our model's predicted labels.Firstly, we learn two versions of our model based on the "consistent" discussions and the "inconsistent" ones in the training set, with learned parameters w con and w incon .For a discussion in the test set, these two models output two probabilities p con = max c,d P (c, d|x, w con ) and p incon = max c,d P (c, d|x, w incon ).We use p con − p incon as a feature.
Furthermore, we consider discourse relations of length one and two from the discourse structure tree.Intuitively, some discourse relations, e.g., ELABORATION followed by multiple POSI-TIVE feedback, imply consistent understanding.
The third feature is based on word entrainment, which has been shown to correlate with task success for groups (Nenkova et al., 2008).Using the formula in Nenkova et al. (2008), we compute the average word entrainment between the main speaker who utters the most words and all the other participants.The content words in the salient phrases predicted by our model is considered for entrainment computation.
Results.Leave-one-out is used for experiments.For training, our features are constructed from gold-standard phrase and discourse labels.Predicted labels by our model is used for constructing features during testing.SVM-based classifier is used for experimenting with different sets of features output by our model.A majority class baseline is constructed as well.We also consider an SVM classifier trained with ngram features (unigrams and bigrams).Finally, we compare with the state-of-the-art method in Kim and Shah (2016), where discourse-relevant features and head ges- ture features are utilized in Hidden Markov Models to predict the consistency label.
The results are displayed in Table 5.All SVMs trained with our features surpass the ngrams-based baseline.Especially, the discourse features, word entrainment feature, and the combination of the three, all significantly outperform the state-of-theart system by Kim and Shah (2016).6

Conclusion
We presented a joint model for performing phraselevel content selection and discourse relation prediction in spoken meetings.Experimental results on AMI and ICSI meeting corpora showed that our model can outperform state-of-the-art methods for both tasks.Further evaluation on the task of predicting consistency-of-understanding in meetings demonstrated that classifiers trained with features constructed from our model output produced superior performance compared to the state-of-the-art model.This provides an evidence of our model being successfully applied in other prediction tasks in spoken meetings.

Figure 1 :
Figure 1: A sample clip from AMI meeting corpus.B, C, and D denotes different speakers.Here we highlight salient phrases (in italics) that are relevant to the major topic discussed, i.e., "which type of battery to use for the remote control".Arrows indicate discourse structure between speaker turns.We also show some of the discourse relations for illustration.

Figure 2 :
Figure 2: Sample summaries output by different systems for a meeting clip from AMI corpus (less relevant utterances in between are removed).Salient phrases by our system output are displayed for each turn of the clip, with duplicated phrases removed for brevity.

Table 1 :
Phrase-based content selection performance on AMI-SUB with accuracy (acc) and F1.We display results of our models trained with gold-standard discourse relation labels and with latent discourse relations.For the later, we also show results based on True Attachment Structure, where the gold-standard attachments are known, and without the True Attachment Structure.Our models that significantly outperform SVM-based model are highlighted with * (p < 0.05, paired t-test).Best result for each column is in bold.

Table 2 :
Discourse relation prediction performance on AMI-SUB.Our models that significantly outperform SVM-based model and Ji et al. (2016) are highlighted with * (p < 0.05, paired t-test).Best result for each column is in bold.FULL.

Table 4 :
ROUGE scores for phrase-based extractive summarization evaluated against human-constructed utterance-level extractive summaries and abstractive summaries.Our models that statistically significantly outperform SVM and Liu et al. (2009) are highlighted with * (p < 0.05, paired t-test).Best ROUGE score for each column is in bold.

Table 5 :
Kim and Shah (2016)rstanding (COU) prediction results on AMI-SUB.Results that statistically significantly outperform ngrams-based baseline andKim and Shah (2016)are highlighted with * (p < 0.05, paired t-test).For reference, we also show the prediction performance based on gold-standard discourse relations and phrase selection labels.