Learning Sentence Ordering for Opinion Generation of Debate

We propose a sentence ordering method to help compose persuasive opinions for debating. In debate texts, support of an opinion such as evidence and reason typically follows the main claim. We focused on this claim-support structure to order sentences, and developed a two-step method. First, we select from among candidate sentences a ﬁrst sentence that is likely to be a claim. Second, we order the remaining sentences by using a ranking-based method. We tested the effectiveness of the proposed method by comparing it with a general-purpose method of sentence ordering and found through experiment that it improves the accuracy of ﬁrst sentence selection by about 19 percentage points and had a superior performance over all metrics. We also applied the proposed method to a constructive speech generation task.


Introduction
There are increasing demands for information structuring technologies to support decision making using a large amount of data. Argumentation in debating which composes texts in a persuasive manner is a research target suitable for such information structuring. In this paper, we discuss sentence ordering for constructive speech generation of debate.
The following is an example of constructive speech excerpts that provide affirmative opinions on the banning of gambling 1 . Motion: This House should ban gambling.
(1) Poor people are more likely to gamble, in the hope of getting rich.
(2) In 1999, the National Gambling Impact Commission in the United States found that 80 percent of gambling revenue came from lower-income households.
We can observe a typical structure of constructive speech in this example. The first sentence describes a claim that is the main statement of the opinion and the second sentence supports the main statement. In this paper, we focus on this claim-support structure to order sentences. Regarding the structures of arguments, we can find research on the modeling of arguments (Freeley and Steinberg, 2008) and on recognition such as claim detection (Aharoni et al., 2014). To the best of our knowledge, there is no research that examines the claim-support structure of debate texts for the sentence ordering problem. Most of the previous works on sentence ordering (Barzilay et al., 2002;Lapata, 2003;Bollegala et al., 2006;Tan et al., 2013) focus on the sentence order of news articles and do not consider the structures of arguments. These methods mingle claim and supportive sentences together, which decreases the persuasiveness of generated opinions.
In this paper, we propose a sentence ordering method in which a motion and a set of sentences are given as input. Ordering all paragraphs of debate texts at once is a quite difficult task, so we have International Debate Education Association.
All Rights Reserved. simplified by assuming that all input sentences stand for a single viewpoint regarding the motion.
We use this claim-support structure as a cue of sentence ordering. We employ two-step ordering based on machine learning, as shown in Fig. 1. First, we select a first sentence that corresponds to a claim, and second, we order the supportive sentences of the claims in terms of consistency. For each step, we design machine learning features to capture the characteristics of sentences in terms of the claimsupport structure. The dataset for training and testing is made up of content from an online debate site.
The remainder of this paper is structured as follows. The next section describes related works dealing with sentence ordering. In the third section, we examine the characteristics of debate texts. Next, we describe our proposed method, explain the experiments we performed to evaluate the performance, and discuss the results. After that, we describe our application of the proposed sentence ordering to automated constructive speech generation. We conclude the paper with a summary and a brief mention of future work.

Related Works
Previous research on sentence ordering has been conducted as a part of multi-document summarization. There are four major feature types to order sentences: publication dates of source documents, topical similarity, transitional association cues, and rhetorical cues.
Arranging sentences by order of publication dates of source documents is known as the chronological ordering (Barzilay et al., 2002). It is effective for news article summarization because descriptions of a certain event tend to follow the order of publication. It is, however, not suitable for opinion generation because such generation requires statements and evidence rather than the simple summarization of an event.
Topical similarity is based on an assumption that neighboring sentences have a higher similarity than non-neighboring ones. For example, bag-of-wordsbased cosine similarities of sentence pairs are used in (Bollegala et al., 2006;Tan et al., 2013). Another method, the Lexical Chain, models the semantic distances of word pairs on the basis of synonym dictionaries such as WordNet (Barzilay and Elhadad, 1997;Chen et al., 2005). The effectiveness of this feature depends highly on the method used to calculate similarity.
Transitional association is used to measure the likelihood of two consecutive sentences. Lapata proposed a sentence ordering method based on a probabilistic model (Lapata, 2003). This method uses conditional probability to represent transitional probability from the previous sentence to the target sentence.
Dias et al. used rhetorical structures to order sentences (de S. Dias et al., 2014). The rhetorical structure theory (RST) (Mann and Thompson, 1988) explains the textual organization such as background and causal effect that can be useful to determine the sentence order. For example, causes are likely to precede results. However, it is important to restrict the types of rhetorical relation because original RST defines many relations and a large amount of data is required for accurate estimation.
There has been research on integrating different types of features. Bollegara et al. proposed machine learning-based integration of different kinds of features (Bollegala et al., 2006) by using a binary classifier to determine if the order of a given sentence pair is acceptable or not. Tan et al. formulated sentence ordering as a ranking problem of sentences (Tan et al., 2013). Their experimental results showed that the ranking-based method outperformed classification-based methods.

Characteristics of Debate Texts
Topical similarity can be measured by the word overlap between two sentences. This metric assumes that the closer a sentence pair is, the more word overlap exists. In order to examine this assumption, we compared characteristics between debate texts and news articles, as shown in  (Napoles et al., 2012). We randomly selected 80,000 articles and extracted seven leading sentences per article. Overall, we found less word overlap in debate texts than in news articles in both neighbor pairs and non-neighbor pairs. This is mainly because debaters usually try to add as much information as possible. We assume from this result that conventional topical similarity is less effective for debate texts and have therefore focused on the claim-support structure of debate texts.
We also examined the occurrence of named entity (NE) in each sentence. We can observe that most of the sentences in news articles contain NEs while much fewer sentences in debate texts have NEs. This suggests that debate texts deal more with general opinions and related examples while news articles describe specific events.

Two-Step Ordering
In this study, we focused on a simple but common style of constructive speech. We assumed that a constructive speech item has a claim and one or more supporting sentences. The flow of the proposed ordering method is shown in Fig. 2  ceives a motion and a set of sentences as input and then it outputs ordered sentences. First, syntactic parsing is applied to the input texts, and then features for the machine learning models are then extracted from the results. Second, we select the first sentence, which is likely to be the claim sentence, from the candidate sentences. This problem is formulated as a binary-classification problem, where first sentences of constructive speech items are positive and all others are negative. Third, we order the remaining sentences on the basis of connectivity of pairs of sentences. This problem is formulated as a ranking problem, similarly to (Tan et al., 2013).

Feature Extraction
We obtained the part of speech, lemma, syntactic parse tree, and NEs of each input sentence by using the Stanford Core NLP (Manning et al., 2014). The following features, which are commonly used in sentence ordering methods to measure local coherence (Bollegala et al., 2006;Tan et al., 2013;Lapata, 2003), are then extracted.
Sentence similarity: Cosine similarity between sentence u and v. We simply counted the frequency of each word to measure cosine similarity. In addition to that, we also measured the cosine similarity between latter half of u (denoted as latter (u)) and former half of v (denoted as former (v)). The sentences are separated by the most centered comma (if exists) or word (if no comma exists).
Overlap: Commonly shared words of u and v.
Let overlap j (u, v) be the number of commonly shared words of u and v, for j = 1, 2, 3 representing lemmatized noun, verb and adjective or adverb, respectively. We calculated overlap j (u, v)/ min(|u|, |v|) and overlap j (latter(u), former(v))/overlap j (u, v), where |u| is the number of words of sentence u. The value will be set to 0 if the denominator is 0.
Expanded sentence similarity: Cosine similarity between candidate sentences expanded with synonyms. We used WordNet (Miller, 1995) to expand the nouns and verbs into synonyms.
Word transitional probability: Calculate conditional probability P (w v |w u ), where w u , w v denote the words in sentences u, v, respectively. In the case of the first sentence, we used P (w u ). A probabilistic model based on Lapata's method (Lapata, 2003) was created.
The following features are used to capture the characteristics of claim sentences.
Motion similarity: Cosine similarity between the motion and the target sentence. This feature examines the existence of the motion keywords.
Expanded motion similarity: Cosine similarity of the target sentence to the motion expanded with synonyms.
Value relevance: Ratio of value expressions. In this study, we defined human values as the topics obviously considered to be positive or negative and highly relevant to people's values and then created a dictionary of value expressions. For example, health, education, and the environment are considered positive for people's values while crime, pollution, and high costs are considered negative.
Sentiment: Ratio of positive or negative words. The dictionary of sentimental words is from (Hu and Liu, 2004). This feature is used to examine whether the stance of the target sentence is positive, negative, or neutral.  Concreteness features are used to measure the relevance of support.

Concreteness features:
The ratio of tokens that are a part of capital words, numerical expression, NE, organization, person, location, or temporal expression. These features are used to capture characteristics of the supporting sentences.
We use the estimated results of the first step as a feature of the second step.
Estimated first sentence similarity: Cosine similarity between the target sentence and the estimated first sentence.

First Step: First Sentence Selection
In the first step, we choose a first sentence from input sentences. This task can be formulated as a binary classification problem. We employ a machine learning approach to solve this problem.
In the training phase, we extract N feature vectors from N sentences in a document, and train a binary classification function f first defined by where s i denotes the feature vector corresponding to the i-th sentence. The function f first returns +1 if s i is the first sentence.
In the prediction phase, we applied f first to all sentences and determined the first sentence that has the maximum posterior probability of f first (s i ) = +1.

Second Step: Ranking-Based Ordering
In the second step, we assume that the first sentence has already been determined. The number of sentences in this step is N second = N − 1. We use a ranking-based framework proposed by Tan et al. (2013) to order sentences.
In the training phase, we generate N second (N second − 1) pairs of sentences from N second sentences in a document and train an association strength function f pair defined by For forward direction pairs, the rank values are set to N − (j − i). This means that the shorter the distance between the pair is, the larger the rank value is. For the backward direction pairs, the rank values are set to 0.
In the prediction phase, the total ranking value of a sentence permutation ρ is defined by where ρ(u) > ρ(v) denotes that sentence u precedes sentence v in ρ. A learning to rank algorithm based on Support Vector Machine (Joachims, 2002) is used as a machine learning model. We used svm rank 3 to implement the training and the prediction of f pair . We used the sentence similarity, the expanded sentence similarity, the overlap, and the transitional probability in addition to the same features as the first step classification. These additional features are defined by a sentence pair (u, v). We applied the feature normalization proposed by Tan et al. (2013) to each additional feature. The normalization functions are defined as 2 http://www.chokkan.org/software/classias/ 3 http://www.cs.cornell.edu/people/tj/svm light/svm rank.html where f i is the i-th feature function, S is a set of candidate sentences, and |S| is the number of sentences in S. Equation (4) is an original value of the i-th feature function. Equation (5) examines the priority of (u, v) to its inversion (v, u). Equation (6) measures the priority of (u, v) to the sentence pairs that have u as a first element. Equation (7) the priority of (u, v) to the sentence pairs that have v as a second element, similarly to Equation (6).

Reconstructing Shuffled Sentences
We evaluated the proposed method by reconstructing the original order from randomly shuffled texts. We compared the proposed method with the Random method, which is a base line method that randomly selects a sentence, and the Ranking method, which is a form of Tan et al.'s method (Tan et al., 2013) that arranges sentences using the same procedure as the second step of the proposed method excluding estimated first sentence similarity feature.

Dataset
We created a dataset of constructive speech items from Debatabase to train and evaluate the proposed method. The speech item of this dataset is a whole turn of affirmative/negative constructive speech which consists of several ordered sentences. Details of the dataset were shown in Table 3. The dataset has 501 motions related to 14 themes (e.g., politics, education) and contains a total of 3,754 constructive speech items. The average sentence length per item is 7.2. Each constructive speech item has a short title sentence from which we extract the value (e.g., "health", "crime") of the item.

Metrics
The overall performance of ordering sentences is evaluated by Kendall's τ , Spearman Rank Correlation, and Average Continuity.
Kendall's τ is defined by where N is the number of sentences and n inv is the number of inversions of sentence pairs. The metric ranges from −1 (inversed order) to 1 (identical order). Kendall's τ measures the efforts of human readers to correct wrong sentence orders. Spearman Rank Correlation is defined by (9) where d(i) is the difference between the correct rank and the answered rank at the i-th sentence. Spearman Rank Correlation takes the distance of wrong answers directly into account.
Average Continuity is based on the number of matched n-grams, and is defined using P n . P n is defined by where m is the number of matched n-grams. P n measures the ratio of correct n-grams in a sequence. Average Continuity is then defined by where k is the maximum n of n-grams, and α is a small positive value to prevent divergence of score.

Results
We applied 5-fold cross validation to each ordering method. The machine learning models were trained by 3,003 constructive speech items and then evaluated using 751 items.
The results of first sentence estimation are shown in Table 4. The accuracy of the proposed method is higher than that of Ranking, which represents the sentence ranking technique without the first sentence selection, by 19.3 percentage points. Although the proposed method showed the best accuracy, we observed that f first (s 0 ) tended to be −1 rather than 1. This is mainly because the two classes were unbalanced. The number of negative examples in the training data was 6.2 times larger than that of positive ones. We need to address the unbalanced data problem for further improvement (Chawla et al., 2004).
The results of overall sentence ordering are shown in Table 5. We carried out a one-way analysis of variance (ANOVA) to examine the effects of different algorithms for sentence ordering. The ANOVA revealed reliable effects with all metrics (p < 0.01). We performed a Tukey Honest Significant Differences (HSD) test to compare differences among these algorithms. In terms of Kendall's τ and Spearman Rank Correlation, the Tukey HSD test revealed that the proposed method was significantly better than the rests (p < 0.01). In terms of Average Continuity, it was also significantly better than the Random method, whereas it is not significantly different from the Ranking method. These results show that the proposed two-step ordering is also effective for overall sentence ordering. However, the small difference of Average Continuity indicates that the ordering improvement is only regional.
Each ordering was awarded one of four grades: Perfect, Acceptable, Poor or Unacceptable. The criteria of these grades are the same as those of (Bollegala et al., 2006). A perfect text cannot be improved by re-ordering. An acceptable text makes sense and does not require revision although there is some room for improvement in terms of readability. A poor text loses the thread of the story in some places and requires amendment to bring it up to an acceptable level. An unacceptable text leaves much to be improved and requires overall restructuring rather than partial revision.
The results of our subjective evaluation are shown in Figure 3. We have observed that about 70 % of randomly ordered sentences are perfect or acceptable. This is mainly because the target documents contain only 3.87 sentences on average, and those short documents are comprehensive even if they are randomly shuffled.
There are four documents containing more than six sentences in the targets. The number of unacceptably ordered documents of the Random method, the Ranking method, and the proposed method are 4, 3, and 1, respectively. We observed that the proposed method selected the claim sentences successfully and then arranged sentences related to the claim sentences. These are the expected results of the first sentence classification and the estimated first sentence similarity in the second step. These results show that the selection of the first sentence plays an important role to make opinions comprehensive.
On the other hand, we did not observe the improvement of the number of the perfectly selected  documents. We found misclassification of final sentences as first sentences in the results of the proposed method. Such final sentences described conclusions similar to the claim sentences. We need to extend the structure of constructive speech to handle conclusions correctly.

Structures of Constructive Speech
We confirmed our assumption that claims are more likely to be described in the first sentence than others by manually examining constructive speech items. We selected seven motions from the top 100 debates in the Debatabase. These selected motions contain a total of 56 constructive speech items. A human annotator assigned claim tags and support tags for the sentences. The results are shown in Table 6.
Here, we can see that about two-thirds of claim  sentences appeared at the beginning of constructive speech items, and that more than 90 % of supportive sentences appeared from the second sentences or later. This means that claims are followed by evidence in more than half of all constructive speech items.

Case Analysis
A typical example ordered correctly by the proposed method is shown in Table 7. This constructive speech item agrees with free admissions at museums. It has a clear claim-support structure. It first, makes a claim related to the contributions of government funding and then gives three examples. The first sentence has no NEs while the second and later sentences have NEs to give details about the actual museums and countries. Neighbor sentences were connected with common words such as "museum," "charge," and "government funding."

Application to Automated Constructive Speech Generation
We applied the proposed sentence ordering to the automated constructive speech generation.

System Description
The flowchart of constructive speech generation is shown in Fig. 4. Here, we give a brief overview of the system. The system is based on sentence extraction and sentence ordering, which we explain with the example motion "This House should ban smoking in public spaces." First, a motion analysis component extracts keywords such as "smoking" and "public spaces" from the motion. Second, a value selection component searches for related sentences with the motion keywords and human value information. More specifically, it generates pairs of motion keywords and values (such as (smoking, health), (smoking, education), and (smoking, crime)) and uses them as search queries. Then, it selects the values of constructive speech in accordance with the number of related sentences to values. In the third step, a sentence extraction component examines the relevancy of each sentence with textual annotation such as promote/suppress relationship and positive/negative relationship. Finally, a sentence ordering component arranges the extracted sentences for each value.

Ordering Results
The system outputs three paragraphs per motion. Each paragraph is composed of seven sentences. Currently, its performance is limited, as 49 out of the 150 generated paragraphs are understandable. To focus on the effect of sentence ordering, we manually extracted relevant sentences from generated constructive speech and then applied the proposed ordering method to them.  The results are shown in Table 8 4 . We can observe that the first sentence mentions the health problem of smoking while the second and third sentences show support for the problem, i.e., the names of authorities such as spokesmen and institutes. The proposed ordering method successfully ordered the types of opinions that have a clear claim-support structure.

Conclusion
In this paper, we discussed sentence ordering for debate texts. We proposed a sentence ordering method that employs a two-step approach based on the claim-support structure. We then constructed a dataset from an on-line debate site to train and evaluate the ordering method. The evaluation results of reconstruction from shuffled constructive speech  Gormley, c ⃝2003, 2005Gormley, c ⃝2003, , 2007Gormley, c ⃝2003, , 2009Gormley, c ⃝2003, , 2011Gormley, c ⃝2003, , 2012 Trustees of the University of Pennsylvania showed that our proposed method outperformed a general-purpose ordering method. The subjective evaluation showed that our proposed method is suitable for constructive speech containing explicit claim sentences and supporting examples.
In this study, we focused on a very simple structure, i.e., claims and support. We will extend this structure to handle different types of arguments in the future. More specifically, we plan to take conclusion sentences into account as a component of the structure.