Argument Generation with Retrieval, Planning, and Realization

Automatic argument generation is an appealing but challenging task. In this paper, we study the specific problem of counter-argument generation, and present a novel framework, CANDELA. It consists of a powerful retrieval system and a novel two-step generation model, where a text planning decoder first decides on the main talking points and a proper language style for each sentence, then a content realization decoder reflects the decisions and constructs an informative paragraph-level argument. Furthermore, our generation model is empowered by a retrieval system indexed with 12 million articles collected from Wikipedia and popular English news media, which provides access to high-quality content with diversity. Automatic evaluation on a large-scale dataset collected from Reddit shows that our model yields significantly higher BLEU, ROUGE, and METEOR scores than the state-of-the-art and non-trivial comparisons. Human evaluation further indicates that our system arguments are more appropriate for refutation and richer in content.


Introduction
Counter-argument generation aims to produce arguments of a different stance, in order to refute the given proposition on a controversial issue (Toulmin, 1958;Damer, 2012). A system that automatically constructs counter-arguments can effectively present alternative perspectives along with associated evidence and reasoning, and thus facilitate a more comprehensive understanding of complicated problems when controversy arises.
Nevertheless, constructing persuasive arguments is a challenging task, as it requires an appropriate combination of credible evidence, rigorous logical reasoning, and sometimes emotional appeal (Walton et al., 2008;Wachsmuth et al., 2017a;Wang et al., 2017). A sample counter-argument for a pro-death penalty post is shown in Figure 1. As can be seen, a sequence of talking points on the "imperfect justice system" are presented: it starts with the fundamental concept, then follows up with more specific evaluative claim and supporting fact. Although retrieval-based methods have been investigated to construct counter-arguments (Sato et al., 2015;Reisert et al., 2015), they typically produce a collection of sentences from disparate sources, thus fall short of coherence and conciseness. Moreover, human always deploy stylistic languages with specific argumentative functions to promote persuasiveness, such as making a concessive move (e.g., "In theory I agree with you"). This further requires the generation system to have better control of the languages style.
Our goal is to design a counter-argument generation system to address the above challenges and produce paragraph-level arguments with rich-yetcoherent content. To this end, we present CAN-DELA-a novel framework to generate Counter-Arguments with two-step Neural Decoders and ExternaL knowledge Augmentation. 1 Concretely, CANDELA has three major distinct features: First, it is equipped with two decoders: one for text planning-selecting talking points to cover for each sentence to be generated, the other for content realization-producing a fluent argument to reflect decisions made by the text planner. This enables our model to produce longer arguments with richer information.
Furthermore, multiple objectives are designed for our text planning decoder to both handle content selection and ordering, and select a proper argumentative discourse function of a desired language style for each sentence generation.
Lastly, the input to our argument generation model is augmented with keyphrases and passages retrieved from a large-scale search engine, which indexes 12 million articles from Wikipedia and four popular English news media of varying ideological leanings. This ensures access to reliable evidence, high-quality reasoning, and diverse opinions from different sources, as opposed to recent work that mostly considers a single origin, such as Wikipedia (Rinott et al., 2015) or online debate portals (Wachsmuth et al., 2018b).
We experiment with argument and counterargument pairs collected from the Reddit /r/ChangeMyView group. Automatic evaluation shows that the proposed model significantly outperforms our prior argument generation system (Hua and Wang, 2018) and other non-trivial comparisons. Human evaluation further suggests that our model produces more appropriate counter-arguments with richer content than other automatic systems, while maintaining a fluency level comparable to human-constructed arguments.

Related Work
To date, the majority of the work on automatic argument generation leads to rule-based models, e.g., designing operators that reflect strategies from argumentation theory (Reed et al., 1996;Carenini and Moore, 2000). Information retrieval systems are recently developed to extract argu-ments relevant to a given debate motion (Sato et al., 2015). Although content ordering has been investigated (Reisert et al., 2015;Yanase et al., 2015), the output arguments are usually a collection of sentences from heterogeneous information sources, thus lacking coherence and conciseness. Our work aims to close the gap by generating eloquent and coherent arguments, assisted by an argument retrieval system.
Recent progress in sequence-to-sequence (seq2seq) text generation models has delivered both fluent and content rich outputs by explicitly conducting content selection and ordering (Gehrmann et al., 2018;Wiseman et al., 2018), which is a promising avenue for enabling end-to-end counter-argument construction (Le et al., 2018). In particular, our prior work (Hua and Wang, 2018) leverages passages retrieved from Wikipedia to improve the quality of generated arguments, yet Wikipedia itself has the limitation of containing mostly facts. By leveraging Wikipedia and popular news media, our proposed pipeline can enrich the factual evidence with high-quality opinions and reasoning.
Our work is also in line with argument retrieval research, where prior effort mostly considers single-origin information source (Rinott et al., 2015;Levy et al., 2018;Wachsmuth et al., 2017bWachsmuth et al., , 2018b. Recent work by Stab et al. (2018) indexes all web documents collected in Common Crawl, which inevitably incorporates noisy, lowquality content. Besides, existing work treats individual sentences as arguments, disregarding their crucial discourse structures and logical relations with adjacent sentences. Instead, we use multiple high-quality information sources, and construct paragraph-level passages to retain the context of arguments.

Overview of CANDELA
Our counter-argument generation framework, as shown in Figure 2, has two main components: argument retrieval model ( § 4) that takes the input statement and a search engine, and outputs relevant passages and keyphrases, which are used as input for our argument generation model ( § 5) to produce a fluent and informative argument.
Concretely, the argument retrieval component retrieves a set of candidate passages from Wikipedia and news media ( § 4.1), then further selects passages according to their stances towards
the input statement ( § 4.3). A keyphrase extraction module distills the refined passages into a set of talking points, which comprise the keyphrase memory as additional input for generation ( § 4.2). The argument generation component first runs the text planning decoder ( § 5.2) to produce a sequence of hidden states, each corresponding to a sentence-level representation that encodes the selection of keyphrases to cover, as well as the predicted argumentative function for a desired language style. The content realization decoder ( § 5.3) then generates the argument conditioned on the sentence representations.

Information Sources and Indexing
We aim to build a search engine from diverse information sources with factual evidence and varied opinions of high quality. To achieve that, we use Common Crawl 2 to collect a large-scale online news dataset covering four major English news media: The New York Times (NYT), The Washington Post (WaPo), Reuters, and The Wall Street Journal (WSJ). HTML files are processed using the open-source tool jus-Text (Pomikálek, 2011) to extract article content. We deduplicate articles and remove the ones with less than 50 words. We also download a Wikipedia  dump. About 12 million articles are processed in total, with basic statistics shown in Table 1. We segment articles into passages with a sliding window of three sentences, with a step size of two. We further constraint the passages to have at least 50 words. For shorter passages, we keep adding subsequent sentences until reaching the length limit. Per Table 1, 120 million passages are preserved and indexed with Elasticsearch (Gormley and Tong, 2015) as done in Stab et al. (2018).
Query Formulation. For an input statement with multiple sentences, one query is constructed per sentence, if it has more than 5 content words (10 for questions), and at least 3 are distinct. For each query, the top 20 passages ranked by BM25 (Robertson et al., 1995) are retained, per medium. All passages retrieved for the input statement are merged and deduplicated, and they will be ranked as discussed in § 4.3.

Keyphrase Extraction
Here we describe a keyphrase extraction procedure for both input statements and retrieved passages, which will be utilized for passage ranking as detailed in the next section.
For input statement, our goal is to identify a set of phrases representing the issues under discussion, such as "death penalty" in Figure 1. We thus first extract the topic signature words (Lin and Hovy, 2000) for input representation, and expand them into phrases that better capture semantic meanings.
Concretely, topic signature words of an input statement are calculated against all input statements in our training set with log-likelihood ratio test. In order to cover phrases with related terms, we further expand this set with their synonyms, hyponyms, hypernyms, and antonyms based on WordNet (Miller, 1994). The statements are first parsed with Stanford part-of-speech tagger (Manning et al., 2014). Then regular expressions are applied to extract candidate noun phrases and verb phrases (details in Appendix A.1). A keyphrase is selected if it contains: (1) at least one content word, (2) no more than 10 tokens, and (3) at least one topic signature word or a Wikipedia article title.
For retrieved passages, their keyphrases are extracted using the same procedure as above, except that the input statement's topic signature words are used as references again.

Passage Ranking and Filtering
We merge the retrieved passages from all media and rank them based on the number of words in overlapping keyphrases with the input statement. To break a tie, with the input as the reference, we further consider the number of its topic signature words that are covered by the passage, then the coverage of non-stopword bigrams and unigrams. In order to encourage diversity, we discard a passage if more than 50% of its content words are already included by a higher ranked passage. In the final step, we filter out passages if they have the same stance as the input statement for given topics. We determine the stances of passages by adopting the stance scoring model proposed by Bar-Haim et al. (2017). More details can be found in Appendix A.2.

Task Formulation
Given an input statement X = {x i }, a set of passages, and a keyphrase memory M, our goal is to generate a counter-argument Y = {y t } of a different stance as X, x i and y t are tokens at timestamps i and t. Built upon the sequenceto-sequence (seq2seq) framework with input attention (Sutskever et al., 2014;Bahdanau et al., 2015), the input statement and the passages selected in § 4 are encoded by a bidirectional LSTM (biLSTM) encoder into a sequence of hidden states h i . The last hidden state of the encoder is used as the first hidden state of both text planning decoder and content realization decoder.
As depicted in Figure 2, the counter-argument is generated as follows. A text planning decoder ( § 5.2) first calculates a sequence of sentence representations s j (for the j-th sentence) by encoding the keyphrases selected from the previous timestamp j − 1. During this step, an argumentative function label is predicted to indicate a desired language style for each sentence, and a subset of the keyphrases are selected from M (content selection) for the next sentence. In the second step, a content realization decoder ( § 5.3) generates the final counter-argument conditioned on previously generated tokens and the corresponding sentence representation s j .

Text Planning Decoder
Text planning is an important component for natural language generation systems to decide on content structure for the target generation (Lavoie and Rambow, 1997;Reiter and Dale, 2000). We propose a text planner with two objectives: selecting talking points from the keyphrase memory M, and choosing a proper argumentative function per sentence. Concretely, we train a sentence-level LSTM that learns to generate a sequence of sentence representations {s j } given the selected keyphrase set C(j) as input for the j-th sentence: where f is an LSTM network, e k is the embedding for a selected phrase, represented by summing up all its words' Glove embeddings (Pennington et al., 2014) in our experiments.
Content Selection C(j). We propose an attention mechanism to conduct content selection and yield C(j) from the representation of the previous sentence s j−1 to encourage topical coherence. To allow the selection of multiple keyphrases, we use the sigmoid function to calculate the score: where W pa are trainable parameters, keyphrases with α jm > 0.5 are included in C(j), and the keyphrase with top attention value is always selected. We further prohibit a keyphrase from being chosen for more than once in multiple sentences. For the first sentence s 0 , C(0) only contains <start>, whose embedding is randomly initialized. During training, the true labels of C(j) are constructed as follows: a keyphrase in M is selected for the j-th goldstandard argument sentence if they overlap with any content word.
Argumentative Function Prediction y p j . As shown in Figure 1, humans often deploy stylistic languages to achieve better persuasiveness, e.g. agreement as a concessive move. We aim to inform the realization decoder about the choice of style, and thus distinguish between two types of argumentative functions: argumentative content sentence which delivers the critical ideas, e.g. "unreliable evidence is used when there is no witness", and argumentative filler sentence which contains stylistic languages or general statements (e.g., "you can't bring dead people back to life").
Since we do not have argumentative function labels, during training, we use the following rules to automatically label each sentence as content sentence if it has at least 10 words (20 for questions) and satisfy the following conditions: (1) it has at least two topic signature words of the input statement or a gold-standard counter-argument 3 , or (2) at least one topic signature word with a discourse marker at the beginning of the sentence. If the first three words in a content sentence contain a pronoun, the previous sentence is labeled as such too. Discourse markers are selected from PDTB discourse connectives (e.g., as a result, eventually, or in contrast). The full list is included in Appendix A.3. All other sentences become filler sentences. In the future work, we will consider utilizing learning-based methods, e.g., Hidey et al. (2017), to predict richer argumentative functions.
3 When calculating topic signatures for gold-standard arguments, all replies in the training set are used as background.
The argumentative function label y p j for the j-th sentence is calculated as follows: where α jm is the alignment score computed as in Eq. 2, c j is the attention weighted context vector, w p , W po , and b p are trainable parameters.

Content Realization Decoder
The content realization decoder generates the counter-argument word by word, with another LSTM network f w . We denote the sentence id of the t-th word in the argument as J(t), then the sentence representation s J(t) from the text planning decoder, together with the embedding of the previous generated token y t−1 , are fed as input to calculate the hidden state z t : The conditional probability of the next token y t is then computed over a standard softmax, with an attention mechanism applied on the encoder hidden states h i to obtain the context vector c w t : Reranking-based Beam Search. Our content realization decoder utilizes beam search enhanced with a reranking mechanism, where we sort the beams at the end of each sentence by the number of selected keyphrases that are generated. We also discard beams with n-gram repetition for n ≥ 4.

Training Objective
Given all model parameters θ, our mixed objective considers the target argument (L arg (θ)), the argumentative function type (L func (θ)), and the next sentence keyphrase selection (L sel (θ)): where D is the training corpus, (X, Y ) are input statement and counter-argument pairs, and Y p are the sentence function labels. α jm are keyphrase selection labels as computed in Eq. 2. For simplicity, we set γ and η as 1.0 in our experiments, while they can be further tuned as hyper-parameters.

Data Collection and Preprocessing
We use the same methodology as in our prior work (Hua and Wang, 2018) to collect an argument generation dataset from Reddit /r/ChangeMyView. 4 To construct input statement and counter-argument pairs, we treat the original poster (OP) of each thread as the input. We then consider the high quality root replies, defined as the ones awarded with ∆s or with more upvotes than downvotes (i.e., karma > 0). It is observed that each paragraph often makes a coherent argument. Therefore, these replies are broken down into paragraphs, and a paragraph is retained as a target argument to the OP if it has more than 10 words and at least one argumentative content sentence.
We then identify threads in the domains of politics and policy, and remove posts with offensive languages. Most recent threads are used as test set. As a result, we have 11, 356 threads or OPs (217, 057 arguments) for training, 1, 774 (33, 318 arguments) for validation, and 1, 703 (36, 777 arguments) for test. They are split into sentences and then tokenized by the Stanford CoreNLP toolkit (Manning et al., 2014).
Training Data Construction for Passages and Keyphrase Memory. Since no gold-standard annotation is available for the input passages and keyphrases, we acquire training labels by constructing queries from the gold-standard arguments as described in § 4.1, and reranking retrieved passages based on the following criteria in order: (1) coverage of topic signature words in the input statement; (2) a weighted summation of the coverage of n-grams in the argument 5 ; (3) the magnitude of stance score, where we keep the passages of the same polarity as the argument; (4) content word overlap with the argument; and (5) coverage of topic signature words in the argument.

System and Oracle Retrieved Passages
For evaluation, we employ both system retrieved passages (i.e., constructing queries from OP) and KM ( § 4), and oracle retrieved passages (i.e., constructing queries from target argument) and KM as described in training data construction. Statistics on the final dataset are listed in Table 2.

Comparisons
In addition to a Retrieval model, where the top ranked passage is used as counter-argument, we further consider four systems for comparison. (1) A standard Seq2seq model with attention, where we feed the OP as input and train the model to generate counter-arguments. Regular beam search with the same beam size as our model is used for decoding.
(2) A Seq2seqAug model with additional input of the keyphrase memory and ranked passages, both concatenated with OP to serve as the encoder input. The reranking-based decoder in our model is also implemented for SEQ2SEQAUG to enhance the coverage of input keyphrases.
(3) An ablated SEQ2SEQAUG model where the passages are removed from the input. (4) We also reimplement the argument generation model in our prior work (Hua and Wang, 2018) (H&W) with PyTorch (Paszke et al., 2017), which is used for CANDELA implementation. H&W takes as input the OP and ranked passages, and then uses two separate decoders to first generate all keyphrases and then the counter-argument. For our model, we also implement a variant where the input only contains the OP and the keyphrase memory.

Training Details
For all models, we use a two-layer LSTM for all encoders and decoders with a dropout probability of 0.2 between layers (Gal and Ghahramani, 2016). All layers have 512-dimensional hidden states. We limit the input statement to 500 tokens, the ranked passages to 400 tokens, and the target counter-argument to 120 tokens. Our vocabulary has 50K words for both input and output, with 300-dimensional word embeddings initialized with GloVe ( We also pre-train a biLSTM for encoder based on all OPs from the training set, and an LSTM for content realization decoder based on two sources of data: 353K counter-arguments that are high quality root reply paragraphs extended with posts of non-negative karma, and 2.4 million retrieved passages randomly sampled from the training set. Both are trained as done in Bengio et al. (2003). We then use the first layer's parameters to initialize all models, including our comparisons.
Under system setup, our model CANDELA statistically significantly outperforms all comparisons and the retrieval model in all metrics, based on a randomization test (Noreen, 1989)  0.0005). Furthermore, our model generates longer sentences whose lengths are comparable with human arguments, both with about 22 words per sentence. This also results in longer arguments. Under oracle setup, all models are notably improved due to the higher quality of reranked passages, and our model achieves statistically significantly better BLEU scores. Interestingly, we observe a decrease of ROUGE and METEOR, but a marginal increase of BLEU-2 by removing passages from our model input. This could be because the passages introduce divergent content, albeit probably on-topic, that cannot be captured by BLEU.
Content Diversity. We further measure whether our model is able to generate diverse content. First, borrowing the diversity measurement from dialogue generation research (Li et al., 2016), we report the average number of distinct n-grams per argument under system setup in Figure 3. Our system generates more unique unigrams and bigrams than other automatic systems, underscoring its capability of generating diverse content.
Our model also maintains a comparable typetoken ratio (TTR) compared to systems that generate shorter arguments, e.g., a 0.79 for bigram TTR of our model versus 0.83 and 0.84 for SEQ2SEQAUG and SEQ2SEQ. RETRIEVAL, con-  Table 3: Main results on argument generation. We report BLEU-2 (B-2), BLEU-4 (B-4), ROUGE-2 (R-2) recall, METEOR (MTR), and average number of words per argument and per sentence. Best scores are in bold. * : statistically significantly better than all comparisons (randomization approximation test (Noreen, 1989), p < 0.0005). Input is the same for SEQ2SEQ for both system and oracle setups.
taining top ranked passages of human-edited content, produces the most distinct words. Next, we compare how each system generates content beyond the common words. As shown in Figure 4, human-edited text, including goldstandard arguments (HUMAN) and retrieved passages, tends to have higher usage of uncommon words than automatic systems, suggesting the gap between human vs. system arguments. Among the four automatic systems, our prior model (Hua and Wang, 2018) generates a significantly higher portion of uncommon words, yet further inspection shows that the output often includes more offtopic information.

Human Evaluation
Human judges are asked to rate arguments on a Likert scale of 1 (worst) to 5 (best) on the following three aspects: grammaticality-denotes language fluency; appropriateness-indicates if the output is on-topic and on the opposing stance; content richness-measures the amount of distinct talking points. In order to promote consistency of annotation, we provide descriptions and sample arguments for each scale. For example, an appropriateness score of 3 means the counterargument contains relevant words and is likely to be on a different stance. The judges are then asked to rank all arguments for the same input based on their overall quality.
We randomly sampled 43 threads from the test set, and hired three native or proficient English speakers to evaluate arguments generated by SEQ2SEQAUG, our prior argument generation  Results. The first 3 examples are used only for calibration, and the remaining 40 are used to report results in Table 4. Inter-annotator agreement scores (Krippendorff's α) of 0.44, 0.58, 0.49 are achieved for the three aspects, implying general consensus to intermediate agreement.
Our system obtains the highest appropriateness and content richness among all automatic systems. This confirms the previous observation that our model produces more informative argument than other neural models. SEQ2SEQAUG has a marginally better grammaticality score, likely due to the fact that our arguments are longer, and tend to contain less fluent generation towards the end.
Furthermore, we see that human arguments are ranked as the best in about 76% of the evaluation, followed by RETRIEVAL. Our model is more likely to be ranked top than any other automatic models. Especially, our model is rated better than either HUMAN or RETRIEVAL, i.e., human-edited text, in 39.2% of the evaluations, compared to 34.2% for SEQ2SEQAUG and 13.3% for our prior model.

Sample Arguments and Discussions
We show sample outputs of different systems alongside human constructed counter-argument in Figure 5. As can be seen, our system arguments cover many relevant talking points, including the idea of "taking care of children" that is also used by human. It further illustrates the effectiveness of our retrieval system and the usage of keyphrase selection as part of text planning to guide argument generation. Moreover, we also observe that our model generation is able to switch between argumentative content sentence and filler sentence, though better control is needed to improve coherence. Meanwhile, SEQ2SEQ frequently echos words from OP, and both SEQ2SEQ and SEQ2SEQAUG suffer from the problems of "hallucination" (e.g., the first sentence in SEQ2SEQAUG) and repetition (e.g., the second and third sentences in SEQ2SEQ). Nonetheless, there is a huge space for improvement. First, our model tends to overuse negation, such as "this is not to say that it is unfair...". It is likely due to its overfitting on specific stylistic languages, e.g., negation is often observed for refutation in debates (Wang et al., 2017). Second, human arguments have significantly better organization and often deploy complicated argumentation strategies (Wachsmuth et al., 2018a), which so far is not well captured by any automatic system. Both points inspire future work on (1) controlling of the language styles and corresponding content, and (2) mining argumentation structures for use in guiding generation with better planning.

Conclusion
We present a novel counter-argument generation framework, CANDELA. Given an input statement, it first retrieves arguments of different perspectives from millions of high-quality articles collected from diverse sources. An argument generation component then employs a text planning decoder to conduct content selection and specify a OP: The wage gap isn't about gender. (...) So one factor definitely the fact that there are some government jobs that are seriously dominated by women and happen to pay really low (...) Human: Children are the first factor. I work in a traditionally male-dominated field, and questions are asked to determine familial status, even though those are illegal. Employers know single moms have to stay home if their child is sick. They know single moms can only work during daycare hours. They know single moms are unavailable for weekend and late night work. They know single moms cannot travel. The workplace is very family-unfriendly. Seq2seq: the problem with this is that there are a lot of people who do n't want to hire women . there are plenty of women who do n't have the money to pay for it . the problem is that women do n't need to pay for their work . they do n't have to worry about it . it 's a matter of money . Seq2seqAug: men and women are not 39.5 % of the pay gap . the problem is not that women are more likely to be victims of sexism , but rather that they are more natural good-looking/attractive action . this is not the case . CANDELA: the problem with this argument is that the wage gap does not have to do with the gender pay gap . it is a fact that women are more likely to be able to take care of their children than their male counterparts . this is not to say that it is unfair to assume that women are being paid less than men , but that does not mean that it is not the case that women are discriminated against . it is not a matter of the wage gap , it is a matter of opinion . it is the job of the employer to make sure that the job is not the same as the other Keyphrase Memory: wage gap; discrimination; gender pay gaps; raise the child; male colleagues; paid maternity leave; underlying gender discrimination . . . suitable language style at sentence-level, followed by a content realization decoder to produce the final argument. Automatic evaluation and human evaluation indicate that our model generates more proper arguments with richer content than nontrivial comparisons, with comparable fluency to human-edited content.

A.2 Stance Scoring Model
Our stance scoring model calculates the score by aggregating the sentiment words surrounding the opinion targets. Here we choose the keyphrases of input statement as opinion targets, denoted as T.
We then tally sentiment words, collected from Hu and Liu (2004), towards targets in T, with positive words counted as +1 and negative words as −1. Each score is discounted by d −5 τ,l , with d τ,l being the distance between the sentiment word l and the target τ ∈ T. The stance score of a text psg (an input statement or a retrieved passage) towards opinion targets T is calculated as: In our experiments, we only keep passages with a stance score of the opposite sign to that of the input statement, and with a magnitude greater than 5, i.e. |Q(psg, T)| > 5 (determined by manual inspection on training set).

A.3 List of Discourse Markers
As described in §5.2 in the main paper, we use a list of discourse markers together with topic signature words to label argumentative content sentences. The following list of discourse markers are manually selected from the Appendix B in Prasad et al. (2008).
• Contrast: although, though, even though, by comparison, by contrast, in contrast, however, nevertheless, nonetheless, on the contrary, regardless, whereas • Restatement/Equivalence/Generalization: eventually, in short, in sum, on the whole, overall • Result: accordingly, as a result, as it turns out, consequently, finally, furthermore, hence, in fact, in other words, in short, in the end, in turn, therefore, thus, ultimately

A.4 Human Evaluation Guideline
Each human annotator is presented with 43 short argumentative text statements, where the first 3 statements are used as calibration for the annotator himself and excluded in the final study. The annotators are asked to evaluate 5 counter-arguments for every statement. For each counter-argument, they rate on a scale of 1 to 5 for the following aspects, and also specify the ranking among the 5 counter-arguments based on the overall quality. We display sample statements and score-level explanations in Table 5. Three aspects of arguments to be evaluated: • Grammaticality: whether the counterargument is fluent and has no grammar errors.
• Appropriateness: whether the counterargument is on topic and on the right stance.
• Content Richness: how many distinct talking points the counterargument conveys.

A.5 Sample Output
In Figure 6 we show two sample snippets of our model outputs where reused keyphrases are highlighted in colors. Notice that even though in test time we disallow the same keyphrase to be selected more than once as the input for text planning decoder, the content realization decoder can still generate the same keyphrase multiple times.
Statement: Legislative bodies should be required to explain, in formal writing, why they voted a certain way when it comes to legislation.
Grammaticality: 1 With plenty of grammatical errors and not readable at all e.g., "the way the way etc. 'm not 's important" 3 With a noticeable amount of grammatical errors but is generally readable e.g., "is a good example. i don't think should be the case. i're not going to talk whether or not it's a bad thing." 5 With no grammatical errors at all and is clear to read e.g., "i agree that the problem lies in the fact that too many representatives do n't understand the issues or have money influencing their decisions." Appropriateness: 1 Not relevant to the prompt at all e.g., " i don't think it 's fair to say that people should n't be able to care for their children" 2 Remotely relevant to the prompt or relevant but poses an unclear stance, or contains obvious contradictions e.g., "the problem with the current system is that there are many people who don't want to vote and they also don't want to vote." 3 Relevant to the prompt but stance is unclear e.g., "i don't agree with you and i think legislative bodies do need to explain why they vote that way" 4 Relevant to the prompt and is overall on the opposing stance with minor logical contradictions e.g., "while i agree with you but i don't think it's a good idea for house reps to explain it because they have other work to do." 5 Relevant to the prompt and is on the opposing stance, has no unnatural repititions and logical contradictions e.g., "there are hundreds of votes a year . how do you decide which ones are worth explaining ? so many votes are bipartisan if not nearly unanimous . do those all need explanations ? they only have two years right now and i do n't want them spending less time legislating ."

Content Richness:
1 Generic response with no useful information about the topic e.g., "i do n't agree with your point about legislation but i 'm not going to change your view." 3 With one of two key information that are useful as counterargument e.g., "i agree that this is a problem for congress term because currently it is too short." 5 With sufficient key information that are useful as counterargument e.g., "congressional terms are too short and us hourse reps have to spend half of their time compaigning and securing campaign funds. they really have like a year worth of time to do policy and another year to meet with donors and do favors." Table 5: Sample statement with explanations on aspect scales. Due to the likely ambiguity in Appropriateness, we provide explanations on every possible score. Example counter-arguments are also given alongside explanations.
In this example, the phrase death penalty is used three times across two sentences, while it is generally nature and relevant in the context. We further include three complete sample outputs for different systems alongside the reranked passages from Figure 7 to Figure 11.
Input: The wage gap isn't about gender. (...) So one factor definitely the fact that there are some government jobs that are seriously dominated by women and happen to pay really low (...)

Passage 1 Source: Wikipedia Stance: -24.65
Research has found that women steer away from STEM fields because they believe they are not qualified for them; the study suggested that this could be fixed by encouraging girls to participate in more mathematics classes. One of the factors behind girls' lack of confidence might be unqualified or ineffective teachers. Teachers' gendered perceptions on their students' capabilities can create an unbalanced learning environment and deter girls from pursuing further STEM education. They can also pass these stereotyped beliefs unto their students. Studies have also shown that student-teacher interactions affect girls' engagement with STEM. Teachers often give boys more opportunity to figure out the solution to a problem by themselves while telling the girls to follow the rules.
Passage 2 Source: The New York Times Stance: -24.01 How are the these pressures different for girls and women than they are for boys and men? If you could change one thing about typical and/or stereotypical gender roles, what would it be? 2. As a class, read and discuss the article "Girls Will be Girls" , focusing on the following questions: a. What does the author, Peggy Orenstein, mean when she says that many women are "struggling to find an ideal mix of feminism and femininity"? Do you agree? Why or why not? b. Why did some people get upset about the implicit "Girls Keep Out" sign on the cover of the "Dangerous Book for Boys"?
Passage 3 Source: The New York Times Stance: -7.91 Poverty is becoming defeminized because the working conditions of many men are becoming more feminized. Whether they realize it or not, men now have a direct stake in policies that advance gender equity. Most of the wage gap between women and men is no longer a result of blatant male favoritism in pay and promotion. Much of it stems from general wage inequality in society at large. IN most countries, women tend to be concentrated in lower-wage jobs. The United States actually has a higher proportion of skilled and highly paid female workers than countries like Sweden and Norway. Yet as a whole, Swedish and Norwegian women earn a higher proportion of the average male wage than American women because the gap between high and low wages is much smaller in those countries.
Passage 4 Source: The New York Times Stance: -21.75 Site Navigation Site Mobile Navigation Women and the Pay Gap In "How to Attack the Gender Wage Gap? Speak Up" (Dec. 16), a solution is proposed for the problem of pay inequality: make women stronger negotiators in securing their own salaries. But we should always remember that employers have an obligation to follow the law in the first place, and to pay men and women working in the same jobs the same pay. Fifty years ago, Congress decided as much by passing the Equal Pay Act -but since then the wage gap has narrowed little. There's nothing wrong in women honing their negotiating skills -and some will succeed in getting higher pay.
Passage 5 Source: The Washington Post Stance: -6.89 But, under this metric for people with a college degree, there is virtually no pay gap at all.") To be specific, "The [Bureau of Labor Standards] reports that single women who have never married earned 96% of men's earnings in 2012," two conservative economists relate. (The Post's Wonkblog comes up with 91 percent. ) Job selection, risk taking ("Nearly all the most dangerous occupations, such as loggers or iron workers, are majority male and 92% of work-related deaths in 2012 were to men. Dangerous jobs tend to pay higher salaries to attract workers"), the choice to work parttime and factors other than discrimination explain the differences: While the BLS [Bureau of Labor Statistics]reports that full-time female workers earned 81% of full-time males, that is very different than saying that women earned 8% of what men earned for doing the same jobs, while working the same hours, with the same level of risk, with the same educational background and the same years of continuous, uninterrupted work experience, and assuming no gender differences in family roles like child care.  Seq2seqAug: as far as i can tell , drug use is the source of the use of drug use . in fact , there is a lot of evidence to support that . drug use can be used as a way to reduce crime . the problem is that drug use does n't make sense . it 's like saying " if you do n't like it , it 's illegal . " CANDELA: there is a lot of research that suggests that there is a substantial amount of drugs that can be used to cause harm to others . however , there is a significant amount of research that can lead to a lot of drug addiction in the us . there are a lot of things that can be attributed to drug use , drug use , substance abuse , etc . i do n't think it 's fair to say that drug use should be illegal , but i think it 's important to note that it 's not a gateway drug . it 's just a symptom of the root cause of the drug addiction problem . it 's not just about the CANDELA input w/o passages: if you want to look at the effects of drug use , you 'll find that there are a lot of people who are addicted to drugs like meth , heroin , and so on . in fact , there are some studies that show that there is a correlation between drug use and the availability of drugs in the us . in other words , there is a significant correlation between the effects of marijuana and drug use , and the correlation between marijuana and marijuana use . this is not to say that drug use is bad , but it is also a symptom of drug use . there are many studies that show a correlation between marijuana use and marijuana are more likely to result in a decrease in violent crime . this is a study that shows that it is not the case that the study found out that it was a result of drug use and it was found that it was found in the study . Input: all drugs should be decriminalized. I am a firm believer that MDMA and LSD can be very therapeutic and eye opening. They can expand your consciousness and allow you to see things from a different perspective and have the ability to alter your life. (...) Passage 1 Source: Wikipedia Stance: -28.88 In 2010, DrugScience produced a ranking of drug harms in the UK, the results of which garnered significant media attention. Drugs were assessed on two metrics -harm to users and harms to society. The report found heroin, crack cocaine, and methamphetamine to be the most harmful drugs to individuals, with alcohol, heroin, and crack cocaine as the most harmful to others. Overall, alcohol was the most harmful drug, with heroin and crack cocaine in second and third places. Most notably, it found the legal status of most drugs bears little relation to the harms associated with them -several class A drugs including ecstacy (MDMA), LSD and magic mushrooms featured at the very bottom of the list.The report found heroin, crack cocaine, and methamphetamine to be the most harmful drugs to individuals, with alcohol, heroin, and crack cocaine as the most harmful to others. Overall, alcohol was the most harmful drug, with heroin and crack cocaine in second and third places. Most notably, it found the legal status of most drugs bears little relation to the harms associated with them -several class A drugs including ecstacy (MDMA), LSD and magic mushrooms featured at the very bottom of the list. Similar findings were found by a Europe-wide study conducted by 40 drug experts in 2015.
Passage It wasn't MDMA after all, but methamphetamine. A new review board quickly signed on to support Mithoefer's study, but the irony of the wasted year wasn't lost on him: The misidentified drug that had been deemed too toxic to evaluate for medical use, the drug that was far more toxic than MDMA, was already a prescription drug. Meanwhile, in the four years the MDMA study lingered between concept and reality, Donna Kilgore had been driven to the brink. She took "every anti-depressant you can name," tried a dozen therapists and an almost equal number of therapeutic approaches. But nothing made that numbness, panic and rage recede.
Passage 5 Source: The New York Times Stance: -5.67 For alcohol, the safety margin is 10 (330 divided by 33 equals 10). In other words, it takes 10 times as much alcohol to kill you as it does to give you a buzz. Note: All such calculations are very rough estimates, and severe toxic reactions can occur at much lower doses depending on the health of the individual. Based on my research, the safety margin of recreational substances as normally used is: 6 for heroin; 10 for alcohol; 15 for cocaine; 16 for MDMA; 20 for codeine; and 1,000 for LSD or marijuana. Cigarettes have little immediate risk of death because most of the very lethal nicotine is destroyed in the smoke. Figure 9: Top 5 retrieved passages after ranking for example in Figure 8.
Input: the presidents new plan to restrict welfare beneficiaries from obtaining a green card is a bad policy . (. . . ) legal immigrants who receive any significant form of Welfare or public benefits are given a large "black mark" on their record that is heavily factored in should they attempt to apply for a green card. (. . . ) Human: The policy is impacting individuals, not the group. Each individual is judged based on the merit they bring forward. If an individual takes government assistance, they are most likely not contributing positively to the US.
Seq2Seq: i do n't think it 's fair to say that the government should be able to do this . if you want to get rid of illegal immigration , you need to be able to find a way to do it . you need to make sure that you do n't want to get a job . it 's just a matter of if you do n't have a problem with the government .
Seq2seqAug there are a lot of people who do n't want to have children , but they do n't have to be able to pay for it . it 's the government 's job to make sure that the immigrants do n't have a job . they do n't need to be deported . they have to go through the process of becoming a citizen .
CANDELA: i 'm not going to try to change your view , but i 'd like to point out that there are a number of undocumented immigrants in the united states , and i 'm not sure what you 're talking about , but i do n't think it 's fair to say that undocumented immigrants are going to be more likely to be able to get the benefits of citizenship . i 'm not a fan of this , but i 'm going to focus on the fact that there is a significant amount of people who do n't have access to welfare programs in the us , and that 's not the case . it 's not like CANDELA input w/o passages: a lot of illegal immigrants in the united states are not in the same position as they are in the us . in fact , the majority of illegal immigrants do not want to pay for the services provided by the federal government . it 's not that they are undocumented , it 's that they do n't have to worry about the fact that they 're undocumented . they 're not going to be able to get a job because they ca n't afford to live in a society where they ca n't get a job . if you 're talking about illegal immigrants , it 's not like they 're in the same boat as you Input: the presidents new plan to restrict welfare beneficiaries from obtaining a green card is a bad policy . (. . . ) legal immigrants who receive any significant form of Welfare or public benefits are given a large "black mark" on their record that is heavily factored in should they attempt to apply for a green card. (. . . ) Passage 1 Source: The Wall Street Journal Stance: -6.57 Who wrote this? Is that you Obama? 11:00 pm February 2, 2011 Welfare Worker in Washington State wrote: I think it is erroneous to assume that a large percentage of welfare recipients are people of color. I see more white people with their hands out in our area. We do have a large percentage of illegals with US born children and immigrants from Russia here in Washington that are receiving benefits. The Russian immigrants bring in their parents and extended family that get SSI benefits for the first 5 years if they are 65 or older. They have large families -and even if they are working have a tendency to get close to 1000 in food benefits each month. The family of the Boston bombers (although here legally as "refugees") collected significant amounts of food stamps, housing subsidies, college subsidies and the like. At the same time, they had sufficient funds to travel to their homeland and other European destinations, despite having reported only modest earnings. Last week, it was reported that a Pakistani owner (presumably a legal immigrant) of a chain of 7-11's was importing illegals to work in his stores. We can have a welfare society or open borders but we can't have both. Any immigration reform has to stipulate that immigrants cannot receive any taxpayer funded benefits (federal, state or local) until after they have achieved citizenship.
Passage 4 Source: The New York Times Stance: -6.16 They are not entitled to a passport or a green card because they bypassed the legal mechanisms for obtaining such documents. In any other country they would be promptly deported, justifiably. I support immigration reform and personally do not feel any economic competition from illegal aliens. That being said, there is a difference between immigrants who have applied for and received citizeship or green cards and those who have not. There should be a fast track naturalization system for children of illegals, such as this student. He grew up here because of his parents' actions, not his own. This is similar to being born here, which has historically entailed citizenship.
Passage 5 Source: The Washington Post Stance: -13.81 In 1996, Congress enacted a requirement that legal immigrants be present for five years before becoming eligible for benefits. But we have never categorically excluded immigrants from receiving public benefits. Until now. If approved, the new policy would effectively deter legal immigrants from using public benefits for which they are eligible, lest they later be denied a green card or be removed. The DHS could also apply "public charge" to legal immigrants who use benefits for their children (such as CHIP), even if the children are U.S. citizens. The Migration Policy Institute estimates that the new policy could have a chilling effect on some 18 million noncitizens and 9 million U.S.-citizen children who reside in families where at least one person uses Medicaid/CHIP, welfare, food stamps or SSI. Figure 11: Top 5 retrieved passages after ranking for example in Figure 10.