From Surrogacy to Adoption; From Bitcoin to Cryptocurrency: Debate Topic Expansion

When debating a controversial topic, it is often desirable to expand the boundaries of discussion. For example, we may consider the pros and cons of possible alternatives to the debate topic, make generalizations, or give specific examples. We introduce the task of Debate Topic Expansion - finding such related topics for a given debate topic, along with a novel annotated dataset for the task. We focus on relations between Wikipedia concepts, and show that they differ from well-studied lexical-semantic relations such as hypernyms, hyponyms and antonyms. We present algorithms for finding both consistent and contrastive expansions and demonstrate their effectiveness empirically. We suggest that debate topic expansion may have various use cases in argumentation mining.


Introduction
Recent years saw substantial advancement of Debating Technologies -computational technologies developed directly to enhance, support, and engage with human debating (Gurevych et al., 2016). A recent milestone in this field is IBM R 's Project Debater R 1 , the first demonstration of a live competitive debate between an AI system and a human debate champion.
When debating a controversial topic, it is often desirable to expand the boundaries of the discussion, and bring up arguments about related topics.
For example, when discussing the pros and cons of the presidential system, it is natural to contrast it with those of the parliamentary system. When debating alternative medicine, we may discuss specific examples, such as homeopathy and naturopathy. Conversely, when discussing bitcoins, we can speak more broadly on cryptocurrency.
Consider the use of debating technologies for decision support, where the pros and cons of a given proposal are extracted from a large corpus, summarized and presented to the user. Current methods for topic-related, corpus-wide argument mining only specify the given debate topic in their search queries (Levy et al., 2017(Levy et al., , 2018Wachsmuth et al., 2017;Stab et al., 2018a,b). As a result, much of the relevant argumentative content is left out of their reach. Alternatively, context-independent argument mining can exhaustively extract argumentative content from a corpus (Lippi and Torroni, 2015), but it cannot tell which arguments are actually relevant for the topic in question.
In this work we take a step towards closing this gap, by introducing the task of Debate Topic Expansion -finding related topics that can enrich our arguments and strengthen our case when debating a given topic. Following previous work (Levy et al., 2017(Levy et al., , 2018, we focus on topics that are Wikipedia concepts (article titles in Wikipedia).
Two types of expansions are studied: consistent and contrastive (Bar-Haim et al., 2017). Arguing in favor or against a consistent expansion may support the same stance towards the original topic, whereas for contrastive expansions the stance is reversed.
For example, Bitcoin⇒Cryptocurrency and Alternative medicine⇒Homeopathy are consistent expansions, while Presidential system⇒Parliamentary system is a contrastive expansion, since we may support the presidential system by criticizing the parliamentary system.
While these relations may seem reminiscent of hypernyms/hyponyms, antonyms and co-hyponyms, we show that they differ from these well-studied relations.
We propose a three-step method for debate topic expansion. First, expansion candidates are extracted from a large corpus using a set of prede-fined patterns. Each expansion type makes use of a different type of corpus and patterns. In the second step, we apply a set of filters to the extraction results. Candidates that pass these filters are manually annotated as good/bad expansions, resulting in the first dataset for this task. 2 The labeled dataset is utilized in the final step, where we employ supervised classification to identify good expansions amongst the candidates. We explore two approaches: (i) traditional feature-based classification, for which we introduce a novel set of features; (ii) a deep neural network, which is trained by distant supervision. Experiments with hundreds of unseen topics show promising results, and the best performance is achieved by combining both classification approaches.

Task Description
Let DC (debate concept) be a Wikipedia concept representing a debate topic. Our goal is to find Wikipedia concepts that represent consistent and contrastive expansions of DC, as defined in the previous section. Table 1 lists several positive and negative examples of candidates extracted for each expansion type, denoted ⇒ and , respectively. The examples are taken from our labeled dataset, to be described in the next sections.
Consistent expansions in our dataset include both broader concepts (examples 1,8,10 in Table 1) and more specific concepts (examples 3,5,7). While some of these expansions are strict hypernyms (1,8) or hyponyms (5,7), other are not (3,10). Moreover, broader/narrower concepts do not necessarily make relevant expansions. For example, while Vegetarianism is a type of Diet (2), arguing about diets in general does not seem relevant for a debate about vegetarianism, in particular since such a debate typically contrasts vegetarianism with other types of diet.
Contrastive expansions involve diverse semantic relations and subtle distinctions. For example, opposites may be relevant expansions in some cases, e.g. (19), but irrelevant in others (15). Contrastive expansions are often co-hyponyms of the DC. For instance, Democracy and Dictatorship (11) are both forms of government. However, co-hyponyms are not always appropriate as con-trastive expansions. When debating about Boxing (e.g., whether it should be banned), we would not contrast it with Wrestling (17), despite both being combat sports, as most arguments for and against boxing equally apply to wrestling.
The above examples illustrate some of the challenges in this task. The criterion for a good expansion -its usefulness in a debate -requires some knowledge and understanding of the possible contexts in which the given DC may be debated. Moreover, such judgments are, to some extent, inherently subjective.

Candidate Extraction
The first step in expanding a given DC is extracting expansion concepts for each expansion type. An expansion concept (EC) is a Wikipedia concept that co-occurs with the DC in some predefined pattern that is matched in a corpus. We use a wikification tool that matches different variations of mentioning the same concept in the text. Below we describe the patterns and corpora used for each expansion type. For both types we require at least two pattern matches for each expansion concept.
Consistent expansions. Our list of patterns for extracting consistent expansions includes some of the well-known Hearst patterns for extracting hyponyms (Hearst, 1992), as well as some additional patterns. Some examples are 'X such as Y ', 'X is a Y ', and 'X and other Y ', where X matches DC and Y matches EC or vice versa. 3 Despite the differences between hypernyms/hyponyms and consistent expansions, these patterns provide a reasonable starting point for our algorithm. The patterns are matched in a corpus C N of news articles, comprising about 10 billion sentences. The sentences undergo wikification and indexing, which allows efficient pattern search for a given concept.
Contrastive expansions. As previously observed by Bar-Haim et al. (2017), queries to web search engines often contain contrastive expressions, e.g. "why is renewable energy better than fossil fuels", and are typically succinct and easy to parse. 4 We used a corpus C Q of 1.2 billion queries (450 million distinct queries) from the Blekko search engine. Sample patterns include 'X * vs * Y ', 'X * better than * Y ', and 'difference between * X * and * Y ', where '*' matches any number of non-concept tokens.

Candidate Filtering
Candidate extraction is followed by a candidate filtering step, in which we apply a set of filters to each extracted pair (DC,EC). The filters for each expansion type are described below.

Filters for Consistent Expansions
Directionality. We aim to determine whether a consistent expansion EC is a generalization or a specialization of the DC based on the number of times EC it is matched in a each role in any of the patterns. If the direction is not clearly determined, i.e. the majority role is matched in less than 80% of the cases, the expansion is discarded.
Named Entity. Only the more specific concept amongst DC and EC (according to the determined direction) may be a named entity 5 .
Frequency Ratio. Incompatible frequencies of DC and EC may indicate a bad expansion. Accordingly, this filter restricts the ratio between the frequencies of DC and EC in C N . We require min( F req(DC) F req(EC) , F req(EC) F req(DC) ) ≥ 0.2. Distributional Similarity. Consistent expansions are expected to occur in contexts similar to the DC. This is captured by measuring the distributional similarity s D between DC and EC. We derived concept-level word2vec vectors (Mikolov et al., 2013a) from C N , where each wikified mention of a concept C was considered an occurrence 5 A Wikipedia concept is considered a named entity if its page type is defined as person, organization or location. of C. s D (DC, EC) is then defined as the cosine similarity between the representations of DC and EC, and we require s D (DC, EC) ≥ 0.5. This filter may also remove EC that is too broad or too narrow with respect to DC.
Substring. We require that EC is not a substring of DC and vice versa. This filter discards expansions such as Private university University and Marriage Gay Marriage.
Additional filters. We also filter concepts containing the phrases 'Anti-', 'List of' and 'Lists of', and pairs (DC,EC) that co-occur in the same sentence in C N less than 10 times.

Filters for Contrastive Expansions
Named Entity. Neither DC nor EC may be named entities.
Substring. Same as for consistent expansions.
Semantic relatedness. We found that unlike consistent relations, contrastive relations better correlate with semantic relatedness than with distributional similarity. We use WORT (Ein Dor et al., 2018), a semantic relatedness tool for Wikipedia concepts, as our relatedness measure. We denote this relation s R , and require that s R ≥ 0.4.

The Debate Topic Expansion Dataset
Based on our candidate acquisition method, we created the DTE (Debate Topic Expansion) dataset, comprising about 3,000 annotated pairs of debate concepts and their expansion candidates. The dataset contains positive and negative examples of both consistent and contrastive expansions. The construction of the dataset is described below.
We manually collected a diverse set of 632 debate concepts from a variety of sources, including the idebate website 6 . For each debate concept, we performed candidate extraction and filtering for consistent and contrastive expansions, as described in the previous section. Each of the resulting (DC,EC) pairs was assessed by five annotators, and was labeled as either positive (good expansion) or negative (bad expansion), based on the majority labeling.
One intriguing subtlety we noticed early on is that in the case of contrastive expansions, whether or not EC is a good expansion somewhat depends on our stance towards the DC. If we argue against the DC (Con stance), we may choose any plausible alternative as our EC, following a line of argument such as "EC is a better alternative to DC". However, when we argue in favor of DC (Pro stance), the typical argument changes to "if we don't choose DC then we are left with EC", which requires EC to be the "default" alternative to DC. For example, when arguing against atheism, one may argue that Christians are happier than atheists; however, when taking a proatheism stance, it is better to argue against religion in general than specifically against Christianity. The annotators were therefore asked to assess contrastive expansions for both positive and negative stances. However, developing a classifier that is able to make such fine distinctions falls out of the scope of the current work. Instead, we take the union of good expansions for each stance as our positive instances, while keeping per-stance annotations in the dataset for future research. Table 2 provides some statistics on the resulting dataset. Our candidate acquisition method was found to be applicable to a significant portion of the topics: one or more good consistent expansions were found for 43% of the topics, and good contrastive expansions were found for 19% of the topics. Precision, however, is low, even after applying our filters: 49% for consistent expansions, and 19% for contrastive expansions. This motivates an additional supervised classification step, to be presented in the next section. These statistics suggest that identifying contrastive expansions is considerably more challenging than finding consistent expansions.
Fleiss' κ is 0.45 for consistent expansions and 0.43 for the unified contrastive expansions, which 6 https://idebate.org/debatabase   , 1997). This level of agreement reflects the complexity and inherent subjectivity of the task, as discussed in Section 2, and is comparable to previous results for annotation tasks in argumentation mining. For example, Aharoni et al. (2014) report κ of 0.39-0.4 for claim and evidence annotation in Wikipedia articles.

Supervised Candidate Classification
We experimented with two complementary supervised classification methods: feature-based classification, which integrates diverse types of evidence from various sources, and a distantlysupervised neural network, which learns to discriminate between positive and negative pairs based on the contexts in which they co-occur.

Feature-Based Classification
We train a logistic regression classifier for each expansion type. The classifiers make use of novel sets of features designed for this task. Most features are shared by both classifiers, and a few additional features were developed specifically for each task. Below we give an overview of the features extracted for a given (DC,EC) pair. A more detailed and complete description is found in Appendix A.
Similarity & relatedness. The similarity and relatedness measures s D , s R , defined in Section 3.2.
Wikipedia. The following features take advantage of DC and EC both being Wikipedia titles, and make use of information found in their respective pages: (i) Number of Wikipedia categories shared by DC and EC; (ii) Count of occurrences of DC in EC categories or EC in DC categories up to two category levels; (iii) count of shared Wikipedia outlinks of DC and EC.
WordNet. Whether DC is a hypernym, hyponym, synonym or co-hyponym of EC in Word-Net (Miller, 1995) -four binary features.
Sentiment. Consistent expansions are expected to have the same sentiment polarity as the DC, whereas opposite polarities may indicate contrastive expansions (e.g., Democracy vs. Dictatorship). Similar to Iyyer et al. (2015), we train a linear SVM classifier on the sentiment lexicon of Hu and Liu (2004), using the word2vec word embeddings computed over the C N corpus as the features and the word polarities as the labels. Word polarity can then be determined by the sign of the classifier's output score, and the sentiment strength by its magnitude. We take the product of the classifier's scores for DC and EC as a single sentiment feature.
Corpus statistics. Simple corpus-based features are derived from the number of co-occurrences of DC and EC in the same sentence in C N or in the same query in C Q . These features are normalized to [0, 1] by setting for each feature an upper threshold k on the count. Counts in the range of [0, k] are linearly transformed to [0, 1], and counts above k are set to 1. We also consider other corpus-based measures, such as pointwise mutual information (PMI) between DC and EC. For the consistent expansions classifier we also use as a feature the frequency ratio measure, defined in Section 3.2.
Other corpus-based features are based on pattern matching. For instance, we define a set of contrastive patterns, e.g, 'Xvs Y ' and 'X instead of Y ' 7 , and derive features such as the (normalized) count of (DC,EC) matches for these patterns, and the PMI of DC and EC in the subset of sentences/queries matching the patterns.
Overall, the feature count for the consistent expansions classifer is 15, and 22 for the contrastive expansions classifier.

Distantly Supervised Neural Network
The other classification approach we experimented with is based on distant supervision (Mintz et al., 2009). As before, we train two separate classifiers for consistent and for contrastive expansions, using their respective training sets. For each pair (DC,EC) from the training set, we retrieve from the C N index up to 10,000 sentences that contain mentions of both DC and EC. The retrieved sentences are all labeled with the pair's label -positive or negative. These labels are noisy, since not every co-occurrence of DC and EC in a sentence is indicative of the relation between them. Our hope, however, is that the large number of training sentences collected this way would compensate for the noisy labels. The mentions of DC and EC in each sentence are replaced with generic symbols, DC and EC, to facilitate generalization over specific instances. We found that for consistent expansions, it is better to keep only the text between these two symbols, while for contrastive expansions, using the whole sentence works better. We balance the dataset to have an equal number of positive and negative training instances for each type.
The sentences collected for the whole training set are then used to train a neural network. Essentially, the network aims to determine whether a given sentence is a positive or a negative evidence for the existence of the target relation (consistent or contrastive expansion) between DC and EC. When applying the classifier to a new pair, we collect up to 500 sentences for that pair, and average the classifier's predictions for each sentence.
Neural network description. Our network is a bi-directional LSTM (Graves and Schmidhuber, 2005) with an additional attention layer (Yang et al., 2016). The models are all trained with a dropout of 0.85, using a single dropout across all timesteps as proposed by Gal and Ghahramani (2016). The cell size in the LSTM layers is 128, and the attention layer is of size 100. We use the Adam method as an optimizer (Kingma and Ba, 2015) with a learning rate of 0.001. Words are represented using the 300 dimensional GloVe embeddings learned on 840B Common Crawl tokens and are left untouched during training (Pennington et al., 2014).

Experimental Setup
We assess the performance of our method on the following practical task: given a debate concept DC, find one good expansion concept EC for each expansion type. Recall that our dataset includes annotations for all the expansion candidates found for each DC by the candidate acquisition algorithm. Here we compare different methods for choosing one good expansion from these candidates. For each expansion type, we assume a scoring function f (X, Y ) over a pair of concepts, which predicts the likelihood of the target relation holding between X and Y . We further assume a threshold α representing the minimum score for a good expansion. Given a debate concept DC, we choose its highest scoring expansion, if its score exceeds the threshold. If no expansion was found, or all the expansion scores are below the threshold, we make no prediction. By modifying the threshold, we can explore the tradeoff between the number of predictions we make and their quality.
The following scoring functions were assessed: Due to the small number of instances in the development set, we did not use it to tune the LR classifier, but rather used both the train and the development sets to train the classifier. Together they contain 983/626 consistent/contrastive expansion candidates, respectively. Performance measures. Let N be the total number of debate concepts in the test set, let C be the number of correct predictions, let P be the number of predictions made, and let R be the number of debate concepts in the test set for which good expansions exist. We define the following measures: (i) Precision@1= C P ; (ii) Re-call@1= C R ; and (iii) Coverage= P N . Figure 1 compares the above candidate scoring methods for both consistent (a) and contrastive (b) expansions. For each configuration, the Pre-cision@1 vs Recall@1 graph is obtained by modifying the threshold α. Only the best-performing baseline for each expansion type is shown, for readability. Both the LR and the LR+DNN configurations outperform the strongest baseline by a large margin. This result illustrates the importance of supervised learning for this task. For consistent expansions, LR+DNN is clearly the best-preforming configuration.

Results
For contrastive expansions, it outperforms the LR classi-  fier in high-precision/low recall areas (Recall@1< 0.53). For high-recall/low-precision areas, the LR classifier performs better. As one may expect, the performance for consistent expansions is better than the performance for contrastive expansions, as the latter seems a more challenging task. Interestingly, for consistent expansions, SIM is the strongest baseline, whereas for contrastive expansions the best baseline is FREQ-Q. The performance of the DNN for consistent expansions is comparable to the best baseline, but for contrastive expansions it is much higher. Again, this may be attributed to difference in the difficulty of the two tasks, which requires, for contrastive expansions, more powerful methods.
We now take a closer look at the results for the LR+DNN configuration. To illustrate its performance, Table 3 includes sample data points for this configuration, for each expansion type. 8 So far we used the Recall@1 measure to compare the coverage of different scoring methods with respect to the given set of candidates. Thus, the coverage of the candidate acquisition step was not taken into account in this assessment. In order to assess the end-to-end performance of our system, we next consider the tradeoff between Preci-sion@1 and Coverage, as the latter measures the fraction of debate concepts for which we make a prediction out of all debate concepts in the test set.
The LR+DNN results for both expansion types are shown in Figure 2, and sample values are shown in Table 4. For example, by setting the threshold appropriately, we can find consistent expansions for 38.3% of the debate concepts with (at least) 80% precision. Precision and coverage for contrastive expansions are lower. For example, when requiring precision of 70%, we can make  predictions for nearly 20% of the topics.

Related Work
There is a vast body of research work on identifying semantic relations between a pair of terms. Most studied relations include hyponyms/hypernyms, synonyms, antonyms, and meronyms. The main approaches applied to this task are summarized below.
Pattern-based methods. A fundamental type of evidence for detecting such relations is based on co-occurrence of the two terms in some text, typically in the same sentence. Pattern-based methods define lexico-syntactic contexts containing slots to be filled by instances of the target relation. Patterns can be defined over surface forms, or over syntactic representations such as paths in a dependency parse. Hearst (1992) introduced a patternbased method for hyponym extraction, using a small set of manually-constructed textual patterns (for example "NP1 such as NP2"). Similar methods were used by Berland and Charniak (1999) for extracting meronyms, and by Lin et al. (2003) for identifying non-synonyms among semantically similar words.
Snow et al. (2005) developed a method to automatically learn new path-based patterns and used these patterns as features for hypernym classification; they later expanded this method for taxonomy construction (Snow et al., 2006). Schulte im Walde and Köper (2013) used automatically acquired word patterns to distinguish between antonyms, synonyms and hypernyms in German.
Distant supervision. Mintz et al. (2009) introduced the concept of distant supervision for relation extraction. The idea is to use an external knowledge base as a source of supervision instead of using labeled text (Riedel et al., 2010). The distant supervision paradigm assumes that any sentence that contains an entity pair of a known relation is likely to express the relation in the text. Since this assumption leads to noisy data and features, researchers have developed multiinstance approaches to deal with invalid sentences and wrong labels (Zeng et al., 2015;Riedel et al., 2010;Surdeanu et al., 2012). Another solution for the noisy data problem is using sentence-level attention model (Lin et al., 2016).
Distributional methods. Distributional methods aim to determine the relation between two terms by independently modeling the contexts in which each term occurs. Lin (1998) and later Weeds and Weir (2003) developed distributional similarity measures and showed that they can be used to predict hypernymy relations over WordNet terms.
Translation in a word embedding space may capture various syntactic and semantic relations between the words. This was demonstrated by Mikolov et al. (2013b) and Pennington et al. (2014) on the task of solving word analogies. Word embeddings were used for various relation extraction tasks, by taking their difference (Roller et al., 2014)  Taxonomy induction from Wikipedia. Apart from relation extraction, taxonomy induction over Wikipedia concepts and categories is another line of research that is related to the current work. Examples are WikiTaxonomy (Ponzetto and Strube, 2007), YAGO (Hoffart et al., 2013), and the Wikipedia Bitaxonomy project (Flati et al., 2014). As described by Gupta et al. (2016), these works utilize information about Wikipedia concepts, the category network and the link structure.
The current work makes the following contributions with respect to previous relation extraction work. First and foremost, it introduces and studies a new relation extraction task -finding consistent and contrastive expansions for a given debate topic. To address this challenging task, we propose a hybrid architecture that combines diverse knowledge sources and techniques. Another contribution of this work is a novel set of patterns, filters and features designed specifically for this task.
Stance classification. Consistent and contrastive relations were previously discussed in the stance classification literature. Somasundaran et al. (2009) refer to these relations as same/alternative, and use them in conjunction with discourse relations to improve the prediction of opinion polarity. However, they do not attempt to identify these relations, but rather take them from a labeled dataset. Bar-Haim et al. (2017), as part of their work on claim stance classification, developed a classifier that aims to distinguish consistent from contrastive relations defined between the sentiment targets of a claim and the debate proposition. By contrast, our work addresses both candidate acquisition and classification, and most candidates are neither consistent nor contrastive expansions.

Conclusion
This work introduced a new task, debate topic expansion, along with a corresponding benchmark dataset, which we plan to make publicly available. We presented a working solution for this challeng-ing task that achieved promising empirical results. The best results are obtained by combining diverse methods and techniques: pattern-based extraction, a novel set of filters and classification features, and a distantly-supervised neural network.
Debate topic expansion may be highly valuable for argumentation mining. For instance, topicrelated argument mining has many potential use cases, such as helping individuals and organizations make better decisions, enhancing civic discourse by identifying arguments raised in the media, and promoting critical thinking among students. Debate topic expansion can enhance the coverage of existing argument mining methods by matching relevant arguments that do not mention the given topic explicitly. In addition, distinguishing consistent and contrastive expansions may improve argument stance classification. We plan to pursue these research directions in future work. 2. C Q VS COUNT: count of DC, EC cooccurrences in C Q queries matching a VS extraction pattern (see Section B.2). ♣N 5000 3. C Q PMI: pointwise mutual information (PMI) of DC, EC over all C Q queries.  (1) 4. C Q VS PMI: PMI of DC, EC over C Q queries matching some VS extraction pattern. Same definition as Equation (1), but with the following quantities: p(DC) = C Q DC VS COUNT/C Q VS SIZE p(EC) = C Q EC VS COUNT/C Q VS SIZE p(DCEC) = C Q VS COUNT/C Q VS SIZE ♠♣  2. C Q VS SIZE: total count of C Q queries matching a VS pattern.

References
3. C Q DC COUNT: count of DC occurrences in C Q queries.
4. C Q EC COUNT: count of EC occurrences in C Q queries.
5. C Q DC VS COUNT: count of DC occurrences in C Q queries matching a VS pattern.
6. C Q EC VS COUNT: count of EC occurrences in C Q queries matching a VS pattern.
7. C N TOT DC: count of DC occurrences in C N sentences, all surface forms.
8. C N TOT EC: count of EC occurrences in C N sentences, all surface forms.

B Patterns
In the patterns listed in this section, (X,Y ) stand for either (DC,EC) or (EC,DC), '*' matches any number of non-concept tokens, and [] indicates optional characters.